Identifying the Genetic Basis of Complex Traits

Humans differ in many observable qualities, termed "phenotypes", ranging from appearance to disease susceptibility. Many phenotypes are largely determined by each individual's specific "genotype", stored in the 3.2 billion bases (A, C, G, and T) of his or her DNA sequence. If we can decipher the sequence by identifying which sequence variations affect a certain phenotype, it would have a great impact on human life.

In recent years, it has become possible to retrieve an individual's sequence information on a genome-wide scale. However, classical approaches focusing on genotype-phenotype correlation often fail to identify significant associations between genotype and phenotype. Variation in the DNA sequence can perturb a complex web of interactions among a number of biological molecules, and that change of the system can lead to phenotypic variation. The complexity of these cellular mechanisms induced by sequence variations, together with environmental factors, makes it difficult to infer the causal relationship between genotype and phenotype.

One way to resolve this challenge is to utilize high-throughput functional information from individuals, such as RNA expression measurements, quantitative protein profiles, metabolite levels and so on. Availability of such information creates many interesting questions, including the following:

  1. Why are some people more susceptible to a particular disease than others? Why do individuals respond to drugs differently? What are the genetic/environmental factors and their underlying mechanisms?

  2. How do variations in the DNA sequence affect various levels of gene regulatory networks - transcription, translation, chromatin and signaling?

  3. What are the molecular-level mechanisms of how DNA sequence differences between closely related species lead to their phenotypic differences? How do those interact with environments?

These problems involve many statistical challenges, for example, extracting complex and biologically meaningful relationships from high-dimensional, sparsely sampled data with noise. Our goal is to address the challenges by developing effective machine learning approaches that can translate sophisticated biological processes into robust statistical models, can incorporate prior knowledge from multiple sources of genomic data, and can learn such models from data efficiently. We believe that these approaches enable more comprehensive understanding of disease genetics, potentially leading to the realization of personalized medicine.