Motifs in Genomic Sequences

Once a complete genome sequence is available, how does one discover which portions of the genome are functional, and what their functions are? One computational approach is to find sequence motifs, which are approximately repeated substrings that occur more frequently than expected by chance. Such a motif could be hypothesized to have functional significance, and appropriate laboratory experiments devised to test that hypothesis.

A natural application of this idea arises in the study of gene regulation. One of the challenges currently facing biologists is to understand the mechanisms that regulate how, when, where, and at what rate genes express their products. An important aspect of this challenge is the identification of binding sites in the genome for the proteins involved in such regulation.

One approach to this problem is to deduce the binding sites by considering the regulatory regions of several functionally related genes from a single genome. We have developed practical algorithms that search for statistically overrepresented motifs in such a collection of regulatory regions [1,3,10,11], these motifs being good candidate binding sites. We also conducted the first comprehensive assessment of 13 such computational motif discovery tools [12].

An orthogonal approach deduces binding sites by considering orthologous regulatory regions of a single gene from multiple species. This approach has been called "phylogenetic footprinting". The simple premise underlying phylogenetic footprinting is that selective pressure causes functional elements to evolve at a slower rate than nonfunctional sequences. This means that unusually well conserved sites among a set of orthologous regulatory regions are excellent candidate binding sites. Given orthologous input sequences and the evolutionary tree relating them, we have developed practical phylogenetic footprinting algorithms that identify the best conserved sites [2,5,7].

Combining the previous two approaches, we developed one of the first algorithms that finds motifs in complex data sets consisting of multiple functionally related genes from multiple species [9].

Motif discovery tools developed from all these algorithms are freely available to the public and used by scientists around the world. We collaborate with experimental biologists to apply these tools to the discovery of important regulatory mechanisms affecting human health [4,6,8].


  1. M. Blanchette, S. Sinha, "Separating real motifs from their artifacts", Bioinformatics, vol. 17 Suppl 1 (2001) S30-8. Pubmed 11472990.   Supplement.
  2. M. Blanchette, M. Tompa, "Discovery of regulatory elements by a computational method for phylogenetic footprinting", Genome Res., vol. 12 (2002) 739-48. Pubmed 11997340.   Supplement.
  3. J. Buhler, M. Tompa, "Finding motifs using random projections", J. Comput. Biol., vol. 9 (2002) 225-42. Pubmed 12015879.   Supplement.
  4. L. Giacani, C. Godornes, M. Puray-Chavez, C. Guerra-Giraldez, M. Tompa, S. Lukehart, A. Centurion-Lara, "TP0262 is a modulator of promoter activity of tpr Subfamily II genes of Treponema pallidum ssp. pallidum", Mol. Microbiol., vol. 72 (2009) 1087-99. Pubmed 19432808.
  5. S. Neph, M. Tompa, "MicroFootPrinter: a tool for phylogenetic footprinting in prokaryotic genomes", Nucleic Acids Res., vol. 34 (2006) W366-8. Pubmed 16845027.   Supplement.
  6. H. Park, K. Guinn, M. Harrell, R. Liao, M. Voskuil, M. Tompa, G. Schoolnik, D. Sherman, "Rv3133c/dosR is a transcription factor that mediates the hypoxic response of Mycobacterium tuberculosis", Mol. Microbiol., vol. 48 (2003) 833-43. Pubmed 12694625.
  7. A. Prakash, M. Tompa, "Discovery of regulatory elements in vertebrates through comparative genomics", Nat. Biotechnol., vol. 23 (2005) 1249-56. Pubmed 16211068.   Supplement.
  8. M. Shnyreva, W. Weaver, M. Blanchette, S. Taylor, M. Tompa, D. Fitzpatrick, C. Wilson, "Evolutionarily conserved sequence elements that positively regulate IFN-gamma expression in T cells", Proc. Natl. Acad. Sci. U.S.A., vol. 101 (2004) 12622-7. Pubmed 15304658.
  9. S. Sinha, M. Blanchette, M. Tompa, "PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences", BMC Bioinformatics, vol. 5 (2004) 170. Pubmed 15511292.
  10. S. Sinha, M. Tompa, "Discovery of novel transcription factor binding sites by statistical overrepresentation", Nucleic Acids Res., vol. 30 (2002) 5549-60. Pubmed 12490723.   Supplement.
  11. M. Tompa, "An exact method for finding short motifs in sequences, with application to the ribosome binding site problem", Proc Int Conf Intell Syst Mol Biol, (1999) 262-71. Pubmed 10786309.
  12. M. Tompa, N. Li, T. Bailey, G. Church, B. De Moor, E. Eskin, A. Favorov, M. Frith, Y. Fu, J. Kent, V. Makeev, A. Mironov, W. Noble, G. Pavesi, G. Pesole, M. Régnier, N. Simonis, S. Sinha, G. Thijs, J. Helden, M. Vandenbogaert, Z. Weng, C. Workman, C. Ye, Z. Zhu, "Assessing computational tools for the discovery of transcription factor binding sites", Nat. Biotechnol., vol. 23 (2005) 137-44. Pubmed 15637633.   Supplement.