Chemistry Lab University of Washington Computer Science & Engineering
 Motif Discovery Assessment: Data Sets
  CSE Home   About Us    Search    Contact Info 

 Assessment Home
 Previous: Summary
 Next: Participant Instructions
    We discussed the problem of how to make the assessment fair and realistic at the December 2002 Bellairs Workshop on Computational Biology.
  • The problem with using real upstream sequences from co-regulated genes is that no one knows what the "correct" answer is: there may be unknown functional sites in the data.
  • The problem with using simulated data is that no one knows the "correct" stochastic model for generating background sequences and planting binding sites: does nature really use Markov chains for background and weight matrices for binding sites?

At the workshop we arrived at the following appealing scheme. The organizers selected organisms and their transcription factors from the TRANSFAC database. Each such pair gives rise to one data set. For each such pair, TRANSFAC lists known binding sites together with the gene in which the binding site occurs, its orientation, and its distance upstream from the transcription or translation start. Suppose there are 8 known binding sites collectively in 5 genes. The organizers then chose 5 random genes from the same organism, extracted their upstream sequences, and planted the 8 binding sites in these upstream sequences in the same location and orientation in which they naturally occur in their own genes. (Actually the organizers could also choose a few more random genes to add to the data set to make the problem realistic, say 7 genes in total, only 5 of which will contain the 8 known binding sites and the other 2 containing no planted binding sites.)

In subsequent discussion with participants, it was decided that each of these three methods of choosing data sets (real upstream sequences, sequences randomly generated by a Markov chain with binding sites inserted, and randomly chosen upstream sequences with binding sites inserted) has its pros and cons. It was decided to include some data sets of all three types.

To summarize, each participating program must be capable of finding motifs whose instances may occur 0 or more times in each input sequence, and on either of the two DNA strands. Some data sets may contain no planted bindings sites at all, and the program will benefit if it is capable of deciding that there are no significant motifs present.

The assessment consisted of 56 such data sets. The data sets came from the human, mouse, D. melanogaster, and S. cerevisiae genomes. Participants were told the genome of each data set. The number of sequences and sequence length varied from data set to data set: There were 1-35 sequences per data set, each sequence of length up to 3000 bp. Total input size of each data set ranged from 1 Kb to 70 Kb.


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to tompa]