|
CSE Home |
About Us |
Search |
Contact Info |
|
We discussed the problem of how to make the assessment fair and
realistic at the December 2002 Bellairs Workshop on Computational
Biology.
At the workshop we arrived at the following appealing scheme. The organizers selected organisms and their transcription factors from the TRANSFAC database. Each such pair gives rise to one data set. For each such pair, TRANSFAC lists known binding sites together with the gene in which the binding site occurs, its orientation, and its distance upstream from the transcription or translation start. Suppose there are 8 known binding sites collectively in 5 genes. The organizers then chose 5 random genes from the same organism, extracted their upstream sequences, and planted the 8 binding sites in these upstream sequences in the same location and orientation in which they naturally occur in their own genes. (Actually the organizers could also choose a few more random genes to add to the data set to make the problem realistic, say 7 genes in total, only 5 of which will contain the 8 known binding sites and the other 2 containing no planted binding sites.) In subsequent discussion with participants, it was decided that each of these three methods of choosing data sets (real upstream sequences, sequences randomly generated by a Markov chain with binding sites inserted, and randomly chosen upstream sequences with binding sites inserted) has its pros and cons. It was decided to include some data sets of all three types. To summarize, each participating program must be capable of finding motifs whose instances may occur 0 or more times in each input sequence, and on either of the two DNA strands. Some data sets may contain no planted bindings sites at all, and the program will benefit if it is capable of deciding that there are no significant motifs present. The assessment consisted of 56 such data sets. The data sets came from the human, mouse, D. melanogaster, and S. cerevisiae genomes. Participants were told the genome of each data set. The number of sequences and sequence length varied from data set to data set: There were 1-35 sequences per data set, each sequence of length up to 3000 bp. Total input size of each data set ranged from 1 Kb to 70 Kb. |
|
Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to tompa] | |