Chemistry Lab University of Washington Computer Science & Engineering
 Motif Discovery Assessment: Statistics
  CSE Home   About Us    Search    Contact Info 

For each tool T and each data set D, the accuracy of T on D can be assessed both at the nucleotide level and at the site level. Specifically, at the nucleotide level let

We will say that a predicted site overlaps a known site if they overlap by at least 1/4 the length of the known site. (Although this cutoff is somewhat arbitrary, the motivation is that, if an experimentalist were to remove the predicted site, enough of the known site would be deleted so that one might be able to see a difference in expression.) At the site level, then, let

At either the nucleotide (x=n) or site (x=s) level one can then define

The sensitivity gives the fraction of known sites (or site nucleotides) that are predicted, and the positive predictive value gives the fraction of predicted sites (or site nucleotides) that are known.

At the nucleotide level one can also define

Finally we consider various single statistics that in some sense average (some of) these quantities. Define the (nucleotide level) performance coefficient as the (nucleotide level) correlation coefficient as and the (site level) average site performance as

We need a way of summarizing the performance of a given tool over a collection of data sets, where that collection might be all the data sets, or all the yeast data sets, or all the generic data sets, etc. For each tool T, each statistic M, and each collection C of data sets of interest, we summarize T ís performance on C as follows. Add nTP, nFP, nFN, nTN, sTP, sFP, and sFN over the data sets in C, and compute the measure M as though C were one large data set. For measures such as Sn and PPV, this is exactly a weighted average, where each term is weighted by its denominator.


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to tompa]