  Motif Discovery Assessment: Statistics

For each tool T and each data set D, the accuracy of T on D can be assessed both at the nucleotide level and at the site level. Specifically, at the nucleotide level let

• nTP be the number of nucleotide positions in both known sites and predicted sites,
• nFN be the number of nucleotide positions in known sites but not in predicted sites,
• nFP be the number of nucleotide positions not in known sites but in predicted sites, and
• nTN be the number of nucleotide positions in neither known sites nor predicted sites.
We will say that a predicted site overlaps a known site if they overlap by at least 1/4 the length of the known site. (Although this cutoff is somewhat arbitrary, the motivation is that, if an experimentalist were to remove the predicted site, enough of the known site would be deleted so that one might be able to see a difference in expression.) At the site level, then, let
• sTP be the number of known sites overlapped by predicted sites,
• sFN be the number of known sites not overlapped by predicted sites, and
• sFP be the number of predicted sites not overlapped by known sites.

At either the nucleotide (x=n) or site (x=s) level one can then define

• Sensitivity: xSn = xTP / (xTP + xFN), and
• Positive Predictive Value: xPPV = xTP / (xTP + xFP).
The sensitivity gives the fraction of known sites (or site nucleotides) that are predicted, and the positive predictive value gives the fraction of predicted sites (or site nucleotides) that are known.

At the nucleotide level one can also define

• Specificity: nSp = nTN / (nTN + nFP).
Finally we consider various single statistics that in some sense average (some of) these quantities. Define the (nucleotide level) performance coefficient as
• nPC = nTP / (nTP + nFN + nFP),
the (nucleotide level) correlation coefficient as
• nCC = (nTP nTN - nFN nFP) / √((nTP+nFN)(nTN+nFP)(nTP+nFP)(nTN+nFN)) ,
and the (site level) average site performance as
• sASP = (sSn + sPPV) / 2.

We need a way of summarizing the performance of a given tool over a collection of data sets, where that collection might be all the data sets, or all the yeast data sets, or all the generic data sets, etc. For each tool T, each statistic M, and each collection C of data sets of interest, we summarize T ’s performance on C as follows. Add nTP, nFP, nFN, nTN, sTP, sFP, and sFN over the data sets in C, and compute the measure M as though C were one large data set. For measures such as Sn and PPV, this is exactly a weighted average, where each term is weighted by its denominator. Computer Science & Engineering University of Washington Box 352350 Seattle, WA  98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to tompa]