
Motif Discovery Assessment: Statistics


For each tool T and each data set D, the accuracy of T on D can be assessed both at the nucleotide level and at the site level. Specifically, at the nucleotide level let

nTP be the number of nucleotide positions in both known sites and predicted sites,

nFN be the number of nucleotide positions in known sites but not in predicted sites,

nFP be the number of nucleotide positions not in known sites but in predicted sites, and

nTN be the number of nucleotide positions in neither known sites nor predicted sites.
We will say that a predicted site overlaps a known site if they overlap by at least 1/4 the length of the known site. (Although this cutoff is somewhat arbitrary, the motivation is that, if an experimentalist were to remove the predicted site, enough of the known site would be deleted so that one might be able to see a difference in expression.) At the site level, then, let

sTP be the number of known sites overlapped by predicted sites,

sFN be the number of known sites not overlapped by predicted sites, and

sFP be the number of predicted sites not overlapped by known sites.
At either the nucleotide (x=n) or site (x=s) level one can then define

Sensitivity: xSn = xTP / (xTP + xFN), and

Positive Predictive Value: xPPV = xTP / (xTP + xFP).
The sensitivity gives the fraction of known sites (or site nucleotides) that are predicted, and the positive predictive value gives the fraction of predicted sites (or site nucleotides) that are known.
At the nucleotide level one can also define

Specificity: nSp = nTN / (nTN + nFP).
Finally we consider various single statistics that in some sense average (some of) these quantities. Define the (nucleotide level) performance coefficient as

nPC = nTP / (nTP + nFN + nFP),
the (nucleotide level) correlation coefficient as

nCC = (nTP nTN  nFN nFP) / √((nTP+nFN)(nTN+nFP)(nTP+nFP)(nTN+nFN)) ,
and the (site level) average site performance as
We need a way of summarizing the performance of a given tool over a collection of data sets, where that collection might be all the data sets, or all the yeast data sets, or all the generic data sets, etc. For each tool T, each statistic M, and each collection C of data sets of interest, we summarize T ’s performance on C as follows.
Add nTP, nFP, nFN, nTN, sTP, sFP, and sFN over the data sets in C, and compute the measure M as though C were one large data set. For measures such as Sn and PPV, this is exactly a weighted average, where each term is weighted by its denominator.


Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 981952350
(206) 5431695 voice, (206) 5432969 FAX
[comments to tompa]
