This page contains supplementary information to the following paper:
Zasha Weinberg and Walter L. Ruzzo, "Sequence-based heuristics for faster annotation of
non-coding RNA families", Bioinformatics, to appear.
Software
The software is here
Supplementary paper
This supplementary paper contains details on algorithms and implementation, as well as the full set of ROC-like curves.
Raw ROC-like Curves
The raw points for the ROC-like curves are supplied as comma-separated
files. The first column is the sensitivity, the second is the filtering
fraction and the third is the heuristic threshold score (in log_2 units)
to obtain this sensitivity & filtering fraction.
The complete set is available tar'd and gzip'd: here (3
MB). The file names are described below.
Rfam families ROC-like curves
All files end in ".posvsfrac.csv".
The first part of the file name is the Rfam ID: RF00001, RF00005, RF00010,
RF00029, RF00031, RF00059, RF00168 or RF00174.
The next part of the file name describes what filter was tested:
-
"" (nothing) : ML-heuristic
-
"_expandedMlPath" : ML-heuristic with expanded-type HMM (only for RF00005)
-
"_expandedRigor" : expanded-type rigorous profile HMM
-
"_fakeCmbuild" : ignore-SS
-
"_full90id_mlPath" : ML-heuristic trained on the full Rfam members, filtered to
<90% identity between pairs of sequences.
-
"_rfamseed-5.0-BLAST-1sided": 1-sided BLAST on seed members of Rfam
-
"_rfamseed-5.0-BLAST-2sided": 2-sided BLAST on seed members of Rfam
-
"_rfamseed-5.0_evalue1-BLAST-1sided": 1-sided BLAST on seed members of Rfam,
run with lower E-value of 1, to assess how accurate ROC-like curves are for
BLAST. (Only for RF00005.)
-
"-BLAST-1sided": 1-sided BLAST on full Rfam members, filtered to <90%
identity between pairs of sequences.
-
"-BLAST-2sided": 2-sided BLAST on full Rfam members, filtered to <90%
identity between pairs of sequences.
tRNAscan-SE scans
All files end in ".posvsfrac.csv". All files use compact-type
ML-heuristic.
For historical reasons, the prokaryote files begin with "TRNA2" while the
eukaryote files begin with "tRNAscan-TRNA2". Next the organism(s) is
specified:
-
"arch" : archaeal
-
"bact" : eubacterial
-
"euk_Celegans": C. elegans (nuclear genome only)
-
"euk_Drosophila": Drosophila (nuclear genome only)
-
"euk_human": human (nuclear genome only)
Next, the window length for the ML-heuristic is given:
-
win100: window length W=100
-
win500: window length W=500
Other data on tRNAscan-SE scans of these same databases are available from our
previous paper; see this Web supplement.