Supplementary Material

This web site supplements my article, "Exploiting Conserved Structure for Faster Annotation of Non-coding RNAs Without Loss of Accuracy" with the raw results of the scans, containing the new ncRNA homologs discovered by the more sensitive technique.  If you use this data in your own research, please cite my article:

Z. Weinberg and W.L. Ruzzo (2004) "Exploiting Conserved Structure for Faster Annotation of Non-coding RNAs Without Loss of Accuracy", Bioinformatics, 20 (suppl. 1): i334-i340.  Presented at the 12th International Conference on Intelligent Systems for Molecular Biology (ISMB 2004) .

Download preprint: in Adobe Acrobat (pdf), in Postcript

Technical supplement

This supplementary paper describes additional technical information on the implementation:

Download supplement: in Adobe Acrobat (pdf), in Postscript

(Last updated October 6, 2004.)

Software download

The software that implemented the techniques in this paper is avilable for download here

Raw results of scans

The following table has a link to the current version of the Rfam Database for each ncRNA family that was scanned for this paper.  It then gives the raw results (.cmzasha), and a comma-separated file (.csv) that is more convenient to look at, and notates which family members were already in Rfam 5.0, and which were new.  These results are for Rfam 5.0, which is no longer the current version.

Instructions:

In the following table, the first column is the Rfam accession number (see Rfam database Web site).  Next is a link to the given family in the current version of Rfam (which at some point in the future will be later than the Rfam 5.0 version that was used for my paper).  Next is a brief description of the family; the Rfam link has a paragraph on each family, and a couple of useful references.  The # known is the number of family members as reported in Rfam 5.0.  # new is the number of additional members found with the more sensitive technique (using the version of RFAMSEQ appropriate to Rfam 5.0).  Next is the .cmzasha file, which is the raw output of my program, and the .csv file, which is the result of processing the raw output slightly.

(Download all results files from the below table at once: tar & gzip archive)

 

Rfam Id link to Rfam name # known # new

.cmzasha
(raw scan)

.csv
(comma-separated file)

RF00001 link 5S rRNA 5460 14 file file
RF00004 link U2 snRNA 466 1 file file
RF00005 link tRNA 58609 5158 file (80 MBytes) file (50 MBytes)
RF00009 link RNase P (nuclear) 69 3 file file
RF00010 link RNase P (bacterial, type A) 413 1 file file
RF00017 link Signal Recognition Particle RNA
(eukaryotic & archaeal)
128 13 file file
RF00023 link tmRNA 226 21 file file
RF00029 link Group II intron 5708 331 file file
RF00059 link Thiamin pyrophosphate riboswitch 276 6 file file
RF00168 link Lysine riboswitch 60 11 file file
RF00174 link Cobalamin riboswitch 170 7 file file
N/A tRNAscan-SE archaea 1016 15 file file
N/A tRNAscan-SE eubacteria 13624 87 file file
N/A tRNAscan-SE Drosophila nuclear 296 1 file
file for selenocysteine tRNA
file
N/A tRNAscan-SE C. elegans nuclear 822 16 file
file for selenocysteine tRNA
file
N/A tRNAscan-SE human nuclear 608 121 file
file for selenocysteine tRNA
file

 

Filter series used in scans

The following describes, for each family scanned, the series of filters used in that scan.  The selection of a filter series is now a fully automated process, as described in the technical supplement.  However, the scheme described in the ISMB paper is partially manual, and for some of the easier families, even more manual tasks were performed; for some families like RF00004, a better filter series could almost certainly be found.

For each family, a numbered list is given showing each filter in the order it is applied.  Each filter either begins with 'hmm', 'sub' (Sub-CM) or 'store-pair'.

If 'hmm' a profile HMM is used, and it's type (expanded or compact) is given.  (Note that the sub-CM and store-pair modifications are all applied on top of the expanded-type HMM).

After 'sub' the node at which the sub-CM is rooted is given, followed by the sub-CM-specific window length is given.  For example, for RF00001, "sub,40,60" is used, which is a Sub-CM rooted at node 40 using a window length of only 60 (even though the full ncRNA requires a window length of 180).  If multiple sub-CMs are used simultaneously in some filter, they are separated by front slashes ('/').

After 'store-pair', the list of modifications is given.  Each modification consists of a number followed by a string.  The number is the node that is to be modified.  The string specifies what information should be stored for the pair at that node.  The first letter says whether the left or right nucleotide is stored: 'l'=left, 'r'=right.  The remainder specifies a partition of the 5 symbols ACGU_ indicating what is stored.  The underscore ('_') represents the empty character (written as epsilon in the paper).  To specify the partition, the symbols are separated with a dash ('-').  For example, "store-pair,3,l-A_-CG-U,76,r-AG_-C-U" says that (1) node 3 is modified by remembering which of the following sets the left nucleotide fits into: {A,_} or {C,G} or {U}, and (2) node 76 is modified by remembering which of the following sets the right nucleotide fits into: {A,G,_} or {C} or {U}.

About the .csv format

This file is a comma-separated file, which is intended to be viewed in Microsoft Excel, or a similar program.  Otherwise, it's a text file, so viewable in any text editor, although the files are very long, so it's going to be tough to read.  With a simple script (e.g. with Perl), you could convert it to other formats.

The columns in the .csv file are:

About the .cmzasha format

The .cmzasha file is the raw output of our software.  The information in it is essentially redundant with the .csv file.  Additional information is: