image University of Washington Department of Computer Science & Engineering

CMfinder - A Covariance Model Based RNA Motif Finding Algorithm
Zizhen Yao, Zasha Weinberg, Walter L. Ruzzo
BioInformatics,2006 22: p. 445-452
[Supplementary Material*]


* This website supplements our paper, "CMfinder - A Covariance Model Based RNA Motif Finding Algorithm" with additional technical details and supplementary results.


Introduction

CMfinder is a RNA motif prediction tool. It is an expectation maximization algorithm using covariance models for motif description, carefully crafted heuristics for effective motif search, and a novel Bayesian framework for structure prediction combining folding energy and sequence covariation. This tool performs well on unaligned sequences with long extraneous flanking regions, and in cases when the motif is only present in a subset of sequences. CMfinder also integrates directly with genome-scale homology search, and can be used for automatic refinement and expansion of RNA families.

Technical details

The implementation described in the paper made use of the secondary structure folding and tree edit programs from ViennaRNA-1.5 .
In addition, covariance model based sequence scanning and Viterbi alignment used CMscan and CMalign, respectively, from Infernal 0.55.
CM construction is somewhat analogous to the EM-based CM training program Covet from Eddy & Durbin's COVE package, but is augmented with the structure inference features described in the paper. It also makes use of pre-released "Dirichlet mixture prior" code provided to us by Dr. Sean Eddy (now incorporated into Infernal 0.70). The following supplementary paper provides additional technical information on the implementation.
Download supplement: Adobe Acrobat (pdf) | Postscript

Software availability

Web server for this tool is available.

Download the software here.

Please contact yzizhen@cs.washington.edu if you have any questions.

Supplementary Results

  • Global alignment tests on Rfam family sequences without flanking regions.

We compared CMfinder with 4 methods: RNAalifold, Pfold,Coveb (a program in the COVE package to construct a CM from a multiple alignment), and Covet (a companion of coveb, but using an EM algorithm to refine the CM). For fairness, each method is tested on the same CLUSTAL alignment for every dataset. The major difference between CMfinder and covet in this test is the use of partition function-based priors in secondary structure prediction.
Both CMfinder and covet use an EM algorithm to refine the alignment and prediction, while the other three methods do not.

In the following table, each cell for a given family and a given method contain three numbers which specifiy the accuracy, sensitivity and specificity respectively, and it is linked to the corresponding alignment file.  For better visulization, all alignments are transformed into the the format described below. The best accuracy, sensitivity and specificity for each family are highlighted in bold font. 

Alignment format description:
The colored block alignment corresponds to the Rfam motif (RM): blocks marked by the same color indicate a helix. Non-canonical base pairs in expected helices are colored light gray. Below each sequence is its predicted secondary structure. Matching pairs of angle brackets indicate the base pairs. The regions that are predicted as part of the motif, but not in the corresponding RM are shown in gray color, and regions that are in the RM, but not predicted as part of the motif are shown as dashed lines. The last line shows the Rfam consensus secondary structure.

Family Name
RFAM_ID
#Seq
Avg Seq %id Length CMfinder Pfold RNAalifold Covet Coveb
Cobalamin RF00174 71 49 216 0.61, 0.59, 0.64 0.41, 0.29, 0.60 0.32, 0.12, 0.87 0.51, 0.60, 0.44 0.26, 0.23, 0.30
Entero_CRE RF00048 56 81 61 0.77, 1.00, 0.59 0.95, 1.00, 0.90 0.71, 0.50, 1.00 0.00, 0.00, 0.00 0.00, 0.00, 0.00
Entero_OriR RF00041 35 77 73 0.95, 0.97, 0.93 0.80, 0.63, 1.00 0.70, 0.63, 0.76 0.06, 0.05, 0.09 0.00, 0.00, 0.00
Histone3 RF00032 63 77 26 1.00, 1.00, 1.00 1.00, 1.00, 1.00 1.00, 1.00, 1.00 0.70, 0.67, 0.74 0.57, 0.33, 0.96
IRE RF00037 30 68 30 0.92, 0.95, 0.90 0.67, 0.55, 0.83 0.64, 0.59, 0.69 0.00, 0.00, 0.00 0.12, 0.04, 0.34
Intron_gpII RF00029 75 55 92 0.71, 0.79, 0.64 0.78, 0.62, 0.99 0.76, 0.58, 0.99 0.67, 0.69, 0.66 0.55, 0.52, 0.58
Lysine RF00168 48 48 183 0.77, 0.76, 0.78 0.59, 0.49, 0.70 0.22, 0.10, 0.48 0.51, 0.49, 0.53 0.41, 0.38, 0.44
Purine RF00167 29 55 103 0.90, 0.93, 0.87 0.76, 0.71, 0.81 0.00, 0.00, 0.00 0.73, 0.87, 0.61 0.67, 0.67, 0.68
RFN RF00050 47 66 139 0.73, 0.85, 0.62 0.72, 0.81, 0.64 0.77, 0.81, 0.74 0.62, 0.80, 0.47 0.58, 0.51, 0.66
Rhino_CRE RF00220 12 71 86 0.96, 1.00, 0.93 0.66, 0.48, 0.93 0.77, 0.75, 0.79 0.33, 0.28, 0.40 0.00, 0.00, 0.00
SECIS RF00031 43 43 68 0.68, 0.56, 0.82 0.00, 0.00, 0.00 0.00, 0.00, 0.00 0.03, 0.02, 0.03 0.00, 0.00, 0.00
S_box RF00162 64 66 112 0.81, 0.89, 0.75 0.77, 0.68, 0.88 0.72, 0.57, 0.92 0.55, 0.59, 0.52 0.41, 0.32, 0.53
Tymo_tRNA-like RF00233 22 72 86 0.77, 0.80, 0.75 0.62, 0.52, 0.74 0.73, 0.54, 0.99 0.58, 0.54, 0.63 0.45, 0.34, 0.58
ctRNA_pGA1 RF00236 17 74 83 0.89, 0.90, 0.88 0.86, 0.80, 0.92 0.92, 0.90, 0.94 0.16, 0.12, 0.21 0.00, 0.00, 0.00
glmS RF00234 14 58 188 0.74, 0.71, 0.77 0.62, 0.47, 0.81 0.49, 0.31, 0.78 0.30, 0.29, 0.32 0.27, 0.18, 0.42
let-7 RF00027 9 69 84 0.79, 0.75, 0.83 0.84, 0.71, 1.00 0.80, 0.68, 0.93 0.00, 0.00, 0.00 0.00, 0.00, 0.00
lin-4 RF00052 9 69 72 0.76, 0.78, 0.73 0.72, 0.57, 0.91 0.73, 0.56, 0.94 0.18, 0.14, 0.24 0.18, 0.08, 0.38
mir-10 RF00104 11 66 75 0.85, 0.94, 0.77 0.75, 0.61, 0.93 0.85, 0.84, 0.86 0.00, 0.00, 0.00 0.01, 0.01, 0.02
s2m RF00164 23 80 43 0.68, 0.65, 0.71 1.00, 1.00, 1.00 0.64, 0.59, 0.69 0.71, 0.65, 0.78 0.68, 0.47, 1.00
Average 0.80, 0.83, 0.78 0.71, 0.63, 0.82 0.62, 0.53, 0.85 0.35, 0.36, 0.35 0.27, 0.21, 0.38

  • Local alignment tests on Rfam family sequences with 200 bases offlanking regions.

We compared CMfinder with 5 methods: RNAalifold, Pfold, Carnac, Foldalign (v1.0 and v2.0), and ComRNA. For Pfold, RNAalifold, and Carnac, we used the default parameters; for the others, parameters were as follows:

CMfinder:
Output 5 single stem-loop, range (30~100bp), 5 double stem-loop, range (40~100bp).
For 6 families (Cobalamin, Lysine, S_box, RFN, glmS, Tymo_tRNA-like) with complicated structure, combine motifs using greedy heuristics.
Foldalign:
v 1.0 : tested on IRE, let-7, lin-4, mir-10, Rhino_CRE.Choose the best scoring motif. spec file parameters:
Align range: # 5 60
Add nt. to seq range: 0
Filter parameters: 2 2 <# input sequences> 15 1
Scoring file: score_v1.mat

v 2.0 : tested on 9 families with #seq < 40, length < 100bp.
foldalign -max_diff 20 -max_length 100 <input_file>


ComRNA:
comRNA -x 0 -ts 20 -tp 20<input_file>
-x 0: no pseudoknot allowed.-ts, -tp are used to control running time, and we set both to 20 times their default values.
Choose the motif with the best accuracy according to Rfam annotation.

The following table is in the same format as the one above. RNAalifold, Pfold and Carnac annotate the secondary structures of input sequences, but do not predict the range of the motif, thus the comparison includes whole sequences.

  • Empty cells indicate that the corresponding experiment were not conducted due to computation cost or other limitation.
  • "NA" cells indicate that the tool did not produce any structure for the given dataset. The resulting accuracy and sensitivity are 0, and specificity is NA. 
  • "X" indicate that the tool terminated abnormally.

Family Name RFAM_ID #Seq Avg Seq %id Length CMfinder Pfold RNAalifold Carnac Foldalign(v1.0) Foldalign(v2.0) ComRNA
Cobalamin RF00174 71 49 216 0.59, 0.45, 0.77 0.05, 0.01, 0.18 NA

X

    NA
Entero_CRE RF00048 56 81 61 0.89, 0.83, 0.94 0.74, 1.00, 0.55 0.22, 0.18, 0.27 NA     NA
Entero_OriR RF00041 35 77 73 0.94, 0.96, 0.92 0.75, 0.88, 0.65 0.76, 0.63, 0.90 0.80, 0.70, 0.93   0.52, 0.53, 0.51 0.52, 0.49, 0.78
Histone3 RF00032 63 77 26 1.00, 1.00, 1.00 0.00, 0.00, 0.00 NA  NA     NA
IRE RF00037 30 68 30 0.77, 0.93, 0.64 0.22, 0.17, 0.28 NA  NA 0.43, 0.35, 0.53 0.38, 0.41, 0.36 NA
Intron_gpII RF00029 75 55 92 0.80, 0.70, 0.92 0.30, 0.14, 0.68 NA  NA     NA
Lysine RF00168 48 48 183 0.77, 0.73, 0.80 0.24, 0.13, 0.43 NA

 X

    NA
Purine RF00167 29 55 103 0.91, 0.90, 0.93 0.07, 0.02, 0.22 NA  NA     0.27, 0.19, 0.76
RFN RF00050 47 66 139 0.39, 0.40, 0.39 0.68, 0.75, 0.62 0.26, 0.23, 0.30  NA     NA
Rhino_CRE RF00220 12 71 86 0.88, 0.90, 0.87 0.52, 0.53, 0.51 0.52, 0.53, 0.52 0.69, 0.52, 0.92 0.43, 0.38, 0.50 0.41, 0.36, 0.47 0.61, 0.58, 0.79
SECIS RF00031 43 43 68 0.73, 0.62, 0.87 NA NA  NA     NA
S_box RF00162 64 66 112 0.72, 0.67, 0.78 0.11, 0.07, 0.19 NA  NA     NA
Tymo_tRNA-like RF00233 22 72 86 0.81, 0.80, 0.82 0.33, 0.31, 0.35 0.36, 0.30, 0.42 0.30, 0.14, 1.00   0.80, 0.78, 0.81 0.48, 0.42, 0.64
ctRNA_pGA1 RF00236 17 74 83 0.91, 0.91, 0.92 0.70, 0.72, 0.69 0.72, 0.68, 0.77  NA   0.86, 0.85, 0.88 0.00, 0.00, 0.00
glmS RF00234 14 58 188 0.83, 0.83, 0.83 0.12, 0.07, 0.24 0.18, 0.07, 0.43  NA     0.13, 0.06, 0.56
let-7 RF00027 9 69 84 0.87, 0.80, 0.93 0.08, 0.06, 0.11 0.42, 0.35, 0.50  NA 0.29, 0.15, 0.57 0.71, 0.69, 0.74 0.78, 0.72, 0.85
lin-4 RF00052 9 69 72 0.78, 0.87, 0.69 0.51, 0.46, 0.56 0.75, 0.62, 0.92 0.41, 0.43, 0.39 0.67, 0.63, 0.70 0.65, 0.71, 0.62 0.24, 0.23, 0.55
mir-10 RF00104 11 66 75 0.66, 0.62, 0.70 0.59, 0.56, 0.62 0.60, 0.47, 0.76  NA 0.55, 0.42, 0.71 0.48, 0.54, 0.43 0.33, 0.28, 0.85
s2m RF00164 23 80 43 0.67, 0.65, 0.70 0.80, 0.91, 0.72 0.45, 0.40, 0.51 0.64, 0.46, 0.95   0.63, 0.64, 0.62 0.29, 0.23, 0.73
Average 0.79, 0.77, 0.81 0.36, 0.36, 0.42 0.28, 0.23, 0.57 0.17, 0.13, 0.84 0.39, 0.47, 0.60 0.60, 0.61, 0.60 0.19, 0.17, 0.65


  • Robustness tests

We tested CMfinder on more challenging datasets with larger flanking regions and only a subset of the sequences containing real RNA motifs. This result is based on an earlier version of CMfinder, which used COVE rather than Infernal for CM operations. To create each dataset, we randomly selected n family members, including a given length of flanking sequence, then permuted the motif regions in k of them.These k sequences serve as control sequences, the rest, with real RNA motifs, are referred to as test sequences. This test was performed on 3 families, Histone, IRE, SEICS. They are cis-regulatory elements with simple structure, thus avoiding the issue of combining motifs. For each family, we performed two sets of test. First, we varied the flanking regions from 50 to 400 bases on both the 5' and 3' sides, with 1/4 control sequences; then we varied the fraction of control sequences from 1/8 to 1/2, all with 100-base flanking regions. We categorized three types of predicted motif instances: true and false motif instances in the test sequences (Tt and Ft, respectively), and those in the control sequences (Fc). Tts and Fts are determined by whether their overlaps with corresponding Rfam motif instances are greater than 10 bases. The cove scores for all the instances in each dataset are presented in the figures below:





  • Results Based on Footprinter Results

Orthologous gene groups were found by BLAST search using Bacillus subtilis genes as queries. Each group contains 20 sequences. For each such dataset, up to 600 bases of upstream intergenic region were collected per species. Then we applied the DNA motif finding tool Footprinter to rank the datasets based on patterns of conservation. Finally, we applied CMfinder to analyze the top ranking groups. The covariance models of promising RNA motifs were used to scan the genome database, and the scanning results were used to refine the RNA motifs. We performed 2 iterations of scanning and refinement for 5 out of 7 chosen families (except S_box and RFN), and the first iteration was performed on the NCBI microbial genome database, and the second scan was performed on the RFAMSEQ database for comparison with RFAM. See the RaveNna documentation for a description of the scanning procedures and format of the scan result files.

Orthologous family Rfam family
Sequence File Initial Alignment Updated Alignment Scan Result
folC  T-box folC.fasta
folC.init.html
folC.all.html
.csv
metK  S_box metK.fasta
metK.init.html
metK.all.html
.csv
ribB  RFN ribB.fasta
ribB.init.html
ribB.all.html
.csv
thiA
 THI thiA.fasta
thiA.init.html
thiA.all.html
.csv
xpt
 Purine xpt.fasta
xpt.init.html
xpt.all.html
.csv
glmS
 glmS glmS.fasta
glmS.init.html
glmS.all.html
.csv
ykoY
 yybP-ykoY ykoY.fasta
ykoY.init.html
ykoY.all.html
.csv
  • Results Based on BLISS Candidates

We also experimented with interesting families identified by other data sources and computational tools. Breaker's lab identified candidate riboswitch families using alignments seeded by BLAST hits between intergenic regions from bacterial genomes. These sequences are then annotated by gene function, and possible terminators. The collection of interesting groups is available as the BLISS database. We chose several groups in BLISS that correspond to several recent riboswitches or riboswitch candidates and applied CMfinder on the raw sequence data. The resulting sequence alignments/structures are shown below.

Gene BLISS Entry CMfinder alignment
yuaA , ydaO yuaA , ydaO yuaA_ydaO.html
ykoK ykoK ykoK.html
ykkC ykkC, ykkD ykkC.html
yqhI (gcvT) yqhI yqhI.html

Last Updated: 09/12/2005