|
|
CMfinder - A Covariance Model Based RNA Motif Finding
Algorithm
Zizhen Yao, Zasha Weinberg, Walter L. Ruzzo
BioInformatics,2006 22: p. 445-452
[Supplementary Material*]
* This website supplements our paper,
"CMfinder - A Covariance Model Based RNA Motif Finding Algorithm"
with additional technical details and
supplementary results.
Introduction
CMfinder is a RNA motif prediction tool. It is an expectation
maximization algorithm using covariance models for motif description, carefully
crafted heuristics for effective motif search, and a novel Bayesian framework
for structure prediction combining folding energy and sequence covariation.
This tool performs well on unaligned sequences with long extraneous flanking
regions, and in cases when the motif is only present in a subset of sequences.
CMfinder also integrates directly with genome-scale homology search, and can be
used for automatic refinement and expansion of RNA families.
Technical details
The implementation described in the paper made use
of the secondary structure folding and tree edit programs from
ViennaRNA-1.5 .
In addition, covariance model based sequence scanning and Viterbi alignment used
CMscan and CMalign, respectively, from
Infernal 0.55.
CM construction is somewhat analogous to the EM-based CM training program
Covet from Eddy & Durbin's
COVE package,
but is augmented with the structure inference features described in the paper. It also makes
use of pre-released "Dirichlet mixture prior" code provided to us by Dr.
Sean Eddy (now incorporated into Infernal 0.70).
The following supplementary paper provides additional technical information on the implementation.
Download supplement:
Adobe Acrobat (pdf) |
Postscript
Software availability
Web server
for this tool is available.
Download the software here.
Please contact yzizhen@cs.washington.edu
if you have any questions.
Supplementary Results
-
Global alignment tests on Rfam family sequences without flanking
regions.
We compared CMfinder with 4 methods: RNAalifold, Pfold,Coveb
(a program in the COVE package to construct a CM from a multiple alignment),
and Covet (a companion of coveb, but using an EM algorithm to refine the
CM). For fairness, each method is tested on the same CLUSTAL alignment for
every dataset. The major difference between CMfinder and covet in this test is
the use of partition function-based priors in secondary structure prediction.
Both CMfinder and covet use an EM algorithm to refine the alignment and
prediction, while the other three methods do not.
In the following table, each cell for a given family and a given
method contain three numbers which specifiy the accuracy,
sensitivity and specificity respectively, and it is linked to
the corresponding alignment file. For better visulization, all
alignments are transformed into the the format described below. The best
accuracy, sensitivity and specificity for each family are highlighted in bold
font.
Alignment format description:
The colored block alignment corresponds to the Rfam motif (RM): blocks marked by
the same color indicate a helix. Non-canonical base pairs in expected helices
are colored light gray. Below each sequence is its predicted secondary structure.
Matching pairs of angle brackets indicate the base pairs. The regions that are predicted as
part of the motif, but not in the corresponding RM are shown in gray color, and
regions that are in the RM, but not predicted as part of the motif are shown as dashed
lines. The last line shows the Rfam consensus secondary structure.
Family Name
|
RFAM_ID
|
#Seq
|
Avg Seq %id
|
Length
|
CMfinder
|
Pfold
|
RNAalifold
|
Covet
|
Coveb
|
| Cobalamin
|
RF00174
|
71
|
49
|
216
|
0.61, 0.59, 0.64 |
0.41, 0.29, 0.60 |
0.32, 0.12, 0.87 |
0.51, 0.60, 0.44 |
0.26, 0.23, 0.30 |
| Entero_CRE
|
RF00048
|
56
|
81
|
61
|
0.77, 1.00, 0.59 |
0.95, 1.00, 0.90 |
0.71, 0.50, 1.00 |
0.00, 0.00, 0.00 |
0.00, 0.00, 0.00 |
| Entero_OriR
|
RF00041
|
35
|
77
|
73
|
0.95, 0.97, 0.93 |
0.80, 0.63, 1.00 |
0.70, 0.63, 0.76 |
0.06, 0.05, 0.09 |
0.00, 0.00, 0.00 |
| Histone3
|
RF00032
|
63
|
77
|
26
|
1.00, 1.00, 1.00 |
1.00, 1.00, 1.00 |
1.00, 1.00, 1.00 |
0.70, 0.67, 0.74 |
0.57, 0.33, 0.96 |
| IRE
|
RF00037
|
30
|
68
|
30
|
0.92, 0.95, 0.90 |
0.67, 0.55, 0.83 |
0.64, 0.59, 0.69 |
0.00, 0.00, 0.00 |
0.12, 0.04, 0.34 |
| Intron_gpII
|
RF00029
|
75
|
55
|
92
|
0.71, 0.79, 0.64 |
0.78, 0.62, 0.99 |
0.76, 0.58, 0.99 |
0.67, 0.69, 0.66 |
0.55, 0.52, 0.58 |
| Lysine
|
RF00168
|
48
|
48
|
183
|
0.77, 0.76, 0.78 |
0.59, 0.49, 0.70 |
0.22, 0.10, 0.48 |
0.51, 0.49, 0.53 |
0.41, 0.38, 0.44 |
| Purine
|
RF00167
|
29
|
55
|
103
|
0.90, 0.93, 0.87 |
0.76, 0.71, 0.81 |
0.00, 0.00, 0.00 |
0.73, 0.87, 0.61 |
0.67, 0.67, 0.68 |
| RFN
|
RF00050
|
47
|
66
|
139
|
0.73, 0.85, 0.62 |
0.72, 0.81, 0.64 |
0.77, 0.81, 0.74 |
0.62, 0.80, 0.47 |
0.58, 0.51, 0.66 |
| Rhino_CRE
|
RF00220
|
12
|
71
|
86
|
0.96, 1.00, 0.93 |
0.66, 0.48, 0.93 |
0.77, 0.75, 0.79 |
0.33, 0.28, 0.40 |
0.00, 0.00, 0.00 |
| SECIS
|
RF00031
|
43
|
43
|
68
|
0.68, 0.56, 0.82 |
0.00, 0.00, 0.00 |
0.00, 0.00, 0.00 |
0.03, 0.02, 0.03 |
0.00, 0.00, 0.00 |
| S_box
|
RF00162
|
64
|
66
|
112
|
0.81, 0.89, 0.75 |
0.77, 0.68, 0.88 |
0.72, 0.57, 0.92 |
0.55, 0.59, 0.52 |
0.41, 0.32, 0.53 |
| Tymo_tRNA-like
|
RF00233
|
22
|
72
|
86
|
0.77, 0.80, 0.75 |
0.62, 0.52, 0.74 |
0.73, 0.54, 0.99 |
0.58, 0.54, 0.63 |
0.45, 0.34, 0.58 |
| ctRNA_pGA1
|
RF00236
|
17
|
74
|
83
|
0.89, 0.90, 0.88 |
0.86, 0.80, 0.92 |
0.92, 0.90, 0.94 |
0.16, 0.12, 0.21 |
0.00, 0.00, 0.00 |
| glmS
|
RF00234
|
14
|
58
|
188
|
0.74, 0.71, 0.77 |
0.62, 0.47, 0.81 |
0.49, 0.31, 0.78 |
0.30, 0.29, 0.32 |
0.27, 0.18, 0.42 |
| let-7
|
RF00027
|
9
|
69
|
84
|
0.79, 0.75, 0.83 |
0.84, 0.71, 1.00 |
0.80, 0.68, 0.93 |
0.00, 0.00, 0.00 |
0.00, 0.00, 0.00 |
| lin-4
|
RF00052
|
9
|
69
|
72
|
0.76, 0.78, 0.73 |
0.72, 0.57, 0.91 |
0.73, 0.56, 0.94 |
0.18, 0.14, 0.24 |
0.18, 0.08, 0.38 |
| mir-10
|
RF00104
|
11
|
66
|
75
|
0.85, 0.94, 0.77 |
0.75, 0.61, 0.93 |
0.85, 0.84, 0.86 |
0.00, 0.00, 0.00 |
0.01, 0.01, 0.02 |
| s2m
|
RF00164
|
23
|
80
|
43
|
0.68, 0.65, 0.71 |
1.00, 1.00, 1.00 |
0.64, 0.59, 0.69 |
0.71, 0.65, 0.78 |
0.68, 0.47, 1.00 |
| Average |
0.80, 0.83, 0.78 |
0.71, 0.63, 0.82 |
0.62, 0.53, 0.85 |
0.35, 0.36, 0.35 |
0.27, 0.21, 0.38 |
-
Local alignment tests on Rfam family sequences with 200 bases offlanking
regions.
We compared CMfinder with 5 methods: RNAalifold, Pfold, Carnac,
Foldalign (v1.0 and v2.0), and ComRNA. For Pfold, RNAalifold, and
Carnac, we used the default parameters; for the others, parameters were as follows:
CMfinder:
Output 5 single stem-loop, range (30~100bp), 5 double stem-loop, range (40~100bp).
For 6 families (Cobalamin, Lysine, S_box, RFN, glmS, Tymo_tRNA-like) with
complicated structure, combine motifs using greedy heuristics.
Foldalign:
v 1.0 : tested on IRE, let-7, lin-4, mir-10, Rhino_CRE.Choose the best
scoring motif. spec file parameters:
Align range: # 5 60
Add nt. to seq range: 0
Filter parameters: 2 2 <# input sequences> 15 1
Scoring file: score_v1.mat
v 2.0 : tested on 9 families with #seq < 40, length < 100bp.
foldalign -max_diff 20 -max_length 100 <input_file>
ComRNA:
comRNA -x 0 -ts 20 -tp 20<input_file>
-x 0: no pseudoknot allowed.-ts, -tp are used to control running time,
and we set both to 20 times their default values.
Choose the motif with
the best accuracy according to Rfam annotation.
The following table is in the same format as the one above. RNAalifold, Pfold
and Carnac annotate the secondary structures of input sequences, but do not
predict the range of the motif, thus the comparison includes whole
sequences.
-
Empty cells indicate that the corresponding experiment were not conducted due
to computation cost or other limitation.
-
"NA" cells indicate that the tool did not produce any structure for the given
dataset. The resulting accuracy and sensitivity are 0, and specificity
is NA.
-
"X" indicate that the tool terminated abnormally.
| Family Name
|
RFAM_ID
|
#Seq
|
Avg Seq %id
|
Length
|
CMfinder
|
Pfold
|
RNAalifold
|
Carnac
|
Foldalign(v1.0) |
Foldalign(v2.0) |
ComRNA
|
| Cobalamin
|
RF00174
|
71
|
49
|
216
|
0.59, 0.45, 0.77 |
0.05, 0.01, 0.18 |
NA |
X
|
|
|
NA |
| Entero_CRE
|
RF00048
|
56
|
81
|
61
|
0.89, 0.83, 0.94 |
0.74, 1.00, 0.55 |
0.22, 0.18, 0.27 |
NA |
|
|
NA |
| Entero_OriR
|
RF00041
|
35
|
77
|
73
|
0.94, 0.96, 0.92 |
0.75, 0.88, 0.65 |
0.76, 0.63, 0.90 |
0.80, 0.70, 0.93 |
|
0.52, 0.53, 0.51 |
0.52, 0.49, 0.78 |
| Histone3
|
RF00032
|
63
|
77
|
26
|
1.00, 1.00, 1.00 |
0.00, 0.00, 0.00 |
NA |
NA |
|
|
NA |
| IRE
|
RF00037
|
30
|
68
|
30
|
0.77, 0.93, 0.64 |
0.22, 0.17, 0.28 |
NA |
NA |
0.43, 0.35, 0.53 |
0.38, 0.41, 0.36 |
NA |
| Intron_gpII
|
RF00029
|
75
|
55
|
92
|
0.80, 0.70, 0.92 |
0.30, 0.14, 0.68 |
NA |
NA |
|
|
NA |
| Lysine
|
RF00168
|
48
|
48
|
183
|
0.77, 0.73, 0.80 |
0.24, 0.13, 0.43 |
NA |
X
|
|
|
NA |
| Purine
|
RF00167
|
29
|
55
|
103
|
0.91, 0.90, 0.93 |
0.07, 0.02, 0.22 |
NA |
NA |
|
|
0.27, 0.19, 0.76 |
| RFN
|
RF00050
|
47
|
66
|
139
|
0.39, 0.40, 0.39 |
0.68, 0.75, 0.62 |
0.26, 0.23, 0.30 |
NA |
|
|
NA |
| Rhino_CRE
|
RF00220
|
12
|
71
|
86
|
0.88, 0.90, 0.87 |
0.52, 0.53, 0.51 |
0.52, 0.53, 0.52 |
0.69, 0.52, 0.92 |
0.43, 0.38, 0.50 |
0.41, 0.36, 0.47 |
0.61, 0.58, 0.79 |
| SECIS
|
RF00031
|
43
|
43
|
68
|
0.73, 0.62, 0.87 |
NA |
NA |
NA |
|
|
NA |
| S_box
|
RF00162
|
64
|
66
|
112
|
0.72, 0.67, 0.78 |
0.11, 0.07, 0.19 |
NA |
NA |
|
|
NA |
| Tymo_tRNA-like
|
RF00233
|
22
|
72
|
86
|
0.81, 0.80, 0.82 |
0.33, 0.31, 0.35 |
0.36, 0.30, 0.42 |
0.30, 0.14, 1.00 |
|
0.80, 0.78, 0.81 |
0.48, 0.42, 0.64 |
| ctRNA_pGA1
|
RF00236
|
17
|
74
|
83
|
0.91, 0.91, 0.92 |
0.70, 0.72, 0.69 |
0.72, 0.68, 0.77 |
NA |
|
0.86, 0.85, 0.88 |
0.00, 0.00, 0.00 |
| glmS
|
RF00234
|
14
|
58
|
188
|
0.83, 0.83, 0.83 |
0.12, 0.07, 0.24 |
0.18, 0.07, 0.43 |
NA |
|
|
0.13, 0.06, 0.56 |
| let-7
|
RF00027
|
9
|
69
|
84
|
0.87, 0.80, 0.93 |
0.08, 0.06, 0.11 |
0.42, 0.35, 0.50 |
NA |
0.29, 0.15, 0.57 |
0.71, 0.69, 0.74 |
0.78, 0.72, 0.85 |
| lin-4
|
RF00052
|
9
|
69
|
72
|
0.78, 0.87, 0.69 |
0.51, 0.46, 0.56 |
0.75, 0.62, 0.92 |
0.41, 0.43, 0.39 |
0.67, 0.63, 0.70 |
0.65, 0.71, 0.62 |
0.24, 0.23, 0.55 |
| mir-10
|
RF00104
|
11
|
66
|
75
|
0.66, 0.62, 0.70 |
0.59, 0.56, 0.62 |
0.60, 0.47, 0.76 |
NA |
0.55, 0.42, 0.71 |
0.48, 0.54, 0.43 |
0.33, 0.28, 0.85 |
| s2m
|
RF00164
|
23
|
80
|
43
|
0.67, 0.65, 0.70 |
0.80, 0.91, 0.72 |
0.45, 0.40, 0.51 |
0.64, 0.46, 0.95 |
|
0.63, 0.64, 0.62 |
0.29, 0.23, 0.73 |
| Average |
0.79, 0.77, 0.81 |
0.36, 0.36, 0.42 |
0.28, 0.23, 0.57 |
0.17, 0.13, 0.84 |
0.39, 0.47, 0.60 |
0.60, 0.61, 0.60 |
0.19, 0.17, 0.65 |
We tested CMfinder on more challenging datasets with larger flanking regions
and only a subset of the sequences containing real RNA motifs. This result is
based on an earlier version of CMfinder, which used COVE rather than Infernal
for CM operations. To create each
dataset, we randomly selected n family members, including a given length of
flanking sequence, then permuted the motif regions in k of them.These k
sequences serve as control sequences, the rest, with real RNA motifs,
are referred to as test sequences. This test was performed on 3 families,
Histone, IRE, SEICS. They are cis-regulatory elements with simple structure,
thus avoiding the issue of combining motifs. For each family, we
performed two sets of test. First, we varied the flanking regions from 50 to
400 bases on both the 5' and 3' sides, with 1/4 control sequences; then we
varied the fraction of control sequences from 1/8 to 1/2, all with 100-base
flanking regions. We categorized three types of predicted motif instances: true
and false motif instances in the test sequences (Tt and Ft, respectively),
and those in the control sequences (Fc). Tts and Fts are determined by
whether their overlaps with corresponding Rfam motif instances are greater than
10 bases. The cove scores for all the instances in each dataset are presented
in the figures below:
 


-
Results Based on Footprinter Results
Orthologous gene groups were found by BLAST search using Bacillus subtilis
genes as queries. Each group contains 20 sequences. For each such dataset,
up to 600 bases of upstream intergenic region were collected per species. Then we applied
the DNA motif finding tool
Footprinter to rank the datasets based on patterns of conservation.
Finally, we applied CMfinder to analyze the top ranking groups. The covariance
models of promising RNA motifs were used to scan the genome database, and the
scanning results were used to refine the RNA motifs. We performed 2 iterations
of scanning and refinement for 5 out of 7 chosen families (except S_box and
RFN), and the first iteration was performed on the
NCBI microbial genome database, and the second scan was performed on the
RFAMSEQ database for comparison with RFAM.
See the RaveNna documentation for a description
of the scanning procedures and format of the scan result files.
-
Results Based on BLISS Candidates
We also experimented with interesting families identified by other data sources
and computational tools. Breaker's lab identified candidate riboswitch families
using alignments seeded by BLAST hits between intergenic regions from bacterial
genomes. These sequences are then annotated by gene function, and possible
terminators. The collection of interesting groups is available as
the BLISS database. We chose several groups in BLISS that correspond to
several recent
riboswitches or riboswitch candidates and applied CMfinder on the raw
sequence data.
The resulting sequence alignments/structures are shown below.
|