Copyright (c) 2005, by Zizhen Yao, Zasha Weinberg and Larry Ruzzo.
All rights reserved. Redistribution is not permitted without the
express written permission of the authors.
1) Description of the programCMfinder is a RNA motif prediction tool. This tool performs well on unaligned sequences with long extraneous flanking regions, and in cases when the motif is only present in a subset of sequences. It is an expectation maximization algorithm using covariance models for motif description, carefully crafted heuristics for effective motif search, and a novel Bayesian framework for structure prediction combining folding energy and sequence covariation. CMfinder also integrates directly with genome-scale homology search, and can be used for automatic refinement and expansion of RNA families.
You can access CMfinder is via our web server. Linux/Unix software package will be available at a later time for local installation.
When you run CMfinder on your own machine, the following command line should be used:
cmfinder.pl [options] <input_sequences>
More detailed description of input and parameters follows:
1. Input sequences
CMfinder takes as input a set of sequences in which you want to find RNA motifs. The input sequences must all be listed in the same file, in FASTA format. See an example. Extended DNA/RNA alphabet letters other than a,c,g or t/u will be replaced randomly by a basic letter within the set. Each sequence must have a name that is different from that of the other sequences.
2. Number of stem-loops
Specifies the the number of stem-loops in each candidate. This parameter determines the complexity of the motifs in most cases, although it is possible that the number of stem-loops in the final output motifs will be different, because the structure can change during the EM refinement procedure. For most datasets, we set it to 1 or 2 to find single stem-loop or double stem-loop motifs. To discover more complicated motifs with more than 2 stem-loops, you can combine multiple motifs
3. Number of motifs
Specifies the number of output motifs. To find a motif with simple structure (e.g. with a single stem-loop), it is sufficient to output 3 motifs, which (in our experience) usually include the true one. For complicated motifs such as riboswitches with more than 3 stem-loops, we suggest trying up to 5 single stem-loop motifs, and 5 double stem-loop motifs to improve the coverage on differen regions of a true RNA. It is possible that some of the output motifs are highly similar, and you can choose to remove redundant motifs.
4.Minimum/Maximum length of a
Specifies the size of the motifs sought.
For single stem-loop families, the default range is 30 ~ 100bp.
For double stem-loop families, the default range is 40 ~ 100bp
The lower bound of motif length is 15bp, and the upper bound is 250bp. The actual output motifs can be slightly outside the specified range, because we do not enforce the range during refinement iteration.
5. Number of candidates:
Specifies the number of candidates in each sequence. The default value is 40, which according to our experience, is sufficient for sequences with length < 500bp. As the sequence length increase, please increase the number of candidates proportionately.
6. Fraction of sequences
containing the motif:
This parameter affects choices between short motifs conserved in more sequences, or longer motifs conserved in fewer sequences. Small variations are not critical. However, if a motif is only contained in 3~4 sequences in a set of 20 sequences, for example, this parameter affects the final output. Note this parameter does not specify how many sequences actually contain the motif instances in the final output, but simply serves as a preliminary guess. In our test set, values in the range 0.4 ~ 0.8 have seemed appropriate.
7.Combine multiple motifs:
Use this feature only if you are searching for RNA motifs with complicated structure with length > 100bp. The program tries to merge consecutitive motifs progressively, and refines the merged motifs iteratively using the EM algorithm.
8.Remove redundant motifs:
CMfinder tries to find multiple motifs that are distinctive, but different motif seeds may converge to the same structure after the EM iteration. For ease of post-processing, we offer this feature to remove largely duplicated motifs. The removed motifs are collected in a single file in case recovery is needed.
The web server provides a gzipped tar ball which includes the following files:
1) Input sequence file
The input sequences are kept in a file named seq.fasta
2) Motifs in stockholm format
Motifs in Stockholm format are stored in files named as seq.fasta.motif.*, where the suffix specifies the number of stem-loops in the input configuration and the motif index. We exploit the mark-up lines
#=GS <seqname> DE <start>..<end> <score>
#=GS <seqname> WT <weight>
to describe the start and end position of the motif within the sequence, the alignment score, and the weight. See an (example) of such a motif.
3) Covariance models for motifs
The motif covariance models are named as seq.fasta.cm.*, where the suffix is the same as the corresponding motifs in Stockholm format. The models are in Infernal 0.55 format.
4) CMfinder summary
The file "seq.fasta.summary" records the input parameters, the motifs that have been combined, or removed, and the summary statistics of all output motifs, which include: the number, sum of weights, average length, average score, entropy reduction compared to random alignment of the same length (one statistical measure of sequence conservation), mutual information of base paired columns, average number of base pairs, average pairwise sequence identity, average folding energy and GC content. See an example .