A computational pipeline for high
throughput discovery of cis-regulatory noncoding RNAs in Bacteria
PLos Computational Biology 3(7):e126
This paper presents an efficient computational pipeline for discovering cis-regulatory ncRNA motifs de novo.
The pipeline is structure-oriented, does not require a multiple sequence alignment as input, and is capable of
detecting RNA motifs with low sequence conservation. We also seamlessly integrate
RNA motif prediction with RNA homolog search, which improves the quality of
the RNA motifs significantly. We have applied this pipeline to discover ncRNAs in the
Bacillus/Clostridium group of Firmicutes. Our top ranking motifs include most known Firmicute Rfam families.
Yao Z, Bariick J, Weinberge Z, Neph S, Breaker R, et al. (2007) A Computational Pipeline for High- Throughput Discovery of cis-Regulatory Noncoding RNA in Prokaryotes. PLoS Comput Biol 3(7): e126 link
Full rank list of motifs :
After RaveNna scan
Before RaveNna scan
CMfinder motifs ,
Rank file (before RaveNna scan)
Refined motifs ,
Rank file (after RaveNna scan)
RNA motifs upstream of ribosomal proteins
Technical details for the pipeline:
Here, we wish to provide sufficient technical details for anyone who is interested to
reconstruct our pipeine for simmilar applications. However, the provided scripts are in general
heavily dependent on the data, external tools and local system configuration, therefore,
not directly applicable without significant customization. Please only use these scripts
as reference if you are intested in the details, and use extreme discretion.
Identify CDD group members (Parallized)
|| CDD models (version 2.05) are collected from here |
Protein sequences in fasta format are extracted from the relevant GenBank files from Microbial database (RefSeq release 14)
|| rpsblast (version 2.2.11) |
| Command: || rpsblast -m 8 -F T -e 0.01 -d cdd_path |
| Postprocessing: || Parse the rpsblast output files. Disregard all domain hits that do not cover at
least 40% of the length of the conserved domain. |
Starting with the most significant hits to a query protein,
accept new domain matches that do not overlap previously accepted domain matches by more than 15 amino acids.
|| cdd_file (3092 CDD groups) |
2942 are selected (removed groups with < 4 sequences or > 70 sequences).
Prepare sequences for RNA motif search
Script that performs all sequence preprocessing: preproc.pl
Output: Sequence files in fasta format. Download the sequences here
- Retrieve upstream sequences
|| CDD groups identified from step 1. |
Local microbe genome database based on NCBI Microbial Genome Database
| Tool: || Microfootprinter . Microfootprinter is the front end of
Footprinter (version 2.0) . |
||microfootprinter -skipduplicatechecks 1 -grouped cdd_group -auxfilesonly 1 -subdir cdd_dir 600 |
|Output: || CLUSTALW protein sequence alignments, and corresponding guided trees (as approximation of phylogenetic trees) |
Upstream sequences of genes in each CDD groups
- Identify matches to Rfam.
|Input: || Upstream sequences collected from step 2.1 |
Rfam.fasta: Fasta sequences of Rfam members filtered to <90% identity (Rfam 7.0)
|Tool: || WU-BLAST |
| Command: || blastn Rfam.fasta infile -W 20 -noseqs > blast_result |
| Postprocessing: ||
Parse blast output, select hits with e_value < 0.001, identifiy > 90%. |
The length of selected hits should be greater than 85% of the matching RNA.
- Remove tRNAs and rRNAs from collected sequences.
- Remove highly similar sequences from each dataset.
Highly similar sequences biase CMfinder to predict motif that are only specific to these sequences.
Therefore, for each pair of highly similar sequences, we just keep one of them.
To do this, for each sequence dataset, we perform pairwise blast search:
xdformat -n seq_file
Starting from the first sequence in the file, if it has not been removed yet,
remove any matching sequence
blastn seq_file seq_file -notes -top -W 8 -noseqs
if this matching sequence has e_value < 0.001,
identifiy > 95% and matching length exceeds 80% of the target sequence length.
|Input: || Output from step 2.1 (upstream sequences of proteins, and protein sequences for constructing phylogeny tree).|
| Tool: || Microfootprinter . |
microfootprinter -clade clade -batch acc.pid -numblast 100 -interactivehtml 0 -maxregionsize 600 -highlight 1 -mask 1 -makehtmllinks 1 |
RNA motif search using CMfinder (Parallized)
| Input: || Preprocessed upstream sequences from step 2.|
| Tool: || CMfinder (version 0.2) |
| Command: ||
cmfinder.pl -n 5 -m 30 -M 100 -f 0.4 seq_file |
cmfinder.pl -n 5 -s 2 -m 30 -M 100 -f 0.4 seq_file
CombMotif.pl seq_file seq_file.motif.h
|Output: || 35975 motifs (in stockholm format) |
Input: Motifs produced by step 4.
Script that performs motif postprocessing: postproc.pl
- Filtering motif instances
Remove motif instances with CM score < 10 bits.
Command: filter.pl -s -lt 10 motif_file motif_file
Remove duplicate motif instances with identical sequences
Command: rm_dup_instances.pl in_motif_file out_motif_file
We remove duplicate motif instances to avoid the problem with repeat elements.
- Rank motifs
Compute motif composite scores (see details here ), and rank motifs accordingly.
Tool: R statistical package is used.
Command: rank_cmfinder.pl -w -rank motif_rank_file
- Select motif.
For each dataset, select up to 4 motifs starting from the highest ranking motif.
A lower ranking motif that overlaps significantly with a selected higher ranking motif will not be selected.
Comand: select_cmfinder.pl motif_dir selected_dir motif_rank_file > selected_rank_file
- Cluster motifs.
We cluster similar motifs according to the genomic coordinates of motif instances.
One motif is grouped with another if at least half of its members overlap, and the overlapped regions are longer
than half of the motif length.
Command: merget_all_motif.pl -rank selected_rank_file selected_dir
- Product HTML table for ranked motif clusters
Command: parse_rank_file.pl -cdd cdd_file selected_dir selected_rank_file > index.html
Scan motif homologs using RaveNna (Parallized)
|| Intergenic sequences of Fimicutes (extendy by 50 nucleotides to account for gene annotation errors) |
CMfinder motifs from step 5.
|Tool: || RaveNnA |
Add the following to data/ravenna.config.tab to configure scanning sequence database: (see Chapter 3 of RavNnA manual) :|
db miniscan-firmicute seqs/IGR.fasta.gz data/file-with-two-lines fullseq projection-of refseq14-microbial
(The above fields are separated by Tab)
This configures a database called 'miniscan-firmicute'. Explanation of major fields is described below:
seqs/IGR.gz : ||a FASTA sequence file containing all IGRs (extended by 50 nucleotides) in the Refseq17 genomes |
|data/file-with-two-lines:|| file with two line feeds ("\n"), to replace potential taxomomy file which we did not use. |
|fullseq:|| select the heuristic HMM threshold based on the actual sequences in seqs/IGR.fasta.gz |
|projection-of refseq17-microbial:|| used to extend the predicted hits by 50 nucleotides on either end,
in the hopes that CMfinder can find additional structure. Extension is performed within the full database to allow bleeding
into annotated ORFs. This process is dictated by the "-extendedSeqs 50,50" flag passed to ravenna.pl in the miniscans.
perl ravenna.pl -global -stoFileName motif_file -infernalCM -database miniscan-firmicute -workDir cmfinder/miniscans -windowLenFromStockholm 1.1,80 -extendedSeqs 50,50 -knownHitsFromGivenSto
|| Scanning results in .csv format. Download |
Refine motif using RaveNna scanning results (Parallized)
|Input: || CMfinder motifs from step 5 |
RaveNnA scanning results from step 6
|Preprocessing: ||Parse RaveNnA scanning results into fasta format of the hit sequences |
| Tool: || CMfinder (version 0.2) |
|Command: || cmfinder -a motif_file -o new_motif_file scan_sequence_file new_cm_file |
|Postprocessing: || Same as step 5 |