image University of Washington Department of Computer Science & Engineering

A computational pipeline for high throughput discovery of cis-regulatory noncoding RNAs in Bacteria
PLos Computational Biology 3(7):e126
[Supplementary Material*]

Introduction

This paper presents an efficient computational pipeline for discovering cis-regulatory ncRNA motifs de novo. The pipeline is structure-oriented, does not require a multiple sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation. We also seamlessly integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly. We have applied this pipeline to discover ncRNAs in the Bacillus/Clostridium group of Firmicutes. Our top ranking motifs include most known Firmicute Rfam families.

Citation

Yao Z, Bariick J, Weinberge Z, Neph S, Breaker R, et al. (2007) A Computational Pipeline for High- Throughput Discovery of cis-Regulatory Noncoding RNA in Prokaryotes. PLoS Comput Biol 3(7): e126
link

A list of Firmicutes species used in the pipeline search

Full rank list of motifs :

After RaveNna scan
Before RaveNna scan

Download:

Upstream sequences
CMfinder motifs , Rank file (before RaveNna scan)
Refined motifs , Rank file (after RaveNna scan)

RNA motifs upstream of ribosomal proteins

IF-3 r-protein leader: Corresponding to Rfam RF00558
L10 r-protein leader Corresponding to Rfam RF00557
L13 r-protein leader Corresponding to Rfam RF00555
L19 r-protein leader Corresponding to Rfam RF00556
L21 r-protein leader Corresponding to Rfam RF00559
S10 r-protein leader
S4 r-protein leader

Technical details for the pipeline:

Here, we wish to provide sufficient technical details for anyone who is interested to reconstruct our pipeine for simmilar applications. However, the provided scripts are in general heavily dependent on the data, external tools and local system configuration, therefore, not directly applicable without significant customization. Please only use these scripts as reference if you are intested in the details, and use extreme discretion.

  1. Identify CDD group members (Parallized)

    Input: CDD models (version 2.05) are collected from here
    Protein sequences in fasta format are extracted from the relevant GenBank files from Microbial database (RefSeq release 14)
    Tool: rpsblast (version 2.2.11)
    Command: rpsblast -m 8 -F T -e 0.01 -d cdd_path
    Postprocessing: Parse the rpsblast output files. Disregard all domain hits that do not cover at least 40% of the length of the conserved domain.
    Starting with the most significant hits to a query protein, accept new domain matches that do not overlap previously accepted domain matches by more than 15 amino acids.
    Output: cdd_file (3092 CDD groups)
    2942 are selected (removed groups with < 4 sequences or > 70 sequences).
  2. Prepare sequences for RNA motif search

    Script that performs all sequence preprocessing: preproc.pl
    Output: Sequence files in fasta format. Download the sequences here
    1. Retrieve upstream sequences
      Input: CDD groups identified from step 1.
      Local microbe genome database based on NCBI Microbial Genome Database
      Tool: Microfootprinter . Microfootprinter is the front end of Footprinter (version 2.0) .
      Command: microfootprinter -skipduplicatechecks 1 -grouped cdd_group -auxfilesonly 1 -subdir cdd_dir 600
      Output: CLUSTALW protein sequence alignments, and corresponding guided trees (as approximation of phylogenetic trees)
      Upstream sequences of genes in each CDD groups
    2. Identify matches to Rfam.
      Input: Upstream sequences collected from step 2.1
      Rfam.fasta: Fasta sequences of Rfam members filtered to <90% identity (Rfam 7.0)
      Tool: WU-BLAST
      Command: blastn Rfam.fasta infile -W 20 -noseqs > blast_result
      Postprocessing: Parse blast output, select hits with e_value < 0.001, identifiy > 90%.
      The length of selected hits should be greater than 85% of the matching RNA.
    3. Remove tRNAs and rRNAs from collected sequences.
    4. Remove highly similar sequences from each dataset.
      Highly similar sequences biase CMfinder to predict motif that are only specific to these sequences.
      Therefore, for each pair of highly similar sequences, we just keep one of them. To do this, for each sequence dataset, we perform pairwise blast search:

      xdformat -n seq_file
      blastn seq_file seq_file -notes -top -W 8 -noseqs

      Starting from the first sequence in the file, if it has not been removed yet, remove any matching sequence
      if this matching sequence has e_value < 0.001, identifiy > 95% and matching length exceeds 80% of the target sequence length.
  3. Footprinter Ranking

    Input: Output from step 2.1 (upstream sequences of proteins, and protein sequences for constructing phylogeny tree).
    Tool: Microfootprinter .
    Command: microfootprinter -clade clade -batch acc.pid -numblast 100 -interactivehtml 0 -maxregionsize 600 -highlight 1 -mask 1 -makehtmllinks 1
  4. RNA motif search using CMfinder (Parallized)

    Input: Preprocessed upstream sequences from step 2.
    Tool: CMfinder (version 0.2)
    Command: cmfinder.pl -n 5 -m 30 -M 100 -f 0.4 seq_file
    cmfinder.pl -n 5 -s 2 -m 30 -M 100 -f 0.4 seq_file
    CombMotif.pl seq_file seq_file.motif.h
    Output: 35975 motifs (in stockholm format)
  5. Motif postprocessing

    Input: Motifs produced by step 4.
    Script that performs motif postprocessing: postproc.pl
    1. Filtering motif instances
      Remove motif instances with CM score < 10 bits.
      Script: filter.pl
      Command: filter.pl -s -lt 10 motif_file motif_file

      Remove duplicate motif instances with identical sequences
      Script: rm_dup_instance.pl
      Command: rm_dup_instances.pl in_motif_file out_motif_file
      We remove duplicate motif instances to avoid the problem with repeat elements.

    2. Rank motifs
      Compute motif composite scores (see details here ), and rank motifs accordingly.
      Tool: R statistical package is used.
      Script: rank_cmfinder.pl
      Command: rank_cmfinder.pl -w -rank motif_rank_file

    3. Select motif.
      For each dataset, select up to 4 motifs starting from the highest ranking motif.
      A lower ranking motif that overlaps significantly with a selected higher ranking motif will not be selected.
      Script: select_cmfinder.pl
      Comand: select_cmfinder.pl motif_dir selected_dir motif_rank_file > selected_rank_file

    4. Cluster motifs.
      We cluster similar motifs according to the genomic coordinates of motif instances.
      One motif is grouped with another if at least half of its members overlap, and the overlapped regions are longer than half of the motif length.
      Script: merge_all_motif.pl
      Command: merget_all_motif.pl -rank selected_rank_file selected_dir

    5. Product HTML table for ranked motif clusters
      Script: parse_rank_file.pl
      Command: parse_rank_file.pl -cdd cdd_file selected_dir selected_rank_file > index.html

  6. Scan motif homologs using RaveNna (Parallized)

    Input: Intergenic sequences of Fimicutes (extendy by 50 nucleotides to account for gene annotation errors)
    CMfinder motifs from step 5.
    Tool: RaveNnA
    Configure RavNnA: Add the following to data/ravenna.config.tab to configure scanning sequence database: (see Chapter 3 of RavNnA manual) :

    db miniscan-firmicute seqs/IGR.fasta.gz data/file-with-two-lines fullseq projection-of refseq14-microbial

    (The above fields are separated by Tab)
    This configures a database called 'miniscan-firmicute'. Explanation of major fields is described below:
    seqs/IGR.gz : a FASTA sequence file containing all IGRs (extended by 50 nucleotides) in the Refseq17 genomes
    data/file-with-two-lines: file with two line feeds ("\n"), to replace potential taxomomy file which we did not use.
    fullseq: select the heuristic HMM threshold based on the actual sequences in seqs/IGR.fasta.gz
    projection-of refseq17-microbial: used to extend the predicted hits by 50 nucleotides on either end, in the hopes that CMfinder can find additional structure. Extension is performed within the full database to allow bleeding into annotated ORFs. This process is dictated by the "-extendedSeqs 50,50" flag passed to ravenna.pl in the miniscans.
    Command: perl ravenna.pl -global -stoFileName motif_file -infernalCM -database miniscan-firmicute -workDir cmfinder/miniscans -windowLenFromStockholm 1.1,80 -extendedSeqs 50,50 -knownHitsFromGivenSto
    Output: Scanning results in .csv format. Download
  7. Refine motif using RaveNna scanning results (Parallized)

    Input: CMfinder motifs from step 5
    RaveNnA scanning results from step 6
    Preprocessing: Parse RaveNnA scanning results into fasta format of the hit sequences
    Tool: CMfinder (version 0.2)
    Command: cmfinder -a motif_file -o new_motif_file scan_sequence_file new_cm_file
    Postprocessing: Same as step 5

Last Updated: 04/29/2007