Software

StatSigMA: Statistical Significance of Multiple Alignments
StatSigMA computes the statistical significance of multiple sequence alignments (of either nucleotide or amino acid sequences), much as BLAST's E-values provide statistical significance for pairwise alignments.

If you use this software for your publications, please read and cite:

  • A. Prakash, M. Tompa, "Assessing the discordance of multiple sequence alignments", IEEE/ACM Trans Comput Biol Bioinform, vol. 6 (2009) 542-51. Pubmed 19875854.
StatSigMA-w: Statistical Significance of Whole-Genome Multiple Alignments
Given any multiple sequence alignment and a phylogeny of the aligned sequences, StatSigMA-w assesses the accuracy of the alignment and identifies suspiciously aligned regions.

If you use this software for your publications, please read and cite:

  • X. Chen, M. Tompa, "Comparative assessment of methods for aligning multiple genome sequences", Nat. Biotechnol., vol. 28 (2010) 567-72. Pubmed 20495551.   Supplement.
  • A. Prakash, M. Tompa, "Measuring the accuracy of genome-size multiple alignments", Genome Biol., vol. 8 (2007) R124. Pubmed 17594489.   Supplement.
MSS: Finding all Maximal Scoring Subsequences
MSS is a practical, linear time algorithm to find, in a sequence of numeric scores, those nonoverlapping, contiguous subsequences having greatest total scores.

If you use this software for your publications, please read and cite:

  • W. Ruzzo, M. Tompa, "A linear time algorithm for finding all maximal scoring subsequences", Proc Int Conf Intell Syst Mol Biol, (1999) 234-41. Pubmed 10786306.
Dapple: Image analysis software for DNA microarrays
Dapple is a program for quantitating spots on a two-color DNA microarray image. Given a pair of images from a comparative hybridization, Dapple finds the individual spots on the image, evaluates their qualities, and quantifies their total fluorescent intensities.

If you use this software for your publications, please read and cite:

  • J. Buhler, T. Ideker, D. Haynor, "Dapple: Improved Techniques for Finding Spots on DNA Microarrays", University of Washington Department of Computer Science & Engineering Technical Report UW-CSE-2000-08-05, (2000)   Supplement.
FootPrinter: A program for phylogenetic footprinting
Phylogenetic footprinting is a method that identifies putative regulatory elements in DNA sequences. It identifies regions of DNA that are unusually well conserved across a set of orthologous sequences.

If you use this software for your publications, please read and cite:

  • M. Blanchette, B. Schwikowski, M. Tompa, "Algorithms for phylogenetic footprinting", J. Comput. Biol., vol. 9 (2002) 211-23. Pubmed 12015878.
  • M. Blanchette, M. Tompa, "Discovery of regulatory elements by a computational method for phylogenetic footprinting", Genome Res., vol. 12 (2002) 739-48. Pubmed 11997340.   Supplement.
  • M. Blanchette, M. Tompa, "FootPrinter: A program designed for phylogenetic footprinting", Nucleic Acids Res., vol. 31 (2003) 3840-2. Pubmed 12824433.
MicroFootPrinter: A microbial front end for FootPrinter
MicroFootPrinter is a front end to the FootPrinter phylogenetic footprinting program, but with specific focus on prokaryotic genomes. You supply a prokaryotic species and gene of interest. MicroFootPrinter will then find related prokaryotes each containing a homologous gene, and run FootPrinter to identify motifs in the regulatory region of your chosen gene that are well conserved across these homologous genes.

If you use this software for your publications, please read and cite:

  • S. Neph, M. Tompa, "MicroFootPrinter: a tool for phylogenetic footprinting in prokaryotic genomes", Nucleic Acids Res., vol. 34 (2006) W366-8. Pubmed 16845027.   Supplement.
PhyME: Motif discovery in data sets that include both intraspecies overrepresentation and interspecies conservation
PhyME discovers motifs by integrating two important aspects of the motif's significance, overrepresentation and interspecies conservation, into one probabilistic score. The algorithm is based on multiple alignment and expectation-maximization.

If you use this software for your publications, please read and cite:

  • S. Sinha, "PhyME: a software tool for finding motifs in sets of orthologous sequences", Methods Mol. Biol., vol. 395 (2007) 309-18. Pubmed 17993682.
  • S. Sinha, M. Blanchette, M. Tompa, "PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences", BMC Bioinformatics, vol. 5 (2004) 170. Pubmed 15511292.
Projection: A motif discovery program based on random projections

If you use this software for your publications, please read and cite:

  • J. Buhler, "Provably sensitive indexing strategies for biosequence similarity search", J. Comput. Biol., vol. 10 (2003) 399-417. Pubmed 13677335.   Supplement.
  • J. Buhler, M. Tompa, "Finding motifs using random projections", J. Comput. Biol., vol. 9 (2002) 225-42. Pubmed 12015879.   Supplement.
YMF and FindExplanators: An enumerative motif discovery program
YMF identifies motifs (made of IUPAC symbols) that occur unusually often in a given set of sequences. FindExplanators extracts from that set of motifs a smaller set of independent motifs.

If you use this software for your publications, please read and cite:

  • M. Blanchette, S. Sinha, "Separating real motifs from their artifacts", Bioinformatics, vol. 17 Suppl 1 (2001) S30-8. Pubmed 11472990.   Supplement.
  • S. Sinha, M. Tompa, "A statistical method for finding transcription factor binding sites", Proc Int Conf Intell Syst Mol Biol, vol. 8 (2000) 344-54. Pubmed 10977095.   Supplement.
  • S. Sinha, M. Tompa, "Discovery of novel transcription factor binding sites by statistical overrepresentation", Nucleic Acids Res., vol. 30 (2002) 5549-60. Pubmed 12490723.   Supplement.
  • S. Sinha, M. Tompa, "YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation", Nucleic Acids Res., vol. 31 (2003) 3586-8. Pubmed 12824371.   Supplement.
Quip: Lossless compression of FASTQ, SAM and BAM files, with or without a reference genome.
A lossless compression algorithm for next-generation sequencing data. Statistical compression along with reference-based and assembly-based methods are used to efficiently compress datasets in FASTQ and SAM/BAM formats to less than 15% of their original size.

If you use this software for your publications, please read and cite:

  • D. Jones, W. Ruzzo, X. Peng, M. Katze, "Compression of next-generation sequencing reads aided by highly efficient de novo assembly", , (Submitted)
SeqBias: Sequence-dependent bias correction for RNA-Seq experiments.
An R/Bioconductor package that models and corrects for per-position sequencing bias in RNA-Seq experiments using a simple Baysian network, increasing the accuracy of quantification. The method includes strong theoretical protection against false discovery of bias.

If you use this software for your publications, please read and cite:

  • D. Jones, W. Ruzzo, X. Peng, M. Katze, "A new approach to bias correction in RNA-Seq", Bioinformatics, vol. 28 (2012) 921-8. Pubmed 22285831.
CMfinder: A covariance model based RNA motif finding algorithm
CMfinder is a tool to predict RNA motifs in unaligned sequences. It is an expectation maximization algorithm using covariance models for motif description, featuring novel integration of multiple techniques for effective search of motif space, and a Bayesian framework that blends mutual information-based and folding energy-based approaches to predict structure in a principled way.

If you use this software for your publications, please read and cite:

  • Z. Yao, Z. Weinberg, W. Ruzzo, "CMfinder--a covariance model based RNA motif finding algorithm", Bioinformatics, vol. 22 (2006) 445-52. Pubmed 16357030.   Supplement.
Multiperm: Shuffling multiple sequence alignments while approximately preserving dinucleotide frequencies
Assessing the statistical significance of structured RNA predicted from multiple sequence alignments relies on the existence of a good null model. Multiperm is a random shuffling algorithm that preserves not only the gap and local conservation structure in alignments of arbitrarily many sequences, but also the mono- and approximate dinucleotide frequencies. The later characteristics have important effects on the predicted thermodynamic stability of RNA structures.

If you use this software for your publications, please read and cite:

  • P. Anandam, E. Torarinsson, W. Ruzzo, "Multiperm: shuffling multiple sequence alignments while approximately preserving dinucleotide frequencies", Bioinformatics, vol. 25 (2009) 668-9. Pubmed 19136551.   Supplement.
RaveNnA: Faster Search for Non-coding RNA Families Without Loss of Accuracy
Non-coding RNAs (ncRNAs) are functional RNA molecules that do not code for proteins. Covariance Models (CMs) are a useful statistical tool to find new members of an ncRNA gene family in a large genome database, using both sequence and, importantly, RNA secondary structure information. Unfortunately, CM searches are slow. The RaveNnA software package makes CMs faster while provably sacrificing none of their accuracy (or faster still with little loss in sensitivity, depending on parameter settings).

If you use this software for your publications, please read and cite:

  • Z. Weinberg, W. Ruzzo, "Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy", Bioinformatics, vol. 20 Suppl 1 (2004) i334-41. Pubmed 15262817.   Supplement.
  • Z. Weinberg, W. Ruzzo, "Faster Genome Annotation of Non-coding RNA Families Without Loss of Accuracy", Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004), (2004) pp 243-251.   Supplement.
  • Z. Weinberg, W. Ruzzo, "Sequence-based heuristics for faster annotation of non-coding RNA families", Bioinformatics, vol. 22 (2006) 35-9. Pubmed 16267089.   Supplement.