A Guide to Interpreting the Match Data

I provide the raw match data from my experiments in three forms: brief, unfiltered, and verbose.

Brief Form

The brief output provides a terse listing of the significant matches in the data. The list of matches has been filtered for significance as described in the paper, using ungapped Karlin-Altschul statistics.

A brief output file contains a series of lines, with one match listed per line. Each line contains a series of space-delimited fields of the form

<seq1> <strand1> <start1> <seq2> <strand2> <start2> <length> <nSubsts> <maxExact>
For example, a typical match might look like
1 f 603258 2 r 1055272 62 14 15
The fields have the following meanings:
seq1, seq2
numbers of the two sequences (1 and 2 if two sequences are compared, 1 and 1 if a sequence is compared to itself). In human-mouse comparisons, sequence 1 is always from human.
strand1, strand2
denote the strands on which the match occurs for sequences one and two respectively. Each field is "f" if the match is on the forward strand, or "r" if it is on the reverse-complement strand relative to the direction of the sequence in the file.
start1, start2
offsets of the match with respect to the beginnings of the two sequences. The sequences are indexed starting at 1. If a match occurs on the reverse-complement strand of a sequence, then the start is given with respect to that strand; that is, base 1 on the complement strand corresponds to the highest-numbered base on the forward strand.
length
length of the match in bases
nSubsts
number of substitutions occuring over the length of the match
maxExact
length of the longest unbroken string of matching nucleotides found in the match

Unfiltered Form

The unfiltered match list is identical to the brief list, except that matches of low significance by the Karlin-Altschul criterion have not been removed. Use this list if you want to take the matches as seeds for Smith-Waterman expansion.

Verbose Form

The verbose output contains the same matches as the brief output, presented in a human-readable form similar to the output of BLAST. A typical match looks like:

Match #4: 1 [75616..75689] 2 [146354..146427] 

     Identities    = 53/74 (72%)
     Strands shown = +/+

        75616 ttttcagggccagcttcacctcttggttccgcagagtgtagataaggggg    75665
              |:||||:|||   |||||: ||:| :||||  |::|||||||| | ||||
       146354 tcttcaaggcactcttcatgtcctcattcctaaaggtgtagattatgggg   146403

        75666 ttgaggaaaggagtgatggccgtg    75689
              || || |:|||:|| |||||||||
       146404 ttcagcagaggggttatggccgtg   146427

Matches in the file are numbered consecutively starting at 1. Each match begins by giving the two matching intervals in the input sequence(s). Sequence numbers are given as in the brief form. Intervals are always given with respect to the forward strand; if the match is on the reverse-complement strand of a sequence, the interval for that sequence will be reversed so that the larger index comes first.

In the alignment display, a matching nucleotide pair is indicated by a vertical bar "|", a transition by a colon ":", and a transversion by a space.

Annotations

For the TCR alpha/delta alignment, I provide annotation data for the human and mouse sequences. This data is again provided as a series of lines, one per annotated feature. Each feature is a series of fields of the form

<name> <type> <strand> <start> <end>
A typical line might look like
"Alpha Constant exon 1" "Gene" F 1054091 1054363
Fields on a line are separated by single spaces. Feature names with embedded spaces are enclosed in double quotes (""). The fields have the following meanings:
name
a name for the feature
type
the type of the feature
strand
the strand on which the feature occurs: "F" for forward, "C" for reverse-complement.
start, end
the boundaries of the feature on the forward strand.

The annotations have been preserved with minimal changes from the form supplied by my colleagues.


Jeremy Buhler (jbuhler@cs.washington.edu)
Last Update: 1/19/2001