I provide the raw match data from my experiments in three forms: brief, unfiltered, and verbose.
The brief output provides a terse listing of the significant matches in the data. The list of matches has been filtered for significance as described in the paper, using ungapped Karlin-Altschul statistics.
A brief output file contains a series of lines, with one match listed per line. Each line contains a series of space-delimited fields of the form
<seq1> <strand1> <start1> <seq2> <strand2> <start2> <length> <nSubsts> <maxExact>For example, a typical match might look like
1 f 603258 2 r 1055272 62 14 15The fields have the following meanings:
The unfiltered match list is identical to the brief list, except that matches of low significance by the Karlin-Altschul criterion have not been removed. Use this list if you want to take the matches as seeds for Smith-Waterman expansion.
The verbose output contains the same matches as the brief output, presented in a human-readable form similar to the output of BLAST. A typical match looks like:
Match #4: 1 [75616..75689] 2 [146354..146427]
Identities = 53/74 (72%)
Strands shown = +/+
75616 ttttcagggccagcttcacctcttggttccgcagagtgtagataaggggg 75665
|:||||:||| |||||: ||:| :|||| |::|||||||| | ||||
146354 tcttcaaggcactcttcatgtcctcattcctaaaggtgtagattatgggg 146403
75666 ttgaggaaaggagtgatggccgtg 75689
|| || |:|||:|| |||||||||
146404 ttcagcagaggggttatggccgtg 146427
Matches in the file are numbered consecutively starting at 1. Each match begins by giving the two matching intervals in the input sequence(s). Sequence numbers are given as in the brief form. Intervals are always given with respect to the forward strand; if the match is on the reverse-complement strand of a sequence, the interval for that sequence will be reversed so that the larger index comes first.
In the alignment display, a matching nucleotide pair is indicated by a vertical bar "|", a transition by a colon ":", and a transversion by a space.
For the TCR alpha/delta alignment, I provide annotation data for the human and mouse sequences. This data is again provided as a series of lines, one per annotated feature. Each feature is a series of fields of the form
<name> <type> <strand> <start> <end>A typical line might look like
"Alpha Constant exon 1" "Gene" F 1054091 1054363Fields on a line are separated by single spaces. Feature names with embedded spaces are enclosed in double quotes (""). The fields have the following meanings:
The annotations have been preserved with minimal changes from the form supplied by my colleagues.