Supplementary Material

This web site supplements my article, "Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing," published in the May 2001 issue of the journal Bioinformatics. Here you will find the assembled, masked sequences, the lists of matches, and (where possible) the annotations for the sequences I used in that article.

If you use the match data below in your own research, please cite my article as follows:

Buhler, Jeremy (2001). "Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing." Bioinformatics 17:419-429.

If you are interested in using my implementation of the lsh-all-pairs algorithm, please send email to

About the Supplementary Data

For each of the three experiments described in the paper, I have provided the sequences used and the list of matches found by the lsh-all-pairs algorithm. For the TCR alpha/delta locus, I also provide feature annotations, including V-segments.

For each experiment, you can download the raw input sequences in FASTA format, the same sequences with interspersed repeats masked out (using A. Smit's RepeatMasker 04042000 with default settings), and the match data in several forms. Before using the match data, you should first read the guide to interpreting it.

Beta-Globin Locus Control Region

The human and mouse beta-globin locus control regions can be found in GenBank. The human sequence is the 5'-most portion of sequence U01317. The mouse locus control region is sequence Z13985.

T-cell Receptor Alpha/Delta Locus

The sequences here were assembled from pieces found in GenBank; see the paper for the fragment accession numbers. Thanks to Jared Roach and Cecilie Boysen for their help in assembling and providing annotations for these sequences.

Although the paper describes only matches between the forward strands of each sequence, these files are expanded to also include matches between the forward strand of human and the reverse-complement strand in mouse.

Human Chromosome 22

I used the chromosome 22 sequence published by the Sanger Center. You can find the original, unmasked sequence and feature annotations on their chromosome 22 web site. The files below correspond to the Sanger sequence release of 5-19-2000.

Note that the masked version of chromosome 22 presented here is not the same as the one found on the Sanger site. I did the masking myself.

Jeremy Buhler (
Last Update: 5/1/2001