A Program for Comparing a DNA sequence against a protein sequence database (DPS, DNA Protein Search) copyright (c) 1996 Xiaoqiu Huang E-mail: huang@cs.mtu.edu No part of this program may be redistributed or used without permission of the author. Please contact the author at huang@cs.mtu.edu for any requests for the package. The author does not offer any warranties or representation, nor does the author accept any liabilities with respect to the program. Xiaoqiu Huang Department of Computer Science Michigan Technological University Houghton, MI 49931 Proper attribution of the author as the source of the software would be appreciated: Huang, X. (1996) Fast Comparison of a DNA Sequence with a Protein Sequence Database. Microbial & Comparative Genomics, 1(4): 281-291. Acknowledgments I thank the following people for discussions and suggestions: Mark Adams, Tony Kerlavage, Brendan Loftus, Steve Rounsley, Granger Sutton, and Jinghui Zhang. The integration of DPS with NAP was performaed at TIGR. The DPS program compares a DNA sequence to a protein database. The DPS enhances the existing methods by addressing the problems of frameshifts and introns. DPS computes high-scoring chains of segment pairs, where segment pairs in a chain can be from different reading frames and there can be an intervening DNA sequence between adjacent segment pairs in a chain. The sensitivity of the program depends on the word size parameter W. Increased sensitivity comes at the expense of decreased speed. A hit is an exact coding of the amino acid word in the DNA sequence. Each hit is extended in both directions. The extesion stops if the score drops by more than the D distance. For a segment pair s, let nstart(s) and nend(s) denote the starting and ending positions of the DNA segment in the DNA sequence, let astart(s) and aend(s) denote the starting and ending positions of the protein segment in the protein sequence, and let score(s) denote the score of s. The first antidiagonal of a segment pair s is defined to be antis(s) = nstart(s) + 3 * astart(s), and the last antidiagonal of s is defined to be antid(s) = nend(s) + 3 * aend(s), where the protein position is scaled up by a factor of 3 since an amino acid corresponds to three nucleotides. A chain of segment pairs is a list of segment pairs in increasing order of their last antidiagonal such that each segment pair is not far from its predecessor and adjacent segment pairs do not have a large overlap. Specifically, any two adjacent segment pairs s and s' in the list satisfy the requirement: antis(s') - antid(s) < A, astart(s') - aend(s) > -B, and nstart(s') - nend(s) > -3 * B. for some nonnegative integers A and B. Here A is called the maximum number of antidiagonals between s and s', and B is the AAOVER parameter in the program. The value for A can be specified at the command line. A long intron requres use of a large value for A. For a chain of segment pairs to be reported, each segment pair in the chain must have a score no less than the initial segment score cutoff, and the chain must have a score no less than the final chain score cutoff. The program shows at most C number of chains of segments. If there are extra chains to be reported, the program only prints out the chain headings without showing the segment alignments. The DPS program is written in C and runs under Unix systems on Sun workstations and under DOS systems on PCs. We think that the program is portable to many machines. To compare a DNA sequence in file DNA_Seq and a protein database in file Protein_Database, use a command of form dps DNA_Seq Protein_Database BLOSUM62 [options] > result where DNA_Seq is a file of a query DNA sequence in FASTA format, Protein_Database is a file of protein database in FASTA format, BLOSUM is a file of specially formatted BLOSUM matrix, and the options are: -a N specify max number N of antidiagonals between segments, -c N specify max number N of chain alignments reported, -d N specify distance N for extension, -f N specify final chain score cutoff N, -i N specify initial segment score cutoff N, -w N specify amino acid word size of N <= 5. BLOSUM62 matrix: ARNDCQEGHILKMFPSTWYVBZX 4 -1 5 -2 0 6 -2 -2 1 6 0 -3 -3 -3 9 -1 1 0 0 -3 5 -1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6 -2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 A sample DNA file: >DNA sequence CTCATTTTTTCTTGCCCGTGTTTTAAATGTTTTCATCCACAGCATTTGAT GGGATGATTGGAAGTGAGACGTTCGAGAAAATCCATATTTTGAGTCAAGA ATTCAGATAATATACTGAGATGATTAGGTATGGCTGGGTTCTACAAAAAC ACAAATATCCGGCTAGCAATGATCACTGAGCAAATTAAAGCGTTAACTCA CTCATTATTGTAGCTTATGCGTTTCTCCTCCTCTCTTTTTTTCCTCGAAC CGGAGTGGAAGATCCAATAACGTAATATTACTGATGTTGTTATTAAAGCT GGCAAAAATAACATGAGGCGTAAAACCGCACTGCGGTAAGATGAGGGTAT AAGGTGGAGATCAGGCGAACAAGCTGTTCTAAATCATACATATGTACAAT GAGAACGTGTAACGATCCAATGAGCGTTTCATGATGCCATTGTTTAATCA GAGTGATGAAAAAGAAATATTTGCGACCTTTTTTCGTTACATTGATCGTG AAATTTTAATCAAAGATAATATAAGGACGTGAGATATTTATCTTTTTACT TGAAATTAACAATAGAATTGCGCTAAGCGGAATAAGAGCTTTCGTAAACC TTTCTATTTGCACCATTGCGTCAACGTATAAAATGGTATGACCTTTACAC AAACGCATGCTTATAATCTTATGTTTTTCATAGGGTGTAATTTGGTTGAT GACGTAGTCTAAATTTGATGCTATCTGCAATTGAGGTACATATAAGAGGT CAATTTCGGGACCAACCCTTTTAATCGAAAAAAACGTAATTCACTAGGGC AAGGGAGAACTTAGCAGCTAATATCGTAAACCTTTCATACTAAAAAAATG CACTTACCATCAACAAAAAACTCAGGACCAATTTCCAAGCTTTTCTAGGT GATTGCCTATAACACAAAAAGATTCGCTCATACATGAGATTTTTACATGT AATAGCAATTTGTTCCGATCAGTTGAAGGTCATCAACGCACGGCAGGTAC ATCCACACCTATCACAAAGCCCTTCAATAATTCACCTACGTAAAGTTATA CCGAAACATGCAAAATCCATGAAAAATTCTGTATGATAACGATCATATCC TTTTGTATTGGTGGTACGATGCTCAAAGATAGTTATTGTTGCACCTGAGG CAAAAGCGGAAATGAAAAATCCAGATGGGGCCAAAAGCAGAAGTATTGTG TACAACAATTGCTTCAGCAGTTTACCAAACCGTTTCCCAGCAATCATCAA AAGTTGCTTTAGCCACATTTCCGCAAGATATCTTTGTGGCTCAACGAAGA GGGCTATTCCAAATGCAATACAATACTAGCCGCTAGTGATCCATGCTTAT AGGCAATTTGATTGATAACTGGCCGTTCTATTAAGGAGTCAATGCTAACC ACATATAATGCATATAGGATTTGGCCTCTGCTGACCGTAATACTAGACAA GGAATATAAAACAACAACGTAACCCAGCATAAAAACGATATAAATAAAAA AAGAAACCAGATCATAAAGTTTGAGGGCACATCCCTCATGGTTTCAAAAT CTCGTACATTGACTCAAACCTCGAGATCTTGTTTAAACGAACATAAGAAC AGCGGTACCAAGTGACACATTGCGTACTGTGTAGTATGCGCCGTATATAA CTTTTTTTTTCTGAAGGTACTTTGAATTACAATCTATTTTTTACAGTTCC TATGGCAGGGGTTGAAGATATTTGGGTCTGAACCATAGCAGGATTAGTGT TATAGTAGGTATGTGAATAGAAGCTAACAAAATGAGATGAACCTCATACA AAGTCGTAGAGAAAACTGCTAACAGAAGAGCTGCGCCTTGAAATCGTATC TCTAAGCTTATAATAAATTGAAAGGAAAAAATACGTGGTAAATGCAAGCG ACCAAAAGGCTACGGCCCAACGCTAACCCGCCGATAGGTGCATAATCTAA TTTACCTCCACCAGCAGGAGCCCTTTTTCTAAGTAATAAGCAAACCAGAT AACTTACATCTTGCTGTAGGAAACAAAAGCCGGAATAATGGTTCACTCAT ATTCTTCGTGTGAAACACAGAAGAAATCCAATATTTGCTTCAGTATTTAT CTCTAAAAATTGGTCCTACATTGGAAACCATAAACCAATTATAACCGGTG TACGAATTGTAAGCTAGTTCTGGAAATGTCATGTTGCGCAGGTAAAAGTG GAGCTGAATTGTATATCTGTTTTGATCATTATTATCCCTCTGGGTGAGTG GAAATATCAATAAAATGCAATGGCACATTTAATATCCTTCTCTTAATTCC GTGATTTATAACATCTTGATGCCAGAAACACCTTTCGGATCCGGCAATAA AGCGGAGATTAGCACGCTTTTCGCCGGTCCTACGGATTTAGTGTTGGCTA TTGTTGAGATTAGTAATACGCAGAGAATTTTTCTACCGGTGAAGCGACCA TCTCAGATTATTAGGTCAAGCAATAA A sample protein datbase file: >P41260; HEMOGLOBIN I (HB I). GLB1_LUCPE SLEAAQKSNVTSSWAKASAAWGTAGPEFFMALFDAHDDVFAKFSGLFSGAAKGTVKNTPE MAAQAQSFKGLVSNWVDNLDNAGALEGQCKTFAANHKARGISAGQLEAAFKVLSGFMKSY GGDEGAWTAVAGALMGEIEPDM >P41261; HEMOGLOBIN II (HB II). GLB2_LUCPE TTLTNPQKAAIRSSWSKFMDNGVSNGQGFYMDLFKAHPETLTPFKSLFGGLTLAQLQDNP KMKAQSLVFCNGMSSFVDHLDDNMLVVLIQKMAKLHNNRGIRASDLRTAYDILIHYMEDH NHMVGGAKDAWEVFVGFICKTLGDYMKELS >P41262; HEMOGLOBIN III (HB III). GLB3_LUCPE SSGLTGPQKAALKSSWSRFMDNAVTNGTNFYMDLFKAYPDTLTPFKSLFEDVSFNQMTDH PTMKAQALVFCDGMSSFVDNLDDHEVLVVLLQKMAKLHFNRGIRIKELRDGYGVLLRYLE DHCHVEGSTKNAWEDFIAYICRVQGDFMKERL >P36032; HYPOTHETICAL 52.3 KD PROTEIN IN FRE2 5'REGION. YKW1_YEAST MSEERHEDHHRDVENKLNLNGKDDINGNTSISIEVPDGGYGWFILLAFILYNFSTWGANS GYAIYLAHYLENNTFAGGSKLDYASIGGLAFSCGLFFAPVITWLYHIFSIQFIIGLGILF QGAALLLAAFSVTLWEIYLTQGVLIGFGLAFIFIPSVTLIPLWFRNKRSLASGIGTAGSG LGGIVFNLGMQSILQKRGVKWALIAQCIICTSLSTIALMLTRTTHQGLRQHKRSYKFELL DYDVLSNFAVWLLFGFVSFAMLGYVVLLYSLSDFTVSLGYTSKQGSYVSCMVSVGSLLGR PIVGHIADKYGSLTVGMILHLVMAILCWAMWIPCKNLATAIRFGLLVGSIMGTIWPTIAS IVTRIVGLQKLPGTFGSTWIFMAAFALVAPIIGLELRSTDTNGNDYYRTAIFVGFAYFGV SLCQWLLRGFIIARDEIAVREAYSADQNELHLNVKLSHMSKCLFRYKQLPRRV