> Proposal 0: blast human chromosome 22 (query) against the genome of > pufferfish database). Both sequences repeat-masked, E-value 10^4. I think that there are two things to remember when we're building this set of "standard" tests: * BLAST is an appoximation. A lot of folks are out there comparing their accelerated / MPI / super-sensitive / java-implemented homology search algorithm against NCBI's BLAST. When there are differences, it's impossible to tell whether they represent added sensitivity, lost specificity, or simply noise. It's really shocking to me that peer reviewed publications accept and publish what amounts to comparisons without controls. Instead BOTH approximations should be compared made to some complete algorithm implementing a search in the same space. For BLAST, one of these is the 1981 work by Smith & Wateraman. For others, it's less well defined. Better yet would be a comparison to some well annotated set of biologically "correct" homologies. These would be laboratory verified, biologist approved homologs. This methodology was used for creating the original PAM and BLOSUM matrixes, but seems to have fallen by the wayside with the data explosion of the last decade. * (more important) We computer folks need to address biologically interesting questions, not just computationally interesting ones. Sure, chromosomes are the biggest biological strings we've got, and BLAST is the hammer that's at the top of the toolbox. Does that really make it the appropriate tool to the task? Unless you're pretty clever about your BLAST parameters, all this test is going to show are some well-documented weaknesses of BLAST when you hit queries of ridiculous size. Those weaknesses are there because it was never designed (the algorithm, not the implementation) with chromosomes in mind. "Local Alignments" were the goal, not large scale genomic archeology. That said, here are some of the questions that I ask when I get hold of a new, great, wonderful sequence based homology tool, I run ALL of these and see where the new tool shines. Then I describe the tool to my users and watch to see if there's any interest. None of the ones I've tried are good at all of these cases. Some are good for none. :) I don't claim that it's a complete list, but it's a start. 0) Query: EST (500-800bp, single pass sequencing - meaning they're positively riddled with errors) Target: EST-unigene set from a single organism. I like to use Medicago. About 140,000 EST reads, which collapse to between 30 and 40,000 contigs. Search: Forward strand vs. forward strand. Theoretically, we know the reading frame for mRNA based clones. Performance Criteria: Accuracy vs. Smith & Waterman results Results Criteria: Response time for a single query; Throughput for large batch queries. 1) Query: Protein Target: NCBI NR Just like above. This one doesn't seperate things out at 2) Query: EST reads Target: Whole Chromosome (I generally use Arabidopsis, since we're a plant lab). Performance Criteria: Response time. Batch throughput Results Criteria: Hits found vs. a target set where I used a sliding window to chop up the chromosome into sequences of 10,000bp. You'd probably be surprised at the differences in the hits. 3) Query: Whole chromosome Target Whole chromosome Performance Criteria: Response time. Results Criteria: Chop up both chromosomes. This one gets really hairy to define the "right" answers. You start to encounter all those good "shadowing" and normalization questions that get really confusing. Thanks for listening. -Chris Dwan University of Minnesota