Joe Landman <landman@scientificappliance.com> wrote on Fri, 7 Jun 2002: > I do not know precisely what Paracel's code does. Also I don't know *precisely* what the code does. I only know *vaguely* that - it can fragment the query sequence and also the database, and - it recomputes the final P and E values based on the set of P and E values obtained for the query-sequence / database fragments, and Paracel is proud on the fact that their final P and E values are identical (plus/minus epsilon) to the P and E values that would have been obtained by running NCBI Blast on the non-fragmented query-sequence and the non-fragmented database. > pathological case (e.g. worst case) was something Ivo Grosse suggested > with Chr21 vs pufferfish, where I was getting about 8x speedup on 16 > CPUs. If I remember correctly, Paracel's Blast had almost exactly the same speed, so a speed-loss of 50% per node seemed normal for programs that also fragment the database. > work by segmenting the input query sequences, optionally segmenting the > databases (this isnt always a performance win though), Exactly. I guess when only fragmenting the query sequence, but not the database, the Blast throughput should scale almost linearly with the number of nodes, till the fileserver cannot handle the output anymore. The only problem with not fragmenting the database is that: - the database may not fit into memory, or - you may need to buy more memory for *each* of the compute nodes, and if alternatively you would spend that amount of money for additional nodes, then it may be that a Blast program that can split the database runs faster on the larger cluster than a Blast program that cannot split the database runs on the smaller cluster. Ivo