[Bioclusters] Gridlet test of BLAST using datagrid
directories.
Rick Westerman
bioclusters@bioinformatics.org
Mon, 02 Dec 2002 16:30:40 -0500
Chris Dwan writes, concerning Don Gilbert's gridlet that downloads
information to each node on an as need basis:
> The fact that the target needs to be re-formatted
>every time we gain or lose a compute node seems particularly iffy.
I had this concern as well: why go through the re-format (i.e.,
formatdb) each time you wish to run a job? I know that my current
formatting of the databases takes a long time every week. However in
trying out Don's gridlet I was pleasantly surprised to find that the format
took an insignificant amount of time compared to the blast search
itself. This was using datasets of 2000 sequences and input of 50+ 1000bp
sequences. Of course reformatting a large dataset just to use against an
input of 1 or 2 sequences would be time inefficient.
Naturally there is a lot of other framework needed aside from the
gridlet. Chris mentioned a few as well as the existing "queuing system of
your choice is used to schedule jobs onto nodes, manage transient and
permanent failures, stage data, and all that other neat stuff." There is
no reason, in my mind, that such a queuing system could not also handle
jobs that split up the databases dynamically. Such splitting up may become
more necessary as the data becomes larger than our computers'
memory. Already I have a PC cluster with very limited memory (but it was
"free" to me) that is limited in what datasets I can submit to it.
In summary, I think that the gridlet might be a worthwhile tool.
-- Rick
Rick Westerman
westerman@purdue.edu
Phone: (765) 494-0505 FAX: (765) 496-7255
Department of Horticulture and Landscape Architecture
625 Agriculture Mall Drive
West Lafayette, IN 47907-2010
Physically located in room S049, WSLR building
Bioinformatics specialist at the Genomics Facility.
href="http://www.genomics.purdue.edu/~westerm"