[Bioclusters] BLAST job time estimates

Micha Bayer bioclusters@bioinformatics.org
08 Jun 2004 12:12:03 +0100

> Even then, after a good fit, there are still multiple factors that would
> influence run time.  The biggest factor would be the index database size
> as compared to the available memory size.  If you overflow local ram,
> the mmap function will flush pages, and you will introduce disk I/O for
> your indices, which could be a significant performance inhibitor,
> depending upon how much I/O is needed.
> To alleviate this, use the "-v N" switch on formatdb, where N is a size
> in MB.  This fragments the database index into approximately N megabyte
> size segments.  This gives you a similar effect to the optimization
> technique named "blocking".

This issue is still a source of great confusion for me. I started a
thread about this earlier on this list and have managed to confuse
myself even more about it since. 

The BLAST manual says that databases can be loaded into memory but there
does not seem to be a way of forcing this - it seems to be up to the OS
to decide whether it loads the db into memory or not. 

On my machine here (Linux RH9) it does not seem to load the database
into memory regardless of its size. I have tried the time command
recently with my BLAST runs, which conveniently also records page
faults, and I get the following output when I run a query against
ecoli.nt (which is pathetically small, a few mb tops, and should easily
fit into my 1gb memory):

>/usr/bin/time -v -- blastall -p blastn -d ecoli.nt -i test.txt -o
        Command being timed: "blastall -p blastn -d ecoli.nt -i test.txt
-o test.out"
        User time (seconds): 0.01
        System time (seconds): 0.02
        Percent of CPU this job got: 8%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.34
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 0
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 792
        Minor (reclaiming a frame) page faults: 621
        Voluntary context switches: 0
        Involuntary context switches: 0
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

To me the number of page faults suggests clearly that the db is not in
memory. Does that mean I cannot ever get the db into memory and on Linux
all BLAST searches will take a huge performance hit because of this?

Where does that leave things like mpiBLAST which gets its performance
increase from the db fitting into memory?

Maybe someone can shed some light on this......

> See above.  How large are your databases?

I plan to run the queries against the standard nr and nt databases and
perhaps whole chromosome dbs as well. nt is currently about 2.6 gb, nr
about 600 mb.


Dr Micha M Bayer
Grid Developer, BRIDGES Project
National e-Science Centre, Glasgow Hub
246c Kelvin Building
University of Glasgow
Glasgow G12 8QQ
Scotland, UK
Email: michab@dcs.gla.ac.uk
Project home page: http://www.brc.dcs.gla.ac.uk/projects/bridges/
Personal Homepage: http://www.brc.dcs.gla.ac.uk/~michab/
Tel.: +44 (0)141 330 2958