(in the interests of full disclosure, I work for a company that sells clusters, software, etc with these things I am writing about here, for bioinformatics and other things) Design is as much about compromise as it is about solving problems. There are several balancing acts you need to engage in to make things work well. Points of serialization (e.g. where things are forced to go one item at a time) are bad for cluster design. NCBI BLAST (and pretty much any library/database searching tool) will have modes of operation that are I/O bound, and others that have other factors (memory, CPU bound). BLAST itself trys to mmap as much of the database indices into memory as possible. If you do not have sufficient memory for the indices, you are going to be hitting your I/O channel hard. First design point is sufficient memory. You should look at your particular calculations, and make sure you can have most/all of it in RAM. The reason is that the virtual memory system access latency and speed is on the order of 10 ms, and 20 MB/s respectively. The RAM latency and speed are on the order of 150 ns and 800 MB/s. You do not want to use SWAP if you can avoid it, nor stream pages from disk for a mmap. What you want to do is use the memory size of your calculation as a lower bound on the RAM size. My rule of thumb is never less than 1 GB/CPU. Any "extra" memory can be used by the system as a disk cache, which can have a dramatic positive impact upon performance. The converse is also true, if you haved 128 MB of ram per computing node, and your calculation wants 768 MB RAM, you are going to spend all your time getting pages off the disk, and very little time calculating. Add to this that RAM is (still) cheap. This may not be the case (cost issues) for the huge clusters, but hopefully they were designed correctly from the beginning. You can increase your run times by orders of magnitudes by having insufficient RAM. Next issue would be I/O (non swap). BLAST and friends do large block sequential reads, especially when streaming indices/db's from the disk. This means that the per spindle speed will be the limiting factor. Or if you are mounting the database indices from a central point versus distributing them, then you are sharing the network pipe bandwidth. The side effects of this include that for N simulateous requestors of data from the pipe, each requestor will typically get about 1/N of the bandwidth (this is the 1/N problem). As N (the number of nodes) increases, the shared bandwidth (the file server) drops accordingly. If you are doing searchs with a system configured like this, you run out of network bandwidth, and you get a point of serialization. This means your cluster cannot scale well. The folks at Oak Ridge ran into this with their MPP BLAST on their SP. I cannot find the pages on the web anymore though. I normally recommend a striped local file system. IDE disks are fast and cheap. SCSI disks are more expensive and fast or faster. There are some calculations you can do on speed of I/O channel, width of striping, and the amount of stuff you are pulling over. File system choice is critical to compute node performance. Some file systems are poorly suited to the demands of computational loads. Others are ideally suited for this. We have seen factors of 1.5 in wall clock time for various calculations (long running) using one file system versus another on the same machine (e.g. format the disk with one file system, run the test, format it with the other, run the test). Since you are striping the local file system, generally it is a good idea to stripe the swap space. Linux VMs still require at least a 1 to 1 ratio of physical memory to swap space. The mid 2.4.x series required more. Swap is usually a point of serialization. The factors indicated above can have a substantial impact upon the performance of the system. They are not the only factors, but they can be quite significant. As always, you want to run your own tests (e.g. your own code for several different input cases, think small, medium, large, and where you really want to explore) on any system to gauge its performance. You will gradually find that marketing claims notwithstanding, system performance is a complex beast, and requires careful attention to many details to understand what is going on, and where the control/tuning knobs are. Enjoy BIO-IT World folks, I wont be attending (prior committments). I think ISMB is more likely. Anyone interested in meeting then? -- Joseph Landman, Ph.D. Senior Scientist, MSC Software High Performance Computing email : joe.landman@mscsoftware.com office : +1 248 208 3312 fax : +1 714 784 3774