[Bioclusters] Request for discussions-How to build a biocluster Part 5 (BLAST/DB management)

Pam Culpepper bioclusters@bioinformatics.org
Fri, 03 May 2002 14:39:18 -0500


--------------080904020105050106000502
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit



Imre Vastrik wrote:

>Sylvain Foisy wrote:
>
>>BLAST
>>
>>OK, which version of BLAST should we use: NCBI or WU? I have used both
>>and quite franckly for most uses, they are pretty much equal although WU
>>seems to be faster. Any particular feature from any of these that could
>>be helpful to specific users?
>>
>
>NCBI blast gives XML and tab-delimited output which can make your
>"parsing-life" slightly easier. Also, for whatever reasons I've never
>managed to make WU blast run faster than NCBI's one, but perhaps this
>has something to do with me ;). The main thing is to use "fresh"
>versions. Both blasts have undergone significant speed improvements over
>past few years.
>
You can get NCBI Blast pre-compiled or as part of the toolbox.  You can 
do more with the toolbox, but you have to figure out how to compile what 
you want.  

You don't get source code for WU-Blast2. So get the latest version.

I prefer source code that I can compile on my native platform.

>
>
>>THE GENBANK DATABASE
>>
>>BLAST without the data, what for? OK, what sould be downloaded: the
>>GenBank database in its own format or the FASTA transformed one that is
>>found in tha BLAST folder at NCBI? In both cases it is a lot of data.
>>The idea would be for a user to get the whole GenBank record for a
>>particular sequence. However, I think that it could be done either way
>>with scripts.
>>
>
>Can't comment much on GenBank since I'm using EMBL (how else? ;)). Since
>I'm splitting the db by species and by sequence type (mRNA/cDNA,
>finished genomic, HTGS, etc, i.e. "finer" split that is readily
>available) I'm parsing the EMBL flatfiles.
>
GenBank is in ASN1 format and has everything.  You can use the asntool 
you compiled from the toolkit and the /demo subdirectory  programs  to 
generate GenBank reports, Medline reports, etc.  and vice versa.

If you have the space, go with GenBank.  You will still have to formatdb 
either to make them Blastable.

>
>>How should the local database be administered? Reading the archive, I
>>think that the consensus is that the DB has to be splitted in n pieces
>>(n=nb of nodes), each piece sent to a particular node, process with
>>formatdb. Or have I everything wrong? I would be worried that the nodes
>>which are getting the human sequences or the EST sequences be very hard
>>working while the ones with the vector sequences are idle. Is it
>>feasible to divide the DB to split the load over the nodes?
>>
>
>I don't have 1st hand experience with blastdbs on clusters (my stuff is
>running on a multiprocessor machine) but I would distribute all dbs on
>all nodes so that you'd avoid the issue of some nodes being more heavily
>loaded than others due to different "popularity" of dbs.
>
>>How should the daily updates be performed?
>>
>
>Again, as I'm familiar only with EMBL stuff I can talk only about that
>(although GB operates probably in a very similar manner). EMBL DNA DBs
>come in following forms: release aka "embl" and everything since the
>last release aka "emblnew". For the latter EBI's ftp site offers:
>-"cumulative" data, i.e. everything since the last release except the
>records changed/deleted
>-weekly updates
>-daily (well, near daily) updates.
>For the latter 2 there are also transaction lists which you can use to
>create the cumulative version locally.
>
>For a given specie/sequence type combination I create 3 blast databases:
>-release
>-new (everything since release, including the latest)
>-latest (the last daily/weekly update)
>
>Users who do their searches regularly (i.e. with each daily/weekly
>update) do it on the "latest". (Obviously it would be dead handy to have
>a way of launching these searches autonatically whenever the db is
>updated...).
>"Occasional" users would search the union of "release" and "new". NCBI
>blast allows you to create alias files listing the "real" blastdbs to
>use which means that the user does not have to know anything about
>release and new etc and can just search "everything".
>
>Rgds.,
>
>imre
>
>P.S. you can see the web front to the blast server I've been talking
>about at:
>http://biomedicum.csc.fi:8010
>_______________________________________________
>Bioclusters maillist  -  Bioclusters@bioinformatics.org
>http://bioinformatics.org/mailman/listinfo/bioclusters
>
I have a script that runs everynight (3 a.m. ) that goes to the NCBI ftp 
site and checks to see if any of the Blast databases were modified -- 
size and date stamp.  If they were, download and formatdb them.  Some of 
the databases are modified more than others.  

GenBank is rolled over about one a month.  Daily updates are kept in a 
daily-nc directory.  You have to use fmerge from the demo subdirectory 
of the toolkit.
 
Pam
http://bcf.bcm.tmc.edu

--------------080904020105050106000502
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<html>
<head>
</head>
<body>
<br>
<br>
Imre Vastrik wrote:<br>
<blockquote type="cite" cite="mid:3CD2500D.DFFE72D7@ebi.ac.uk">
  <pre wrap="">Sylvain Foisy wrote:<br><br></pre>
  <blockquote type="cite">
    <pre wrap="">BLAST<br><br>OK, which version of BLAST should we use: NCBI or WU? I have used both<br>and quite franckly for most uses, they are pretty much equal although WU<br>seems to be faster. Any particular feature from any of these that could<br>be helpful to specific users?<br></pre>
    </blockquote>
    <pre wrap=""><!----><br>NCBI blast gives XML and tab-delimited output which can make your<br>"parsing-life" slightly easier. Also, for whatever reasons I've never<br>managed to make WU blast run faster than NCBI's one, but perhaps this<br>has something to do with me ;). The main thing is to use "fresh"<br>versions. Both blasts have undergone significant speed improvements over<br>past few years.</pre>
    </blockquote>
You can get NCBI Blast pre-compiled or as part of the toolbox. &nbsp;You can do
more with the toolbox, but you have to figure out how to compile what you
want. &nbsp;<br>
    <br>
You don't get source code for WU-Blast2. So get the latest version.<br>
    <br>
I prefer source code that I can compile on my native platform.<br>
    <blockquote type="cite" cite="mid:3CD2500D.DFFE72D7@ebi.ac.uk">
      <pre wrap=""><br><br></pre>
      <blockquote type="cite">
        <pre wrap="">THE GENBANK DATABASE<br><br>BLAST without the data, what for? OK, what sould be downloaded: the<br>GenBank database in its own format or the FASTA transformed one that is<br>found in tha BLAST folder at NCBI? In both cases it is a lot of data.<br>The idea would be for a user to get the whole GenBank record for a<br>particular sequence. However, I think that it could be done either way<br>with scripts.<br></pre>
        </blockquote>
        <pre wrap=""><!----><br>Can't comment much on GenBank since I'm using EMBL (how else? ;)). Since<br>I'm splitting the db by species and by sequence type (mRNA/cDNA,<br>finished genomic, HTGS, etc, i.e. "finer" split that is readily<br>available) I'm parsing the EMBL flatfiles.<br></pre>
        </blockquote>
GenBank is in ASN1 format and has everything. &nbsp;You can use the asntool you
compiled from the toolkit and the /demo subdirectory &nbsp;programs &nbsp;to generate
GenBank reports, Medline reports, etc. &nbsp;and vice versa.<br>
        <br>
If you have the space, go with GenBank. &nbsp;You will still have to formatdb
either to make them Blastable.<br>
        <blockquote type="cite" cite="mid:3CD2500D.DFFE72D7@ebi.ac.uk">
          <pre wrap=""><br></pre>
          <blockquote type="cite">
            <pre wrap="">How should the local database be administered? Reading the archive, I<br>think that the consensus is that the DB has to be splitted in n pieces<br>(n=nb of nodes), each piece sent to a particular node, process with<br>formatdb. Or have I everything wrong? I would be worried that the nodes<br>which are getting the human sequences or the EST sequences be very hard<br>working while the ones with the vector sequences are idle. Is it<br>feasible to divide the DB to split the load over the nodes?<br></pre>
            </blockquote>
            </blockquote>
            <blockquote type="cite" cite="mid:3CD2500D.DFFE72D7@ebi.ac.uk">
              <pre wrap=""><br>I don't have 1st hand experience with blastdbs on clusters (my stuff is<br>running on a multiprocessor machine) but I would distribute all dbs on<br>all nodes so that you'd avoid the issue of some nodes being more heavily<br>loaded than others due to different "popularity" of dbs.<br><br></pre>
              <blockquote type="cite">
                <pre wrap="">How should the daily updates be performed?<br></pre>
                </blockquote>
                <pre wrap=""><!----><br>Again, as I'm familiar only with EMBL stuff I can talk only about that<br>(although GB operates probably in a very similar manner). EMBL DNA DBs<br>come in following forms: release aka "embl" and everything since the<br>last release aka "emblnew". For the latter EBI's ftp site offers:<br>-"cumulative" data, i.e. everything since the last release except the<br>records changed/deleted<br>-weekly updates<br>-daily (well, near daily) updates.<br>For the latter 2 there are also transaction lists which you can use to<br>create the cumulative version locally.<br><br>For a given specie/sequence type combination I create 3 blast databases:<br>-release<br>-new (everything since release, including the latest)<br>-latest (the last daily/weekly update)<br><br>Users who do their searches regularly (i.e. with each daily/weekly<br>update) do it on the "latest". (Obviously it would be dead handy to have<br>a way of launching these searches autonaticall
y whenever the db is<br>updated...).<br>"Occasional" users would search the union of "release" and "new". NCBI<br>blast allows you to create alias files listing the "real" blastdbs to<br>use which means that the user does not have to know anything about<br>release and new etc and can just search "everything".<br><br>Rgds.,<br><br>imre<br><br>P.S. you can see the web front to the blast server I've been talking<br>about at:<br><a class="moz-txt-link-freetext" href="http://biomedicum.csc.fi:8010">http://biomedicum.csc.fi:8010</a><br>_______________________________________________<br>Bioclusters maillist  -  <a class="moz-txt-link-abbreviated" href="mailto:Bioclusters@bioinformatics.org">Bioclusters@bioinformatics.org</a><br><a class="moz-txt-link-freetext" href="http://bioinformatics.org/mailman/listinfo/bioclusters">http://bioinformatics.org/mailman/listinfo/bioclusters</a><br></pre>
                </blockquote>
I have a script that runs everynight (3 a.m. ) that goes to the NCBI ftp site
and checks to see if any of the Blast databases were modified -- size and
date stamp. &nbsp;If they were, download and formatdb them. &nbsp;Some of the databases 
are modified more than others. &nbsp;<br>
                <br>
 GenBank is rolled over about one a month. &nbsp;Daily updates are kept in a daily-nc 
directory. &nbsp;You have to use fmerge from the demo subdirectory of the toolkit.<br>
 &nbsp;<br>
Pam<br>
<a class="moz-txt-link-freetext" href="http://bcf.bcm.tmc.edu">http://bcf.bcm.tmc.edu</a><br>
                </body>
                </html>

--------------080904020105050106000502--