[Bioclusters] blast db update

Mon, 18 Oct 2004 12:12:30 -0700 (PDT)

Good to know!

Thanks again for your input!
Peiran

>Delivered-To: bioclusters@bioinformatics.org
>Mime-Version: 1.0 (Apple Message framework v619)
>Content-Transfer-Encoding: 7bit
>From: David Adelson <david.adelson@tamu.edu>
>Subject: Re: [Bioclusters] blast db update
>To: bioclusters@bioinformatics.org
>X-BeenThere: bioclusters@bioinformatics.org
>X-Mailman-Version: 2.0.8
>List-Unsubscribe: <https://bioinformatics.org/mailman/listinfo/bioclusters>, 
<mailto:bioclusters-request@bioinformatics.org?subject=unsubscribe>
>List-Id: Clustering, compute farming & distributed computing in life science 
informatics <bioclusters.bioinformatics.org>
>List-Post: <mailto:bioclusters@bioinformatics.org>
>List-Help: <mailto:bioclusters-request@bioinformatics.org?subject=help>
>List-Subscribe: <https://bioinformatics.org/mailman/listinfo/bioclusters>, 
<mailto:bioclusters-request@bioinformatics.org?subject=subscribe>
>List-Archive: <https://bioinformatics.org/pipermail/bioclusters/>
>Date: Mon, 18 Oct 2004 13:28:39 -0500
>X-Virus-Scanned: clamd / ClamAV version 0.70, clamav-milter version 0.70j
>
>Peiran
>
>We just download the whole thing.  nt is actually not so huge compared 
>to raw trace file data for canine, bovine and chicken :-).  We have 
>used perl scripts along with entrez queries to download portions of 
>htgs or gss for organism specific dbs and that avoids having to 
>download the whole thing if you just want one organism.  We do this 
>mainly to speed up the blast searches, so that people working on 
>sorghum or rice don't have to wait for blast to search the 85% of htgs 
>they are not interested in.
>
>Dave
>
>On Oct 18, 2004, at 12:54 PM, Peiran Song wrote:
>
>> Dave,
>>
>> Thank you for your reply and advice!
>>
>> The thing still bugs me is the updateing of huge database like nt, how 
>> to avoid
>> downloading the whole thing every time. Do you do incremental updates? 
>> What is
>> your strategy there?
>>
>> thanks,
>> Peiran
>>
>>
>>
>>
>>> Delivered-To: bioclusters@bioinformatics.org
>>> Mime-Version: 1.0 (Apple Message framework v619)
>>> Content-Transfer-Encoding: 7bit
>>> From: David Adelson <david.adelson@tamu.edu>
>>> Subject: Re: [Bioclusters] blast db update
>>> To: bioclusters@bioinformatics.org
>>> X-BeenThere: bioclusters@bioinformatics.org
>>> X-Mailman-Version: 2.0.8
>>> List-Unsubscribe: 
>>> <https://bioinformatics.org/mailman/listinfo/bioclusters>,
>> <mailto:bioclusters-request@bioinformatics.org?subject=unsubscribe>
>>> List-Id: Clustering, compute farming & distributed computing in life 
>>> science
>> informatics <bioclusters.bioinformatics.org>
>>> List-Post: <mailto:bioclusters@bioinformatics.org>
>>> List-Help: 
>>> <mailto:bioclusters-request@bioinformatics.org?subject=help>
>>> List-Subscribe: 
>>> <https://bioinformatics.org/mailman/listinfo/bioclusters>,
>> <mailto:bioclusters-request@bioinformatics.org?subject=subscribe>
>>> List-Archive: <https://bioinformatics.org/pipermail/bioclusters/>
>>> Date: Mon, 18 Oct 2004 10:59:42 -0500
>>> X-Virus-Scanned: clamd / ClamAV version 0.70, clamav-milter version 
>>> 0.70j
>>>
>>> Peiran,
>>>
>>> sorry about the first reply, I need to read things before I reply to
>>> them.
>>>
>>> For the type of download you refer to you can use an entrez query with
>>> one of the SOAP tools.
>>>
>>> See
>>> http://eutils.ncbi.nlm.nih.gov/entrez/query/static/
>>> efetchseq_help.html#SequenceDatabases for some details.
>>>
>>> You should be able to write a perl script based on the example they
>>> provide and organism name (taxonomy ID) that allows you to retrieve
>>> just the sequences from the organism you want from the db you want in
>>> fasta format.
>>>
>>> For example, ("txid4530"[Organism] AND biomol_genomic[PROP])  should
>>> return all rice genomic sequences.
>>>
>>>  Just have cron run it and then btformatdb it as usual.
>>>
>>> Hope this helps.
>>>
>>> Dave
>>>
>>> On Oct 14, 2004, at 4:37 PM, Peiran Song wrote:
>>>
>>>> Hi,
>>>>
>>>> This has been a topic before, but I am still in need of suggestions 
>>>> on
>>>> the job that I try to do. I need to build a local Genbank human, 
>>>> mouse
>>>> and zebrafish blast database which is updated fairly frequently if 
>>>> not
>>>> nightly, and be able to run the btblastall from iNquiry software to
>>>> parallel blast job.
>>>>
>>>> I could think of two ways to get the database, but am troubled with 
>>>> the
>>>> updates on both.
>>>>
>>>> One is to get the nt database and run blast with gi list of the 
>>>> species
>>>> interested. I will have to get FASTA data from NCBI so that to format
>>>> it
>>>> in a way that the btblastall could parallel with. But I don't think 
>>>> the
>>>> NCBI site support rsync, ture? Then what are people's solution for
>>>> frequent update? Another problem of this strategy is the gi list also
>>>> has to be updated, I don't have a good idea on that either...
>>>>
>>>> Another choice is to parse the genbank release to get initial data, 
>>>> and
>>>> use the daily file for updates. But as fmerge is no longer supported,
>>>> is
>>>> there a good way to do the merge with NCBI db format? (WU BLAST 
>>>> package
>>>> has utility to achieve that.)
>>>>
>>>> Help me out!
>>>>
>>>> Thanks,
>>>> Peiran Song
>>>>
>>>> Zebrafish Information Network
>>>>
>>>> _______________________________________________
>>>> Bioclusters maillist  -  Bioclusters@bioinformatics.org
>>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>>>
>>>
>>> _______________________________________________
>>> Bioclusters maillist  -  Bioclusters@bioinformatics.org
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>
>> _______________________________________________
>> Bioclusters maillist  -  Bioclusters@bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters@bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bioclusters