[BiO BB] Efficient way to retrieve full length cDNA sequences from GenBank?

Mikhail Fursov mike.fursov at gmail.com
Fri Apr 3 03:06:50 EDT 2009


What is the size of species genomes you use? Do you have them locally?
If genomes size is < RAM on you computer a simple example could be:

1) Merge all your sequences into a single sequence with ~100 'N' chars
between them
2) Merge all genomes
3) Find repeats (common hits) between 2 resulted sequences



On Thu, Apr 2, 2009 at 11:09 PM, Mike Marchywka <marchywka at hotmail.com>wrote:

>
> ----------------------------------------
> > Date: Thu, 2 Apr 2009 16:41:51 +0100
> > From: pfern at igc.gulbenkian.pt
> > To: bbb at bioinformatics.org
> > Subject: Re: [BiO BB] Efficient way to retrieve full length cDNA
> sequences from GenBank?
> >
> > Hi
> >
> > I would do it programmatically. You do not even need to know much of PERL
> to
> > create your own simple scripts and the ENSEMBL APIs.
> >
>
> I was using bash scripts with various things ( sed/awk) to parse blast
> output
> on short probe queries and then using wget or curl to request
> genome sequence near the hits ( alt, you can just download
> the complete genomes locally and use your favorite random access
> facility, perl would work, to get pieces you want).
> IIRC, I then used my own c++ code for various tests.
>
> For unrelated work on splicing, many arguable splicing cues could be
> formulated as regular expressions with reverse-complement matches.
> You can also set up your own local blast DB or get other patterns
> or rules against which to search. Not sure if there are canned
> tools but it isn't hard to do a lot of this locally once you
> get coarse hits for marginal candidates.
>
>
>
> >
> > Go to http://www.ensembl.org and look for the APIs in the Docs & FAQ's
> section.
> > It is full of instructions and examples.
> >
> > Good luck
> > Pedro
> >
> > --
> > Pedro Fernandes
> > Centro Português de Bioinformática
>
> > Quoting dale richardson :
> >
> >>
> >> So my question is this:
> >>
> >> What is the most efficient way to obtain a set of cDNA sequences that
> >> match to a set of genomic DNA sequences while excluding spurious
> >> hits , RefSeq sequences and "pseudo" full length cDNAs?
> >>
> >> As you can imagine, I am interesting in looking for alternative splice
> >> variants for a number of genes.
>
>
> _________________________________________________________________
> Rediscover Hotmail®: Get quick friend updates right in your inbox.
>
> http://windowslive.com/RediscoverHotmail?ocid=TXT_TAGLM_WL_HM_Rediscover_Updates1_042009
> _______________________________________________
> BBB mailing list
> BBB at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/bbb
>



-- 
Mikhail Fursov



More information about the BBB mailing list