[BiO BB] Clustering EST sequences
Martin Gollery
mgollery at unr.edu
Tue Apr 1 17:58:43 EST 2003
This is very strange- spaces are allowed in fasta, at least in the description
section. In the first part you may need to replace the spaces with | symbols,
as follows:
Change
>gi 29125973 emb AJ550374.1 USO550374 Uncultured soil bacterium partial nosZ
gene for putative nitrous oxide reductase, clone T8C23
GGCTGGGG...
to
>gi|29125973|emb|AJ550374.1|USO550374 Uncultured soil bacterium partial nosZ
gene for putative nitrous oxide reductase, clone T8C23
GGCTGGGG...
Quoting "Scott A. Halpine" <shalpine at ecomplexsystems.com>:
> Clustering EST sequencesI don't know of any conversion utilities but you can
> certainly write a quick conversion in Perl. I'm not familiar with the
> specific layouts but it sounds like you simply need to properly truncate each
> row of data. There shouldn't be a problem if your field partition is white
> space (or any other specific delimiter for that matter).
> If you don't get a better offer, send me a small data file of what you need
> converted, the field delimiter used, and an example of what it needs
> converted into. I should be able to write you a Perl routine and send it back
> to you.
> Scott A. Halpine
> Ecologic Complex Systems, LLC
> 4640 Forbes Blvd, Suite 200
> Lanham, MD 20706-4885
> Phone: 301-918-3283
> Fax: 301-429-8762
>
> ----- Original Message -----
> From: Bossers, A.
> To: bio_bulletin_board at bioinformatics.org
> Cc: biodevelopers at bioinformatics.org
> Sent: Tuesday, April 01, 2003 6:29 AM
> Subject: [BiO BB] Clustering EST sequences
>
>
> Dear All,
>
> I have a very basic problem of which I wonder how others have solved this.
>
>
> I want to make a unigene collection of a large EST database. We have
> chromat files in ABI format and I use Linux on the intel platform.
>
> I have phred and phrap running but since phrap was originally designed for
> genomic sequences we get lots of misaasemblies on poly-A or poly-T
> stretches.
>
> Therefore I installed the TIGR tigcl package which is designed for EST
> databases and also runs very well on multi node machines.
>
> However, it uses multi fasta files (and corresponding (optional) quality
> files) as input.
> I wanted to use the phred package to generate the required fasta and qual
> files. This runs fine but the fasta file has in the >name line additional
> info separated with spaces. These files are not accepted by TGICL.
>
> Is there an easy unix (linux) utility to convert these multi fasta files
> and quality fasta files in simpel >name {CRT} seq files so they kan be used
> as input for tgicl? Or is a conversion utility available to convert/extract
> phreds phd files into fasta-seq and fasta-qual?
>
> Any help would be appreciated,
>
> Alex
>
>
>
Martin Gollery
Associate Director of Bioinformatics
University of Nevada at Reno
Dept. of Biochemistry / MS200
(775)784-6048
-------------------------------------------------
This mail sent through https://webmail.unr.edu
More information about the BBB
mailing list