[BiO BB] Clustering EST sequences

Tue Apr 1 17:58:43 EST 2003

This is very strange- spaces are allowed in fasta, at least in the description 
section. In the first part you may need to replace the spaces with | symbols, 
as follows:
Change
>gi 29125973 emb AJ550374.1 USO550374 Uncultured soil bacterium partial nosZ 
gene for putative nitrous oxide reductase, clone T8C23
GGCTGGGG...

to

>gi|29125973|emb|AJ550374.1|USO550374 Uncultured soil bacterium partial nosZ 
gene for putative nitrous oxide reductase, clone T8C23
GGCTGGGG...

Quoting "Scott A. Halpine" <shalpine at ecomplexsystems.com>:

> Clustering EST sequencesI don't know of any conversion utilities but you can
> certainly write a quick conversion in Perl. I'm not familiar with the
> specific layouts but it sounds like you simply need to properly truncate each
> row of data. There shouldn't be a problem if your field partition is white
> space (or any other specific delimiter for that matter). 
> If you don't get a better offer, send me a small data file of what you need
> converted, the field delimiter used, and an example of what it needs
> converted into. I should be able to write you a Perl routine and send it back
> to you. 
> Scott A. Halpine
> Ecologic Complex Systems, LLC
> 4640 Forbes Blvd, Suite 200
> Lanham, MD 20706-4885
> Phone: 301-918-3283
> Fax: 301-429-8762
> 
>   ----- Original Message ----- 
>   From: Bossers, A. 
>   To: bio_bulletin_board at bioinformatics.org 
>   Cc: biodevelopers at bioinformatics.org 
>   Sent: Tuesday, April 01, 2003 6:29 AM
>   Subject: [BiO BB] Clustering EST sequences
> 
> 
>   Dear All, 
> 
>   I have a very basic problem of which I wonder how others have solved this.
> 
> 
>   I want to make a unigene collection of a large EST database. We have
> chromat files in ABI format and I use Linux on the intel platform.
> 
>   I have phred and phrap running but since phrap was originally designed for
> genomic sequences we get lots of misaasemblies on poly-A or poly-T
> stretches.
> 
>   Therefore I installed the TIGR tigcl package which is designed for EST
> databases and also runs very well on multi node machines.
> 
>   However, it uses multi fasta files (and corresponding (optional) quality
> files) as input. 
>   I wanted to use the phred package to generate the required fasta and qual
> files. This runs fine but the fasta file has in the >name line additional
> info separated with spaces. These files are not accepted by TGICL.
> 
>   Is there an easy unix (linux) utility to convert these multi fasta files
> and quality fasta files in simpel >name {CRT} seq files so they kan be used
> as input for tgicl? Or is a conversion utility available to convert/extract
> phreds phd files into fasta-seq and fasta-qual?
> 
>   Any help would be appreciated, 
> 
>           Alex 
> 
> 
> 

Martin Gollery
Associate Director of Bioinformatics
University of Nevada at Reno
Dept. of Biochemistry / MS200
(775)784-6048

-------------------------------------------------
This mail sent through https://webmail.unr.edu