[BiO BB] Clustering EST sequences

Tue Apr 1 06:29:37 EST 2003

Dear All,

I have a very basic problem of which I wonder how others have solved this.

I want to make a unigene collection of a large EST database. We have chromat
files in ABI format and I use Linux on the intel platform.
I have phred and phrap running but since phrap was originally designed for
genomic sequences we get lots of misaasemblies on poly-A or poly-T
stretches.

Therefore I installed the TIGR tigcl package which is designed for EST
databases and also runs very well on multi node machines.
However, it uses multi fasta files (and corresponding (optional) quality
files) as input.
I wanted to use the phred package to generate the required fasta and qual
files. This runs fine but the fasta file has in the >name line additional
info separated with spaces. These files are not accepted by TGICL.

Is there an easy unix (linux) utility to convert these multi fasta files and
quality fasta files in simpel >name {CRT} seq files so they kan be used as
input for tgicl? Or is a conversion utility available to convert/extract
phreds phd files into fasta-seq and fasta-qual?

Any help would be appreciated,

	Alex

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/bbb/attachments/20030401/3bcee204/attachment.html>