[Bioclusters] Parallel Sequence Alignment tool

Paulo Nuin nuin at genedrift.org
Mon Aug 3 11:50:07 EDT 2009


Hi

Just my two cents. Aligning rRNA is not a straightforward process and  
it shouldn't be attempted to be accomplished automatically. Muscle,  
MAFFT and other fast algorithms will generate very low quality  
alignments if it's done blindly. Based on the number of sequences you  
have, and their nature, you would be OK by wrapping some script around  
ClustalW or ClustalW-MPI.

A good protocol to align rRNA is as follows:

- align two sequences
- add a third sequence to it by using the first two as a profile
- add a fourth sequence using the first three as a profile
- add a fifth sequence ...
- at some point you will have a good enough profile that would allow  
you to use the aligned sequences as a model to the ones added to the  
alignment

The reason is rRNA has a secondary (and tertiary) structure that  
contains stems and loops. Stems are short segments that are somewhat  
"duplicated" along the flat sequence and attache to each other when  
forming the secondary structure. This connection sometimes don't  
follow the usual A-T(U) C-G connection. Due to the stems there is a  
pattern on the primary structure that has to be followed to generate a  
good (but not excellent) alignment.

I guess a rRNA alignment software would be too slow for your  
requirements, but I guess by using ClustalW-MPI and some sequences as  
profile would you get a slightly good alignment in maybe a couple of  
days.


Hope that helps
Paulo



On 30-Jul-09, at 12:19 PM, Nick Holway wrote:

> Hello,
>
> Steve actually posted this on behalf of me, so to cut out the middle
> man I'll answer.
>
> I'm trying to assist a scientist with a bioinformatics project. He's
> trying to align 16s rDNA sequences to identify the bacterial species.
> I launched a Muscle job on his behalf which took ~5.5 days to run (on
> 3GHz "Harpertown" Xeons). The file the scientist gave me had ~5000
> sequences in which were mostly 1000-1500 bases long.
>
> I'm trying to persuade the scientist to see if he can reduce the
> number of sequences that he needs to align and also to see if his data
> needs to let Muscle run to completion rather than just the first two
> iterations.
>
> My reason for wanting to know if there are any good parallel sequence
> alignment tools is that we've seen some excellent speed increases with
> our MD code. Knowing this scientist I imagine he'll need the entire
> data set to be aligned :)
>
> If you need me to find out any more information from the scientist
> please let me know.
>
> Thanks
>
> Nick
>
> 2009/7/22 Juan Carlos Perin <bic at genome.chop.edu>:
>> Are you looking to align short reads from ngs, or other data?
>>
>> ~ juan
>>
>> On Jul 17, 2009, at 10:41, <slitster at rcn.com> wrote:
>>
>>> Does anyone have recommnedations for a parallel sequence alignment  
>>> tool
>>>
>>> User investigation so far has turned up clustalW-MPI, but it seams  
>>> to be
>>> using an older version of clustalW.
>>>
>>> Any imput much appreciated.
>>>
>>> Cheers
>>>
>>> Steve
>>>
>>> _______________________________________________
>>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>> http://www.bioinformatics.org/mailman/listinfo/bioclusters
>>>
>>
>> _______________________________________________
>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>> http://www.bioinformatics.org/mailman/listinfo/bioclusters
>>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/bioclusters




More information about the Bioclusters mailing list