[BiO BB] Random Sequence Generator

Joe Landman landman at scalableinformatics.com
Wed Oct 6 10:23:54 EDT 2004


Boris Steipe wrote:

> In this kind of simulation, you assume that all nucleotides are 
> independent, this does not conserve dinucleotide, trinucleotide 
> frequencies etc. If higher order correlations may play a role, it 
> would be more appropriate to randomly sample from the original, rather 
> than simulate a sequence.



Might be better (if you need multi-letter properties to match some 
sequence library set), to sample the distribution of the multi-letters, 
and pull randomly from there as compared to single letters.  This way 
you can (to an extent) preserve correllations at the di-/tri-/... higher 
orders as required, though you will miss still higher order patterns 
(and isn't that what some of the HMM tools are for anyway?)  and still 
"randomly" sample.  Though with all due respect, please don't use "rand" 
for random numbers.  The Mersenne twister and other modern pseudo-random 
number generators (PRNG) have superior properties, and decades of work 
on the part of folks doing Monte Carlo work in physics and chemistry 
have indicated that the quality of the PRNG is quite important.


So what I am saying is that if you need to emit "random patterns" with 
similar di-nucleotide or tri-nucletide frequencies, that you emit 
di-nucleotides and tri-nucleotides versus single nucleotides. 

Joe

[good/readable perl code removed:  ]

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615




More information about the BBB mailing list