[BiO BB] Random Sequence Generator

Boris Steipe boris.steipe at utoronto.ca
Wed Oct 6 10:56:39 EDT 2004


Sorry, I was misleading: "randomly sample" should be "randomly sample 
di-/tri-/... nucleotides from the original" ... This is to address the 
problem that your original sequence may not be sufficiently long to get 
meaningful frequencies. Eg. a hexamer randomly pulled from a 1kb 
promoter sequence implicitly represents that sequence's underlying 
hexamer frequencies; but I could not compile frequencies of _all_ 4096 
hexamers from 1 kb. Of course you can't use this to see whether such a 
hexamer would be overrepresented - you need an independent random model 
for that. But you can look at separations between patterns, clustering, 
correlations and the like.
Be well,

Boris
(I concur on the PRNG issue; However, for applications where "really 
random" is important, why not use true random numbers obtained from a 
physical process. This is easy enough to do, e.g. see 
http://www.lavarnd.org   :-)




On Wednesday, Oct 6, 2004, at 10:23 Canada/Eastern, Joe Landman wrote:

>
>
> Boris Steipe wrote:
>
>> In this kind of simulation, you assume that all nucleotides are 
>> independent, this does not conserve dinucleotide, trinucleotide 
>> frequencies etc. If higher order correlations may play a role, it 
>> would be more appropriate to randomly sample from the original, 
>> rather than simulate a sequence.
>
>
>
> Might be better (if you need multi-letter properties to match some 
> sequence library set), to sample the distribution of the 
> multi-letters, and pull randomly from there as compared to single 
> letters.  This way you can (to an extent) preserve correllations at 
> the di-/tri-/... higher orders as required, though you will miss still 
> higher order patterns (and isn't that what some of the HMM tools are 
> for anyway?)  and still "randomly" sample.  Though with all due 
> respect, please don't use "rand" for random numbers.  The Mersenne 
> twister and other modern pseudo-random number generators (PRNG) have 
> superior properties, and decades of work on the part of folks doing 
> Monte Carlo work in physics and chemistry have indicated that the 
> quality of the PRNG is quite important.
>
>
> So what I am saying is that if you need to emit "random patterns" with 
> similar di-nucleotide or tri-nucletide frequencies, that you emit 
> di-nucleotides and tri-nucleotides versus single nucleotides.
> Joe
>
> [good/readable perl code removed:  ]
>
> -- 
> Joseph Landman, Ph.D
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
> phone: +1 734 612 4615
>
> _______________________________________________
> BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board




More information about the BBB mailing list