[BiO BB] Random Sequence Generator
Boris Steipe
boris.steipe at utoronto.ca
Wed Oct 6 10:56:39 EDT 2004
Sorry, I was misleading: "randomly sample" should be "randomly sample
di-/tri-/... nucleotides from the original" ... This is to address the
problem that your original sequence may not be sufficiently long to get
meaningful frequencies. Eg. a hexamer randomly pulled from a 1kb
promoter sequence implicitly represents that sequence's underlying
hexamer frequencies; but I could not compile frequencies of _all_ 4096
hexamers from 1 kb. Of course you can't use this to see whether such a
hexamer would be overrepresented - you need an independent random model
for that. But you can look at separations between patterns, clustering,
correlations and the like.
Be well,
Boris
(I concur on the PRNG issue; However, for applications where "really
random" is important, why not use true random numbers obtained from a
physical process. This is easy enough to do, e.g. see
http://www.lavarnd.org :-)
On Wednesday, Oct 6, 2004, at 10:23 Canada/Eastern, Joe Landman wrote:
>
>
> Boris Steipe wrote:
>
>> In this kind of simulation, you assume that all nucleotides are
>> independent, this does not conserve dinucleotide, trinucleotide
>> frequencies etc. If higher order correlations may play a role, it
>> would be more appropriate to randomly sample from the original,
>> rather than simulate a sequence.
>
>
>
> Might be better (if you need multi-letter properties to match some
> sequence library set), to sample the distribution of the
> multi-letters, and pull randomly from there as compared to single
> letters. This way you can (to an extent) preserve correllations at
> the di-/tri-/... higher orders as required, though you will miss still
> higher order patterns (and isn't that what some of the HMM tools are
> for anyway?) and still "randomly" sample. Though with all due
> respect, please don't use "rand" for random numbers. The Mersenne
> twister and other modern pseudo-random number generators (PRNG) have
> superior properties, and decades of work on the part of folks doing
> Monte Carlo work in physics and chemistry have indicated that the
> quality of the PRNG is quite important.
>
>
> So what I am saying is that if you need to emit "random patterns" with
> similar di-nucleotide or tri-nucletide frequencies, that you emit
> di-nucleotides and tri-nucleotides versus single nucleotides.
> Joe
>
> [good/readable perl code removed: ]
>
> --
> Joseph Landman, Ph.D
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web : http://scalableinformatics.com
> phone: +1 734 612 4615
>
> _______________________________________________
> BiO_Bulletin_Board maillist - BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
More information about the BBB
mailing list