[BiO BB] Random Sequence Generator
Boris Steipe
boris.steipe at utoronto.ca
Wed Oct 6 09:45:53 EDT 2004
On Wednesday, Oct 6, 2004, at 08:18 Canada/Eastern, Fei Li wrote:
> I do not think we can find a general random sequence generator for all
> the case. At least, we need consider GC content, and maybe even
> content of "GC", "GG", "GT", "GA"......"TC"....."TA", etc. A "real"
> random sequnce is most likely useless in a specific case.
>
> Fei
>
Here is how this is done in principle: random characters with
arbitrary, predefined target frequencies. (The code is a teaching
example for non-Perl programmers, VERY explicit, one could of course do
this MUCH more concisely in "real" Perl, at the expense of legibility).
The target frequencies do not need to sum to 1.0, they can be raw
counts from an example sequence. Changing this to amino acids instead
of nucleotides should be straightforward.
In this kind of simulation, you assume that all nucleotides are
independent, this does not conserve dinucleotide, trinucleotide
frequencies etc. If higher order correlations may play a role, it would
be more appropriate to randomly sample from the original, rather than
simulate a sequence.
Boris
=======================================================================
#!/usr/bin/perl
use strict;
use warnings;
my $OutputLength = 100;
my @Character; # Stores the emitted characters
my @Frequency; # Stores the target frequency of each
character
my @Interval; # Stores the summed probability of each
character
initializeArrays();
for (my $i=0; $i < $OutputLength; $i++) {
print getCharacter(rand);
}
print "\n";
exit();
# ====== Subroutines =======================
sub initializeArrays {
my $Sum = 0;
# Initialize the Character array - we enter target frequencies
explicitly
# but this could also be the result of counting amino acids in a
database,
# or nucleotides in a genome etc.
# In this example we generate a skewed nucleotide composition
$Character[0] = 'A'; $Frequency[0] = 17;
$Character[1] = 'G'; $Frequency[1] = 27;
$Character[2] = 'C'; $Frequency[2] = 23;
$Character[3] = 'T'; $Frequency[3] = 19;
# Convert the values of the target frequencies array to the top
boundaries
# of intervals having relative widths proportional to their
relative frequency
# and ranging from 0 to 1.0
for (my $i=0; $i <= $#Frequency; $i++) {
$Sum += $Frequency[$i];
}
for (my $i=0; $i <= $#Frequency; $i++) {
$Interval[$i] = 0;
for (my $j=0; $j <= $i; $j++) {
$Interval[$i] += $Frequency[$j] / $Sum;
}
}
return();
}
# ==========================================
sub getCharacter {
my ($RandomNumber) = @_;
my $i=0;
# Which interval does the random number fall into ? This interval
has the
# index of the character we output.
for ($i=0; $i <= $#Interval; $i++) {
if ($Interval[$i] > $RandomNumber) { last; }
}
return($Character[$i]);
}
More information about the BBB
mailing list