[BiO BB] Random Sequence Generator

Wed Oct 6 09:45:53 EDT 2004

On Wednesday, Oct 6, 2004, at 08:18 Canada/Eastern, Fei Li wrote:

> I do not think we can find a general random sequence generator for all 
> the case.  At least, we need consider GC content, and maybe even 
> content of  "GC", "GG", "GT", "GA"......"TC"....."TA", etc.   A "real" 
> random sequnce is most likely useless in a specific case.
>  
> Fei
>

Here is how this is done in principle: random characters with 
arbitrary, predefined target frequencies. (The code is a teaching 
example for non-Perl programmers, VERY explicit, one could of course do 
this MUCH more concisely in "real" Perl, at the expense of legibility). 
The target frequencies do not need to sum to 1.0, they can be raw 
counts from an example sequence. Changing this to amino acids instead 
of nucleotides should be straightforward.

In this kind of simulation, you assume that all nucleotides are 
independent, this does not conserve dinucleotide, trinucleotide 
frequencies etc. If higher order correlations may play a role, it would 
be more appropriate to randomly sample from the original, rather than 
simulate a sequence.

Boris
=======================================================================
#!/usr/bin/perl
use strict;
use warnings;

my $OutputLength = 100;
my @Character;             # Stores the emitted characters
my @Frequency;             # Stores the target frequency of each 
character
my @Interval;              # Stores the summed probability of each 
character
initializeArrays();

for (my $i=0; $i < $OutputLength; $i++) {
     print getCharacter(rand);
}
print "\n";

exit();

# ====== Subroutines =======================
sub initializeArrays {
     my $Sum = 0;
     # Initialize the Character array - we enter target frequencies 
explicitly
     # but this could also be the result of counting amino acids in a 
database,
     # or nucleotides in a genome etc.

     # In this example we generate a skewed nucleotide composition
     $Character[0] = 'A'; $Frequency[0] = 17;
     $Character[1] = 'G'; $Frequency[1] = 27;
     $Character[2] = 'C'; $Frequency[2] = 23;
     $Character[3] = 'T'; $Frequency[3] = 19;

     # Convert the values of the target frequencies array to the top 
boundaries
     # of intervals having relative widths proportional to their 
relative frequency
     # and ranging from 0 to 1.0

     for (my $i=0; $i <= $#Frequency; $i++) {
         $Sum += $Frequency[$i];
     }

     for (my $i=0; $i <= $#Frequency; $i++) {
         $Interval[$i] = 0;
         for (my $j=0; $j <= $i; $j++) {
            $Interval[$i] += $Frequency[$j] / $Sum;
         }
     }
     return();
}

# ==========================================
sub getCharacter {
     my ($RandomNumber) = @_;
     my $i=0;

     # Which interval does the random number fall into ? This interval 
has the
     # index of the character we output.
     for ($i=0; $i <= $#Interval; $i++) {
         if ($Interval[$i] > $RandomNumber) { last; }
     }
     return($Character[$i]);
}