[BiO BB] generate random motifs

Thu Jul 14 11:38:52 EDT 2005

Maya,

You had asked me whether some code I had posted here in March can be  
adapted for your purpose
(http://bioinformatics.org/pipermail/bio_bulletin_board/2005-March/ 
002379.html).

I have attached the adapted version below. Just a little programming  
fun on a hot summer Thursday morning in Toronto. I hope this helps.  
Just edit the definitions of motif and alphabet for your purposes.

<disclaimer>This is what I consider "teaching code" i.e. with an  
emphasis to demonstrate _how_ something can be done, with the newcomer  
in mind; especially if you want to go in and change things for your own  
purpose. Perl has many constructs to write something like this much  
more concisely, but that's not the point here.</disclaimer>

Have fun,

Boris

On Wednesday, Jul 13, 2005, at 13:21 America/Montreal, Gao Zhang wrote:

> Hi all,
>  
> I need some help to generate random motifs. The goal is to
> generate a random motif of width w.  Each position will
> have a dominant letter with probability around 0.85.
>  
> I know there is python module TAMO  
> (http://jura.wi.mit.edu/fraenkel/TAMO/documentation/ 
> TAMO.MotifTools.html#Motif-emit )
> which can do this job.
>  
> Can anybody know how to do this using Perl/Bioperl?
>  
> Thank you very much and look forward to your reply!
>  
> Best Regards,
>    Maya
>

#!/usr/bin/perl
# randommotifs.pl
#
# This program generates randomized instances of pseudomotifs, i.e.
# sequences that are obtained by adding "noise", according to a  
specified
# model to a consensus sequence. To keep this reasonably general, we
# proceed through the following steps:
#
#    Define consensus motif and alphabet
#    For the required number of repetitions
#       For each position in motif
#          Define an appropriate probability distribution over the  
alphabet
#          Produce a character according to the distribution
#       ...
#    ...
#
# The probability distributions are generated to produce a consensus  
character
# with a particular "consensus probability" and all other characters
# with equal probabilites (the noise). This is defined in a single  
subroutine
# so it is straightforward to change this for a different model of noise
# (e.g. increase the probabilities for "similar" characters). It's easy
# to become creative here and incorporate (biological) knowledge into  
the
# noise-model.
#
# Probablities are then converted into cumulative intervals to chose a
# particular character:
#
# Eg: Probabilities  A: 0.4, C: 0.3, G: 0.2, T: 0.1
#     Intervals      A: 0.4  C: 0.7, G: 0.9, T: 1.0
#
# This example can be pictured as
#
#   0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0
#    +----+----+----+----+----+----+----+----+----+----+
#    |         A         |      C       |    G    | T  |
#    +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This then points to the character  
to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================

use strict;
use warnings;

my $N_Sequences = 100;                # Number of sequences to be  
generated
my @Motif = split(//,'TTGACATATAAT'); # This example is simply a  
concatenation
                                       # of the -35 and -10 consensus  
elements
                                       # of prokaryotic promoters;  
replace with
                                       # whatever ...
my @Alphabet = split(//,'ACGT');      # Nucleotides; replace with  
whatever ...
my $P_Consensus = 0.85;               # Replace with whatever ...

# ====== Globals ==========================
my @Probabilities;              # Stores the probability of each  
character
# ====== Program ==========================

for (my $i=0; $i < $N_Sequences; $i++) {
     for (my $j=0; $j < scalar(@Motif); $j++) {

         loadConsensusCharacter($Motif[$j]);    # Load the character for  
this position
         addNoiseToDistribution();              # Modify probabilities  
according to noise model
         convertToIntervals();
         print(getRandomCharacter(rand(1.0)));  # Output character
     }
print "\n";
}

exit();

# ====== Subroutines =======================
#
sub loadConsensusCharacter {
     my ($char) = @_;
     my $Found = 'FALSE';

     for (my $i=0; $i < scalar(@Alphabet); $i++) {
         if ( $char eq $Alphabet[$i]) {	# found character in i_th  
position in Alphabet
             $Probabilities[$i] = 1.0;
             $Found = 'TRUE';
         } else {
             $Probabilities[$i] = 0.0;
         }
     }
     if ($Found eq 'FALSE') {
     die("Panic: Motif-Character\"$char\" was not found in Alphabet.  
Aborting.\n");
     }

return();
}

# ==========================================
sub addNoiseToDistribution {

     # This implements a very simple noise model:
     # We work on an array in which one element is 1.0 and
     # all others are 0.0. The element with 1.0 has the same
     # index as the consensus character in the Alphabet.
     #
     # We change that value to the consensus probability and
     # we distribute the remaining probability equally among
     # all other elements.
     #
     # It should be straightforward to implement more elaborate
     # noise models, or use a position specific scoring matrix
     # or something else here.

     my $P_NonConsensus = (1.0-$P_Consensus) / (scalar(@Alphabet) - 1);

     for (my $i=0; $i < scalar(@Probabilities); $i++) {
         if ( $Probabilities[$i] == 1.0 ) {	# ... this is the consensus  
character
             $Probabilities[$i] = $P_Consensus;
         } else {
             $Probabilities[$i] = $P_NonConsensus;
         }
     }

     return();
}

# ==========================================
sub convertToIntervals {

     my $Sum = 0;

     # Convert the values of the probabilities array to the top  
boundaries
     # of intervals having widths proportional to their relative  
frequency.
     # Numbers range from 0 to 1.0  ...

     for (my $i=1; $i < scalar(@Probabilities); $i++) {
         $Probabilities[$i] += $Probabilities[$i-1];
     }

     return();
}

# ==========================================
sub getRandomCharacter {

     my ($RandomNumber) = @_;
     my $i=0;

     # Test which interval contains the RandomNumber ...
     # The index of the interval points to the Event we should choose.

     for ($i=0; $i < scalar(@Probabilities); $i++) {
         if ($Probabilities[$i] > $RandomNumber) { last; }
     }

     return($Alphabet[$i]);
}