[BiO BB] Restriction sites frequencies in mouse genome

Harry Mangalam harry.mangalam at uci.edu
Wed Sep 6 15:43:08 EDT 2006


If by calculating frequencies, you want to find all the sites in a 
genome, tacg will do this.  It will find all the sites you give it 
(I've tested it on all human chromosome assemblies) as well as the 
predicted frequency based on the base pair distribution.

It can theoretically do the entire genome in one shot if you have 
enough RAM, but I've never tried it and the output would be pretty 
ferocious.
for example, for chromosome 21 (a paltry 33.6MB), the summary output 
is:

## Sequence: #1; from file: UNAVAILABLE
   Format: FASTA; ID: gi:89161201; Description: Homo sapiens 
chromosome 21, alternate assembly (based on Celera assembly), whole 
genome shotgun sequence.

== Sequence info:

    NB: sequence length > A+C+G+T due to -> 224404 <- IUPAC 
degeneracies.
    # of:  N:224404  Y:0  R:0  W:0  S:0  K:0  M:0  B:0  D:0  H:0  V:0

   #s below are for top strand; 'sites exp' values calculated on the 
basis of both strands.
   33216610 bases; 9772353 A(29.42 %)  6752472 C(20.33 %)  6753971 
G(20.33 %)  9713410 T(29.24 %)

== Enzymes that DO NOT MAP to this sequence:

        There were NO NON-matches - ALL patterns matched at least 
ONCE.


== Total Number of Hits per Enzyme:
       AatII  1068       BsiEI  1803       EcoRV  4841        PsiI 
20384
        AccI 12230     BsiHKAI 23981        FauI 18509       
PspGI112279
       AccII  9733       BsiWI   174      Fnu4HI 74994      PspOMI  
6067
      Acc65I  3021        BslI 91011        FokI 59656        PstI 
15561
        AciI 52859        BsmI 13955        FseI   235        PvuI   
181
        AclI  2047       BsmAI 73662        FspI  1211       PvuII 
12841
        AfeI  1406       BsmBI  7619       HaeII  7030        RsaI 
56361
       AflII  7226       BsmFI 45828      HaeIII 99508       RsrII   
126
      AflIII 18426    Bsp1286I 57995        HgaI  8115        SacI  
6829
        AgeI   676       BspEI  1246        HhaI 21013       SacII   
893
        AhdI  3149       BspHI 11844      HinP1I 21013        SalI   
392
        AluI143869       BspMI 16591      HincII 13046       SanDI  
3409
        AlwI 37296        BsrI 63802     HindIII  9457        SapI  
4316
       AlwNI 16140       BsrBI  2994       HinfI 96900      Sau96I 
77627
        ApaI  6067       BsrDI 16179        HpaI  4478      Sau3AI 
79640
       ApaLI  6042       BsrFI  4609       HpaII 29934        SbfI  
1068
        ApoI 74171       BsrGI  9408        HphI 67904        ScaI  
5880
        AscI    47      BssHII   890        KasI  2793       
ScrFI137189
        AseI 17631       BssKI137189        KpnI  3021       SexAI  
3472
        AvaI 12916       BssSI  5101       MaeII 28783       SfaNI 
42093
       AvaII 31938      BstAPI  9253      MaeIII 83257        SfcI 
39408
       AvrII  6112       BstBI  1256       MboII100007        SfiI   
599
        BaeI  2868      Bst4CI 87767        MfeI  6359        SfoI  
2793
        BaeI  2868      BstDSI 14918        MluI   334        SgfI    
13
       BamHI  4165      BstEII  4065        MlyI 44962       SgrAI   
214
        BanI 18704      BstF5I 59661        MnlI308118        SmaI  
4948
       BanII 27893       BstNI112279        MscI 14579        SmlI 
29332
        BbeI  2793       BstUI  9733        MseI226716       SnaBI  
1598
        BbsI 16623       BstXI 19685        MslI 38862        SpeI  
4362
        BbvI 63057       BstYI 24349      MspA1I 17762        SphI  
6477
       BbvCI 14806     BstZ17I  4605        MwoI 73785        SrfI   
302
        BcgI  3733      Bsu36I 10646        NaeI  1898        SspI 
28450
        BcgI  3733        BtgI 14918        NarI  2793        StuI  
8988
       BciVI  7495        BtrI  3836        NciI 24927        StyI 
34781
        BclI  8350       Cac8I 66066        NcoI  8941        SwaI  
2801
        BfaI 83296        ClaI  1121        NdeI 10096        TaiI 
28783
        BglI  6550       Csp6I 56361      NgoMIV  1898        TaqI 
17908
       BglII  8895       CviJI507227        NheI  2770        TatI 
30303
        BlpI  6131       CviRI168208      NlaIII161486        TfiI 
51945
        BmrI 19063        DdeI155096       NlaIV 87348        TliI  
1496
        BplI 11478        DpnI 79640        NotI   127        TseI 
63101
        BpmI 32957        DraI 41466        NruI   209      Tsp45I 
47283
      Bpu10I 25858      DraIII  6989        NsiI 11383     
Tsp509I254887
        BsaI 18254        DrdI  3165        NspI 36783       TspRI 
98632
       BsaAI  9382        EaeI 20232        PacI  1946     Tth111I  
7783
       BsaBI  4988        EagI  1139        PciI 12666        XbaI  
9158
       BsaHI  6162        EarI 25525       PflMI 11275        XcmI  
9507
       BsaJI121468        EciI  6774        PleI 44962        XhoI  
1496
       BsaWI  3529    Ecl136II  6829        PmeI   539        XmaI  
4948
      BseMII104754      Eco57I 24123        PmlI  4081        XmnI 
11146
       BseRI 23673       EcoNI  8774      Ppu10I 11383
       BseSI 25059    EcoO109I 28937       PpuMI 12989
        BsgI 24191       EcoRI  8938       PshAI  3251

To get the actual prdicted number of sites, you have to generate the 
Sites info which would be enormous but easily sed-able to extract 
what you needed.

This took 9.5s on a 2GHz Opteron running 64bit Linux  

If you want, I'll send you the source tarball in a separate email.

hjm


On Tuesday 29 August 2006 05:35, Benoit VARVENNE wrote:
> Hello everybody,
>
> Thanks to all for your ideas and suggestions. I think i'm going to
> consider perl programming to calculate restriction sites frequency
> as softwares mentionned in your mails (+softwares i found) don't
> seem to be useful for a whole genome scale. Programming was to be
> avoid for this study but it seems to be the only solution. I'm
> really surprised not being able to find such an already done study.
>
> Thanks again,
> Regards,
>
> Benoît Varvenne,
> Bioinformatics pearson in charge,
> Genoway Lyon - France.
>
> Le 28/08/06 11:34, « Benoit VARVENNE » <varvenne at genoway.com> 
a écrit :
> > Dear Members,
> >
> > I am a new member of this mailing-list and i don't know if such a
> > post will draw the attention of anyone here. So excuse me in
> > advance if my subject is not appropriate.
> > I am searching for a way to calculate restriction sites frequency
> > in mouse genome (so sequences from 6 to 13bp). I have already
> > tried to do so using blast (or blast-like) tools and configuring
> > them as needed but it gave no results, because of too numerous
> > hits i think.
> >
> > I would be very greatful if someone could help me on this topic.
> >
> > Thanks a lot for your help,
> > Best regards,
> >
> > Benoît Varvenne,
> > Bioinformatics pearson in charge,
> > Genoway Lyon - France
> >
> > _______________________________________________
> > General Forum at Bioinformatics.Org -
> > BiO_Bulletin_Board at bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
>
> _______________________________________________
> General Forum at Bioinformatics.Org -
> BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board

-- 
Harry Mangalam - Research Computing at NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824 0084(o), 949 285 4487(c) 
harry.mangalam at uci.edu



More information about the BBB mailing list