[BiO BB] Restriction sites frequencies in mouse genome
    Harry Mangalam 
    harry.mangalam at uci.edu
       
    Thu Sep  7 12:23:39 EDT 2006
    
    
  
On Thursday 07 September 2006 00:39, Benoit VARVENNE wrote:
> Hello,
>
> Harry,
> Thanks for your answer. I'd be very interested in having this code.
>
> First i only had to calculate frequencies in mouse genome but now
> things have changed... I'm interested in having positions of hits
> and in calculating distribution, fragment length ...
It can do the above things with no problems besides size of output (if 
you ask for all the hits for a 4cutter in 200MB, you'll get lots of 
output). tacg can generate output for gnuplotting directly for these 
kinds of distribution plots or in a few different table formats. 
(see -G option).
> The next step will be to make the link between hits found and
> corresponding features available in Ensembl databases (site in an
> existing gene, centromere, repeat regions, ...).
> I think i'm going to use Ensembl Perl API to do so.
Unfortunately, it will not do this directly now..  Your stated 
approach is probably best.  
The src is on its way.
hjm
> If anyone has got other ideas, i'd be very interested in them.
>
> If anyone's interested, i've got an optimized (program memory and
> performance) general perl script for finding number of hits of a
> sequence (or a pattern version) in very big sequences (like
> chromosomes or genome). Let me know if you want it.
> There is no management of a list of program entries for the moment
> and no management of storing positions, ....
>
>
> Regards,
>
> Benoit Varvenne,
> Bioinformatics pearson in charge,
> Genoway Lyon - France.
>
> Le 6/09/06 21:43, « Harry Mangalam » <harry.mangalam at uci.edu> 
a écrit :
> > If by calculating frequencies, you want to find all the sites in
> > a genome, tacg will do this.  It will find all the sites you give
> > it (I've tested it on all human chromosome assemblies) as well as
> > the predicted frequency based on the base pair distribution.
> >
> > It can theoretically do the entire genome in one shot if you have
> > enough RAM, but I've never tried it and the output would be
> > pretty ferocious.
> > for example, for chromosome 21 (a paltry 33.6MB), the summary
> > output is:
> >
> > ## Sequence: #1; from file: UNAVAILABLE
> >  Format: FASTA; ID: gi:89161201; Description: Homo sapiens
> > chromosome 21, alternate assembly (based on Celera assembly),
> > whole genome shotgun sequence.
> >
> > == Sequence info:
> >
> >   NB: sequence length > A+C+G+T due to -> 224404 <- IUPAC
> > degeneracies.
> >   # of:  N:224404  Y:0  R:0  W:0  S:0  K:0  M:0  B:0  D:0  H:0 
> > V:0
> >
> >  #s below are for top strand; 'sites exp' values calculated on
> > the basis of both strands.
> >  33216610 bases; 9772353 A(29.42 %)  6752472 C(20.33 %)  6753971
> > G(20.33 %)  9713410 T(29.24 %)
> >
> > == Enzymes that DO NOT MAP to this sequence:
> >
> >       There were NO NON-matches - ALL patterns matched at least
> > ONCE.
> >
> >
> > == Total Number of Hits per Enzyme:
> >      AatII  1068       BsiEI  1803       EcoRV  4841        PsiI
> > 20384
> >       AccI 12230     BsiHKAI 23981        FauI 18509
> > PspGI112279
> >      AccII  9733       BsiWI   174      Fnu4HI 74994      PspOMI
> > 6067
> >     Acc65I  3021        BslI 91011        FokI 59656        PstI
> > 15561
> >       AciI 52859        BsmI 13955        FseI   235        PvuI
> > 181
> >       AclI  2047       BsmAI 73662        FspI  1211       PvuII
> > 12841
> >       AfeI  1406       BsmBI  7619       HaeII  7030        RsaI
> > 56361
> >      AflII  7226       BsmFI 45828      HaeIII 99508       RsrII
> > 126
> >     AflIII 18426    Bsp1286I 57995        HgaI  8115        SacI
> > 6829
> >       AgeI   676       BspEI  1246        HhaI 21013       SacII
> > 893
> >       AhdI  3149       BspHI 11844      HinP1I 21013        SalI
> > 392
> >       AluI143869       BspMI 16591      HincII 13046       SanDI
> > 3409
> >       AlwI 37296        BsrI 63802     HindIII  9457        SapI
> > 4316
> >      AlwNI 16140       BsrBI  2994       HinfI 96900      Sau96I
> > 77627
> >       ApaI  6067       BsrDI 16179        HpaI  4478      Sau3AI
> > 79640
> >      ApaLI  6042       BsrFI  4609       HpaII 29934        SbfI
> > 1068
> >       ApoI 74171       BsrGI  9408        HphI 67904        ScaI
> > 5880
> >       AscI    47      BssHII   890        KasI  2793
> > ScrFI137189
> >       AseI 17631       BssKI137189        KpnI  3021       SexAI
> > 3472
> >       AvaI 12916       BssSI  5101       MaeII 28783       SfaNI
> > 42093
> >      AvaII 31938      BstAPI  9253      MaeIII 83257        SfcI
> > 39408
> >      AvrII  6112       BstBI  1256       MboII100007        SfiI
> > 599
> >       BaeI  2868      Bst4CI 87767        MfeI  6359        SfoI
> > 2793
> >       BaeI  2868      BstDSI 14918        MluI   334        SgfI
> > 13
> >      BamHI  4165      BstEII  4065        MlyI 44962       SgrAI
> > 214
> >       BanI 18704      BstF5I 59661        MnlI308118        SmaI
> > 4948
> >      BanII 27893       BstNI112279        MscI 14579        SmlI
> > 29332
> >       BbeI  2793       BstUI  9733        MseI226716       SnaBI
> > 1598
> >       BbsI 16623       BstXI 19685        MslI 38862        SpeI
> > 4362
> >       BbvI 63057       BstYI 24349      MspA1I 17762        SphI
> > 6477
> >      BbvCI 14806     BstZ17I  4605        MwoI 73785        SrfI
> > 302
> >       BcgI  3733      Bsu36I 10646        NaeI  1898        SspI
> > 28450
> >       BcgI  3733        BtgI 14918        NarI  2793        StuI
> > 8988
> >      BciVI  7495        BtrI  3836        NciI 24927        StyI
> > 34781
> >       BclI  8350       Cac8I 66066        NcoI  8941        SwaI
> > 2801
> >       BfaI 83296        ClaI  1121        NdeI 10096        TaiI
> > 28783
> >       BglI  6550       Csp6I 56361      NgoMIV  1898        TaqI
> > 17908
> >      BglII  8895       CviJI507227        NheI  2770        TatI
> > 30303
> >       BlpI  6131       CviRI168208      NlaIII161486        TfiI
> > 51945
> >       BmrI 19063        DdeI155096       NlaIV 87348        TliI
> > 1496
> >       BplI 11478        DpnI 79640        NotI   127        TseI
> > 63101
> >       BpmI 32957        DraI 41466        NruI   209      Tsp45I
> > 47283
> >     Bpu10I 25858      DraIII  6989        NsiI 11383
> > Tsp509I254887
> >       BsaI 18254        DrdI  3165        NspI 36783       TspRI
> > 98632
> >      BsaAI  9382        EaeI 20232        PacI  1946     Tth111I
> > 7783
> >      BsaBI  4988        EagI  1139        PciI 12666        XbaI
> > 9158
> >      BsaHI  6162        EarI 25525       PflMI 11275        XcmI
> > 9507
> >      BsaJI121468        EciI  6774        PleI 44962        XhoI
> > 1496
> >      BsaWI  3529    Ecl136II  6829        PmeI   539        XmaI
> > 4948
> >     BseMII104754      Eco57I 24123        PmlI  4081        XmnI
> > 11146
> >      BseRI 23673       EcoNI  8774      Ppu10I 11383
> >      BseSI 25059    EcoO109I 28937       PpuMI 12989
> >       BsgI 24191       EcoRI  8938       PshAI  3251
> >
> > To get the actual prdicted number of sites, you have to generate
> > the Sites info which would be enormous but easily sed-able to
> > extract what you needed.
> >
> > This took 9.5s on a 2GHz Opteron running 64bit Linux
> >
> > If you want, I'll send you the source tarball in a separate
> > email.
> >
> > hjm
> >
> > On Tuesday 29 August 2006 05:35, Benoit VARVENNE wrote:
> >> Hello everybody,
> >>
> >> Thanks to all for your ideas and suggestions. I think i'm going
> >> to consider perl programming to calculate restriction sites
> >> frequency as softwares mentionned in your mails (+softwares i
> >> found) don't seem to be useful for a whole genome scale.
> >> Programming was to be avoid for this study but it seems to be
> >> the only solution. I'm really surprised not being able to find
> >> such an already done study.
> >>
> >> Thanks again,
> >> Regards,
> >>
> >> Beno?t Varvenne,
> >> Bioinformatics pearson in charge,
> >> Genoway Lyon - France.
> >>
> >> Le 28/08/06 11:34, ??Benoit VARVENNE?? <varvenne at genoway.com>
> >
> > a ?crit?:
> >>> Dear Members,
> >>>
> >>> I am a new member of this mailing-list and i don't know if such
> >>> a post will draw the attention of anyone here. So excuse me in
> >>> advance if my subject is not appropriate.
> >>> I am searching for a way to calculate restriction sites
> >>> frequency in mouse genome (so sequences from 6 to 13bp). I have
> >>> already tried to do so using blast (or blast-like) tools and
> >>> configuring them as needed but it gave no results, because of
> >>> too numerous hits i think.
> >>>
> >>> I would be very greatful if someone could help me on this
> >>> topic.
> >>>
> >>> Thanks a lot for your help,
> >>> Best regards,
> >>>
> >>> Beno?t Varvenne,
> >>> Bioinformatics pearson in charge,
> >>> Genoway Lyon - France
> >>>
> >>> _______________________________________________
> >>> General Forum at Bioinformatics.Org -
> >>> BiO_Bulletin_Board at bioinformatics.org
> >>> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
> >>
> >> _______________________________________________
> >> General Forum at Bioinformatics.Org -
> >> BiO_Bulletin_Board at bioinformatics.org
> >> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
-- 
Harry Mangalam - Research Computing at NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824 0084(o), 949 285 4487(c) 
harry.mangalam at uci.edu
    
    
More information about the BBB
mailing list