Bioinformatics.org
[University of Birmingham]
Not logged in
  • Log in
  • Bioinformatics.org
    Membership (45125+) Group hosting [?] Wiki
    Franklin Award
    Sponsorships

    Careers
    About bioinformatics
    Bioinformatics jobs

    Research
    All information groups
    Online databases Online analysis tools Online education tools More tools

    Development
    All software groups
    FTP repository
    SVN & CVS repositories [?]
    Mailing lists

    Forums
    News & Commentary
  • Submit
  • Archives
  • Subscribe

  • Jobs Forum
    (Career Center)
  • Submit
  • Archives
  • Subscribe
  • CD-HIT: Sequence clustering software - Support tickets

    Submit | Open tickets | Closed tickets

    [ Ticket #453 ] failing when long gene description
    Date:
    05/06/08 10:48
    Submitted by:
    unset
    Assigned to:
    unset
    Category:
    Clustering
    Priority:
    5
    Ticket group:
    Critical
    Resolution:
    Unset
    Summary:
    failing when long gene description
    Original submission:
    I used cd-hit weekly to cluster nr.fa. Lately the program failed, just hanging, no reporting any progress nor any error. Thinking that nr.fa was too big I split it and submit by parts...all parts but 1 was clustered. Subdivided the failing part, re submiting, again all the parts but one was clustered...repeated the process several times, till got to the offendiing sequence

    The annotation only is over 300K....still is a problem.
    Do you think you can solve this?
    Thanks

    Raquel Norel
    rn98@columbia.edu
    Please log in to add comments and receive followups via email.
    Followups
    Comment Date By
    I had same problem and contacted Weizhong Li. He hinted to me that the long description is the problem. I wrote the following small perl script using bioperl to remove the description of the fasta sequences.

    CD-HIT works now.

    #!/usr/local/bin/perl

    use Bio::Seq;
    use Bio::SeqIO;

    $seqin = Bio::SeqIO->new( -format => 'Fasta', -file => 'nr.fasta');
    $seqout= Bio::SeqIO->new( -format => 'Fasta', -file => '>nr_no_desc.fasta');

    my $seq_count=0;

    while (my $NextSeq = $seqin->next_seq())
    {
    $NextSeq->desc("");
    $seqout->write_seq($NextSeq);
    $seq_count = $seq_count+1;
    }
    print "Finished shortening descriptions of $seq_count sequences!n";
    08/12/08 09:24 unset
    1 - Change in cd-hi.h the default value (300000):

    For example with a new size of 600000:
    #define MAX_DES 600000
    #define MAX_LINE_SIZE 600000

    2- Rebuild cd-hit application

    05/30/08 14:56 chcaron
    No results for "Dependent on ticket"
    No results for "Dependent on Task"
    No other tickets are dependent on this ticket
    Ticket change history
    Field Old value Date By
    status_id Pending 07/14/11 01:22 liwz
    close_date 12/31/69 19:00 07/14/11 01:22 liwz

     

    Copyright © 2025 Scilico, LLC · Privacy Policy