BIRCH/New Applications under consideration
From Bioinformatics.Org Wiki
SeqKit - nice tools for manipulating FASTA/FASTQ filesAdded in BIRCH 3.40.
- Genbeans - Includes manipulation of FASTA files in a GUI
- Ugene - Especially good for cloning tasks, and available for redistribution under GPL2.0. http://ugene.net/
- GenomeTools http://genometools.org/ - looks particularly good for tools.
- The Viral Bioinformatics Resource Center at UVic http://athena.bioc.uvic.ca/ has a bunch of neat Java applications that look quite promising. They include things like Jdotplotter, SequenceSearcher, NAP (DNA to protein aligner?), GraphDNA. There are also some good genomics tools.
- CLC Sequence Viewer - free; Linux, Windows, Mac. http://www.clcbio.com/products/clc-sequence-viewer/
- Snapgene Viewer - http://www.snapgene.com/products/snapgene_viewer/
- EPoS - a modular software framework for phylogenetic analysis and visualization. Includes blastviewer for viewing blast results. https://bio.informatik.uni-jena.de/epos/ However, it looks like the last release was in 2011.
- Reverse translation - There should be an automated way to identify the best degnerate primers from a protein sequence. One possibility would be ot modify PROT2NUC to make a list of the best primers, and then to overline them on the output.
- Need to have a good 3D structure viewer. Cn3D has big problems in portability.
- Python Hydrophilicity Plot - http://www.omicsonline.org/python-based-hydrophilicity-plot-to-assess-the-exposed-and-buried-regions-of-a-protein-jpb.1000182.php?aid=1570%3Faid=1570
Quality control and assessment
FastQC - GUI for evaluating raw or corrected read files. Can save QC information in a nice HTML report.http://www.bioinformatics.babraham.ac.uk/projects/
- Samstat - (v 1.5.1) command line program to generate QC reports on reads http://samstat.sourceforge.net/. No documentation. Generally, gives some of the same information as FastQC, but doesn't present overall numerical statistics, nor k-mer information. The graphs are less useful than what FastQC presents. There seems to be no reason to have Samstat when FastQC is available.
Web site with links to error correction tools - https://omictools.com/error-correction-category
- FASTX-Toolkit - Pre-processing tools for sequencing reads http://hannonlab.cshl.edu/fastx_toolkit/
- Pollux - claims to be able to do many platforms, including Illumina and Ion Torrent. http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0435-6
- Quake - corrects sequencing reads or throws out bad reads. Results in a substantial improvement in subsequent assembly steps. http://www.cbcb.umd.edu/software/quake/
- Racer (Illumina only) - http://www.csd.uwo.ca/~ilie/RACER/ Supersedes HiTek by the same authors.
- Fiona - Fiona: A parallel and automatic strategy for read error correction https://www.seqan.de/apps/fiona/
- Lighter - Lighter: fast and memory-efficient sequencing error correction without counting https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0509-9
- Jabba - Jabba: hybrid error correction for long sequencing reads http://almob.biomedcentral.com/articles/10.1186/s13015-016-0075-7
Removel of non-paired reads from paired files
Sometimes one read of a pair is lost when trimming or quality correction are done. For example, if after trimming, a one of the two reads was too short, it might be deleted from one file, but its mate not deleted from the other. Some assembly programs fail if even a single unpaired read is found (eg. rnaspades).
Since read files tend to have 4 lines per read, a crude way to detect the number of reads in a file is 'wc -l'. The number of reads is the number of lines divided by 4. There should be exactly the same number of reads in the left and right read files for a read pair.
I have tried several programs for removing non-paired reads, so far without success:
- fastqCombinePairedEnd.py - For large files, crashes
Segmentation fault (core dumped)
- fastq-pair - divides reads between paired and singleton reads, but sometimes misses unpaired reads.
- remove_unpaired.pl - doesn't appear to work, and no documentation
- MaSuRCA - https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt476 The unique idea seems to be "super-reads". Use of k-mer tables makes it possible to extend paired-end reads into super-reads. Original reads are extended a base at a time using information from k-mer statistics. Worth a look.
- SGA - String Graph Assembler. good eukaryotic assemblies. Efficient de novo assembly of large genomes using compressed data structures
- MIRA - https://sourceforge.net/p/mira-assembler/wiki/Home/
- ABySS - good results for Euk. data http://www.bcgsc.ca/platform/bioinfo/software/abyss
- SPAdes - Quick. Available for both Linux and OSX. http://bioinf.spbau.ru/spades
- Newbler - Proprietary! Must be downloaded from Roche site. Cannot be redistributed. http://swes.cals.arizona.edu/maier_lab/kartchner/documentation/index.php/home/docs/newbler
- IDBA-UD - good reviews on Euk. data http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/index.html
- SOAPdenovo2 - Part of a collection of genome assembly tools. Appears to be able to handle large reads, but documentation is a bit equivocal. Reviews mention that a lot of paramater tweaking is needed, and you still may not get a good assembly.http://soap.genomics.org.cn/index.html
- Cerulean - An interesting new strategy, to first get long contigs, and go back and try to match reads to big contigs. arXiv:1307.7933v1
- Velvet - Requires a very big memory space.
- Ray - http://denovoassembler.sourceforge.net/index.html
- Mostafa M AbbasEmail author, Qutaibah M Malluhi and Ponnuraman Balakrishnan (2014) Assessment of de novoassemblers for draft genomes: a case study with fungal genomes BMC Genomics201415(Suppl 9):S10 DOI: 10.1186/1471-2164-15-S9-S10
Assembly viewers and Quality Assessment
- Tablet (currently installed on Flamingo), from Hutton Institute
- Quast - Extensive tools for visualization and evaluation of assembly quality. Looks good. http://quast.sourceforge.net/quast.html
- BamView - An interactive Java application for visualising read-alignment data stored in BAM files.http://www.sanger.ac.uk/science/tools/bamview
- Rascaf - Rascaf (RnA-seq SCAFfolder) uses continuity and order information from paired-end RNA-seq reads to improve a draft assembly, particularly in the gene regions. By the author of Rcorrector.
- REAPR - http://www.sanger.ac.uk/resources/software/reapr - open source, free.
- SSPACE - http://www.baseclear.com/genomics/bioinformatics/basetools/SSPACE - free, but commercial product, so it can't be redistributed.
- GapFiller - GapFiller: a de novo assembly approach to fill the gap within paired reads. http://www.ncbi.nlm.nih.gov/pubmed/23095524
- GAM-NGS - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3633056/ Merge assemblies into a better larger assembly.
- GARM - Genome Assembly, Reconciliation and Merging 
- GRASS - GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies http://bioinformatics.oxfordjournals.org/content/28/11/1429.long -
Genome annotation and visualization
Ekblom R, Wolf JBW (2014) A field guide to whole-genome sequencing, assembly and annotation http://onlinelibrary.wiley.com/doi/10.1111/eva.12178/full
- DIAMOND - DIAMOND is a sequence aligner for protein and translated DNA searches and functions as a drop-in replacement for the NCBI BLAST software tools. It is suitable for protein-protein search as well as DNA-protein search on short reads and longer sequences including contigs and assemblies, providing a speedup of BLAST ranging up to x20,000.
Annotation formats and software
- Why are NCBI GFF3 files still broken? http://blastedbio.blogspot.ca/2011/08/why-are-ncbi-gff3-files-still-broken.html
- NCBI Eukaryotic Genome Annotation Pipeline http://www.ncbi.nlm.nih.gov/books/NBK169439/
- MAKER Pipeline http://www.yandell-lab.org/software/maker.html
- Anvi'o: an advanced analysis and visiualization platform for 'omics data (2015) Peer J 3:31319; DOI 10.7717/peerj.1319.
- RNAmmer - Annotates rRNAs http://www.cbs.dtu.dk/services/RNAmmer/
- Findtrna - Annotates tRNAs http://www.bioinformatics.org/findtrna/
- miRNA - Genome-wide annotation of microRNA primary transcript structures reveals novel regulatory mechanisms http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4561498/
- MITOS: Mitochondrial Genome annotation - http://mitos.bioinf.uni-leipzig.de/
- IGB - Integrated Genome Browser (presented at PAG 2016)
- Pluses: great look and feel; designed to easily add Java plugins; has a "Just In Time" download architecture, so only the parts of the genome you are viewing, and relevant metadata, get downloaded as you need them. Also uses caching to speed things up.
- Minuses: has to use genomes that have been converted for IGB
- IBS: An Illustrator for the presentation and illustration of biological data - This looks really great, and is written in Java, so it's platform independent. Can work directly with Uniprot annotations, and they claim to be working on NCBI GenBank and other formats. Looking at the user manual, it seems to be more of a drawing program for making gene diagrams, rather than a more conventional genome viewer. The manual, doesn't say anything about importing an annotated genome, at first look.
Bioinformatics (2015) 31 (20): 3359-3361. doi: 10.1093/bioinformatics/btv362
- Integrative Genomics Viewer - Broad Institute http://www.broadinstitute.org/igv
- Jbrowse Genome Viewer - http://jbrowse.org
- Apollo - a genome editing plugin for Jbrowse http://genomearchitect.org/
- SyMap - http://www.agcol.arizona.edu/software/symap/index.html A very sophisticated genome viewer
Written in Java for Mac and Linux. One caveat - uses a MySQL database, that must be installed separately for anything other than the demos. Definitely worth trying.
- chromosome x chromosome dot plots
- circular synteny maps
- 3D comparison plots
- 2D chromosome comparisons
- Circlator - circularize genome assemblies http://www.sanger.ac.uk/science/tools/circlator
- SyMAP - whole genome dot plots, synteny visualization, 2D and 3D views, and written in Java. Could be used for Cytogenetics. http://www.agcol.arizona.edu/software/symap/v4.0/UserGuide.html
PathVisioRPC - An XMLRPC interface for PathVisio. In other words, an API for data visualization. Bindings for many languages, including Python, Java and R. http://www.biomedcentral.com/1471-2105/16/267?utm_campaign=BMC24047B&utm_medium=BMCemail&utm_source=Teradata
misFinder - identify mis-assemblies in an unbiased manner using reference and paired-end reads http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0818-3
- qod - An alternative approach to multiple genome comparison http://doi.org/10.1093/nar/gkr177
- BBSketch - very rapid comparison of genomes based on k-mers. Claims to be independent of genome size.
- lastal - add -P option to run in parallel
- How much of LAST can we automate through BioLegato? Do we need a BioLegato for comparative genomics?
- It might be easy to get Last to create genomic dot-plots showing ONLY repetitive sequences compared between chromosomes. Simply mask the input sequences for all sequences OTHER than what you want to look at, and then run Last (or DXHOM, for that matter). This could create maps of particular transposons or other repetitve elements scattered throughout the genome.
It might be very useful when studying genome evolution.
- Yass - YASS :: genomic similarity search tool http://bioinfo.lifl.fr/yass/index.php
- They show dotplots in their paper, but don't actually have a program for dotplots.
Last - LAST: Genome-Scale Sequence Comparison http://last.cbrc.jp/doc/last.html
- multiple chromosomes or genomes in a single plot
- no coordinates on dotplots
- Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Java, appears to be quick. Nice feature to calculate plots by functional anotation. One big downside - no way to launch gepard gui with sequences specified on the command line. You can run at the command line, but it will just generate a static bitmap as output. Update: Actually, there are a lot of problems.
- Only distributed on a not very user-friendly git repo
- documentation is not up to date
- doesn't provide launch scripts for command line usage. Despite what is discussed on the Git, I have never been able to get gepard to run as a command line application.
- The author really doesn't seem to have any interest in human beings being able to download and use this program.
- only does pairwise comparisons, not multiple comparisons
- MatrixPlot - http://www.cbs.dtu.dk/services/MatrixPlot/
- ACT (Sanger) 
Mauve http://darlinglab.org/mauve/mauve.html Mauve directly reads GenBank files. However, it is prone to crashes when running the progressive alignment. Crashes happen when Mauve reads in GenBank files, but not FASTA files. This is important, because Mauve is less useful if you can't visualize features. This should be reported to the Darling lab.
- Brig - BLAST Ring Image Generator (BRIG) http://brig.sourceforge.net
- Mercator - Orthology maps between multiple genomes https://www.biostat.wisc.edu/~cdewey/mercator/
- MAVID - multiple alignment for large genomic sequences http://bio.math.berkeley.edu/mavid/download/
Genome Re-sequencing and genotyping
Gene Expression/Transcriptome Analysis
- QuickRNASeq - Linux-HPC based with Web interface; Pulls together a lot of things into a neat pipeline. http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2356-9
- Post-processing and Quality Control
- TransRate - A tool for reference-free quality assessment of de novo transcriptome assemblies http://hibberdlab.com/transrate/
- Transcriptome assembly and Quantitation
- Kalisto - http://pachterlab.github.io/kallisto/ Near-optimal probabilistic RNA-seq quantification; claims to be much faster than Hisat and others.
- STAR aligner, fast aligner.
- Hysat- successor to TopHat, faster.
- RobiNA, a User Friendly Graphical Interface to Powerful Open Source Microarray and RNA-Seq* Processing http://mapman.gabipd.org/web/guest/robin
- http://rseqc.sourceforge.net - QC for RNA aligments in Python
- Picard - Java tools for BAM & SAM files; displays quality information
- QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0670-5. Best for post-assembly QC. Requires bam files and gtf files.
- Downsteam analysis
- BioStars article on recent transcript count software https://www.biostars.org/p/255698/
- Best practises RNAseq - Sleuth vs. NA's in annotation
- Pathway studio
- Should there be a blrna?
- In principle, any DNA sequence could be used to launch RNA tasks, and most RNA programs probably automatically change T to U.
- Splitting out RNA programs into a separate blrna program might prevent people from running RNA tasks on things that aren't transcribed.
- We could have an Export to RNA function.
- In principle, FEATURES could even be programmed to automatically launch blrna if the feature key was any of the RNA keys.
- Strictly speaking, it would be incorrect to translate a DNA sequence, so translation tasks shouldn't be in bldna. But for convenience, it makes sense to keep them there.
- ApolloRNA - Online tools, but looks like they let you download. Uses ImageMagic and Ghostscript https://carlit.toulouse.inra.fr/ApolloRNA/download.html
- Freiburg RNA tools - http://rna.informatik.uni-freiburg.de/
- RNAStructure (Mattews Lab) - comprehensive package, all in 1 Java GUI. May be easy to install. http://rna.urmc.rochester.edu/RNAstructure.html
- Forna (force-directed RNA): Simple and effective online RNA secondary structure diagrams
Bioinformatics (2015) 31 (20): 3377-3379. doi: 10.1093/bioinformatics/btv372
- Parallel tiled Nussinov RNA folding loop nest generated using both dependence graph transitive closure and loop skewing
Multiple sequence alignment
- We need a program that can map GenBank features to a multiple sequence alignment.
Nice try, but no cigar:
- pfaat - No way to put in your own annotations, and automated annotation from Uniprot is extremely limited. No good for DNA annotation
- aline - This program seems initially promising, but has fatal flaws. Perl scripts using tck/tl for GUI. Not supported since its release in 2008. Some ability to put in annotations, but its the details that make this not worth using.
- There is sequence numbering for each component of the alignment
- no numbering with respect to the alignment
- The GUI is noticeably slow.
- No documentation at all, and the paper is too brief to be useful. In fact, when the program launches from the command line, you get a message saying "Documentation would be nice." Nuff said. This thing is hard enough to figure out as it is without documentation.
- The paradigm is 1. choose a tool 2. select a part of the alignment. One big problem is that you can't select parts of the alignment by dragging past the contents of the current window. It won't autoscroll as you move. Neither can you select parts of the alignment by SHIFT, begin-select, move, end-select. This means that there is no way to annotate features such as introns or exons that span beyond the current window.
- Some tools just don't seem to work at all.
- You'd like to be able to select existing objects and modify them, but this capability is frustratingly limited.
Pattern recognition and detection
- CRISPResso2 - Analysis of genome editing outcomes from deep sequencing data http://crispresso.pinellolab.partners.org/help
Basic Restriction Enzyme Tasks in BioLegato
- seq -- cut --> frags
- seq1, seq2 -- ligate --> frags
- could also do ligation in bldna, just by creating a new sequecnce out of 2 or more selected sequences
- frags -- map --> image
- frags + feature location --> possible REs for cloning the feature
Can we implement features that persist from step to step? Look at the various file formats eg. SFF. Some of these may be a way to preserve feature annotation without creating a GenBank flat file.
Examples of tasks:
- clone a PCR product
- move a cassette from one vector to another
- clone a synthetic dsDNS into 1 or 2 restriction sites
- delete a fragment.
BioPython contains a package called Restriction. This package appears to have classes for Restriction enzymes, which can work with Seq objects do do many of these tasks.
If we use the Restriction class, it might be useful to create new classes as extensions of existing classes. That way, the new classes could be contributed to BioPython.
It might be tempting recognize that much of the above could be accomplished by running BACHREST and DIGEST from wrappers. However, we have to concede that while these are well-written programs, they are not worth the effort to support as Pascal code. The better way is to leverage the BioPython code for what it can do, and adapt the logic from BACHREST and DIGEST to handle the downstream fragment tasks.
Packages for cloning tasks
|Looks like a nice GUI. Not as thoroughly tested on Linux as Mac,Windows.
Discussions for previous versions suggest that critical cloning functions may not work on Linux. See
Also a problem is installation. SC is a bit fussy about where you launch it. The binary needs to be in the same directory as the rest of the package. Symbolic links don't work for launching it because it can't find its libraries. There is also no mechanism for specifying an input file on the command line.
|Mac, Windows, Linux|
|Uses a lot of existing software (eg. MUSCLE, BLAST, PRIMER3) with its own interface. Has some NGS stuff in it (eg. Velvet).||Mac,Windows,Linux|
|ApE - A Plasmid Editor
|Appears to be compiled, not a Java application. The last Linux version was released in 2009.||OSX,Windows. Adaptable to Linux?|
Genetic Mapping/Molecular Markers
- GMATo - A novel tool for the identification and analysis of microsatellites in large genomes. (Java GUI, Perl backend) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3705631/
- PLINK (Harvard) http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml -VERY full featured command line program, Linux preferred. Has lots and lots of analytical methods for SNPs, genetics, statistical tests, and probably the kitchen sink.
- need program to evaluate reproducibility of 2 or more replicates for a set of primers
Packages to look at:
- MadMapper - Python scripts from the Michaelmore lab
- Quality control of genetic markers
- Group analysis
- linear order of markers on linkage groups
- QGene - Java program with GUI for QTL mapping. Runs under Windows but we'd expect it should be platform-independent.
- MapDisto - Genetic analysis. Runs on Windows and Mac. It looks like it's mainly an Excell plugin, so maybe it can be gotten to run with LibreOffice Calc.
- xQTL - Runs in Java, has a web interfact.
- seamless data management for genotypes, molecular data and phenotypes
- analytical pipelines and tools
- high throughput cluster computing
- MSTMap - MSTMap is a software tool that is capable of constructing genetic linkage maps efficiently and accurately. It can handle various mapping populations including BC1, DH, Hap, and RIL, among others. Source code available.
The Laboratory of Statistical Genomics at Rockefeller University maintains what seems to be an up to date list of genetic analysis software.
Maybe its time to phase out Phylip.
- v3.69 was released in 2009, and there is no evidence of further support
- short ID problem
- limited types of metadata that can be in files.
- Some packages give nice integrated graphical tools
- not really designed for comparative genomics
- IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies.Mol Biol Evol. 2015 Jan;32(1):268-74. doi: 10.1093/molbev/msu300. Epub 2014 Nov 3.
- PAML - Phylogenetic Analysis by Maximum Likelihood - A Unix style package that looks like it would be easily automatable under BioLegato. Also has its own GUI. http://abacus.gene.ucl.ac.uk/software/paml.html
- comparison and tests of trees
- estimation of parameters
- likelihood ratio tests of hypotheses
- estimation of divergence times
- reconstruction of ancestral sequences
- estimation of synonymous and non-synonymous substitution rates
- Mol. Evol., Phylogenetics and Epidemiology - http://tree.bio.ed.ac.uk/software/ - software for mol. evol., phylogenetics and epidemiology. Looks polished, many things in Java. Includes programs for Bayesean phylogeny, viewing and darwing trees.
- SNPhylo - Phylogenetic analysis using SNP data http://www.biomedcentral.com/1471-2164/15/162 -looks great, but could be complicated to integrate, especially since it requires some specific R packages.
- Is there a program out there that will do some sort of sliding window plot of sequence conservation in a multiple alignment? This would be particuarly useful in phylogenetic analysis. Ideally, the sliding window would give a bit score for information content. WebLogo almost does this, in that you get a bit score at each position.
- You could break up an alignment into regions to look at reticulate evolution
- Discover which are the most informative and least informative regions
- TreeLink - data integration, clustering and visualization of phylogenetic trees; http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0860-1
- HybPhyloMaker: HybPhyloMaker: Target Enrichment Data Analysis From Raw Reads to Species Trees https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5768271/ - Incredibly, written in BASH. This pipeline automates steps from multiple alignment to genome trees. Probably worth a look, even if only for the references.
- MEGA - has a bad reputation of being a black box with little ability to adjust parameters. "Like a software corset"
- PAUP - kind of up in the air, but the stated goal is to make a Windows and Mac commercial version. Can't be redistributed. The current web site (May 17) is still equivocal about what will actually be available.