BIRCHv4.00
From Bioinformatics.Org Wiki
[return to Release To Do List]
Contents |
Installation
Python3 compliance
Based on testing on macos-arm64 on which python2 is not installed, BIRCH should now be completely Python3 compliant. We will fix remaining noncompliant scripts, if any, as they show up.
New platforms
Need to add new platforms for BIRCH_PLATFORM.
Choices:
- linux-arm64
In order, these need to be changed in:
1. BioLegato
2. install-scripts
3. getbirch
4. birchdb
5. scripts
BIRCH
- make dropdown menus on home page scrollable
- ABySS requires make. Fedora 39 doesn't have make by default. Where do we document this? Can we do that maybe in one of the ABySS scripts?
- Python3 - deprecation of escape sequences. Python3 gives a warning for escape sequences in regular expressions such as re.match eg. '\s'. This must be changed to '\\s'.
See [1]. This has been fixed in htmldoc.py, and the revised htmldoc.py appears to work. In future releases, we will have to go through all python scripts and change these. Acckkk!!! - Check programs for compliance with 64-bit GI numbers in NCBI databases.
- Mail server - Need documentation how to set up mail server for email notification.
- The Phylip web site mentions "A new release of PHYLIP, version 3.698" which fixes a consensus tree bug. No date is given. We should check to see if we have this version, and if not, upgrade.
BioLegato
- bltable - File --> Save Selection As: If you choose a file, it puts in the whole path in quotes, which for some reason doesn't save the file. If you remove the path and quotes, the file gets saved.
- Get latest BioLegato version into BioLegato tutorial
- chooseviewer.py should have a way to view Markdown (.md) files. Well... that turns out to be easier said than done. You'd think programs like Evice or LibreOffice Write could do that, but it turns out not to be the case. Actually, it's very hard to make a legible PDF file from Markdown. We can look at this, but there is no simple answer.
- Get BioLegato to recompile with Java11
- Update documentation on adding local components to BioLegato
- Remote execution - It might be almost trivial to add to BioLegato the capability to run jobs on remote servers. We could run the command with sshcc, but set it in an environment variable, so the PCD would look something like
shell "$BL_REMOTE blastp ...."
where $BL_REMOTE would be something like sshcc, or whatever command on your system sends a job to a remote host. This would only work on a clustered system where all hosts share a common file system eg. NFS. If $BL_REMOTE is blank, the command just runs on the local host.
bldna, blprotein
- Automated sequence renaming - Need to be able to rename sequences using some sort of regular expression substitution. SeqKit may be able to do this.
- How hard would it be to revise BioLegato to always use Accession numbers, rather than LOCUS names? Virtually no software uses LOCUS. This is moot except for very old sequenes, since NCBI decided long ago to make LOCUS and Accession identical. However,if you do get an older sequence coming up, it would be good not to have to deal with this problem.
- [Bugzilla 1223 https://www.bioinformatics.org/support/index.php?func=detail_ticket&bug_id=1223&group_id=543] - Edit --> Change case - If you change case, you lose all annotation. After changing case, if you try to use File --> View file, a pseudo-GenBank file is shown that is missing almost all annotation. This seems to be a problem with BioLegato, since there is no "Change case" .blmenu file.
- It might be useful to be able to go from sequences to Neighbors or Links. Two possible ways:
- from bldna or blprotein, export sequences to blncbi based on accession numbers. May need a script something like GenBank2Entrez.Probably just runs Eutils.post.
- Have a script that directly sends output to blncbi, so that you're not only running Eutils.post, but also running elink.
- Time to revisit Genome Browsers. To consider:
- Genome Workbench - We could have an export function from bldna that extracts Accession numbers from entries and then loads them into Genome Workbench.
- UCSD (or is is USC?) Genome Browser
- VISTA
Phylogeny
- bltree -> ConfAdd - Add box to paste in bootstrap tree file
- How about email notification for long running jobs?
- Archaeopteryx - Update documentation database to point to current docs, apparently on Google docs as archaeopteryx.js. They seem to be focusing on the web version, and not so much on the standalone application.
- bltree - open trees in text editor for pasting into menus
- All phylogeny scripts (dnadist.py, dnapars.py, dnaml.py etc.) call the main program using Popen, but call later steps such as bootstrapping, consense, and uniqid, using subprocess.call. One some systems, it looks like one of these steps seems to be called before the previous step has been completed, resulting in a No such file or directory message and empty output. You can rerun the program with no changes and then get expected output. Would changing all of these calls to Popen calls be more consistent? I have a feeling we've done this before, so tread carefully to avoid swapping one problem for another. The advantage of Popen is that we can do a p.wait() after every call.
- We need to somehow integrate taxfetch.py so into blnalign, blpalign so that it can get taxonomic information from accession numbers. This is trickier than it might first appear. For example, BioLegato will export sequences using GenBank LOCUS names. It is not clear at what step in the process one would do this. The goal is to be able to generate a phyloxml file for Archaeopteryx to read. Also, it would be nice if we could get alphabetic taxonomy codes like those used by Pfam, as opposed to the numeric taxid numbers from NCBI.
- When doing bootstrapping, the treefile and outfile don't get decoded, so they have the names from uniqid.py, rather than the original names.
Multiple alignment
- blnalign, blpalign - Add MaxAlign to tutorials.
BLAST+
- $doc/BIRCH/birchadmin/blastdb/BLASTDB-Considerations.html -Add some stats on search times for various databases on different platforms.
blncbi
- add NCBI datasets, dataformat - easy command line tools that should complement ncbiquery.py
- add einfo
- rename blncbi to blentrez?
- It looks like Related and Link only work for nucleotide sequences. This needs to work for proteins as well.
samtools
- Depending on the Linux release, some library dependencies may not be present on the system. The best solution is to change the name of samtools in the bin-xxx-xxx directories to samtools.bin, and then revise samtools.sh (which is linked from script/samtools) to include lib-xxx-xxx/samtools/lib in the LD_LIBRARY_PATH or DYLD_LIBRARY_PATH.
blreads
- Update to latest Hisat2 release.
- in pkg, there is a hisat and hisat2 link. There should be just a single link, but need to fix some of the symbolic links that point to hisat.
- clean up the differential tutorial. There are some inconsistencies in it, and we need to update it for the latest Hisat2
- There was a post on BioStars that indicated that the latest release of rnaspades no longer does error correction on reads. We better look into this, because error checking programs like rcorrector also can eliminate unpaired reads (I'm pretty sure) and at the very least, the tutorials have to be changed to reflect the change in rnaspades.
- Transcriptome Assembly Tools - scripts for cleaning up reads eg. uncorrectible reads, overrepresented sequences etc.
- Update Spades to v 3.14.1 - critical for Python3 support
- Transrate is no longer supported by the developer, and has a number of known bugs/issues. Potential alternatives:
- BUSCO for assembly statistics https://busco.ezlab.org
- rnaQUAST (now supports Python3)
- Try fastp as an alternative for trimming reads.
blmarker
- revise menus as done for blpalign, blnalign
- add support for other file formats
LAST
- last-dotplot requires python-pil. It doesn't look like it will be easy to package that in the lib-xxx-xxx/python hierarchy, so it should be documented as a dependency which has to be installed by the user.
Cytoscape
On CCL, Cytoscape gets to the splash screen but hangs at that point. The Java error messages complain about permissions.