BIRCHv4.01
From Bioinformatics.Org Wiki
[return to Release To Do List]
Contents |
New platforms
Need to add new platforms for BIRCH_PLATFORM.
Choices:
- linux-arm64
In order, these need to be changed in:
1. BioLegato
2. install-scripts
3. getbirch
4. birchdb
5. scripts
linux-arm64
For now, development will have to be done on a VM. According to How to install Linux on a mac with Apple silicon, UTM is probably the best choice of a VM environment for MacOSX on ARM.
BIRCH
- make dropdown menus on home page scrollable
- Mail server - Need documentation how to set up mail server for email notification.
- ABySS requires make. Fedora 39 doesn't have make by default. Where do we document this? Can we do that maybe in one of the ABySS scripts?
- Check programs for compliance with 64-bit GI numbers in NCBI databases.
BioLegato
- bltable - File --> Save Selection As: If you choose a file, it puts in the whole path in quotes, which for some reason doesn't save the file. If you remove the path and quotes, the file gets saved.
- Get latest BioLegato version into BioLegato tutorial
- chooseviewer.py should have a way to view Markdown (.md) files. Well... that turns out to be easier said than done. You'd think programs like Evice or LibreOffice Write could do that, but it turns out not to be the case. Actually, it's very hard to make a legible PDF file from Markdown. We can look at this, but there is no simple answer.
- Get BioLegato to recompile with Java11
- Update documentation on adding local components to BioLegato
- Remote execution - It might be almost trivial to add to BioLegato the capability to run jobs on remote servers. We could run the command with sshcc, but set it in an environment variable, so the PCD would look something like
shell "$BL_REMOTE blastp ...."
where $BL_REMOTE would be something like sshcc, or whatever command on your system sends a job to a remote host. This would only work on a clustered system where all hosts share a common file system eg. NFS. If $BL_REMOTE is blank, the command just runs on the local host.
- Remote BLAST/FASTA - Add a Run button to the local BLAST/FASTA menus to run BLAST/FASTA on a local server.
On CCL this would be the ccxx hosts, which is easy to do because they share $HOME directories with the login hosts. The command can be something like
bl_blast_server.sh <blast commands>
This approach has been implemented in a very short bash script that doesn't need any command line parameters other than the blast command, because it can run in the same directory on the remote host as it does on the login host.
However, to implement a more generic script, say for running on DAC clusters, we would need more parameters
bl_blast_server.sh <RHOST> <RDIR> <RUSERID> <infile> <outfile> <blast_command>
In both cases, the user has to have passwordless ssh set up. For the latter, the script would have to copy the infile to the remote host, run BLAST, and then copy back the .xml output from the remote host.
The underlying implementation for blastn would be
$BIRCH/dat/bldna/PCD/Database/BLASTNlocal.blmenu
@include "$BIRCH/local/admin/BLAST/server/BLASTNlocal.blinclude"
$BIRCH/local/script/bl_blast_server.sh
One issue is that we would have to have the include file in local-generic, probably with all lines commented out. Few systems would implement a local BLAST server. Those that do could uncomment the appropriate include files. They would also have to modify bl_blast_server.sh for their own site.
bldna, blprotein
xylem_shuffle - We were previously using the old version of shuffle. xylem_shuffle.py has been changed to call the xylem_shuffle. The latter reads FASTA files, adds "-rand" to the output names, and no longer deletes the first two lines of output. These things used to be handled by xylem_shuffle.py, but now xylem_shuffle does that, so the script has been reduced to a very simple wrapper. xylem_shuffle.py also aborts if the input file doesn't exist, or is not larger than 1 character.- Automated sequence renaming - Need to be able to rename sequences using some sort of regular expression substitution. SeqKit may be able to do this.
- How hard would it be to revise BioLegato to always use Accession numbers, rather than LOCUS names? Virtually no software uses LOCUS. This is moot except for very old sequenes, since NCBI decided long ago to make LOCUS and Accession identical. However,if you do get an older sequence coming up, it would be good not to have to deal with this problem.
- [Bugzilla 1223 https://www.bioinformatics.org/support/index.php?func=detail_ticket&bug_id=1223&group_id=543] - Edit --> Change case - If you change case, you lose all annotation. After changing case, if you try to use File --> View file, a pseudo-GenBank file is shown that is missing almost all annotation. This seems to be a problem with BioLegato, since there is no "Change case" .blmenu file.
- It might be useful to be able to go from sequences to Neighbors or Links. Two possible ways:
- from bldna or blprotein, export sequences to blncbi based on accession numbers. May need a script something like GenBank2Entrez.Probably just runs Eutils.post.
- Have a script that directly sends output to blncbi, so that you're not only running Eutils.post, but also running elink.
- Time to revisit Genome Browsers. To consider:
- Genome Workbench - We could have an export function from bldna that extracts Accession numbers from entries and then loads them into Genome Workbench.
- UCSD (or is is USC?) Genome Browser
- VISTA
Phylogeny
- The Phylip web site mentions "A new release of PHYLIP, version 3.698" which fixes a consensus tree bug. No date is given. We should check to see if we have this version, and if not, upgrade.
- How about email notification for long running jobs?
- All phylogeny scripts (dnadist.py, dnapars.py, dnaml.py etc.) call the main program using Popen, but call later steps such as bootstrapping, consense, and uniqid, using subprocess.call. One some systems, it looks like one of these steps seems to be called before the previous step has been completed, resulting in a No such file or directory message and empty output. You can rerun the program with no changes and then get expected output. Would changing all of these calls to Popen calls be more consistent? I have a feeling we've done this before, so tread carefully to avoid swapping one problem for another. The advantage of Popen is that we can do a p.wait() after every call.
- We need to somehow integrate taxfetch.py so into blnalign, blpalign so that it can get taxonomic information from accession numbers. This is trickier than it might first appear. For example, BioLegato will export sequences using GenBank LOCUS names. It is not clear at what step in the process one would do this. The goal is to be able to generate a phyloxml file for Archaeopteryx to read. Also, it would be nice if we could get alphabetic taxonomy codes like those used by Pfam, as opposed to the numeric taxid numbers from NCBI.
Multiple alignment
- blnalign, blpalign - Add MaxAlign to tutorials.
BLAST+
- $doc/BIRCH/birchadmin/blastdb/BLASTDB-Considerations.html -Add some stats on search times for various databases on different platforms.
blncbi
- add NCBI datasets, dataformat - easy command line tools that should complement ncbiquery.py
- add einfo
- rename blncbi to blentrez?
- It looks like Related and Link only work for nucleotide sequences. This needs to work for proteins as well.
samtools
- Depending on the Linux release, some library dependencies may not be present on the system. The best solution is to change the name of samtools in the bin-xxx-xxx directories to samtools.bin, and then revise samtools.sh (which is linked from script/samtools) to include lib-xxx-xxx/samtools/lib in the LD_LIBRARY_PATH or DYLD_LIBRARY_PATH.
blreads
- Update to SPAdes 4.0. This package has moved to a GitHub site. Now has binaries for all platforms, including Darwin-arm64.
- Update to latest Hisat2 release.
- in pkg, there is a hisat and hisat2 link. There should be just a single link, but need to fix some of the symbolic links that point to hisat.
- clean up the differential tutorial. There are some inconsistencies in it, and we need to update it for the latest Hisat2
- There was a post on BioStars that indicated that the latest release of rnaspades no longer does error correction on reads. We better look into this, because error checking programs like rcorrector also can eliminate unpaired reads (I'm pretty sure) and at the very least, the tutorials have to be changed to reflect the change in rnaspades.
- Transcriptome Assembly Tools - scripts for cleaning up reads eg. uncorrectible reads, overrepresented sequences etc.
- Transrate is no longer supported by the developer, and has a number of known bugs/issues. Potential alternatives:
- BUSCO for assembly statistics https://busco.ezlab.org
- rnaQUAST (now supports Python3)
- Try fastp as an alternative for trimming reads.
blmarker
- revise menus as done for blpalign, blnalign
- add support for other file formats
LAST
- last-dotplot requires python-pil. It doesn't look like it will be easy to package that in the lib-xxx-xxx/python hierarchy, so it should be documented as a dependency which has to be installed by the user.
Cytoscape
On CCL, Cytoscape gets to the splash screen but hangs at that point. The Java error messages complain about permissions.