From Bioinformatics.Org Wiki
Click here to go back to: PCD
One thing that I am finding would make a strong addition to BioLegato would be to extend PCD with a feature for insertion/replacement of code at run time.
For example, in a menu for running BLAST, it would be nice if the PCD could specify that parts of the code that lists the databases available on the local system could be read from a file, rather than being hard-coded into the PCD.
For example, the PCD code might look like this
var "dbase" type combobox label "Database" default 0 choices include $BIRCH/local/dat/BLAST/nt-db.txt "User-created file (FASTA format)" "-subject %USERFILE%"
where the file referenced contains the lines that should be inserted at that point in the code eg.
"nt - non-redundant nucleotide" "-db nt" "pdbnt - PDB nucleotide seqs." "-db pdbnt" "vector - Vector seqs." "-db vector"
The PDB would be parsed as if the original file had been
var "dbase" type combobox label "Database" default 0 choices "nt - non-redundant nucleotide" "-db nt" "pdbnt - PDB nucleotide seqs." "-db pdbnt" "vector - Vector seqs." "-db vector" "User-created file (FASTA format)" "-subject %USERFILE%"
(I am borrowing from the C #include function.)
This kind of mechanism would be a powerful addition to PCD. I have numerous uses for it. At present, I am simulating the process with Python scripts that process the PCD files when new database files are added or deleted, but there needs to be a cleaner way to do this.
The usefulness of this PCD instruction goes beyond simple lists. Just about any PCD could be included in the input file. This would make it easier to build some very complex menus from fairly simple pieces of code.
Should include be indentation agnostic?
assumes that the source file is not indented. When included, the current indentation level will be applied to the included lines
assumes that the source file is indented properly for the context of the insertion point.
- File must be indented correctly for the scope of the insertion site
- Still need to decide whether or not include statement must be indented.
Should include be recursive?
Should the included source also allow includes? If so, the PCD parser needs to be able to detect circular dependencies.
- Not now. However, if recursion is added at a later time, that won't break existing code or .blmenus files
- If an include is detected in an included file, print an error saying "Recursion not supported in this version."
- Absolutely no fully-qualified file paths!!! That works against portability.
- Probably best if the file path is relative to the directory in which the parent PCD file is found.
- If we allow subdirectories in the file path, that could compromise the portability if we ever implement PCD on Windows, since file separators are '\' rather than '/'.
- Should the file path allow environment variables? There is actually a strong argument in favor of doing so, because for a chooser or combobox, we might want to take the choices from a file in a different directory. For example, a list of local BLAST databases might be found in some directory in $BIRCH/local, rather than the directory for the parent PCD file. This requires solving the file separator problem above, to allow Windows paths.
- Use existing method eg. BLMain.envreplace("$BL_HOME") to parse paths.
- In most cases, BioLegato uses file.pathseperator and file.seperator to encode paths. The main difference in handling Linux vs. Windows, is that you still must begin an environment variable with $, rather than the %var% convention used in Windows.
Is include a part of the PCD language, or a preprocessing directive outside of PCD?
This affects both the underlying implementation of how BioLegato processes a PCD file, as well as the syntax of PCD.
If it makes pre-processing easier, we could have some sort of flag, like a hash mark, that could signal doing the inclusion before parsing the PCD. It would not be considered a formal part of the PCD language, but rather a pre-processing directive.
In a way, this might be good, because include wouldn't cause a change in PCD, per se, but rather a change in how PCD is implemented.
- Probably simplest if include was a pre-processing step not part of the PCD language.
- Maybe we use some symbol(s) other than # to flag the include
It's probably best to use a symbol other than '#' to indicate an include. An include is fundamentally different from a comment, so from an OO viewpoint, it should have a distinct definition. As well, it's easier to scan for if we use a non-#. The other thing is that we risk conflicting with comment-detection elsewhere. Granted, C uses #include, but it's possible that that was seen as a mistake, in retrospec. I don't know what the C community thinks about this.
For now, let's use '@' and the character for any pre-processing directive, and specifically, @include for an include line.
Can we come up with a good example that demonstrates the functionality and usefulness of pcd.exec? We also need some precise documentation of how to use this feature.
File naming conventions
- There is no reason I can think of to require include to enforce any file-naming conventions on the include file.
- However, I propose that included files should probably have a .blinclude extension. This distinguishes them from .blmenu files. The .pcd extension, or whether that is reserved. pcd.jjt lists four file extensions, .pcd, .blmenu, .blitem or .biopcd that are legal for pcd menus.
- In fact, maybe the one thing that might be enforced is that include files can NOT have the .blmenu extension!
- PCD.getCurrentPWD tells directory in which the current menu is being read. If we stipulate that the Include file has to be in the same directory as the .blmenu file, this could simplify things. We could potentially also require that the two files have the same basename, but different extensions eg. x.blmenu and x.pcd.
- Alternatively, do we want to implement environment variables as part of the path to the include file?
Syntax of the include line
- Three choices regarding quotation of file path:
- quoting NOT allowed eg. include path
- quoting required eg. include ["|']path["|']
- quoting optional
- Spacing between include and path
- 1 or more spaces?
- require 4 spaces as in Python?
Re: quotation - My guess is that we need to support quoting in case there are file path components containing (Ugh!) blanks. However, I wouldn't make quoting mandatory.
Re: Spacing - Since the include line is not really a PCD statement anyway, it makes no sense to enforce some sort of indentation rule. I'd say that the path is simply the remainder of the line, following include and 1 or more blank spaces.
Implementation: Both indentation and optional quotes have been implemented. The syntax of the include line is therefore:
The code for parsing is found in bioLegato/src/BioPCD/parser/src/org/biopcd/parser. Where do we actually implement include? Have a look in BLMain.java.loadPCD.
It looks like PCD (which is generated from pcd.jit) contains the code for actually reading each menu. At line 661 we see:
// Open PCD menu file. FileReader infile = new FileReader(path);
// Create a new PCD object to store the PCD data read in // from the PCD menu file, and parse the menu file into the // PCD menu object. PCDObject pcdo = loadPCDStream (infile, path.getParentFile(), canvas);
This changes to
// Call includePCD to insert includes into the blmenu file File temp1 = PreProcess.includePCD(path);
// Open PCD menu file. //FileReader infile = new FileReader(path); FileReader infile = new FileReader(temp1); etc......
BIRCHDEV/local/script/bldna.include - calls bioLegato 1.0.6
In principle, it should be possible to create a .blmenus file with nothing but includes, which evaluate to a complete .blmenus file. Can't imagine why anyone would want to do that, but it might be a test that should be passed.
- Database --> BLASTNlocal
- Menu file: BIRCHDEV/local/dat/bldna/PCD/Database/testBLASTNlocal.blmenu - reads blastdb_n.txt as include file
- Include file: BIRCHDEV/local/dat/bldna/PCD/Database/blastdb_n.txt - include file for testBLASTNlocal.blmenu
- Test: BIRCHDEV/local/dat/bldna/PCD/Database/MUSCATAL.gen - GenBank of mouse catalase gene. It is best to test this sequence using the RefSeq Gene database. RefSeq Gene is a small database, so the search should return results almost instantaneously.
- Database --> FEATURES_KEY
- Menu: BIRCHDEV/local/dat/bldna/PCD/Database/testFEATURES_KEY.blmenu - runs the Features program on a GenBank file
- Include file: BIRCHDEV/local/dat/bldna/PCD/Database/feakey.txt - list of feature keys for choice menu
- Test: BIRCHDEV/local/dat/bldna/PCD/Database/mouse_catalase.gen - contains 8 mouse catalase sequences. Try extracting features with key words such as CDS, STS, 5'UTR, repeat_region.
DONE for NCBINUC and NCBIPROT.
- Database --> Nucleotide - Query NCBI Nucleotide Database
- Menu: BIRCHDEV/local/dat/blncbi/PCD/Database/testNCBINUC.blmenu - reads feakey.txt for 8 duplicate choice menus
- Include file: BIRCHDEV/local/dat/bldna/PCD/Database/feakey.txt - read by testNCBINUC.blmenu as include file
- Test: Primary organism: Pisum AND (TextWord: PR10 OR TextWord: drr206) - should return 13 entries to blncbi.
- Preferences --> BLHelper
- <srike>UpdateAddInstall --> BlastDB Report - include a locally-specified FTP site specified in $BIRCH/local. Same for other BlastDB menus</strike>
- UpdateAddInstall --> BlastDB Update/Add/Delete - Do we gain anything by using include in these to specify database choices?
BLHelper, and Update, Add and Delete menus may be more trouble than they're worth, when it comes to using @include.
One thing that is becoming apparent is that where only a single menu field needs replacing, @include makes things simpler. Where several fields must be substituted, using @include might actually make things harder to understand for the PCD programmer.
tbl2asn - use @ include for choices in qXfield. This may also not be the best idea, since we have to hardwire default values for each of the choosers. As well, there is a documented bug in BioLegato with this menu, that currently has a workaround in the .blmenu code itself.