################################################################################### # FA_LOLIPOG_P: A package to calculate First-Approximate LOg-LIkelihood for the Pattern Of Gaps, # as well as the first-approximate perfect score (described below), # of a given MSA (almost exclusively via Perl modules and scripts). # # Version 0.3: Copyright (C) 2013 Kiyoshi Ezawa # Version 0.5: Copyright (C) 2014 Kiyoshi Ezawa # Version 0.5.1: Copyright (C) 2014 Kiyoshi Ezawa # Version 0.6: Copyright (C) 2015 Kiyoshi Ezawa # Version 0.6.1: Copyright (C) 2015 Kiyoshi Ezawa # Version 0.6.1.2: Copyright (C) 2015 Kiyoshi Ezawa # Version 0.6.1.3: Copyright (C) 2016 Kiyoshi Ezawa # Version 0.6.1.5: Copyright (C) 2016 Kiyoshi Ezawa # Version 0.6.1.6: Copyright (C) 2016 Kiyoshi Ezawa # Version 0.6.1.7: Copyright (C) 2016 Kiyoshi Ezawa # # This package is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This package is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License, "GNU_GPL.txt", # along with this package. If not, see . # # The author can be contacted by e-mailing to # (replace " dot " and " at " with "." and "@", respectively). # ################################################################################### * [ Major modification from version 0.6.1.6 to version 0.6.1.7 ] The reference names, "Ezawa 2016" and "Ezawa, unpublished", were changed into "Ezawa 2016a" and "Ezawa 2016b", respectively. And the reference information of the latter was updated. * [ Major modification from version 0.6.1.5 to version 0.6.1.6 ] A new sample script, "Sample_Scripts/fa_lolipog_hs.alpha2.pl", was created. The script can read in a Dawg control file with different parameter settings for insertions and deletions. Accordingly, a note was added to CAVEAT3. * [ Major modification from version 0.6.1.3 to version 0.6.1.5 ] A note was added to CAVEAT2. CAVEAT3 was added. * [ Major modification from version 0.6.1.2 to version 0.6.1.3 ] A bug was removed from the subroutine, 'correl_linregr_sglexpl_wgt', in the Perl Module, 'MyPerlModules/MyLinearRegression.pm'. The directory, "./INDEL_LIKELIHOOD/Proof_of_Principle3/," was added. The directory, "./INDEL_LIKELIHOOD/Perturbation_Analyses/PWA_probs/Iteration_Formulas/," was updated. The title of a reference (Ezawa, unpublished) changed. A reference (Ezawa 2016) was added. * [ Major modification from version 0.6.1.1 to version 0.6.1.2 ] The reference information was updated as follows: (OLD) "Ezawa, Graur and Landan 2015c" -> (NEW) "Ezawa, unpublished." * [ Major modifications from version 0.6.1 to version 0.6.1.1 ] The source information for the references, "Ezawa, Graur and Landan. 2015a,b," was updated. * [ Major modifications from version 0.6 to version 0.6.1 ] The major reference information, including the titles, sources and statuses, changed. (From "Ezawa, Landan, and Graur, unpublished" to "Ezawa, Graur and Landan. 2015a,b,c.") * [ Major modifications from version 0.5.1 to version 0.6 ] (i) The title of the major reference (Ezawa et al., unpublished) was changed. (ii) The package now contains some new Perl modules, named "MyPertProb_xxx.pm," that perform some perturbation calculations on the probabilities of local alignments. Some Perl scripts that use these modules were also contained in the package. (iii) The package now contains a new component that enables the analyses of the homology structures (see, e.g., Lunter et al. 2005) of the alignments. The component consists of two new Perl modules, "MyReadAlnProbIngredients.pm" and "MyTreeMap_indels_ML_hs.pm," as well as some scripts that use the modules. Accompanying this modification: (1) some Perl script were added under the directory, 'Sample_Scripts/'; and (2) some Perl scripts under the directory, 'INDEL_LIKELIHOOD/', were also replaced. * [ Major modifications from version 0.5 to version 0.5.1 ] (i) For the convenience of the users, the sample script, "br_infer_psm_indel_events_spt_odr_dawg.v2.pl," in "Sample_Scripts/" was renamed as "fa_lolipog.alpha.pl," and another sample script, "br_infer_psm_indel_events_spt_odr_dawg.pl," was removed. (ii) The file, "HISTORY_OF_TERMINOLOGY_CHANGES.txt," was added. * [ Major modifications from version 0.3 to version 0.5 ] (i) The latter version got focused on calculating the LOLIPOG (LOg-LIkelihood for the Pattern Of Gaps) of an input multiple sequence alignment (MSA), by temporarily freezing the ability of the former version to calculate the log-likelihood for the residue pattern in MSA. Thus, this version cannot calculate what I used to call the "first-approximate perfect (FAP) score" of a MSA in version 0.3. (The function will be resurrected in a future version.) (ii) The Perl modules to enumerate parsimonious indel scenarios now respects the spatial order of indel events that do not appear to interfere with each other at first glance, if a flanking block has the "presence" state at a 'downstream' node and prevents the 'deletion' event from moving across. (Single quotes indicate that the entity is considered in a (virtual) time direction in which the event is regarded as a 'deletion'.) -------------------------[CAVEAT1]------------------------------------------ The current version was designed mainly to analyze (almost selectively neutral) DNA MSAs, and performance tests were conducted only on simulated DNA MSAs generated by DAWG (Cartwright 2005). Although the package is in principle applicable to real-life MSAs of DNA or protein sequences, caution should be excercised when doing such analyses, because a real-life MSA in general is a product of more complex evolutionary processes (some examples are found in Discussion of Ezawa 2016a), and because NO MSAs reconstructed by sequence aligning programs are guaranteed to be correct, while this package requires a correct MSA as input. ------------------------------------------------------------------------------- -------------------------[CAVEAT2]------------------------------------------ Thus far, I confirmed that the package works on the termnal of my Mac OS X platform (10.6 on Intel Mac), and on a GNU/Linux platform (x86_64). And I suspect that it should work on other UNIX-type platforms as well (although I am not completely sure). [NOTE ADDED on June 14 (Tue), 2016: I confirmed that the package works also on Linux (Red Had Enterprise Linux 6.4).] On the other hand, I am almost sure that it won't work on Windows, because the modules and scripts heavily depend on Unix commands. So, for an exclusive Windows-user to use the package, he/she will have to learn how to use Unix (and maybe to install a Unix emulator). I hope that somebody with a goodwill will adapt the package to non-Unix platforms... ------------------------------------------------------------------------------- -------------------------[CAVEAT3]------------------------------------------ Currently, the main script, "Sample_Scripts/fa_lolipog_hs.alpha.pl," and some of the scripts in "INDEL_LIKELIHOOD/" that input the Dawg control file, can deal only with the situations where the insertion rate is equal to the deletion rate. This is exclusively because their input interface (consisting of the subroutine, "read_dawg_ctrl_file", and its surrounding) is primitive. The Perl modules and internal subroutines for the indel analyses are actually flexible enough to deal with more general situations. Although I hope to rectify this problem later, the users could modify the input interface on their own to analyse the cases with inequal rates, if they prefer. [NOTE ADDED on July 31 (Sun), 2016: Now, the package contains a new sample script, "Sample_Scripts/fa_lolipog_hs.alpha2.pl", which can properly read in a Dawg control file specifying different rates and length distributions for an insertion and a deletion. If the uses want to analyze the cases with different model parameter settings for insertions and deletion, they could do as follows: (1) Issue a command: "diff Sample_Scripts/fa_lolipog_hs.alpha.pl Sample_Scripts/fa_lolipog_hs.alpha2.pl" and see the differences between the two sample scripts. (2) Incorporate similar differences into the relevant scripts in "./INDEL_LIKELIHOOD/" that reads in Dawg control files. (3) Aside from these modifications, create auxiliary files via "Sample_Scripts/InData/calc_log_mfacs_pars_pwa_dawg.pl" (or equivalent scripts in the right locations) under the same indel model parameter setting (including the ratio between the insertion rate and the deletion rate) as described in the Dawg control file to be used. (4) Perform the analyses using the tools and ingredients thus prepared. ] ------------------------------------------------------------------------------- If you find any problems that is not solvable by yourself (and to which the above caveats do NOT apply), please e-mail the author at the address: (replace " dot " with ".", and " at " with "@"). [ Directory (i.e. folder) structure ] FA_LOLIPOG_P.ver0.6.1/ is the "root directory" of the package (ver 0.6.1). It contains the following sub-directories and files: * README.txt File you are now reading. * DISCLAIMER.txt Disclaimer of warranty and limitation of liability extracted from the GNU General Public License, version 3. * GNU_GPL.txt Copy of the GNU General Public License, version 3. * HISTORY_OF_TERMINOLOGY_CHANGES.txt Record of the history of changes in terminology related to this package. * Sample_Scripts/ Directory containing a sample Perl script that shows how to use the main Perl modules, as well as two auxiliary Perl scripts that will be needed for the sample script to run properly. Specifically, The file, "fa_lolipog_hs.alpha.pl," is the sample script, which enumerates parsimonous local indel histories that can result in the gap-configuration (actually, the homology structure) of each gapped segment in a MSA, calculates the logarithm of the probability (log-probability) of the homology structure of the segment, and calculates the relative probability that each local indel history contributes to the homology structure of the gapped segment among parsimonious histories. [A new version, "fa_lolipog_hs.alpha2.pl" was added on July 31 (Sun), 2016. It differs from the old version ("fa_lolipog_hs.alpha.pl") in that it can read in a Dawg control file specifying different rates and/or length distributions for insertion and deletion.] The file, "preprocess_msa_dawg.pl," is one of the auxiliary scripts, which pre-processes the input MSA so that local MSAs with the same homology structure will be displayed in a unique manner. The file, "Sample_Scripts/InData/calc_log_mfacs_pars_pwa_dawg.pl," is the other of the auxiliary scripts, which creates the tables of pre-computed multiplication factors of various homology structures of local PWAs. The tables constitutes one essential set of input files for "fa_lolipog_hs.alpha.pl." The file, "manual.fa_lolipog_hs.alpha.txt," describes how to use these scripts. * MyPerlModules/ Directory containing Perl modules that are necessary for this package but that are not available as a part of the DENSERM_P package (Ezawa 2013). Among the modules, "MyPerlModules/MyTreeMap_indels_spt_odr.pm" and "MyPerlModules/MyTreeMap_indels_ML_hs.pm" play key roles in this package. The former is essential for our sub-algorithm that attempts to enumerate all possible parsimonious insertion/deletion histories each of which result in the homology structure of a gapped segment in a given MSA. The latter is essential for calculating the logarithmic probability (i.e., log-likelihood) of the homology structure. The algorithms are described briefly in (Ezawa 2016b) and in details in (Ezawa, Graur and Landan, 2015b). Other modules, along with those in "MyPerlModules_DENSERM/," either logistically support these main modules or take care of the input/output of data. [NOTE: The module "MyPerlModules/MyTreeMap_indels.pm" is the predecessor of "MyPerlModules/MyTreeMap_indels_spt_odr.pm." A major difference is in how to handle two blocks with seemingly swappable gap patterns; the latter freely swaps the blocks if necessary, whereas the former keeps the blocks' spatial order if one block has a "presence" state in the 'downstream' of the indel event in the other block (in the virtual time direction that it is regarded as a 'deletion').] * MyPerlModules_DENSERM/ Directory containing Perl modules that are already available as a part of the DENSERM_P package (Ezawa 2013). Some of them are necessary for this FA_LOLIPOG_P package as well. * INDEL_LIKELIHOOD/ Directory containing major Perl scripts, a set of tables of pre-computed PWA multiplication factors, as well as contril files of Dawg (Cartright 2005), that were used for the analyses to demonstrate that our algorithms and mathematical formulas work adequately well. The algoriths, mathematical formulas and the results of the analyses are described briefly in (Ezawa 2016b) and in details in (Ezawa, Graur and Landan, 2015a,b). See "README.INDEL_LIKELIHOOD.txt" in the directory for more details. (On March 26, 2016, control files for additional three simulated MSA sets (3P, 3M and 3F) were imported from the package, 'ComplLiMment_P.ver0.6.1.3,' to this directory. These sets and the package are described in details in (Ezawa 2016a).) [ How to use the package ] * GENERAL NOTE: In this package, you should run a Perl script "xxx.pl" by issuing a command "perl xxx.pl." (Alternatively, you could change the permission mode of each script via, e.g., "chmod u+x {script name}", and issue a command "./{script name}". I would not recommend it much, though...) 0. If not yet, you might want to install the molecular evolution simulator, 'Dawg' (Cartwright 2005) into your platform, especially if you want to create simulated MSAs by yourself. 'Dawg,' its brief manual, etc. are available at the URL, http colon slash slash scit.us slash projects slash dawg (replace ' colon ' and ' slash ' with ':' and '/', respectively). 1. Extract the archive, "Suppl4_draftyyy.zip" or "FA_LOLIPOG_P.vxxx.tar.gz", via the command, "unzip Suppl4_draftyyy.zip" or "tar -xpzf FA_LOLIPOG_P.vxxx.tar.gz" (, which I suspect you've already done). Then, 'cd' to the top directory, "FA_LOLIPOG_P.vxxx/", which popped out of the archive. 2. Add the absolute paths of the sub-directories "MyPerlModules/" and "MyPerlModules_DENSERM/" to the environment variable "PERL5LIB." If you are using bash, add the following lines to an appropriate place in the ".bashrc" file in your home directory: ---------- lines to be added to "~/.bashrc" ----------------- if [ -n "$PERL5LIB" ]; then export PERL5LIB=${PERL5LIB}:{the absolute path of "MyPerlModules"}:{the absolute path of "MyPerlModules_DENSERM"} else export PERL5LIB={the absolute path of "MyPerlModules"}:{the absolute path of "MyPerlModules_DENSERM"} fi -------------- end --------------------------- If you are using tcsh (or csh), add the following lines to a proper place in '.tcshrc' (or '.cshrc') in your home directory: ---------- lines to be added to "~/.tcshrc" (or "~/.cshrc") ----------------- if (($?PERL5LIB) && ("$PERL5LIB" !~ "")) then setenv PERL5LIB ${PERL5LIB}:{the absolute path of "MyPerlModules"}:{the absolute path of "MyPerlModules_DENSERM"} else setenv PERL5LIB {the absolute path of "MyPerlModules"}:{the absolute path of "MyPerlModules_DENSERM"} endif -------------- end --------------------------- Alternatively, you could 'cp' the modules in "MyPerlModules/" and in "MyPerlModules_DENSERM/" to a directory that is already listed in "PERL5LIB." (You can confirm the directories listed in the environment variable via, e.g., "echo $PERL5LIB.") 3. Now you can run the Perl script, "fa_lolipog_hs.alpha.pl," in the subdirectory, "Sample_Scripts/," to find out parsimonious local indel histories that can explain the homology structure of each gapped segment of a given MSA, calculate the probabilities of their relative contributions, and calculate the approximate probability that the homology structure of each segment occurs. [Depending on the situation, you will also have to run one or both of the auxiliary scripts, "preprocess_msa_dawg.pl" and "Sample_Scripts/InData/calc_log_mfacs_pars_pwa_dawg.pl."] For instructions on how to do that, read the file, "manual.fa_lolipog_hs.alpha.txt," in the subdirectory. 4. Or, alternatively, you could roughly trace the path that we took to demonstrate that our algorithms and mathematical formulas work fine, as described in our manuscript (Ezawa, Graur and Landan, 2015b). This can be done by 'cd'ing to "INDEL_LIKELIHOOD/" and running the scripts in its sub-directories in a proper order. See "README.INDEL_LIKELIHOOD.txt" for more detailed instructions. (NOTE: To trace the entire web of paths we took, you will need lots of minor Perl scripts. If you want to do this, please contact the author of this package (K.Ezawa).) [ References ] * Cartwright RA. 2005. "DNA assembly with gap (Dawg): simulating sequence evolution." Bioinformatics 21:iii31-iii38. * Ezawa K. 2013. "DENSERM: DEtecting Negative SElection on Recurrent Mutations," in Bioinformatics.org [URL: "http colon slash slash www.bioinformatics.org slash ftp slash pub slash DENSERM" (replace ' colon ' and ' slash ' with ':' and '/', respectively)]. * Ezawa K. 2016a. "Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map." BMC Bioinformatics 17:133. (DOI 10.1186/s12859-016-0945-5). * Ezawa K. 2016b. "General continuous-time Markov model of sequence evolution via insertions/deletions: local alignment probability computation." BMC Bioinformatics 17:397. (DOI: 10.1186/s12859-016-1167-6). * Ezawa K, Graur D, Landan G. 2015a. "Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part II: Perturbation analyses." bioRxiv doi: http://dx.doi.org/10.1101/023606. * Ezawa K, Graur D, Landan G. 2015b. "Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part III: Algorithm for first approximation." bioRxiv doi: http://dx.doi.org/10.1101/023614. * Lunter G, Miklos I, Drummond A, Jensen JL, Hein J. 2005. "Bayesian coestimation of phylogeny and sequence alignment." BMC Bioinformatics 6:83. # First version of this file was created on May 25th (Sun), 2014 by K. Ezawa. # It was rewritten on Sep 2 (Tue), 2014, when the package was updated from ver. 0.5 to ver. 0.5.1, by K. Ezawa. # It was further rewritten on June 17 (Wed),in 2015, when the package was updated from ver. 0.5.1 to ver. 0.6, by K. Ezawa. # It was further rewritten on July 30 (Thu),in 2015, when the package was updated from ver. 0.6 to ver. 0.6.1, by K. Ezawa. # It was further rewritten on Aug 4 (Tue),in 2015, when the package was updated from ver. 0.6.1 to ver. 0.6.1.1, by K. Ezawa. # It was further rewritten on Aug 31 (Mon),in 2015, when the package was updated from ver. 0.6.1.1 to ver. 0.6.1.2, by K. Ezawa. # It was further rewritten on March 26 (Sat),in 2016, when the package was updated from ver. 0.6.1.2 to ver. 0.6.1.3, by K. Ezawa. # It was further rewritten on June 14 (Tue),in 2016, when the package was updated from ver. 0.6.1.3 to ver. 0.6.1.5, by K. Ezawa. # It was further rewritten on July 31 (Sun),in 2016, when the package was updated from ver. 0.6.1.5 to ver. 0.6.1.6, by K. Ezawa. # It was further rewritten on October 8 (Sat),in 2016, when the package was updated from ver. 0.6.1.6 to ver. 0.6.1.7, by K. Ezawa.