PLoS Computational Biology recently published an article about spliceosomal
intron evolution by Nguyen, Yoshihama, and Kenmochi . The authors were unaware of some earlier independent results. Foremostly, the main point of the articlethat of estimating the density of potential intron sitesis not novel. It was described more than three months earlier . The numerical results are virtually identical in the two publications, which is not surprising, since they apply the same model to the same data . A recent article points to the model''s validity. Raible and
coauthors report that
introns in the protostome Platynereis dumerilii are almost as abundant as in humans, and many introns are in homologous positions between the two species. The shared positions indicate that at most one-third of human introns were gained in the vertebrate lineage, in agreement with the estimates of and . In contrast, parsimony estimates should change significantly when including P. dumerilii.
To estimate ancestral intron losses and gains, Nguyen and coauthors use an exponential-time procedure, which is practical only for a few species. In reality, the estimation can be done in linear time , as described briefly below. We are modeling intron presence and absence in homologous sites across organisms related by a known phylogeny. Presence and absence are encoded by 1 and 0, respectively. Introns evolve independently, by a Markov model for a binary character. On branch e, an intron is lost with
probability pe(1 0) and an intron is gained with probability pe(0 1) at every site. Assuming a continuous-time Markov process,
where , > 0 are branch-specific gain and loss rates, and t > 0 is branch length. Introns are observed at the terminal taxa. An all-absent intron site is never observed, and, thus, the number of potential intron sites must be estimated for correct
likelihood optimization. The likelihood can be
computed by a dynamic programming algorithm . The algorithm calculates the conditional likelihood Lu(x) for every node u and
state x 0,1: Lu(x) is the probability of the observed states in descendants of u, conditioned on the state x at u. One can further define the upper conditional likelihood Uu(x) for the observed states outside the subtree of u, which can be computed efficiently by dynamic programming even if the underlying process is irreversible . Feslenstein reviews relevant techniques for the reconstruction of ancestral molecular sequences, which are generally assumed to evolve by a reversible process. Now, the posterior probability of the intron state x at every node u can be computed as
The posterior probability for state change x y on an edge uv is computed as
The expected numbers of gains or losses are obtained by summing the probabilities quv(0 1) and quv(1 0) over all intron sites, respectively. Nguyen and coauthors consider instead all 2N state labeling of N internal nodes to compute the expected numbers of gains and losses. A Java package implements the more efficient procedure, and is publicly available at .
Nguyen et al. reiterate well-known concerns of identifiability. Their Proposition 1 echoes the Pulley Principle for ambiguous root placement . Proposition 2 asserts that there are two possible parameter sets pe(x y) for every branch, which can be combined to get exponentially many choices that give the same likelihood function. The continuous-time process of implies pe(0 1) pe(1 0) < 1. Such constraint leads to unique parametrization (except for the root position), and is more natural than the one proposed by Nguyen and coauthors, which is based on the variance of intron gains and losses.
Nguyen and coauthors discuss an important study by Qiu, Schisler, and Stoltzfus . Qiu and coauthors constructed multiple alignments of ten gene families. The families had 68 sequences and 49 intron sites on average. Using a Bayesian framework, Qiu and coauthors estimated two intron evolution paramete
More abstracts about the On the Estimation of Intron Evolution