---
title: "Allele frequency change and selection"
author: "Matheus Januario, Andressa Viol, and Daniel Rabosky"
date: "Jan 2024"
output: rmarkdown::html_vignette
vignette: >
 %\VignetteIndexEntry{deeptime_clocks}
 %\VignetteEngine{knitr::rmarkdown}
 %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Learning objectives

1. Exploring sequence data
2. A simple measure of genetic distance
3. The Poisson correction
4. Building a molecular clock
5. Molecular clocks and inferences about deep time
6. Jukes-Cantor correction
7. The uncertainty of your molecular clock

# Introduction

```{r}
#loading packages we will need today:
library(ape)
library(evolved)
```

We have encountered many processes that change allele frequency in a population, but today we will study the actual genetic differences among species. We will also learn how to use that genetic information to reconstruct the evolutionary past of lineages with the so-called "molecular clock". To do that, we will use _real data_ on molecules to estimate the divergence (in numbers of substitutions) between homologous proteins in different organisms.

Homologous sequences, together with information from the fossil record, will be used to "calibrate" the rate at which amino acid substitutions occur during protein evolution. More specifically, given an  estimate of the amount of divergence, plus an estimate of the rate at which divergence occurs, we can estimate how much time has passed since divergence.

\fbox{\begin{minipage}{48em}
Question 1:\\

Why is it important to make sure that the compared sequences are homologous?
\end{minipage}}

# Exploring sequence data

We have assembled a set of protein sequences from the cytochrome oxidase 1 gene. This gene, often known as CO1 ("see-oh-one"), is a mitochondrial gene that plays a key role in cellular respiration (e.g., the primary aerobic pathway to energy or ATP generation). CO1 contains approximately 513 amino acids and has been used in many studies to construct phylogenetic trees and estimate divergence times between taxa.

We manually downloaded full-length genes of a number of focal taxa from the public database [GenBank](https://www.ncbi.nlm.nih.gov/genbank/) (more general info  about GenBank can be found in Benson et al., 2012). We then *aligned* the sequences using [CLUSTAL](https://www.ebi.ac.uk/jdispatcher/msa/clustalo) (more info in Sievers et al., 2011), a widely-used sequence alignment tool.

Because this dataset is included in our R package, we can access it by simply typing: 

```{r}
data("cytOxidase")
```

Now we can pull out the sequence of a single organism by typing:

```{r, results = F}
cytOxidase["human"]
```


\fbox{\begin{minipage}{48em}
Question 2:\\

When you run the above in your own console, you will see a sequence of letters. Given what you know so far about CO1, answer in one sentence: What do these letters represent? What is their biological role?
\end{minipage}}

\begin{center}

-- 

\end{center}

Now, let's explore our sequences a little.

How many sequences do we have?

```{r}
length(cytOxidase)
```

What are the organisms in our dataset?

```{r}
names(cytOxidase)
```

How long is each sequence?

```{r}
unlist(lapply(cytOxidase, nchar))
```

You can access any particular sequence by typing:

```{r}
cytOxidase["platypus"]
```

You can compute the number of differences by typing:

```{r}
countSeqDiffs(cytOxidase, "human", "alligator")  
```

Finally, you can get the length of a specific sequence by typing:

```{r}
nchar(cytOxidase["platypus"])
```

***

# A simple measure of genetic distance

As mentioned in class, the simplest genetic distance between two sequences is just the proportion of differences (i.e., the p-distance).

\fbox{\begin{minipage}{48em}
Question 3:\\

(I) Calculate the p-distance between the human and the chimpanzee for the CO1 gene. 

(II) What is the p-distance between the human and the crustacean? 

(III) What is the proportion of sites that \textit{do not} vary for each of the above pairs of taxa?\\

Hint: Consider what exactly you should count, and how you would use that information to calculate the p-distance.
\end{minipage}}

Now, lets focus on these pairs of taxa:

```
human-bird
human-chimp
human-platypus
human-echinoderm
cnidaria-insect
insect-crustacean
annelid-insect
chimpanzee-insect
chimpanzee-fish
platypus-cnidaria
frog-lamprey
snake-shark
insect-mollusk
bird-alligator
```

\fbox{\begin{minipage}{48em}
Question 4:\\ 

Count the differences among all species pairs above. To do this in R, you probably want to organize and store your results with a matrix or a named vector.\\

If you don't want to use R to do this, you can make a spreadsheet in paper, too.
\end{minipage}}

If you are unfamiliar with creating matrices or named vectors in R, below is a tutorial to walk you through how to do this.

\begin{center}

-- 

\end{center}

To make a named vector, you will access the `names` attribute of a vector with the function `names()`. You will assign another vector that contains the names to the `names` attribute of your vector like this:

``` {r}
seq_diff_vec <- c(4, 8, 3, 6, 0) # note that these numbers are completely made up
names(seq_diff_vec) <- c("snake-chimp", "human-echinoderm", "alligator-insect", 
                         "chimp-echinoderm", "human-chimp")

seq_diff_vec
```

Note that you cannot repeat names, each name must be unique. 

As seen above, you now have a vector whose values are labeled with a unique identifier. As you have done many times in these labs, you can also access a particular value within your vector by its unique name using this format:

``` {r}
seq_diff_vec["snake-chimp"]
```

To make a matrix, use the handy function `matrix()`. Say I wanted to build a matrix that counts differences between 3 pairs: snake-human, insect-snake, and insect-human. First, I will build a skeleton matrix to hold this data:

``` {r}
m <- matrix(data = NA,  # placeholder value for each matrix's cell will be NA 
                   # (later will be replaced by real data)
       nrow = 3,  # one row for each taxon group
       ncol = 3,  # one col for each taxon group
       dimnames = list(c("snake", "human", "insect"), # these are the row names
                       c("snake", "human", "insect")) # these are the column names
)
```

Let's take a look at our skeleton matrix.

``` {r}
m
```

Looks good! Now, to fill in the cell values that currently say `NA` with data containing the sequence differences between each pair, I can directly subset the matrix at the row, column, or cell that I want using base subsetting in R. A very important thing to note is that this matrix is symmetric, meaning that both the upper and lower triangles are the same. It is symmetric because there are two cells for each pair of taxa, but only one value of sequence differences for that pair. Thus, you may choose to either leave one half of the matrix blank, or repeat the values.

``` {r}
m[1,] <- c(0, 5, 2) # replacing the entire first row with fake numbers
m[2,] <- c(5, 0, 80) # replacing the entire second row with fake numbers
m[3,] <- c(2, 80, 0) # repalcing the entire third row with fake numbers
m
```

Like this, we can manually alter the matrix. You can also do this with a nested loop, though this is more complicated.

\begin{center}

-- 

\end{center}

\fbox{\begin{minipage}{48em}
Question 5:\\

Now, use the differences you calculated in Q4 to calculate the p-distances among pairs of taxa.\\

We recommend doing this with vectorized calculations in R, but again, you can do this "by hand". Please note that if you choose to do this by hand, we recommend you still assign your results to a vector in R to make the next few exercises easier.
\end{minipage}}


# The Poisson correction

\fbox{\begin{minipage}{48em}
Question 6:\\

Why it is important to apply a correction to the p-distance?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 7:\\

Using the vector of p-distances you computed in Q5, compute the corresponding distances under the Poisson correction.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 8:\\

Use your answer for Q7 to make a scatterplot comparing the p-distances (calculated in Q5) and the Poisson-corrected distances (calculated in Q7) among pairs of taxa. Do you think the Poisson correction is necessary for these data? Why or why not?
\end{minipage}}

Here are some extra hints for Q8:

Hint (1): Make sure you have entered your data correctly! A good thing to always do is to make sure the lengths of your vectors are the same:

```
length(p_dist) # assuming your p-distance vector is named like this
length(poiss_dist) # same as above for the poisson-corrected distance
```

Hint (2): You can add a 1:1 line (perfect correspondence between axes) by typing in the following *after* the code that creates your scatterplot:

```
abline(a=0, b=1) # note that the arguments are the parameters of the line equation
```

## Some food for thought: 

What is your estimate of the average number of substitutions per site that have occurred since humans and chimpanzees shared a common ancestor? What assumption did you make to get this estimate?

***

# Building a molecular clock

Now, we are going to calibrate a molecular clock and apply it to the distances you have computed above. The _rate_ at which our metaphorical clock "ticks" will be used to compute the expected number of substitutions per million years. Just for reference, here is a phylogeny of all the species in your dataset: 

```{r, echo=FALSE}
animal_phy <- ape::read.tree(text=paste0("(cnidaria:1,((((bryozoa:1,(mollusk:1,annelid:1):1):1,(crustacea:1,insect:1):1):1,(echinoderm:1,(lamprey:1,(shark:1,(fish:1,(frog:1,((platypus:1,(human:1,chimpanzee:1):1):1,(snake:1,(alligator:1,bird:1):1):1):1):1):1):1):1):1):1):1);"))

plot.phylo(animal_phy)

# We can also add node labels to facilitate our discussion:
ape::nodelabels(cex=.65, bg = "lightgreen")
```

Note that we manually set all branches to have equal lengths in the figure above. What we want now is to make the branches proportional to time.

For our calibration, we are going to use a single fossil to calibrate the age of the most-recent ancestor of humans and birds. Fortunately, there is an extremely useful fossil taxon, *Hylonomus* known from Nova Scotia, that is thought to be extremely closely related to the common ancestor of birds and mammals. Researchers have used the age of this fossil (312 Ma) to calibrate molecular clocks.

\fbox{\begin{minipage}{48em}
Question 9:\\

At which node in the phylogenetic tree would \textit{Hylonomus} be, approximately? 
\end{minipage}}

Now, we will use your Poisson-corrected distance between birds and humans and the dating of \textit{Hylonomus} to estimate the rate of protein evolution (in substitutions per site per million years).

\fbox{\begin{minipage}{48em}
Question 10:\\

What is the equation that connects the following 3 quantities: (I) number of substitutions, (II) time, and (III) rate of substitutions?\\

Hint: Let's use an analogy here. If you are driving in a car and have driven a given distance for a given amount of time, what speed have you been driving at?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 11:\\

Apply your equation from Q10, and estimate the substitution rate of the bird/human clade.\\

Hint: How much time did humans and \textit{Hylonomus} have to diverge? Is that the same amount of time that humans and birds have had to accumulate genetic differences?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 12:\\ 

Are there any other pairs of taxa in the tree whose genetic distance could have been used to calibrate the protein evolution rate by using the age of \textit{Hylonomus} (other than the human-bird distance)?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 13:\\

Compare the number of amino acid differences between the following three pairs of species:\\

(I) human-bird\\
(II) platypus-snake\\
(III) alligator-human\\

Do you expect the number of substitutions separating these pairs of taxa to be the same or different? Why?\\

Hint: Draw a phylogeny of just these 5 taxa.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 14:\\

Use your formula from Q10 to get to a new formula describing how to get the divergence time between two taxa, given a rate of substitution and an observed number of substitutions.\\

First hint: Remember your rate is in units of expected substitutions per millions of years.

Second hint: Remember that substitutions accumulate on both lineages.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 15:\\

Using your formula from Q14, compute the divergence time for each of the taxon splits listed in Q4. Then, on a blank piece of paper or in R, draw a time-calibrated phylogenetic tree for 6 (or more, if you prefer) species, using the divergence times as the dating of each node in your phylogeny. The species can be your choice, but pick at least 3 invertebrate taxa and at least 3 vertebrate taxa. Include this tree as the answer to this question.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 16:\\

Do some quick research on the internet (we suggest Wikipedia), and label the following in your tree from Q15 (draw a vertical line):\\

(I) Approximate data of the Cambrian explosion\\
(II) Permo-Triassic extinction\\
(III) K-Pg (a.k.a. the "Cretaceous-Paleogene") extinction\\
(IV) First hominid in fossil record.\\

\end{minipage}}

## Molecular clocks and inferences about deep-time

\fbox{\begin{minipage}{48em}
Question 17:\\

What biological/geological assumptions did you make in order to get the mean rate estimate?

Hint: Think about which factors would make your estimate even less certain, or make your estimate completely wrong.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 1:\\

Many papers have used a "molecular clock" to estimate the timing of events that happened long ago during the history of life. For example, some papers have argued that the Cambrian explosion is not a real phenomenon, by purporting to show (using molecular clocks) that most major animal groups did not originate during the Cambrian. Other studies have used molecular clocks to estimate timing of events for which there is no or minimal fossil evidence, such as the common ancestor of all life on Earth, or the origin of eukaryotic cells. Briefly comment on some potential limitations of or caveats to these types of analyses that you can think of.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 2:\\

Test your molecular clock: Use your answers from Q15 to test how accurate your molecular clock estimates are when compared to fossil record estimates. To do this, try to replicate the scatterplot (Figure 1B) in Hedges and Kumar (2003), available on Canvas. This paper also contains estimates from the fossil record for important events.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 3:\\

Test your molecular clock again: Use taxa "triples" to test your estimates (use one triple per node in your phylogeny). Create a scatterplot with "sister taxon 1 to outgroup" genetic distance on the x-axis, and the "sister taxon 2 to outgroup" genetic distance on the y-axis. Does the observed pattern match what you would expect if the molecular clock is "ticking" at a more-or-less constant rate?
\end{minipage}}

# The Jukes-Cantor correction

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 4:\\

Usually the Poisson correction works quite well for amino acids, but for nucleotides this is not the case. What explains this?\\

Hint: There are 4 different nucleotides and 20 different amino acids.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 5:

Now we will investigate the theoretical shape of each correction for genetic distances. We will look at how the different corrected values compare to each other across a range of values for p-distance. To do this, make a scatterplot comparing the p-distance (on the x-axis) with:\\

(I) its own p-distance (this will be a 1-to-1 line, which will help your comparison)\\
(II) The Poisson-corrected number of changes for a given value of p-distance\\
(III) The Jukes-Cantor-corrected number of changes for a given value of p-distance\\

Hint: Limit your x-axis scale (i.e., the p-distance) to the $[0 - 0.74]$ interval.
\end{minipage}}

### Some food for thought:

What happens with the Jukes-Cantor distance if $p \geq 0.75$?  Why would that happen?

Hint: Think about what "multiple hits" means.

***

# The uncertainty of your molecular clock

Now, we are going to look at how confident we are about our original rate estimate, and what the implications of this is. More specifically, we are going to generate a profile plot of the "log-transformed probability" (in this case, this is the same as the _log likelihood_, a term you might have heard before) as a function of the substitution rate parameter. 

To get the probability of observing a specific number of mutations, you can use the `dpois()` function:

```{r}
# For instance, what is the probability of observing 10 substitutions in a 37-long 
# AA sequence after 10 million years of divergence
# if the substitution rate is equal to 0.1 substitutions
# per site per million years?

rate_A <- 0.1
AA_seq_length <- 37

obs_subs <- 10  # amount of substitutions that have occurred
obs_time <- 10  # amount of time that has passed

prob_given_rate_A <- dpois(
  x = obs_subs, lambda = obs_time * rate_A * AA_seq_length)

prob_given_rate_A # this is the likelihood

# finally, we take the (natural) log, to get
# the log-likelihood of observing the data
# given the model and the rate value:
log(prob_given_rate_A)

# the questions is, then: what is the likelihood of observing >the same< data 
# under different values of substitution rate?
```


\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 6:

With the observed genetic distance that you counted for the human-bird split in Q4, plot the log-transformed probability of observing that number of substitutions, given many different values of substitution rates.\\

What is the rate value that maximizes the likelihood of observing your data?\\

Hint 1: On the x-axis, plot the rate parameter, and on the y-axis, plot the (natural) log-transformed probability of observing the amount of substitutions, given the rate.
\end{minipage}}

## Brief introduction to intervals derived from log-likelihood

When we are estimating only a single parameter, likelihood theory shows that any likelihood value of a parameter $\theta$ given data $x$ (i.e. $L(\theta | x)$) that satisfies the following inequality:

\begin{equation}
L(\theta | x) \geq ( MAX(L(\theta | x)) - 1.92)
\end{equation}

will provide an estimate of the 95% confidence interval for our parameter value.

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 7:\\

Use the inequality above to calculate the 95\% confidence interval for the oldest and newest split on the phylogeny you drew for Q15. Are you equally confident that those two nodes have the same level of uncertainty?
\end{minipage}}

\

***

# References

Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2012). GenBank. Nucleic acids research, 41(D1), D36-D42.

Hedges, S. B., & Kumar, S. (2003). Genomic clocks and evolutionary timescales. TRENDS in Genetics, 19(4), 200-206.

Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., ... & Higgins, D. G. (2011). Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology, 7(1), 539.