This assignment will be due next Thursday, February 17th. As always, point-form answers are fine. If we don't get through all of the exercises today, you can continue to work on them from any computer with web access. If you have any trouble, feel free to email me (daniel.gaston at dal.ca) or stop by (Tupper 8D2).
For this week's exercise, we'll be looking at the phylogeny of the
Asiatic and African elephants and the woolly mammoth. The complete
mammoth mitochondrial genome sequence was recently
published in Nature, and the authors included a phylogeny of this
group. They conclude that the mammoth and Asiatic elephant are more
closely related to each other than either is to the African elephant.
However, their conclusion is based on an assumption that the 3
species have evolved at equal rates since their divergence, and not
on an outgroup-rooted phylogeny (their attempts to root the tree with
an outgroup were unsuccessful). Even more recently the complete
mitochondrial genome sequence of the Mastodon was published and
proposed as a better outgroup candidate to the Asian/African
Elephants and the Woolly Mammoth than the Dugong and Hyrax are.
Dugong and Hyrax are the closest living relatives to the extant
Elephants and so have been the best choice to date as an outgroup.
Choosing a correct outgroup is very important in molecular
Since we're not convinced that the mammoth's phylogenetic position is resolved, we'll be analysing two mitochondrial genes from a number of mammals, including the elephants and the mammoth. Here's a list of the species we'll be looking at, along with some taxonomic classification and links to information and pictures. You should look over the taxonomy, but keep in mind that you won't necessarily see these relationships in your phylogenies. You should also follow the links for the species you've never heard of.
Edit a multiple alignment with BioEdit
Use PAUP parsimony program to make trees
Perform parsimony-based bootstrapping using DNAPARS online
Use PHYLIP distance methods to make trees
Use RaXML on the web to make bootstrapped ML trees
Last tutorial, you used ClustalW to create multiple alignments
using a variety of gap extension and gap opening parameters. While
some combinations of parameters may have produced better alignments
than others, you may have noticed that there are some regions of the
alignments you looked at that were always poorly aligned. It's not a
good idea to evaluate these "ambiguous" regions when you
perform a phylogenetic analysis, because the assumption made by all
analysis methods is that characters in a "column" of an
alignment are homologous.
We'll begin today by taking a raw alignment of mammalian mitochondrial small subunit ribosomal RNA (12S-rRNA) and removing the ambiguous regions. We will not be using this sequences for the later phylogenetic analysis, although you may chose to do so on your own.
Download the alignment: 12S-rRNA.fasta.
Open BioEdit, an alignment editor for Windows (BioEdit can be downloaded from http://www.mbio.ncsu.edu/BioEdit/page2.html. If you're interested in editing alignments, and you don't use Windows, there are other editors available. MacGDE, from http://www.msu.edu/~lintone/macgde is a good option for Mac users).
Go to the "File" menu, and choose "Open". Change "Files of type" to "All Files (*.*)", then browse to the right directory and open your file.
Jalview is also available for alignment editing and can be easily installed.
Connect to the Jalview website: http://www.jalview.org. Click on "Download", then click on "Start with Java Webstart." You should get a pop-up window asking you whether to save or open a file called jalview.jnlp. Open it (using the default program... if that doesn't work, you'll want to use Java Webstart, or javaws.exe). You'll then get an annoying series of pop-ups asking if you want to change file associations; just answer yes.
You should now have Jalview open. It will contain a sample alignment and a tree, which you can either close or ignore. Now, open the alignment you downloadedFile > Input Alignment > from File (you may need to change "Files of Type" to "All Files or Fasta"). Colour-code characters by going to Colour > ClustalX (feel free to play with other colouring schemes!) You can also make the extra information below your alignment go away by turning off View > Show Annotations
Your sequence alignment should now be open! If you scroll
back and forth, you should see regions where the sequences aren't
very well aligned (either you can tell for sure that they're
misaligned, or you just aren't sure they're properly aligned) .
Usually these regions are near gaps. To edit the alignment you can
click and drag along the numbers above the alignment to highlight colums,
delete using the delete key.
QUESTION 1. How many poorly aligned regions are there? (This is a subjective question!)
You want to remove the ambiguously aligned regions. First, you need to tell BioEdit to allow you to delete regions of the alignment. Go to the "Sequence" menu, then "Edit Mode", then "Edit Residues".
Click and drag over the top of the alignment to select
regions you want to delete. Look for well-aligned "anchor"
regions on either side of "gappy" regions to help you
decide what to select. If you're happy with what you've selected,
hit either delete or backspace. Delete all the regions in the
alignment that you think are poorly aligned.
QUESTION 2. How many sites are left in the alignment?
Preparation For Phylogenetic Analysis:
Now that you've edited a multiple alignment, you've seen how difficult and subjective it is! Now that you have an appreciation for editing sequence alignments you will use this edited sequence, and another protein coding sequence (Nad2.phy) to perform several different phylogenetic analyses. You can use your own edited version of the 12S rRNA alignment or download my edited copy (12S-rRNA-clean.phy).
In order to view the phylogenetic trees generated from many programs you will need to install tree- viewing software. There are many examples out there but for the purposes of this tutorial we will use FigTree or the Phylodendron website online. FigTree is cross-platform and can be easily installed on Windows, Mac OSX, or Linux computers and has avariety of options for easily viewing phylogenetic trees. FigTree is downloadable from:
Download and Install FigTree on the computer. Please ask if you are having any trouble with this.
Alternatively you can use the Phylodendron website. Simply paste any newick tree string into the text box (or upload a saved file), set the output options (setting the output to GIF will probably be easiest) and click submit. You'll receive back a nice image of the resulting phylogenetic tree. Many of the tools used here also draw trees using plain text characters as part of their output, which should be sufficient.
NOTES ON PARSIMONY ANALYSIS:
QUESTION 3: Open the alignment files with Notepad or Wordpad. Looking at only the first line, how many sites and how many taxa (sequences) are there in each alignment? (If you can't guess what the numbers in the first line mean read this description of PHYLIP format) Report the numbers for BOTH Alignments. Make clear which you are indicating.
We'll be building both distance and parsimony trees using programs in Joe Felsenstein's PHYLIP package, a fairly comprehensive phylogenetic analysis package. PHYLIP is available for download from Felsenstein's site: http://evolution.genetics.washington.edu/phylip.html. PHYLIP is available for online use at the Pasteur Institute's website (http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html). However, the Pasteur is very popular, so analyses may time out, or the queue may be full. Instead, we'll use the Centre de Bioinformatique de Bordeaux's PHYLIP server, which has fewer programs, but shorter queues.
Go to the Centre de Bioinformatique de Bordeaux's PHYLIP website: http://cbi.labri.fr/outils/Pise/phylip.html.
Across from "dnapars", click the "go" link. Fill in your email address (this will be important for all analyses today), and upload your 12S rRNA alignment in the "the name of a file" field. Take a look at the other options.
QUESTION 4. What is transversion parsimony? (Hint: Look at the DNAPARS manual: http://cbi.labri.fr/outils/Pise/doc/dnapars.html)
Click "Run dnapars" to start the analysis with the default options.
The analysis should take less than a minute. Once it's done, right-click on "outfile". Save it with the extension .txt, then open it with a word processor or text editor. You'll be comparing trees from different analyses, so you should save all your trees, and make sure you know which is which... be very careful that you don't have line breaks in the middle of your trees, this will make them very hard to look at.
QUESTION 5. How many equally parsimonious trees did DNAPARS find? What is the relationship between extant elephants and the mammoth? Can you spot a difference between any two trees?
Return to the main PHYLIP page (http://cbi.labri.fr/outils/Pise/phylip.html), and click on "go" across from "protpars" (PHYLIP typically uses different programs for nucleotide and amino acid sequences). The PROTPARS page is virtually identical to the DNAPARS page, except that the "Transversion parsimony" option has been replaced by a pull-down menu specifying genetic code (you don't need to change this to "vertebrate mitochondrial", it's not relevant to our analysis). Fill in your email address and select the nad2-clean.phy.txt file, and click "Run protpars", then wait for your results.
Again, save "outfile" as a text file, and open it with a text editor or word processor, and label this tree "PROTPARS"
QUESTION 6. Is this tree the same as any of the trees you produced with DNAPARS? Note that neither this nor any of the DNAPARS trees is rooted, which can make them hard to look at. Please ask if you'd like some help! Name at least 4 species that form a monophyletic group in this tree and at least 1 DNAPARS tree.
Bootstrapping is used to estimate the certainty of an estimated phylogeny. It involves randomly choosing columns from your sequence alignment until you have a new alignment of exactly the same size. Some of the columns from the original alignment will be represented more than once, while others will be absent. Then a tree is inferred from the new alignment. This whole process is repeated many times (usually at least 100), and all the trees produced are summarised in such a way that the number of trees in which each group of taxa was present is reported.
We'll be using parsimony to bootstrap our NADH dehydrogenase alignment. You can use any phylogenetic analysis method to bootstrap, but since a phylogeny is inferred for each bootstrap "replicate", we'll use the fastest method.
Go back to the PROTPARS server at CBI: http://cbi.labri.fr/outils/Pise/protpars.html.
As before, enter your email address, then select the nad2-clean.phy.txt alignment.
Scroll down to Bootstrap Options. Check the box next to "Perform a bootstrap before analysis" and enter an odd number for the "random seed". The default 100 replicates is fine. Check the "Compute a consensus tree" box. Then scroll back to the top and click "Run protpars". This will take awhile! You'll receive a series of emails when your analysis is done. While you wait, go on to distance methods.
When you finally receive your bootstrap results, open the message with the subject "access all protpars results URL (available for 10 days)". Follow the link to your results, then click on "outfile.consense". Copy and paste the tree (at the bottom) into a Word document (or else use a text editor), and label "PROTPARS bootstrap". The numbers along branches are bootstrap support values for groups observed in the consensus tree. These values indicate the number of replicates in which that group was observed.
QUESTION 7. Is your bootstrap consensus tree identical to the tree you produced from nad2 in section B, above. What do you think this says about the certainty of your phylogeny?
Generally, a group is considered "well-supported" if its bootstrap support value is greater than 80%, and "unsupported" if its bootstrap support value is less than 50%.
QUESTION 8. How many groups are well-supported? How many are unsupported?
NOTES ON DISTANCE METHODS:
Once again, return to the main PHYLIP page (http://cbi.labri.fr/outils/Pise/phylip.html), but this time click "go" across from "protdist". PROTDIST calculates the maximum likelihood pairwise distances, but doesn't produce a tree! (Re-read the yellow box on distance methods if you don't understand why)
Select the Nad2.phy file, and fill in your email address on the first line.
Scroll down to distance options and the "distance model" drop down menu. This determines the probabilities of different types of substitutions. Choose either "Dayhoff PAM matrix" or "Jones-Taylor-Thornton matrix".
We won't be using any of the other options. Look through them if you like. The "Gamma distribution of rates among positions" pull-down menu should allow you to account for different rates of evolution at different positions in your alignment (normally you would want to do this), but it doesn't work properly.
Click "Run protdist", then wait for your results to appear (shouldn't be more than a couple minutes). Click on the link to "outfile" (this is your pairwise distance matrix). Distances are in a table, with species names in the first column. It should look something like this (individual lines are longer though, so there will be mid-row line breaks):
3 A 0.0 3.0 1.0 2.5 B 3.0 0.0 2.0 5.5 C 1.0 2.0 0.0 5.5 D 2.5 5.5 5.5 0.0
Note that this matrix is symmetrical across the diagonal: the distance from A to B (3.0) is the same as the distance from B to A.
QUESTION 9. Report the following distances:
Asian Elephant and Mammoth:
African Elephant and Mammoth:
Human and Mastodon:
If your web browser opened the distance matrix in the same window, click "Back." There should be a pull-down menu from which you can choose from among several programs to "run" on "outfile" (your distance matrix). These are all programs that build trees from pairwise distances. Choose "fitch" and click "Run the selected program on outfile".
Another window will come up with options. Scroll down to "Randomize" options and put in an odd random number ("random number seed") and tell the program to do 5 jumbles.
The "fitch" program uses the Fitch-Margoliash least-squares method to choose the best tree from among a large number of trees. As we did with parsimony, we will need to use a heuristic tree searching method. Part of this heuristic method involves randomly adding sequences to a growing tree. You have just told the program to do this procedure 5 times ("5 jumbles") in order to reduce the chance that the order of the species in your alignment might influence the tree.
Scroll up and click on "Run fitch". Once the analysis is finished you will have both a drawing of the best tree (in "outfile") and the tree in the more computer program-readable Newick format (in "outtree"). Save outfile and label it "PROTDIST-FITCH". Examine the tree, and answer the following questions.
QUESTION 10. What are the relationships between the Elephants and Mammoth? Is there anything different between the placement of the Mammoth sequence compared to it's placement in the PROTDIST Parsimony tree generated earlier?
Notes on Maximum Likelihood
We'll be analysing our mitochondrial DNA using maximum likelihood with the tool RAxML. Keep in mind that the method is somewhat different for nucleotide and amino acid data, with respect to choice of model. For amino acid sequences, models are typically derived from log-odds matrices (that is, substitution frequencies among different amino acids are estimated from large databases, not from the multiple alignment you're analyzing. For nucleic acids, models are estimated from the alignment itself , based on certain assumptions (e.g., are all substitutions equally likely, or might probabilities for transitions and transversions differ?)
The program we'll use is called RAxML, we'll be using it online at the CIPRES web server: www.phylo.org
Go to the CIPRES web portal:
Click the link for CIPRES Science Gateway on the left hand side of the page, then the "Use CIPRES Science Gateway" button.
You can chose to register (lets you save results online, or log in as a guest by clicking "Proceed without Registering". I would recommend registering but you may need to start your analysis, as registration may take time.
Expanding the Guest Folder on the left you'll see two sub-folders: Data and Tasks. Click the Data folder
Use the "Upload/Enter Data" button to upload both of your edited phylip format alignments (12S-rRNA and Nad2). Be sure to enter the appropriate information in the data upload form such as data type (nucleotide or amino acid), and give them appropriate labels (12S-rtRNA, Nad2).
You can now click the tasks folder and the button "Create New Task" to set up a RAxML analysis.
You can now give your task an appropriate label, set the input data, select the tool, and set parameters for the job. Start by selecting the 12S-rRNA alignment as the input data and for tool select RAxML-HPC-Blackbox.
Look over the parameter values and make sure the data type is set to nucleotide. We will use the default parameter values
QUESTION 11. What is the final optimized maximum likelihood (loglk) value of the final result tree?
Download/view the RAxML_bipartitions file and view in FigTree or using the PhyloDendron website. This file contains the final maximum likelihood consensus tree as well as bootstrap support values. If you view your results on the CIPRES portal you can copy and paste the output tree into Phylodendron for viewing,
Question 12. Does the relationship between Elephants and Mammoths change from the one observed using PROTPARS (with the Nad2.phy alignment)? If so how? If the relationship between Elephants and Mammoth remains the same has the bootstrap support value for the Mammoth and its closest relative changed? If so how?
Start a new task and perform a RAxML analysis using the Nad2.phy protein alignment. Make sure to select "protein sequences" under the new paramaters.
Select Use Empirical Base Frequencies under the new parameters, otherwise leave everything under the default values.
Submit and run the job.
QUESTION 13. Does the relationship between Elephants and Mammoths change at all between the two RAxML trees? If so how. What is the bootstrap support for Mammoth/African Elephant or Mammoth/Asian Elephant in these two trees (which ever one is observed. Please indicate what sister taxa are observed and the bootstrap support so that it is clear)
Question 14. Are there any other major differences between the two RAxML trees? If so, what are they? Describe as accurately as possible.
QUESTION 15. Look back over all your trees, and focus on the position of the unknown sequences with respect to the Elephants and Mammoth. Can you support the claims of the authors of the paper that Mammoths and Asian elephants are more closely related to one another than either is to the African elephant? Keep in mind bootstrap support values and the differences between Parsimony, Distance, and Likelihood methods and explain how you can support your opinion. I am most interested in your reasoning here, refer to your class notes as needed and be specific in your answer. I am looking for good, solid, well reasoned arguments here. Don't be too vague!
Jalview (alignment viewing/editing)