Phylogenetic Analysis

Biochemistry 4010/5010

10 February 2011

This assignment will be due next Thursday, February 17th. As always, point-form answers are fine. If we don't get through all of the exercises today, you can continue to work on them from any computer with web access. If you have any trouble, feel free to email me (daniel.gaston at or stop by (Tupper 8D2).


For this week's exercise, we'll be looking at the phylogeny of the Asiatic and African elephants and the woolly mammoth. The complete mammoth mitochondrial genome sequence was recently published in Nature, and the authors included a phylogeny of this group. They conclude that the mammoth and Asiatic elephant are more closely related to each other than either is to the African elephant. However, their conclusion is based on an assumption that the 3 species have evolved at equal rates since their divergence, and not on an outgroup-rooted phylogeny (their attempts to root the tree with an outgroup were unsuccessful). Even more recently the complete mitochondrial genome sequence of the Mastodon was published and proposed as a better outgroup candidate to the Asian/African Elephants and the Woolly Mammoth than the Dugong and Hyrax are. Dugong and Hyrax are the closest living relatives to the extant Elephants and so have been the best choice to date as an outgroup. Choosing a correct outgroup is very important in molecular phylogenetic analysis.

Since we're not convinced that the mammoth's phylogenetic position is resolved, we'll be analysing two mitochondrial genes from a number of mammals, including the elephants and the mammoth. Here's a list of the species we'll be looking at, along with some taxonomic classification and links to information and pictures. You should look over the taxonomy, but keep in mind that you won't necessarily see these relationships in your phylogenies. You should also follow the links for the species you've never heard of.

Taxonomy of relevant mammals:
Mole (Chrysochloris asiatica)
Dugong (Dugong dugon)
Shrews (Macroscelididae):
Short-eared elephant shrew (Macroscelides proboscideus)
Long-eared elephant shrew (Elephantulus sp. VB001)
Hedgehog (Echinops telfairi)
Hyrax (Procavia capensis)
Mastodons (Mammutidae):
American Mastodon (Mammut americanum)
Elephants (Elephantidae):
Woolly Mammoth (Mammuthus primigenius)
Asiatic elephant (Elephas maximus)
African elephant (Loxodonta africana)
Odd-toed ungulates (Perissodactyla):
Horse (Equus caballus)
White rhinoceros (Ceratotherium simum)
Carnivores (Carnivora):
Dog (Canis familiaris)
Polar Bear (Ursus maritimus)
Cow (Bos taurus)
Human (Homo sapiens)
Gorilla (Gorilla gorilla)
Chimpanzee (Pan troglodytes)


  1. Edit a multiple alignment with BioEdit

  2. Use PAUP parsimony program to make trees

  3. Perform parsimony-based bootstrapping using DNAPARS online

  4. Use PHYLIP distance methods to make trees

  5. Use RaXML on the web to make bootstrapped ML trees

A. Edit a Multiple Alignment with BioEdit

Last tutorial, you used ClustalW to create multiple alignments using a variety of gap extension and gap opening parameters. While some combinations of parameters may have produced better alignments than others, you may have noticed that there are some regions of the alignments you looked at that were always poorly aligned. It's not a good idea to evaluate these "ambiguous" regions when you perform a phylogenetic analysis, because the assumption made by all analysis methods is that characters in a "column" of an alignment are homologous.

We'll begin today by taking a raw alignment of mammalian mitochondrial small subunit ribosomal RNA (12S-rRNA) and removing the ambiguous regions. We will not be using this sequences for the later phylogenetic analysis, although you may chose to do so on your own.

  1. Download the alignment: 12S-rRNA.fasta.

  2. Open BioEdit, an alignment editor for Windows (BioEdit can be downloaded from If you're interested in editing alignments, and you don't use Windows, there are other editors available. MacGDE, from is a good option for Mac users).

  3. Go to the "File" menu, and choose "Open". Change "Files of type" to "All Files (*.*)", then browse to the right directory and open your file.

  4. Jalview is also available for alignment editing and can be easily installed.

  5. Connect to the Jalview website: Click on "Download", then click on "Start with Java Webstart." You should get a pop-up window asking you whether to save or open a file called jalview.jnlp. Open it (using the default program... if that doesn't work, you'll want to use Java Webstart, or javaws.exe). You'll then get an annoying series of pop-ups asking if you want to change file associations; just answer yes.

  6. You should now have Jalview open. It will contain a sample alignment and a tree, which you can either close or ignore. Now, open the alignment you downloadedFile > Input Alignment > from File (you may need to change "Files of Type" to "All Files or Fasta"). Colour-code characters by going to Colour > ClustalX (feel free to play with other colouring schemes!) You can also make the extra information below your alignment go away by turning off View > Show Annotations

  7. Your sequence alignment should now be open! If you scroll back and forth, you should see regions where the sequences aren't very well aligned (either you can tell for sure that they're misaligned, or you just aren't sure they're properly aligned) . Usually these regions are near gaps. To edit the alignment you can click and drag along the numbers above the alignment to highlight colums, delete using the delete key.

    QUESTION 1. How many poorly aligned regions are there? (This is a subjective question!)

  8. You want to remove the ambiguously aligned regions. First, you need to tell BioEdit to allow you to delete regions of the alignment. Go to the "Sequence" menu, then "Edit Mode", then "Edit Residues".

  9. Click and drag over the top of the alignment to select regions you want to delete. Look for well-aligned "anchor" regions on either side of "gappy" regions to help you decide what to select. If you're happy with what you've selected, hit either delete or backspace. Delete all the regions in the alignment that you think are poorly aligned.

    QUESTION 2. How many sites are left in the alignment?

Preparation For Phylogenetic Analysis:

Now that you've edited a multiple alignment, you've seen how difficult and subjective it is! Now that you have an appreciation for editing sequence alignments you will use this edited sequence, and another protein coding sequence (Nad2.phy) to perform several different phylogenetic analyses. You can use your own edited version of the 12S rRNA alignment or download my edited copy (12S-rRNA-clean.phy).

In order to view the phylogenetic trees generated from many programs you will need to install tree- viewing software. There are many examples out there but for the purposes of this tutorial we will use FigTree or the Phylodendron website online. FigTree is cross-platform and can be easily installed on Windows, Mac OSX, or Linux computers and has avariety of options for easily viewing phylogenetic trees. FigTree is downloadable from:


Download and Install FigTree on the computer. Please ask if you are having any trouble with this.

Alternatively you can use the Phylodendron website. Simply paste any newick tree string into the text box (or upload a saved file), set the output options (setting the output to GIF will probably be easiest) and click submit. You'll receive back a nice image of the resulting phylogenetic tree. Many of the tools used here also draw trees using plain text characters as part of their output, which should be sufficient.

B. Parsimony with PHYLIP


Maximum parsimony analysis finds the OPTIMAL tree that explains your dataset. The OPTIMALITY criterion it uses is minimizing the number of substitutions that have occurred in evolution. That is, the most parsimonious tree (optimal parsimony tre e) is the tree that requires the fewest amino acid or nucleotide substitutions between the sequences in the alignment to explain the data. This minimum number of substitutions is also called the tree "score" in "steps".

To find this tree, in principle the programs do "tree-searching" -- that is, they look through ALL possible trees relating the sequences and calculate the minimum number of changes for the given tree. Then they choose the tree that gives the SMALLEST N UMBER (the "shortest tree" or "optimal tree"). This type of strategy is sometimes called an exhaustive search (i.e., all possibilities are examined).

For more than 10-11 taxa there are TOO MANY POSSIBLE TREES to look at all of them (see the table below), so HEURISTIC searching strategies must be used. Although there are some programs that include exhaustive search options for smaller numbers of taxa, the programs we'll be using for parsimony don't: if you're interested in exhaustive searching, you should look into PAUP* or MEGA.

Number of taxa

Number of possible trees





















QUESTION 3: Open the alignment files with Notepad or Wordpad. Looking at only the first line, how many sites and how many taxa (sequences) are there in each alignment? (If you can't guess what the numbers in the first line mean read this description of PHYLIP format) Report the numbers for BOTH Alignments. Make clear which you are indicating.
We'll be building both distance and parsimony trees using programs in Joe Felsenstein's PHYLIP package, a fairly comprehensive phylogenetic analysis package. PHYLIP is available for download from Felsenstein's site: PHYLIP is available for online use at the Pasteur Institute's website ( However, the Pasteur is very popular, so analyses may time out, or the queue may be full. Instead, we'll use the Centre de Bioinformatique de Bordeaux's PHYLIP server, which has fewer programs, but shorter queues.
  1. Go to the Centre de Bioinformatique de Bordeaux's PHYLIP website:
  2. Across from "dnapars", click the "go" link. Fill in your email address (this will be important for all analyses today), and upload your 12S rRNA alignment in the "the name of a file" field. Take a look at the other options.

    QUESTION 4. What is transversion parsimony? (Hint: Look at the DNAPARS manual:

  3. Click "Run dnapars" to start the analysis with the default options.
  4. The analysis should take less than a minute. Once it's done, right-click on "outfile". Save it with the extension .txt, then open it with a word processor or text editor. You'll be comparing trees from different analyses, so you should save all your trees, and make sure you know which is which... be very careful that you don't have line breaks in the middle of your trees, this will make them very hard to look at.

    QUESTION 5. How many equally parsimonious trees did DNAPARS find? What is the relationship between extant elephants and the mammoth? Can you spot a difference between any two trees?

  5. Return to the main PHYLIP page (, and click on "go" across from "protpars" (PHYLIP typically uses different programs for nucleotide and amino acid sequences). The PROTPARS page is virtually identical to the DNAPARS page, except that the "Transversion parsimony" option has been replaced by a pull-down menu specifying genetic code (you don't need to change this to "vertebrate mitochondrial", it's not relevant to our analysis). Fill in your email address and select the nad2-clean.phy.txt file, and click "Run protpars", then wait for your results.
  6. Again, save "outfile" as a text file, and open it with a text editor or word processor, and label this tree "PROTPARS"

    QUESTION 6. Is this tree the same as any of the trees you produced with DNAPARS? Note that neither this nor any of the DNAPARS trees is rooted, which can make them hard to look at. Please ask if you'd like some help! Name at least 4 species that form a monophyletic group in this tree and at least 1 DNAPARS tree.

C. Parsimony-based bootstrapping with PROTPARS

Bootstrapping is used to estimate the certainty of an estimated phylogeny. It involves randomly choosing columns from your sequence alignment until you have a new alignment of exactly the same size. Some of the columns from the original alignment will be represented more than once, while others will be absent. Then a tree is inferred from the new alignment. This whole process is repeated many times (usually at least 100), and all the trees produced are summarised in such a way that the number of trees in which each group of taxa was present is reported.

We'll be using parsimony to bootstrap our NADH dehydrogenase alignment. You can use any phylogenetic analysis method to bootstrap, but since a phylogeny is inferred for each bootstrap "replicate", we'll use the fastest method.
  1. Go back to the PROTPARS server at CBI:
  2. As before, enter your email address, then select the nad2-clean.phy.txt alignment.
  3. Scroll down to Bootstrap Options. Check the box next to "Perform a bootstrap before analysis" and enter an odd number for the "random seed". The default 100 replicates is fine. Check the "Compute a consensus tree" box. Then scroll back to the top and click "Run protpars". This will take awhile! You'll receive a series of emails when your analysis is done. While you wait, go on to distance methods.
  4. When you finally receive your bootstrap results, open the message with the subject "access all protpars results URL (available for 10 days)". Follow the link to your results, then click on "outfile.consense". Copy and paste the tree (at the bottom) into a Word document (or else use a text editor), and label "PROTPARS bootstrap". The numbers along branches are bootstrap support values for groups observed in the consensus tree. These values indicate the number of replicates in which that group was observed.

    QUESTION 7. Is your bootstrap consensus tree identical to the tree you produced from nad2 in section B, above. What do you think this says about the certainty of your phylogeny?

    Generally, a group is considered "well-supported" if its bootstrap support value is greater than 80%, and "unsupported" if its bootstrap support value is less than 50%.

    QUESTION 8. How many groups are well-supported? How many are unsupported?

D. Distance methods with PHYLIP


Distance phylogenetic methods actually consist of two sequential procedures: (1) the calculation of pairwise distances between sequences and arrangement of these distances in a table or "distance matrix", and (2) building a tree that be st reflects these distances between species (a tree that "fits" the distance matrix the best).

  1. Calculation of pairwise distances requires comparing the two aligned sequences and using a method to calculate how many changes have occurred between the two sequences. This value will always be equal to or larger than the observed number of subst itutions because while you can only observe a single difference at a given site (either the two sequences have the same amino acid, or they don't), more than one substitution may have occurred (the so-called "multiple-hits" phenomenon). The method that PR OTDIST uses is called "maximum likelihood distance" (don't confuse this with a full maximum likelihood method, below...) and employs an amino acid substitution matrix such as the Dayhoff PAM or JTT matrix to help estimate the number of changes that have t aken place. Since JTT or PAM matrix give you probability values for substitutions between different amino acids, they can be used to estimate that number of changes that were most "probable" between two sequences by the method of maximum likelihood (see b elow).

  2. Estimating a tree from a distance matrix can be done in a number of ways. Generally there are (a) "clustering" methods that simply produce a good tree from a set of distances (but don't evaluate different topologies) and (b) methods that optimise the fit between a given tree and the distance matrix and therefore examine many trees by a "tree-searching" procedure to find which one has the best fit to the distance matrix. A clustering method is the neighbour-joining algorithm (the NEI GHBOR program is available to do this kind of analysis on this website). An example of the method (b) above is the Fitch-Margoliash "least-squares" procedure. You will use (b) below.

  1. Once again, return to the main PHYLIP page (, but this time click "go" across from "protdist". PROTDIST calculates the maximum likelihood pairwise distances, but doesn't produce a tree! (Re-read the yellow box on distance methods if you don't understand why)
  2. Select the Nad2.phy file, and fill in your email address on the first line.
  3. Scroll down to distance options and the "distance model" drop down menu. This determines the probabilities of different types of substitutions. Choose either "Dayhoff PAM matrix" or "Jones-Taylor-Thornton matrix".

    We won't be using any of the other options. Look through them if you like. The "Gamma distribution of rates among positions" pull-down menu should allow you to account for different rates of evolution at different positions in your alignment (normally you would want to do this), but it doesn't work properly.
  4. Click "Run protdist", then wait for your results to appear (shouldn't be more than a couple minutes). Click on the link to "outfile" (this is your pairwise distance matrix). Distances are in a table, with species names in the first column. It should look something like this (individual lines are longer though, so there will be mid-row line breaks):
          A    0.0    3.0   1.0   2.5
          B    3.0    0.0   2.0   5.5
          C    1.0    2.0   0.0   5.5
          D    2.5    5.5   5.5   0.0
    Note that this matrix is symmetrical across the diagonal: the distance from A to B (3.0) is the same as the distance from B to A.
      QUESTION 9. Report the following distances:
    Asian Elephant and Mammoth:
    African Elephant and Mammoth:
    Human and Mastodon:

  5. If your web browser opened the distance matrix in the same window, click "Back." There should be a pull-down menu from which you can choose from among several programs to "run" on "outfile" (your distance matrix). These are all programs that build trees from pairwise distances. Choose "fitch" and click "Run the selected program on outfile".
  6. Another window will come up with options. Scroll down to "Randomize" options and put in an odd random number ("random number seed") and tell the program to do 5 jumbles.

    The "fitch" program uses the Fitch-Margoliash least-squares method to choose the best tree from among a large number of trees. As we did with parsimony, we will need to use a heuristic tree searching method. Part of this heuristic method involves randomly adding sequences to a growing tree. You have just told the program to do this procedure 5 times ("5 jumbles") in order to reduce the chance that the order of the species in your alignment might influence the tree.

    Scroll up and click on "Run fitch". Once the analysis is finished you will have both a drawing of the best tree (in "outfile") and the tree in the more computer program-readable Newick format (in "outtree"). Save outfile and label it "PROTDIST-FITCH". Examine the tree, and answer the following questions.

      QUESTION 10. What are the relationships between the Elephants and Mammoth? Is there anything different between the placement of the Mammoth sequence compared to it's placement in the PROTDIST Parsimony tree generated earlier?

E. Maximum Likelihood with bootstrapped RAxML

Notes on Maximum Likelihood Methods:

Maximum likelihood methods (and their conceptual sisters, the Bayesian methods) are amongst the most computationally intensive of phylogenetic methods. Theoretically, if the evolutionary model used is realistic this method should give you the best ch ance of estimating the "true" tree.

Briefly, the method chooses (amongst many alternative trees) the tree that maximises the probability of observing the data, given the evolutionary model. That is, if the model is true, out of all possible trees, the maximum likelihood tree is the tree that makes the evolution of your sequence alignment the most probable. These probabilities are expressed usually as log-probabilities (log-likelihoods). You'll learn more about this in class....

As with parsimony and Fitch-Margoliash distance methods, maximum likelihood analysis involves calculating a score (in this case, a likelihood) for many different trees. We already said that, for parsimony, there's no way to evaluate all possible tree s, unless you're looking at very few species. This is even more true for maximum likelihood, because it takes much longer to evaluate each tree (there are more calculations involved). Consequently, there are many clever heuristic methods available. Th e program we'll be using starts with a Neighbor-Joining distance tree, and rearranges species until the likelihood stops improving. This program is unusually fast for a maximum likelihood program, but its heuristic is less likely to find the maximum like lihood tree than those used by some other programs.

We'll be analysing our mitochondrial DNA using maximum likelihood with the tool RAxML. Keep in mind that the method is somewhat different for nucleotide and amino acid data, with respect to choice of model. For amino acid sequences, models are typically derived from log-odds matrices (that is, substitution frequencies among different amino acids are estimated from large databases, not from the multiple alignment you're analyzing. For nucleic acids, models are estimated from the alignment itself , based on certain assumptions (e.g., are all substitutions equally likely, or might probabilities for transitions and transversions differ?)

The program we'll use is called RAxML, we'll be using it online at the CIPRES web server:
  1. Go to the CIPRES web portal:
  2. Click the link for CIPRES Science Gateway on the left hand side of the page, then the "Use CIPRES Science Gateway" button.
  3. You can chose to register (lets you save results online, or log in as a guest by clicking "Proceed without Registering". I would recommend registering but you may need to start your analysis, as registration may take time.
  4. Expanding the Guest Folder on the left you'll see two sub-folders: Data and Tasks. Click the Data folder
  5. Use the "Upload/Enter Data" button to upload both of your edited phylip format alignments (12S-rRNA and Nad2). Be sure to enter the appropriate information in the data upload form such as data type (nucleotide or amino acid), and give them appropriate labels (12S-rtRNA, Nad2).
  6. You can now click the tasks folder and the button "Create New Task" to set up a RAxML analysis.
  7. You can now give your task an appropriate label, set the input data, select the tool, and set parameters for the job. Start by selecting the 12S-rRNA alignment as the input data and for tool select RAxML-HPC-Blackbox.
  8. Look over the parameter values and make sure the data type is set to nucleotide. We will use the default parameter values

  9. QUESTION 11. What is the final optimized maximum likelihood (loglk) value of the final result tree?

  10. Download/view the RAxML_bipartitions file and view in FigTree or using the PhyloDendron website. This file contains the final maximum likelihood consensus tree as well as bootstrap support values. If you view your results on the CIPRES portal you can copy and paste the output tree into Phylodendron for viewing,

Question 12. Does the relationship between Elephants and Mammoths change from the one observed using PROTPARS (with the Nad2.phy alignment)? If so how? If the relationship between Elephants and Mammoth remains the same has the bootstrap support value for the Mammoth and its closest relative changed? If so how?

  1. Start a new task and perform a RAxML analysis using the Nad2.phy protein alignment. Make sure to select "protein sequences" under the new paramaters.
  2. Select Use Empirical Base Frequencies under the new parameters, otherwise leave everything under the default values.
  3. Submit and run the job.

    QUESTION 13. Does the relationship between Elephants and Mammoths change at all between the two RAxML trees? If so how. What is the bootstrap support for Mammoth/African Elephant or Mammoth/Asian Elephant in these two trees (which ever one is observed. Please indicate what sister taxa are observed and the bootstrap support so that it is clear)
    Question 14. Are there any other major differences between the two RAxML trees? If so, what are they? Describe as accurately as possible.

    QUESTION 15. Look back over all your trees, and focus on the position of the unknown sequences with respect to the Elephants and Mammoth. Can you support the claims of the authors of the paper that Mammoths and Asian elephants are more closely related to one another than either is to the African elephant? Keep in mind bootstrap support values and the differences between Parsimony, Distance, and Likelihood methods and explain how you can support your opinion. I am most interested in your reasoning here, refer to your class notes as needed and be specific in your answer. I am looking for good, solid, well reasoned arguments here. Don't be too vague!


Websites needed for this exercise:

PHYLIP servers

PHYML Online


Jalview (alignment viewing/editing)