Multiple Sequence Alignments

Protein domains and multiple alignments

Biochemistry 4010/5010

24 January 2013

This assignment will be due 1 week from today (Jan 31st). Again, point-form answers are fine. If we don't get through all of the exercises today, you can continue to work on them from any computer with web access. If you have any trouble, feel free to email me (jshleap at dal.ca), or stop by Mona Campbell Rm 4233.

Goals:

ClustalW multiple sequence alignments online
Protein domains at NCBI
PSI-BLAST

A. ClustalW multiple sequence alignments

This exercise uses ClustalW, a popular multiple sequence alignment program. We'll be using ClustalW via a web interface. There are a few of these available, but sometimes they're overloaded, or else they're down for maintenance. For this reason, I'm providing links to two ClustalW servers: the European Bioinformatics Institute (EBI, http://www.ebi.ac.uk/clustalw/... this is the one I'd recommend) and Kyoto University Bioinformatics Center (http://clustalw.genome.jp) I recommend using the first one (EBI). However, if you use the second one PLEASE WRITE THAT DOWN on your lab assignment.

Download amino acid sequences for accession numbers AAA50993, P02992, and P32481 from the NCBI protein database (http://www.ncbi.nlm.nih.gov). These are the E. coli tufA, yeast tufA, and other yeast sequence from our previous lab.
Connect to one of the ClustalW servers mentioned above.

Which webserver did you use?

Paste all 3 sequences in FASTA format into the sequences area so that they look like:
```
    >ecoli_tu
    MSKEKFERTKPHVNVGT...
    >yeast1
    MSALLPRLLTRTAFKAS...
    >yeast2
    MSDLQDQEPSIIINGNL... 

    
```
QUESTION 1. What are the default gap opening, extension penalties for the multiple sequence alignment (not pairwise)? (Slow/Accurate settings if appropriate) What is the default scoring matrix used? (Hint: click on the words above the pull-down menus for an explanation on the EBI server)

Gap opening:

Gap extension:

Scoring matrix:
Fill in your email address (if asked to do so), then click "Run". In a short while, a results page will appear. In all of these sites it is possible to view a plain text output of your alignment. You may choose to simply look at these to get an idea of the placement of gaps and their lengths. If you opt to view the alignments this way copy your alignment to a plain text file in notepad and save it so you can compare it to other results.

If you chose the EBI website:
- You can just click on "show colors" to see what your alignment looks like better, with amino acid property colour coding. Scroll back and forth and take note of the placement of the gaps (in general). Leave the results windows open! You'll be comparing this alignment to others.
Now, examine the effects of changing gap opening and extension parameters:
- Go to your favourite ClustalW server again (in a NEW window, if you used the EBI server), and redo the alignment, but CHANGE the alignment parameters above this time to radically increase the Gap Open value (e.g. to 100). Remember to change the GOP in the multiple alignment settings, not the pairwise alignment settings.
- Now do it again but increase the Gap Extension value to 10 (and put the Gap Open back to the default setting) -- Redo the alignment.
- Now do BOTH changes at once and redo the alignment.
QUESTION 2. Does the alignment change when you reset the Gap Extension and Gap Opening penalties? If so, how and why does it change? Explain in general terms.

B. Protein motifs/domains at NCBI, SMART and PFAM

Recently the BLAST server at NCBI has added the capacity of identifying "conserved domains" using modification of the PSI-BLAST searching procedure called RPS-BLAST. Here the BLAST algorithm is used with a query sequence to search a database (CDD -- conserved domain database) of position-specific scoring matrices (PSSMs) of well-known protein motifs (also called domains) that tend to occur in many different protein families. There are many such "motif" or "domain" databases -- the two currently used by NCBI in addition to their own "curated domain alignments" (NCBI's are: cd, LOAD and COG) in CDD are the SMART database and the PFAM database. These databases are all curated collections of aligned protein motifs

Go to the NCBI website (http://www.ncbi.nlm.nih.gov) and retrieve a "mystery" amino acid sequence with the accession number CAC38754.

Copy this sequence and go to the NCBI webserver and click on BLAST.

Scroll down and look under Specialized BLAST. Click on find conserved domains in your sequence (cds) (3rd entry in list)

QUESTION 3. What databases are available to search and how big are they (in number of PSSMs)?
Choose the database with the largest number of motifs and paste your sequence into the query window and click on "Submit"

You should see a gray line with numbers over it representing your sequence and below it some coloured boxes.

Below the line there are coloured regions with abbreviations in the middle. These are the locations and names of the conserved motifs/domains identified. There are 3 different domains identified in this protein. If you click on the "View Full Result" button you can see different abbreviations given to the "same" domains by different databases. Below this are a list of the conserved domains hit and corresponding E-values.
Go back to the top and click on the top red domain box (CCP) and scroll down to the bottom of that window to see the multiple alignment of the conserved domain.

QUESTION 4: In general what do you think the coloring of the alignment corresponds to? (If you have problems with this question, try playing with the "Color Bits" pull-down menu - change the value and hit 'Reformat'.) What do you think the grey numbers in brackets indicate?
Look in the Links box on the left-hand corner, next to "Source".

QUESTION 5. What database does this come from?

If this is NOT the SMART database, then click on the link next to "Related CD", and click on the entry that begins with "smart" (if this instruction doesn't make sense, ask for help, the formatting may differ, depending on which domain you chose). Take note of the name of this domain

Domain name:
Go to the SMART database website: http://smart.embl-heidelberg.de

Click on the blue box to select Normal SMART mode. At the bottom of the next page (under "Domains detected by SMART"), type the NAME of your domain into the "Search domain and protein annotation" box. Click "Search". If you have trouble here, please ask for help, the SMART website can be a bit tricky to figure out.

QUESTION 6. Give a brief description of the domain in the following categories:

Name:

SMART accession number:

Description:

QUESTION 7: How many of these domains are found in the SMART non-redundant database (abbreviated nrdb)?

How many proteins in the "nrdb" have these domains?

Why are these two numbers (above) different?

QUESTION 8. List two other kinds of information about this protein that can be retrieved from the SMART database (i.e., look on the page and click on links to see what info you can get)
Go back in your browser to the first window from the results of NCBI conserved domains search that showed the conserved domains as coloured rectangles. Click on the button just below the graphical display that says "Search for similar domain architectures".

QUESTION 9. What does this new page show? Do all proteins on this page have the same numbers and locations of the conserved domains? What does this tell you about protein evolution (NOTE: We want you to think here... there is no right or wrong answer to this question. This last question is worth the most points)

C. Scouring the database for distant homologues using PSI-BLAST

PSI-BLAST uncovers many protein relationships missed by single-pass database- search methods and has identified relationships that were previously detectable only from information about the three-dimensional structure of the proteins.

Here, you will learn how to operate PSI-BLAST by using a comparison of proteins from thermophilic archaea and bacteria as an example.

Get the uncharacterized protein MJ0414 from Methanococcus jannaschii (accession# Q57857) in FASTA format.
Go to the NCBI web page (http://www.ncbi.nlm.nih.gov), go to BLAST, then click on Protein BLAST and under Algorithm select the PSI-BLAST radio button.
Now paste your protein sequence into the text area. Click on Algorithm Parameters and near the bottom change the PSI-BLAST threshold (at the bottom of the page) from 0.005 to 0.01. Expect Threshold should be 10. If running Internet Explorer, Safari, or Chrome this may be different. Please change it back to 10 if this is the case. Also check that under Scoring Parameters the Gap existence and Gap extension penalty is 11 and 1 respectively. Now run the BLAST search.

Examine the results of the program's initial gapped BLAST search.

QUESTION 10. How many significant hits did you get? (significant = E < 0.01)

At this point all of the "checked" sequences will be multiply aligned by PSI-BLAST to build a position-specific scoring matrix (PSSM) and this will be used in the next iteration of searching:
Now scroll back up to the top of the Descriptions section and "Run PSI-Blast iteration 2".

QUESTION 11. The sequences that were picked up in this iteration are indicated in yellow shading. By looking at all the hits, what is the most common name for the proteins that were identified as hits?

QUESTION 12 Based on these annotation can you putatively assign a function to the "unknown protein" you originally used to do the search (describe the FUNCTION, not just the name)?

The following two questions will require you to read the handouts I've given you in class or lecture 5 material and/or read NCBI's PSI-Blast tutorial

QUESTION 13. Why do new database hits in the second iteration have E-values of << 0.01, but yet did not appear at all in the first iteration?

QUESTION 14. If you kept running more PSI-BLAST iterations, they may converge. What does this mean?

Websites needed for this exercise:

ClustalW servers
http://www.ebi.ac.uk/clustalw/
http://clustalw.genome.jp
http://www.ch.embnet.org/software/ClustalW.html

Jalview (alignment viewing/editing)
http://www.jalview.org

NCBI (rpsblast, PSI-BLAST, sequence databases)
http://www.ncbi.nlm.nih.gov

SMART database:
http://smart.embl-heidelberg.de