This assignment will be due 1 week from today (Jan 31st). Again, point-form answers are fine. If we don't get through all of the exercises today, you can continue to work on them from any computer with web access. If you have any trouble, feel free to email me (jshleap at dal.ca), or stop by Mona Campbell Rm 4233.
This exercise uses ClustalW, a popular multiple sequence alignment program. We'll be using ClustalW via a web interface. There are a few of these available, but sometimes they're overloaded, or else they're down for maintenance. For this reason, I'm providing links to two ClustalW servers: the European Bioinformatics Institute (EBI, http://www.ebi.ac.uk/clustalw/... this is the one I'd recommend) and Kyoto University Bioinformatics Center (http://clustalw.genome.jp) I recommend using the first one (EBI). However, if you use the second one PLEASE WRITE THAT DOWN on your lab assignment.
Download amino acid sequences for accession numbers AAA50993, P02992, and P32481 from the NCBI protein database (http://www.ncbi.nlm.nih.gov). These are the E. coli tufA, yeast tufA, and other yeast sequence from our previous lab.
Connect to one of the ClustalW servers mentioned
above.
Which webserver did you use?
Paste all 3 sequences in
FASTA format into the sequences area so that they look like:
>ecoli_tu MSKEKFERTKPHVNVGT... >yeast1 MSALLPRLLTRTAFKAS... >yeast2 MSDLQDQEPSIIINGNL...
QUESTION 1. What are the default gap opening, extension
penalties for the multiple sequence alignment (not pairwise)? (Slow/Accurate settings if appropriate) What is the default scoring matrix used? (Hint: click on
the words above the pull-down menus for an explanation on the EBI
server)
Gap opening:
Gap
extension:
Scoring matrix:
Fill in your email address (if asked to do so), then click "Run". In a short while, a results page will appear. In all of these sites it is possible to view a plain text output of your alignment. You may choose to simply look at these to get an idea of the placement of gaps and their lengths. If you opt to view the alignments this way copy your alignment to a plain text file in notepad and save it so you can compare it to other results.
If you chose the EBI website:
You can just click on "show colors" to see what your alignment looks like better, with amino acid property colour coding. Scroll back and forth and take note of the placement of the gaps (in general). Leave the results windows open! You'll be comparing this alignment to others.
Now, examine the effects of changing gap opening and extension parameters:
Go to your favourite ClustalW server again (in a NEW window, if you used the EBI server), and redo the alignment, but CHANGE the alignment parameters above this time to radically increase the Gap Open value (e.g. to 100). Remember to change the GOP in the multiple alignment settings, not the pairwise alignment settings.
Now do it again but increase the Gap Extension value to 10 (and put the Gap Open back to the default setting) -- Redo the alignment.
Now do BOTH changes at once and redo the alignment.
QUESTION 2. Does the alignment change when you reset the Gap
Extension and Gap Opening penalties? If so, how and why does it
change? Explain in general terms.
Recently the BLAST server at NCBI has added the capacity of identifying "conserved domains" using modification of the PSI-BLAST searching procedure called RPS-BLAST. Here the BLAST algorithm is used with a query sequence to search a database (CDD -- conserved domain database) of position-specific scoring matrices (PSSMs) of well-known protein motifs (also called domains) that tend to occur in many different protein families. There are many such "motif" or "domain" databases -- the two currently used by NCBI in addition to their own "curated domain alignments" (NCBI's are: cd, LOAD and COG) in CDD are the SMART database and the PFAM database. These databases are all curated collections of aligned protein motifs
Go to the NCBI website (http://www.ncbi.nlm.nih.gov)
and retrieve a "mystery" amino acid sequence with the
accession number CAC38754.
Copy this sequence and go to the
NCBI webserver and click on BLAST.
Scroll down and look under
Specialized BLAST. Click on find
conserved domains in your sequence (cds) (3rd
entry in list)
QUESTION 3. What databases are
available to search and how big are they (in number of PSSMs)?
Choose the database with the largest number of motifs and
paste your sequence into the query window and click on "Submit"
You should see a gray line with numbers over it representing
your sequence and below it some coloured boxes.
Below the
line there are coloured regions with abbreviations in the middle.
These are the locations and names of the conserved motifs/domains
identified. There are 3 different domains identified in this
protein. If you click on the "View Full Result" button
you can see different abbreviations given to the "same"
domains by different databases. Below this are a list of the
conserved domains hit and corresponding E-values.
Go back to the top and click on the top red domain box (CCP) and
scroll down to the bottom of that window to see the multiple alignment of the
conserved domain.
QUESTION 4: In general what do you think
the coloring of the alignment corresponds to? (If you have problems
with this question, try playing with the "Color Bits"
pull-down menu - change the value and hit 'Reformat'.) What do you think the grey numbers in brackets
indicate?
Look in the Links box on the left-hand corner, next to
"Source".
QUESTION 5. What database does this
come from?
If this is NOT the SMART database, then
click on the link next to "Related CD", and click on the
entry that begins with "smart" (if this instruction
doesn't make sense, ask for help, the formatting may differ,
depending on which domain you chose). Take note of the name of this
domain
Domain name:
Go to the SMART database website:
http://smart.embl-heidelberg.de
Click
on the blue box to select Normal SMART mode. At the bottom of the
next page (under "Domains detected by SMART"), type the NAME of
your domain into the "Search domain and protein annotation" box. Click
"Search". If you have trouble here, please ask
for help, the SMART website can be a bit tricky to figure
out.
QUESTION 6. Give a brief description of the domain in the following categories:
Name:
SMART
accession number:
Description:
QUESTION
7: How many of these domains are found in the SMART non-redundant database
(abbreviated nrdb)?
How many proteins in the "nrdb"
have these domains?
Why are these two numbers (above)
different?
QUESTION 8. List two other
kinds of information about this protein that can be retrieved from
the SMART database (i.e., look on the page and click on links to see
what info you can get)
Go back in your browser to the first window from the results of NCBI conserved domains search that showed the
conserved domains as coloured rectangles. Click on the button just
below the graphical display that says "Search for similar
domain architectures".
QUESTION 9. What does this new
page show? Do all proteins on this page have the same numbers and
locations of the conserved domains? What does this tell you about
protein evolution (NOTE: We want you to think here... there is no
right or wrong answer to this question. This last question is worth the most points)
PSI-BLAST uncovers many protein relationships missed by
single-pass database- search methods and has identified relationships
that were previously detectable only from information about the
three-dimensional structure of the proteins.
Here, you will
learn how to operate PSI-BLAST by using a comparison of proteins from
thermophilic archaea and bacteria as an example.
Get the uncharacterized protein MJ0414 from Methanococcus jannaschii (accession# Q57857) in FASTA format.
Go to the NCBI web page (http://www.ncbi.nlm.nih.gov), go to BLAST, then click on Protein BLAST and under Algorithm select the PSI-BLAST radio button.
Now paste your protein sequence into the text area. Click on
Algorithm Parameters and near the bottom change the PSI-BLAST
threshold (at the bottom of the page) from 0.005 to 0.01. Expect Threshold should be 10.
If running Internet Explorer, Safari, or Chrome this may be different.
Please change it back to 10 if
this is the case. Also check that under Scoring Parameters the Gap
existence and Gap extension penalty is 11 and 1 respectively. Now
run the BLAST search.
Examine the results of the program's
initial gapped BLAST search.
QUESTION 10. How many
significant hits did you get? (significant = E < 0.01)
At
this point all of the "checked" sequences will be multiply
aligned by PSI-BLAST to build a position-specific scoring matrix
(PSSM) and this will be used in the next iteration of searching:
Now scroll back up to the top of the Descriptions section and "Run PSI-Blast iteration 2".
QUESTION
11. The sequences that were picked up in this iteration are indicated in yellow shading. By looking
at all the hits, what is the most common name for the proteins that
were identified as hits?
QUESTION 12
Based on these annotation can you putatively assign a function to the "unknown protein"
you originally used to do the search (describe the FUNCTION, not just the name)?
The following two questions will require you to read the handouts I've given you in class or lecture 5 material and/or read NCBI's PSI-Blast tutorial
QUESTION 13. Why do new database hits in
the second iteration have E-values of << 0.01, but yet did not
appear at all in the first iteration?
QUESTION
14. If you kept running more PSI-BLAST iterations, they may
converge. What does this mean?
ClustalW
servers
http://www.ebi.ac.uk/clustalw/
http://clustalw.genome.jp
http://www.ch.embnet.org/software/ClustalW.html
Jalview
(alignment viewing/editing)
http://www.jalview.org
NCBI
(rpsblast, PSI-BLAST, sequence
databases)
http://www.ncbi.nlm.nih.gov
SMART
database:
http://smart.embl-heidelberg.de