Information

Is there a PSI-BLAST for nucleotide sequences?

Is there a PSI-BLAST for nucleotide sequences?



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I understand that one can translate a nucleotide sequence and run PSI-BLAST on the protein (proteins if you take the 6 reading frames), but I'm looking for distant homology for bacterial small RNAs (typically 50-200 nucleotides long and noncoding).

If there is no such resource, what are the main obstacles to this implementation?


First check if your RNA sequences are described by existing covariance models (CMs) available in Rfam. You can do this using the Infernal package to search the Rfam database of CMs. For those RNA sequences which match an Rfam CM, you can then use that CM to search the sequence databases for additional matches.

For those that do not match an Rfam CM, you will want to build your own models. In order to do this you need to identify homologues for each sequence which you can use to produce an alignment from which a model can be built. In order to do this you will want to use a method which is RNA aware and uses a rigorous search method. For example from the FASTA suite, which has an RNA mode which adjusts the scoring accordingly:

  • Smith and Waterman for local/local alignment (e.g. SSEARCH)
  • Needleman-Wunsch for global/global alignment (e.g. GGSEARCH)
  • Hybrid alignment for global/local alignment (e.g. GLSEARCH)

Your coverage requirements and the nature of the database being searched will determine the most suitable method to use for the sequence similarity search. Combining the best search method with appropriate selection of the database to search, for example the European Nucleotide Archive (ENA) provide a set of non-protein coding sequences (ftp://ftp.ebi.ac.uk/pub/databases/ena/non-coding/) derived from the annotations in EMBL-Bank that could be a good starting point your search. Will improve the sensitivity of your search.

Given the set of homologous sequences you need to produce a multiple sequence alignment (MSA) to generate a model from. To do this you will want to use an RNA aware MSA tool, for example R-COFFEE or Clustal Omega in order to produce an alignment which attempts to take into account the folding of the RNA molecules.

Given the alignment you can create a CM using Infernal or an HMM using HMMER, and use this to search the sequence database (cmsearch or hmmsearch) to find additional homologues in the database.


If you have a non-coding gene sequence (e.g. regulatory sequence) this answer should hold your solution:

Background theory

  • Firstly you must realize that PSI-BLAST is built for detecting "romote homologues", (i.e. those that have a very "distant evolutionary relationship" to your query) - from a database of sequences. It is therefore known to be a "sensitive" analysis which can recruit distantly related matches but has a small chance of recruiting some false matches - "rogue homologues".

  • Secondly PSI-BLAST is known as a "profile method" that is it uses multiple sequences that are cumulatively recruited with each "psi-blast iteration", to build an empirical profile of amino acid residues along the positions of your query. This is in the same family of analyses as "Hidden markov models" (HMMs) in that HMMs use multiple sequences to build an empirical profile that is able to recruit distant homologous, except the "profile" includes probabilistic pathways to all the recruited sequences.

My Answer

I suggest you use a software package called HMMER. Indeed this method shares critical theoretical similarity to PSI-BLAST as well as functionality in your case (searching for remote nucleotide sequence matches in a database vs. a nucleotide query), it also does not assume your sequence is protein-coding - here is the wiki description:

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy.1 Its general usage is to identify homologous protein or nucleotide sequences. It does this by comparing a profile-HMM to either a single sequence or a database of sequences.

Other possible answers

If you are afraid of using HMMER, then here is a list of all alignment software tools represented in a table that allows you to focus on only those that use nucleotide sequence as input:

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software


Assuming you are using PSI-BLAST to recruit coding homologous nucleotide sequences to your query nucleotide sequence.

Here's a work-around using PSI-BLAST itself:

  1. Translate your nucleotide sequence into amino acid sequence
  2. Run psi-blast to recruit matching homologous protein sequences
  3. Store the names or database IDs (e.g. genbank accession numbers) of the best matching proteins
  4. Acquire nucleotide sequences of your matches by searching the IDs against a nucleotide database

Extra details:

  • This type of alignment is called a "codon alignment" (as opposed to DNA alignment or protein alignment)
  • This assumes your DNA protein codes for a protein whose functionality is constrained by evolution
  • You must remove all introns from your sequence prior to aligning
  • Your first codon must be a start codon (ATG)
  • Your last codon must be a stop codon.

How BLAST works – Concepts, Types, & Methods Explained

BLAST stands for Basic Local Alignment Search Tool. It is a local alignment algorithm-based tool that is used for aligning multiple sequences and to find similarity or dissimilarity among various species. In this article, we will explain different kinds of BLAST tools and how does BLAST algorithm works.

BLAST is a heuristic method which means that it is a dynamic programming algorithm that is faster, efficient but relatively less sensitive.

For BLAST(ing) any sequence, there is a query sequence and a target sequence/database. The query sequence is the sequence for which we want to find out the similarity and the target sequence is a sequence/database against which the query sequence is aligned. Blast returns the output in the form of hit tables that are arranged in decreasing order of matched accession number along with their titles, query coverage, sequence identity, score, and an e-value in separate columns. The reliability of the matched sequences is assessed by e-value.

BLAST has different programs to align sequences of nucleotides, proteins, etc. It consists of other multiple BLAST programs, but the basic kinds of BLAST are as follows:

Blastn

It is a type of blast where the query sequence is a nucleotide and the target sequence is also a nucleotide, i.e., it is a nucleotide against a nucleotide.

Blastp

Blastp is a protein to protein blast where the query sequence is a protein and the target sequence is also a protein.

Blastx

In this type of blast, the query sequence is a nucleotide sequence and the target is a protein sequence/database. First, the nucleotide sequence is converted into its protein sequence in three reading frames, then it is searched against the protein.

Tblastn

In tblastn, the query is a protein and the target is a nucleotide sequence/database. Here, the protein sequence is searched against a nucleotide database which is translated to its corresponding proteins. The translation occurs in all reading frames, but the reading frame is only for the conventional 5’ to 3’ site in the databases, therefore, only 3 reading frames are compared.

Tblastx

It is a type of blast in which the nucleotide sequence is against the nucleotide database but at the protein level. In other words, the nucleotide query sequence and the target sequences are both translated into their corresponding protein sequences and then aligned together. Both the query and the target are translated in all 6 reading frames.

Special kinds of BLASTs:

Megablast

It is very similar to blastn but its advantage over blastn is that in megablast long sequences can be aligned. A large number of sequences having large sizes can be easily aligned using megablast and all the query sequences are concatenated into one large query sequence. It is a greedy algorithm so that it induces gaps during the alignment and hence, similar sequences are not avoided. Megablast due to these features is faster than blastn but less sensitive since it is a greedy algorithm, but it is very useful when a large number of similar sequences are to be aligned in one go.

Discontiguous Megablast

It is exactly the opposite of the megablast referred to as “Highly Dissimilar Megablast”. It is used to find the dissimilar sequences of the query sequence, i.e., paralogs. Here, the user wants to find the paralogs of a gene present in distant species. So, here the output is those sequences that have the least amount of similarity with the query sequence.

Position-Specific Iterated (PSI) Blast is very sensitive and usually used for protein similarity search. The query sequence is taken and subjected to blastp which results in the formation of a multiple sequence alignment (MSA) of most similar sequences. From this MSA, the pattern that identifies the query and its homologs are taken, then this conserved pattern is subjected to blastp again to filter the database. This process of identifying patterns from MSA, blasting the pattern against the database again creating MSA, and then again identifying a redefined pattern is PSI Blast.

PHI Blast

Pattern Hit Initiated (PHI) blast is very similar to PSI Blast but there is not any iteration. It can be used for DNA as well as protein queries.

RPS Blast

Reverse Position Specific (RPS) Blast is also similar to PSI Blast which matches the query with a set of conserved domain, HMM profiles, or pre-aligned profiles. In this kind of blast, the query sequence (DNA / protein) is searched against an existing collection of conserved domains, a preconfigured MSA of various genes.

How does Blast work?

Blast is a greedy algorithm that was developed by Altschul et al. [1]. It is similar to FASTA but more efficient. As FASTA uses a ktup parameter, similarly BLAST also uses a window size for proteins and nucleotides. Both assume that good alignments contain short stretches of exact matches. BLAST is an improvisation over FASTA in the sense that it is faster, more sensitive, more statistically significant, and easy to use. There is a threshold in blast known as ‘Minimal Score denoted as ‘S’. It means that whatever the match is between the query and the database is it must have a value equal to or greater than S.


Is there a PSI-BLAST for nucleotide sequences? - Biology

Go back to the original SWISS-PROT entry at NCBI. Now use the BLink link to retrieve related proteins, Click the Best Hits button and find the related protein from the fish Fundulus heteroclitus. Follow the PubMed link from this record to read about the biology of this protein. What is the physiological role of this CFTR homologue in this animal?

CFTR contains conserved domains that are homologous to bacterial transporters. These bacterial homologues do not appear in the BLink output because only the top 200 proteins are shown. You can use the "Related sequences" link on the CFTR_HUMAN record to find these. Go back to the CFTR_HUMAN record and follow the "Related sequences" link. How many related proteins are there? To identify the ones from bacteria click on the History tab. Follow the instructions on that page for constructing an query combining the protein neighbors with an organism field search bacteria. Your query will be something similar to the following

Find the genomic scaffold AE003584 from Drosophila melanogaster using Entrez Nucleotide. Display protein links to see the predicted proteins for this scaffold. (You will need to increase the number of records displayed to see all of the proteins on one page. Then use the browser's "Find in page" function to find the protein that you want.) Identify conserved domains present in predicted protein CG10879 (AAF51293) by clicking on the BLink link and then clicking the CDD buttton. These conserved domains suggest a potential function for this hypothetical protein. Now perform a search against the Prosite patterns using the ScanProsite tool at ExPASy. Did you find the same protein family signature? To verify the Pfam results, try the search against the ProSite profiles. Do your results agree now? This points out the problems with representing a profile as a pattern.

The Entrez nucleotides [Properties] field stores information about the kind of sequence and its source. You can use the the index feature on the Preview/Index tab to display the terms that are indexed for this field. The Properties field terms are somewhat cryptic, but they are very useful for searching. Three useful types are the biomol, gbdiv and srcdb sets. The biomol terms classify records based on the the type and origin of the molecule, for example biomol mrna or biomol genomic. The gbdiv sets of terms index records by the GenBank division code, gbdiv est, gbdiv pri, gbdiv htg and so on. The srcdb terms classify records based upon their database origin. For nucleotide records these could be GenBank, EMBL, DDBJ, RefSeq or PDB (gbdiv genbank, gbdiv embl, gbdiv ddbj, gbdiv refseq). Perform an organism search for mouse, then use Preview/Index tab and the Properties field terms to count the number of mouse genomic records. How many of these are draft sequences (gbdiv htg)? How many are finished records (gbdiv rod)? How many are genome survey sequences? How many of these genomic records are RefSeqs? What kind of RefSeqs are they? Now retrieve all mouse mRNA records. How many of these are in the rodent division? How many are in the EST division? Using these properties field terms, design a query and retrieve all the mouse known mRNA RefSeqs (NM_).

Use Entrez Nucleotide to find the full-length cDNA (mRNA) sequence for Plasmodium falciparum glyceraldehyde 3-phosphate dehydrogenase (GAPD). This time start by typing Plasmodium in the search box without limiting to any field. How many records do you retrieve? Browse through your results to find some records that are not from Plasmodium. Display a few of these to see why you retrieved them you should find "Plasmodium" somewhere on the record. Now use the Limits tab to restrict to Plasmodium in the Organism field [Organism]. How many nucleotide records in Entrez are from Plasmodium? Now find GAPD records by using the Preview/Index tab to add glyceraldehyde 3-phosphate dehydrogenase as a [Title] Word. How many records did you retrieve?

Search for population and phylogenetic studies on bears in Entrez PopSet. Find the study on brown bears and polar bears and display the alignment. What gene or molecular regions were used in this study? Use the tool bar link to display variations in the alignment. Are there fixed differences in the sequences from the brown bear, Ursus arctos, and the polar bear sequences in the alignment? What if the Ursus arctos sequence from the "ABC" islands (Sequence 7) is removed? Link to the article to read more about these remarkable results.

Substantial data are available for two species of filarial nematodes that are human parasites. Use the Taxonomy Browser to examine the number of nucleotide sequences for the superfamily Filaroidea and determine which two species these are. How many nucleotide and protein sequences are there for each of these two species? Display nucleotide records for each of these. What kinds of sequences are most of these?

There are a number of sequences for extinct organisms in the NCBI databases. Visit the list of extinct taxa in the Taxonomy pages.

Inositol polyphosphate phosphatases contain conserved acidic residues involved in binding metal ions. Retrieve the human INPP1 protein (INPP_HUMAN) from Entrez proteins. Follow the "Domains" link to to display pre-computed Conserved Domain Database (CDD) search results. Click on the "Details" button to display the complete results. Follow the link to the pfam inositol_P domain and display the domain in Cn3D by clicking on the "View 3D Structure" button. Identify the conserved residues surrounding the magnesium ions by double clicking on them in the structure. The corresponding residues will be highlighted in the sequence alignment. You can annotate the side chains on these if you like. First change the setting on the CDD page from "Virtual Bonds" to "All Atoms" then display the structure. You can then use the Style->Edit Global Style menu to turn off side chains and the Style->Annotate menu to selectively turn on the side chains for amino acids that coordinate the magnesium ions.

Michael Crichton's fantasy about cloning dinosaurs, Jurassic Park, contains a putative dinosaur DNA sequence. Use nucleotide-nucleotide BLAST against the default nucleotide database, nr, to identify the real source of the following sequence. Select, copy and paste it into the BLAST form window.

This is probably the most common use of nucleotide-nucleotide BLAST: sequence identification, establishing whether an exact match for a sequence is already present in the database.

NCBI scientist Mark Boguski noticed this obvious "contaminant" and supplied Crichton with a better sequence, shown below, for the sequel, The Lost World. Identify the most likely source of this sequence using nucleotide-nucleotide BLAST. Mark imbedded his name in the sequence he provided. To see Mark's name use the translating BLAST (blastx) page with the sequence below. (Look for MARK WAS HERE NIH).

The the proper use of the translating BLAST services is to look for similar proteins (identify potential homologueues) in other species.

Higher eukaryotic genomes contain large amounts of repetitive DNA. The most abundant interspersed repeat in the human genome is the Alu element. Alus tend to occur near genes, within the introns of genes, or in the regions between genes. In some cases, their presence and absence can fairly accurately show the intron-exon structure of a gene. Demonstrate this by performing a nucleotide-nucleotide BLAST search against the Alu database with the genomic sequence of the human Von Hippel Lindau syndrome gene (Accession AF010238). Note that the exons appear in the BLAST graphic as places where the Alu elements do not align.

The Caenorhabditis elegans gene SMA-4 is a member of the dwarfins gene family, also called the MAD family, which plays a role in transforming growth factor beta-mediated signal transduction. In this example we will attempt to find homologs for the SMA-4 protein (SMA4_CAEEL, Accession P45897) in vertebrate species. using protein-protein BLAST.

Of course, this protein already is in the Entrez Protein and BLAST databases. Remember that if the goal is to find a homolog in another species for a protein that is already present in the Entrez system, it is not necessary to perform a BLAST search the precalculated similarities are already available through BLink. Verify this by following the BLink link from P45897 in Entrez Protein. Click the best hits button and find the best protein hit to chicken (Gallus gallus). The alignment between SMA-4 and the best chicken match is available by clicking on the linked BLAST score.

To simulate performing a BLAST search with a novel protein, we will use an Entrez query to remove all Caenorhabditis proteins from the BLAST database.

Link to the protein-protein blast page and enter the SMA-4 accession number (P45897) in the Search text area. We will search against the default, nr, database. In order to remove, the Caenorhabditis proteins from the nr database, enter the following Entrez search in the "Limit by Entrez query" box under the "Options" section of the form: Because there are a large number of related proteins in the BLAST database, we also need to increase the number of descriptions or BLAST hits that will be shown. Do this by increasing the number of descriptions to 500 in the "Format" section of the BLAST form. Run the search by clicking the BLAST button.

On the formatting page, you can see that the CD-search has identified conserved domains in this protein. You can click on the graphic to see what these domains are and what their function is.

Click the format button to retrieve your BLAST results. Look at your BLAST graphical output and verify that the Entrez query eliminated the protein from the database you should see no full-length matches. Now look at your descriptions and their e-values. In the non-significant e-values (> 1) there are two proteins from sheep (Ovis aries) labeled as MAD proteins (Smad4 and Smad7). These protein fragments are homologs of SMA-4, but we did not demonstrate that with this particular search. In the following exercise we will show using PSI-BLAST that these sheep proteins are significant matches to SMA-4. Be sure to retain your formatting page for these results or copy your request ID so you can format them for PSI-BLAST fo the next exercise.

Look at the BLAST output and find all chicken (Gallus gallus) proteins that are similar to SMA-4. (Use the Tax Blast link at the upper left of the graphic to help in finding the chicken proteins.) These should be the same proteins found by BLink previously.

Open a new browser window so you don't lose your results against the nr and run the same search again. Restrict the search to chicken proteins using the Entrez query option as you did before. This time use the query Are the same proteins found? Compare the expectation values of these hits to the same hits found against nr with no organism restriction. Why are the e-values different for the same scores and alignments?

The Sma-4 protien that we used previously belongs to a large family of proteins. (Jump back to the protein-protein search). Some members of this family are not readily identified in an ordinary blastp search, however, additional Sma-4 homologs can be found by using the more sensitive position-specific iterated BLAST (PSI-BLAST). Any protein-protein BLAST search on the NCBI web pages can be extended to a PSI-BLAST search simply by re-formating the results. Check the "Format for PSI-BLAST" box on the formatting page for the first search that you saved from the exercise above and click format.

The results are the same except that they are formatted differently. There is a line across the descriptions section of the results corresponding to the PSI-BLAST inclusion threshold of 0.005. Position-specific information from a multiple sequence alignment of the sequences above this line are used to generate a position-specific score matrix (PSSM) in the next iteration. Notice that one of the first proteins below this line is the Smad4 from the sheep (Ovis aries). What is the e-value of this hit?

Now click the "Run PSI-BLAST iteration 2" button. Note that the Formatting page is refreshed in its separate window, generating a new Request ID number. Click the "Format" button and the results of iteration 2 will load. Click on the "Skip to the first new sequence" link on the Iteration 2 results page. What is this sequence? What is its new expect value? Notice that there are now several new sequences above threshold. Some of them are not annotated as Sma/Mad homologs but are clearly significant hits. These new sequences will be used to construct a new PSSM for iteration 3 and so on. After a few more iterations no more sequences will be found at this point the search is said to have converged.

The prion protein is found in high concentrations in the brains of humans and other mammals. In certain degenerative nuerological diseases, prion proteins aggregate into polymers. Several of these prion diseases seem to be transmissible. Perhaps the most remarkable aspect of these is that the infectious agent appears to be an aberrant form of the prion protein itself. Bovine spongiform encephalopathy (BSE) is one of the transmissible prion diseases that has received much recent notoriety. There are a number of polymorphisms that have been identified in the prion proteins for several mammals, notably human, mouse, and sheep. Some of these are associated with inherited prion diseases and some with susceptibility to transmissible forms. Retrieve the SWISS-PROT record for the human prion protein (PRIO_HUMAN) and look at the FEATURE table to see the various polymorphisms. Use this protein to perform a translated blast search (PROTEIN query - TRANSLATED database) search against human Ests and look at your results to see if any of these polymorphisms are present in the Est data. This is easier to see if you change the formatting options on the BLAST form to display one of the query-anchored alignment options. Try the "flat query-anchored with identities". (See the problem on prion SNPs in the Genomes section.)

The human fragile histidine triad protein (FHIT, Accession P49789) isstructurally related to galactose-1-phosphate uridylyltransferases. However, this relationship is not apparent in an ordinary BLAST search. Perform a protein-protein BLAST search against the SWISS-PROT database with P49789 and search your results for galactose-1-phosphate uridylyltransferases. Now use PSI-BLAST to verify the relationship between these two protein families.

A frequent use of nucleotide-nucleotide BLAST is to check oligonucleotides for hybridization or PCR. The goal most people have when doing this is to make sure that the primer will give a unique product from the target genome or cDNA population. Because BLAST is local and searches both strands, one can simply concatenate a pair of +/- strand primers and use them in a single search. Combine the following pair of candidate PCR primers in a nucleotide-nucleotide search against the default nucleotide database and identify the gene amplified.

Now try these modified primers. There is one mismatch in each near the middle.

Notice that the previous hits are completely missing. Now adjust the Word Size from 11 to 7 under the BLAST Advanced Options and try the search again. Do you find the original hits again? Are they still the among the best hits? Can you devise a modification in the search strategy that will make them the best hits again?

As the database grows, so does the number of chance occurrences of amino acid motifs that spell out words or people's names in single-letter amino acid codes. One such name motif is ELVIS. Find the number of occurrences of ELVIS in the protein nr. To get any hits at all, you will have to adjust several of the advanced BLAST parameters including the Expect value, Word size, and Score Matrix. Adjust some of these in the "Other advanced options" box. Options are entered in a command-line style. For example, typing

sets the Expect value cut-off to 10000. Visit the BLAST "Frequently Asked Questions" by following the link on the left side bar of the BLAST page for more information. See especially the entry on "How do I perform a similarity search with a short peptide/nucleotide sequence?". We now have a page with presets optimized to find short nearly exact matches. You can run the search on this page to see the correct parameters to use.

UniGene is the best NCBI resource to use to find out to what gene (or suspected gene) a particular database sequence belongs. This is especially true for ESTs where there may be no annotations on the sequence, but may also be important for other sequences where the annotation may be incomplete or obsolete. Database identifiers for UniGene searches may come from BLAST output or from microarray (hybridization) data. For example, mRNA that hybridized to the EST sequence with accession number BG618105 was highly expressed in a human liver tumor sample.

Retrieve the record from the nucleotide database using the accession number in the search box on the NCBI homepage. Display the record. Is there any annotation indicating what gene this is?

Now link to UniGene from the "Links" menu in the upper right. What is the name of this gene? Link to LocusLink from the UniGene cluster. What is the function of this protein?

Go back to UniGene. Look at the ESTs in this cluster. How many are there? Identify a pair of ESTs (a 5' and 3' read) that come from the same clone ID. You'll nedd to display all ESTs and scroll down to see these. Use BLAST 2 sequences to align these to the full length RefSeq mRNA from the LocusLink entry. Note the mismatches that are most likely due to sequencing errors in the ESTs.

Expression information is implied by the sources of the cDNA libraries in a particular cluster. NCBI also has linked tag counts from quantitative SAGE libraries to the UniGene clusters. Follow the "Gene to Tag" mapping link to see a "virtual Northern" display of the counts of reliable tags from this cluster in SAGE libraries. What library shows the highest relative expression of this gene?

On the LocusLink page use the main map viewer link (mv) in the "Map Information" section to display this gene in the MapViewer. What chromosomal region is this? What maps are displayed? You can click on the map name at the top to learn more about the information displayed for each map. Uncheck the "Compress Maps" option on the left-hand-side to see the full marker labels. The UniGene map shows the density of EST hits on the genome. Generally the peaks in this histogram highlight the exons of expressed genes. Notice that there are some hits that don't correspond to the exons shown in the gene model on the Genes map. What could these represent? To see another view of the alignment based gene model follow the "ev" link to display this in the evidence viewer.

Use the the zoom graphic on the left hand side of the map viewer to zoom out and display two other members of this small gene family, AFP and AFM. Are these in the same orientation? There is also fourth member of this small family somewhat removed from these also on chromosome 4 called GC. Display the entire region between GC and AFM by typing these symbols in the "Region Shown" boxes on the left-hand-side and pressing the "Go" button.

From the LocusLink entry, click on the mouse gene symbol entry under mapping information to display the corresponding mouse LocusLink record. Follow the mouse map viewer link to display the corresponding region in the mouse map viewer. What chromosome is this? Adjust the view to see if the same gene family is present with the same structure in the mouse genome. Link to the contig record for this region of mouse genome from the map viewer. How large is this contig? Examine the bottom of the record and notice that it is assembled from both BAC clone (draft and finished) and whole genome shotgun sequence. Retrieve one of the whole genome shotgun pieces (e.g. CAAA01153721). Link from this record to the master record for the mouse whole genome shotgun project (CAAA01000000). How many records are in this set?

The gene causing the juvenile form of nephronophthisis was recently identified on human chromosome 1. We will use related protein and nucleotide records to identify this gene in other species. Retrieve the human NPHP4 entry from LocusLink. This protein apparently has a homolog in C. elegans. Demonstrate this by following the BLink link (BL) next to the provisional RefSeq protein in this entry. Clicking the "Best hits" button will make it easier to identify. Notice that there is also a homolog in mouse. Retrieve the mouse protein by linking through the Accession number. Display the linked nucleotide sequence. Use this Accession number (AY118229) in rat genome BLAST to find the gene in the rat genome. Search against the genome assembly. What supercontig did you hit? On what rat chromosome is this gene? Display your results in the Map Viewer by clicking on the Genome View button that appears on the BLAST results page and link to the contig map element. Use the "Maps and Options" link and add the "Genes" map to the display. Is this gene annotated on the rat Map Viewer?

Your BLAST hits imply an exon-intron structure for this gene. How many exons do your BLAST hits imply? How large is this gene? You can make a more precise alignment-based model for this gene using the Spidey tool. To do this you will need to adjust the base pair range displayed on the Map Viewer to the smallest interval that contains all of the BLAST hits. Then get this sequence using the "Download/View Sequence/Evidence" link. Display the genomic region in the browser and save it to disk. Use this genomic sequence on the Spidey page. Use the mouse cDNA (AY118229) you used before for the mRNA sequence.

The following amplified DNA sequence is associated with a human disease gene polymorphism: Use this sequence in the human genome BLAST service to identify this gene. Follow the linked identifier on the BLAST results to display you results in the Map Viewer. On what chromosome is this gene? What gene is it? Examine the BLAST alignment to identify the postion and nature of the polymorphism. In what exon is this?

We can now see if this polymorphism has been mapped to the genome from the SNP database. Use the "Maps and Options" link to add the Variation map to the display. To zoom in to the region by placethe mouse pointer over the map and click to display the pop-up zoom menu. Choose an appropriate level to see the polymorphisms in the region of interest. Find the coding region SNP that maps to the same place as your polymorphism identified by BLAST. Link from the Map Viewer to the RefSNP record. Does this SNP imply a change in the amino acid sequence? What is it? (You will notice that there are multiple splice variants for this gene, but the amino acid change is consistent in all of those that contain this coding exon.)

This is a well known polymorphism in the HFE gene that causes hemochromatosis when homozygous. From the RefSNP record you can link to OMIM to learn more about this. You can also follow the links to 3D structure mappings to display the position of this polymorphism in the structure (1A6Z) of the HFE protein. Based on this, why does this amino acid change have a detrimental effect on the function of this protein?

Use LocusLink to find the the entry for the human glyceraldehyde 3-phosphate dehydrogenase gene. Click on the Map Viewer link ( mv ) to find the map location and the contig containing the the GAPD gene. Zoom in to see the exon-intron structure of the gene. How many exons are there? Now use human genome BLAST to verify the location and structure of this gene. Use the GAPD RefSeq (NM_002046) to perform this search. Set both the alignments and descriptions to 250. How many contigs do you hit in the human genome? Click on the Genome View button to see the distribution of these hits on the genome. Look at some of the high scoring single hits and to see what's unusual about them. How can you account for these results?


TOOL SECTIONS

The search section contains popular search tools, such as NucleotideBLAST, ProteinBLAST ( 11 ), PSI-BLAST ( 12 ), and HMMER ( 13 ), as well as our in-house developments such as HHpred, HHsenser and PatternSearch. In comparison with the NCBI server, our BLAST tools offer greater flexibility and functionality: searches can be run against uploaded personal databases or selectable sets of genomes (updated weekly from NCBI and ENSEMBL), databases can be switched between PSI-BLAST runs, alignments can be extracted, viewed online or forwarded to other tools, and two graphs show matched regions and E -value distributions. The fastHMMER tool performs HMMER searches of all standard sequence databases in ∼10% of the time by reducing the database with one iteration of PSI-BLAST at a cut-off E -value of 10 000. PatternSearch identifies sequences containing a user-defined Prosite pattern or regular expression. HHpred is a new server for protein structure and function prediction ( 5 ). It takes a query sequence as input and searches user-selected databases for homologs with a new and very sensitive method based on pairwise comparison of hidden Markov models (HMMs). Available databases, among others, are InterPro, CDD and an aligment database we build from Protein Data Bank (PDB) sequences and which can be used for 3D structure prediction. HHsenser is a transitive search method based on HMM-HMM comparison ( 7 ). This method utilizes a sequence as input and builds an alignment with as many near or remote homologs as possible, often covering the whole protein superfamily.

The alignment section includes the well-known, popular multiple alignment program ClustalW ( 14 ), together with the more recently developed multiple alignment methods ProbCons ( 15 ), MUSCLE ( 16 ) and MAFFT ( 17 ). Also in this section is Blammer ( 10 ), which converts BLAST or PSI-BLAST output to a multiple alignment by realigning gapped regions using ClustalW and removing local inconsistencies through comparison with an HMM. HHalign aligns two alignments with each other by pairwise comparison of HMMs and displays similarities in a profile–profile dotplot.

In the sequence analysis section, we have grouped tools for repeat identification and analysis of periodic regions in proteins. HHrep is a server for de novo repeat detection that is very sensitive in finding proteins with strongly diverged repeats, such as TIM barrels and β-propellers ( 6 ). REPPER ( 8 ) analyzes regions with short gapless repeats in protein sequences. It finds periodicities by Fourier transform and internal sequence similarity. The output is complemented by coiled-coil prediction and secondary structure prediction using PSIPRED ( 18 ). Aln2Plot shows a graphical overview of average hydrophobicity and side chain volume in a multiple alignment.

In the secondary structure section, Quick2D integrates the results of various secondary structure prediction programs, such as PSIPRED ( 18 ), JNET ( 19 ) and PROFKing ( 20 ), the transmembrane prediction of MEMSAT2 ( 21 ) and HMMTOP ( 22 ) and the disorder prediction of DISOPRED ( 23 ) into a single colored view. The AlignmentViewer clusters sequences by a sequence idenity criterion, annotates groups of sequences using PSIPRED and MEMSAT2 predictions of a multiple alignment and graphically displays the results in an interactive Java applet.

The tertiary structure section contains Modeller ( 24 ) and HHpred ( 5 ). Modeller is a very popular program for comparative modeling. It generates a 3D structural model from a sequence alignment of a protein sequence with one or more structural templates. In contrast to the standalone version of Modeller, the input format does not need to be PIR but can also be FASTA or most other standard multiple alignment formats. Modeller is tightly integrated with HHpred, allowing selected hits of HHpred results to be used as templates for subsequent comparative modeling. On the results page, models can be evaluated by using a browser-embedded 3D-viewer and charts with output from several model quality assessment programs are provided. This allows fast interactive refinement cycles of the underlying multiple sequence alignment. The page also provides a link to the i MolTalk server, which offers several additional tools for the detailed analysis of structures and models ( 25 , 26 ).

In the classification section, we offer modules of the widely used phylogenetic analysis suite PHYLIP ( 27 ), the ANCESCON package ( 28 ) for distance bases phylogenetic analysis and CLANS ( 9 ). CLANS clusters user-provided sequences based on BLAST pairwise similarities ( 29 ). The results can be analysed with a CLANS Java applet or can br exported to CLANS format.

Finally, in the utilities section there is a collection of tools which help to perform simple tasks that the user will often be confronted with. It includes a sequence reformatting utility, a six-frame translation tool for nucleotide sequences, Extract_gis for the extraction of gi-numbers from BLAST files, the RetrieveSeq tool for identifier-based sequence retrieval from the non-redundant protein or nucleotide databases at NCBI, gi2Promotor for the extraction of nucleotide sequences upstream of genes identified by the gi-numbers of their encoded proteins and a backtranslation tool.


BLAST

NCBI BLAST is the most commonly used sequence similarity search tool. It uses heuristics to perform fast local alignment searches.

PSI-BLAST allows users to construct and perform a BLAST search with a custom, position-specific, scoring matrix which can help find distant evolutionary relationships. PHI-BLAST functionality is also available to restrict results using patterns.

Please read the provided Help & Documentation and FAQs before seeking help from our support staff. If you have any feedback or encountered any issues please let us know via EMBL-EBI Support. If you plan to use these services during a course please contact us. Read our Privacy Notice if you are concerned with your privacy and how we handle personal information.

EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK +44 (0)1223 49 44 44


How to Interpret BLAST Results

So you have acquired raw sequence data that you want to connect to a larger body of research. Likely, the first database you will reference is the National Center for Biotechnology Information (NCBI) BLAST (basic local alignment search tool). You might also reference other meaningful databases like Swiss-Prot, the Human Genome Browser and Pfam, depending on the questions that you’re trying to answer and the nature of your samples. The results come with quality measures that call for a bit of interpretation.

There are multiple versions of BLAST but for this summary we will stick to nucleotide-nucleotide alignments for simplicity. When you query a database your sequences get compared to every other sequence until top hits are found and reported in the results with quality metrics.

Some hits may report the same scores and so differentiating the varying levels of confidence that each parameter describes is necessary to choose sequences for the next phase of your analysis. The results are defined as:

  • Maximum Score is the highest alignment score (bit-score) between the query sequence and the database segments. It is sort-of inversely proportional to the e-value. A larger bit score is less likely to be obtained by chance than is a smaller bit score.
  • Total Score is the sum of the alignment scores of all sequences from the same db
  • Percent Query Coverage is the percent of the query length that is included in the aligned segments
  • E-value is the measure of likeliness that sequence similarity is not by random chance
  • Percent Identity describes how similar the query is to the aligned sequences

It is not really possible to make an informed decision about the scores, or the validity of the alignments, without delving into a detailed explanation of the scoring system that is used. For both nucleotide and protein, sequences are placed into a matrix then a heuristic algorithm is applied to get a raw score.

Here is an example of a scoring matrix used for local alignments, where the red numbers represent the path that was taken. This matrix is from an amino acid alignment, but the basic model remains the same for nucleotides as well. Each base in the matrix gets scored in relation to its pair on the other axis. In the case of BLASTn, a match gets +1 points, mismatch -3 and skipping to the next letter (a gap) is set to default to Linear mode but you can change that. Increasing gap costs will cause alignments with less gap to show up. The model chooses the path that will elicit the highest score, where the sum of this path is the raw score. This score is normalized to get Maximum and Total scores.

The e-value, or expect value, is the number of similar sequences you expect to see by chance in a database of a specified size. Typically, a low e-value indicates similarity between sequences, and you can infer that sequences are homologous. Although, BLAST does not measure homology directly. It is calculated using the bit score, length of the query, and size of the database. Since a particular bit score is more easily obtained by chance with a longer query than with a shorter query, longer queries correspond to larger E-values. Larger dbs make a particular bit score more easily obtained by chance, a larger db results in a larger bit score.

In this blast output, you can expect to see the first four hits on this search 0 x 10⁰ times by random chance, meaning these hits are not random. They also have the same percent similarity. With this, the question of how to interpret e-values arises. What is a good e-value to support a claim that two sequences are biologically related? And which hits can we ignore… There is no universal answer but we can narrow the options with a few guidelines.

  • Checking the length of the sequence as a percentage of the query can give some reference to the length of each hit in relation to your query
  • The type of query determines the best e-value to use
  • The conclusions you draw from the data will be influenced by the e-value

To find extremely similar sequences, a high-scoring sequence with an e-value in the range of much-smaller-than-zero is likely a good choice.


BLAST (Basic Local Alignment Search Tool)

BLAST was developed by Altschul. et al, and it was published on Journal of Molecular Biology(J. Mol. Biol. 215:403-410(1990)).

BLAST(Basic Local Alignment Search Tool) is analysis tools/suits of similarity between sequences with nucleotide or protein databases.

BLAST programs makes sequences compare with open-access databases for similarities.

The results from BLAST operations is relatively statistical data.

BLAST proceeds local algorithms to demonstrate the similarities of two sequences, for example, two sequences alignment.

Capabilities and Usage

BLAST makes several sequences input compare with each data stored in many nucleotide or protein databases.

BLAST proceeds compares of homologous similarities between sequences input and data stored in databases.

While operating BLAST, the searching database must be a single type, for instance, proteins.

There are flexible operations, nucleotide sequence input compares with protein databases or protein sequence input compares with protein databases.

The suits prepared by GCG and EMBOSS include five kinds of BLAST operations as following:

Based on the sequence input, choose the better one kind of all BLAST operations. For example, while a compare with both nucleotide-type sequences there are BLASTN or TBLASTX operations you could choose. Usually the default setting is BLASTN, it is for sure to operate two sequences under TBLASTN(gap not considered).

BLAST is also operated on the local/terminal. It is necessary to download public databases and to maintain/update databases.

There are websites providing BLAST operations for free, but if the sequence is quite important, it would be better operating one on the local/terminal.


If you read wiki carefully, you can see that PSSM is calculated in 3 steps. First frequency is calculated (how much times was the amino acid or nucleotide on that location in the motif), from that you can calculate probability (in the wiki example there were 10 sequences, so each freq. is divided by 10).

Then the log likelihood is calculated, which are the PSSM values. These are in your first matrix (rounded down). The second matrix shows how much the values are relative from your pseudocounts (I assume it was set to default = 0).

Lambda and kappa are estimated to calculate the normalized score (S') for HSP, if you never heard of this before I suggest you read the original psi-blast paper first which you can find here.


Basic local alignment search tool (BLAST) is a sequence similarity search program. The National Center for Biotechnology Information (NCBI) maintains a BLAST server with a home page at http://www.ncbi.nlm.nih.gov/BLAST/ . We report here on recent enhancements to the results produced by the BLAST server at the NCBI. These include features to highlight mismatches between similar sequences, show where the query was masked for low-complexity sequence, and integrate information about the database sequences from the NCBI Entrez system into the BLAST display. Changes to how the database sequences are fetched have also improved the speed of the report generator.

Basic local alignment search tool (BLAST) is a sequence similarity search program that can be used via a web interface or as a stand-alone tool to compare a user's query to a database of sequences ( 1 , 2 ). Several variants of BLAST compare all combinations of nucleotide or protein queries with nucleotide or protein databases. BLAST is a heuristic that finds short matches between two sequences and attempts to start alignments from these ‘hot spots’. In addition to performing alignments, BLAST provides statistical information about an alignment this is the ‘expect’ value, or false-positive rate.

The National Center for Biotechnology Information (NCBI) maintains a BLAST server with a homepage at http://www.ncbi.nlm.nih.gov/BLAST/ . On the homepage the different BLAST searches are listed by type: nucleotide, protein, translated and genomes. The ‘Program Selection Guide’ ( http://www.ncbi.nlm.nih.gov/blast/producttable.shtml ) provides an introduction to the various programs and database options ( 3 ). When a query is submitted to the NCBI server, either as a sequence in FASTA format or as a sequence identifier, e.g. GenBank accession number, the search is sent to the BLAST server and a ‘Request Identifier’ (RID) is returned. The query and results are stored in a structured format for up to 24 h after an RID is issued. The RID identifies the search and allows the results to be viewed in several formats, which include the familiar BLAST report, a simplified ‘hit table’, XML and ASN.1 [( 4 ) and http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.610 ]. The number of outstanding jobs from one IP address is taken into account when queuing requests, as described at http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.shtml#Queuetime , so that one user does not monopolize the entire service. Searches sent to the server are handled by a sophisticated queuing system that may spread the search over 10 to 20 machines, making the search much faster than if it were run on one machine. Queries and results are stored in an SQL database. More details are available at ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blast-sc2004.pdf

We report here on new display features that we have implemented. These include highlighting mismatches between similar sequences, showing where the query was masked for low-complexity sequence and integrating information from the NCBI Entrez system ( 5 ) into the BLAST display. Additionally the new report generator has been optimized for databases with large sequences.

Custom definition lines

During the past five years many genomes have become searchable and the sequences in those databases are typically long contigs or chromosomes. Additionally many long nucleotide sequences have been added to the BLAST databases as a result of high-throughput genomic projects. Traditionally sequences in the BLAST database have been associated with only one descriptive phrase that is normally the same as the ‘definition’ in the GenBank flat file. This means that only very generic information is provided for matches to long database sequences, even though such a sequence might have annotations for many genes, coding regions (CDS) and other features. The top line of Figure 1 shows a database sequence definition and merely states that the sequence is part of human chromosome 6 and is about 48 million bases long. This reveals little about the region of the database sequence containing the match. To address this issue, we now provide feature information for BLAST alignments involving long database sequences (currently defined as larger than 200 kb).

Two types of sequence features (CDS and rRNA) are currently supported but this could be expanded to other features. An example is shown in Figure 1 where a custom definition line is displayed for each of the two alignments. According to the custom definition lines the query matches a region inside the human major histocompatability complex (MHC) A gene, as well as a region that is about 54 kb upstream of the MHC A gene and about 58 kb downstream of the MHC G gene, allowing one to quickly draw the conclusion that the query sequence is highly related to the human MHC. This feature is always enabled for reports at the NCBI BLAST web site.

New format options for easier sequence analysis

Frequently alignments are between very similar sequences and it's difficult to identify a few mismatches in the pairwise alignment. To address this issue we recently introduced a new format called ‘Pairwise with identities’, shown in Figure 2 on an alignment with 98% identity between the query and database sequence. A dot indicates identity between the database sequence and query at that position mismatches are shown as the database sequence letter in place of the dot and colored red. In addition the word ‘Sbjct’ (on the left of the figure) is also colored red if there is a mismatch on the line. Enable this option with the ‘Alignment View’ pull-down menu shown in Figure 3 .

The majority of BLAST searches at the NCBI web site are nucleotide queries against nucleotide databases (e.g. BLASTN). Many of these queries are mRNAs or match to sequences with annotated coding regions. The standard BLAST report does not show the amino acid sequence translated from the query or annotated on the database sequence, even though that may be of great interest to the user furthermore figuring out the positions of the encoded amino acids on the corresponding nucleotide sequence can be challenging, especially if the coding region is long or involves multiple exons. We have introduced a new ‘CDS Feature’ to display such coding regions. With this option any pre-annotated CDS protein products on the query (if the query is an accession) or the database sequence are fetched from Entrez and shown with the residues aligned to the second base of a codon ( Figure 2 ). For a user-submitted query in FASTA format a putative protein product is calculated using the coding frame of the database sequence as a guide. Mismatched amino acids for the database sequence can also be shown in color. Combined with the ‘Pairwise with identities’ option discussed above this format makes certain tasks easier, such as analysis for silent and replacement mutations. Owing to the overhead of fetching the CDS feature from Entrez this option is currently not the default. Enable this option by checking the ‘CDS feature’ box on the BLAST format page as shown in Figure 3 .

Low complexity sequences are compositionally biased regions of amino acid or nucleotide sequence, which often result in artificially high scores in sequence similarity searches. Low-complexity filters, such as SEG ( 6 ) or DUST,mask these regions and prevent them from overly biasing the results. Traditionally BLAST has replaced the masked regions by Xs or Ns in the BLAST report. The BLAST formatter now can represent these regions by lower-case letters, making them distinct from the (upper-case) non-filtered regions ( Figure 2 ). In addition the user may select from three colors (black, gray, red) to vary the emphasis on these regions ( Figure 3 ). This new display option is now the default, showing the masked regions in gray lower-case.

General improvements to the BLAST web site

The BLAST graphical overview is a schematic representation of alignments matching the query sequence. It is useful for quickly localizing regions of interest in the query based on it's similarity to other sequences in the database. To reduce the complexity in generating this graphic overview we have now implemented it as HTML tables that use a few small static images (gifs). This design is more robust and also lends itself to future development of a graphical viewer for stand-alone and command-line client BLAST.

The new report generator has improved functionality to fetch part of a database sequence. This can be essential if the database sequence is long, such as a chromosome, and the alignment to be presented only involves a small fraction of the sequence. Previously the entire database sequence was fetched and much of that sequence was not used. This improved functionality has led to a dramatic decrease in formatting time for searches against genomes.

BLAST provides several different modes for viewing BLAST results. The Query-anchored view gives a stacked view of database sequences aligned to the query with indication of insertions and mismatches ( 3 ). This provides an easy method to scan alignments and locate things like SNP's and amino acid substitutions among a group of related sequences. Previously the query-anchored views were not fully supported for BLASTX and TBLASTX searches that involved translated sequences. The formatter now supports this format for all these programs. Use the ‘Alignment View’ pull-down to enable this option ( Figure 3 ).

From the BLAST results it is now possible to select some or all of the database sequences and perform an Entrez query to fetch them. Checking the boxes in the alignment section selects the sequences to download and clicking the ‘Get Selected sequences’ button takes the user to Entrez, where the sequences can be displayed in various formats, (such as GenBank or FASTA) and saved to a file. The saved file can then be used as input to another program.

Future directions

We are currently redesigning the BLAST web pages to make them more effective tools. Some of the changes will be better organized HTML that makes options apparent to the user, such as making it easier to limit a search or results to a particular organism or subset of the data available. Results will also be made more user-friendly by better organizing the output. Nearing completion is a utility to calculate distances between sequences in the BLAST results and present those as a tree. Finally we are also working on making it possible to save search or formatting strategies for future use.

Excerpt from a BLAST result showing custom-definition lines. The query was bases 241 through 480 of a human MHC A gene nucleotide sequence (NM_002116) in a search against the human genome. The top line of the figure is the traditional sequence definition. Custom definition lines are provided for both of the alignments shown and are relevant to the region matched (first alignment) or nearby regions (second alignment).

Excerpt from a BLAST result showing custom-definition lines. The query was bases 241 through 480 of a human MHC A gene nucleotide sequence (NM_002116) in a search against the human genome. The top line of the figure is the traditional sequence definition. Custom definition lines are provided for both of the alignments shown and are relevant to the region matched (first alignment) or nearby regions (second alignment).

Demonstration of new format options. FASTA sequence for the human cystic fibrosis trans-membrane conductance regulator sequence (NM_000492) was used as query for a BLASTN search against the nr database using default parameters. Three new display options are shown in this figure. The first is the ‘Pairwise with identities’ option. Nucleotide matches in the database sequence are shown as dots (‘.’), nucleotide mismatches in the database sequence (as well as the database sequence identification) are colored red. The second new option is the presentation of the CDS features, which is shown for both the query and database sequences above and below the BLAST alignment, respectively. The CDS feature annotated on the database sequence was retrieved from Entrez the putative CDS feature on the query was produced automatically using the CDS of the database sequence as a guide. Mismatches for the amino acid sequence derived from the database sequence are colored pink. Finally the new masking option is shown (see text). Bases 175–181 of the query were masked for low-complexity during the search and are shown in lower-case gray letters.

Demonstration of new format options. FASTA sequence for the human cystic fibrosis trans-membrane conductance regulator sequence (NM_000492) was used as query for a BLASTN search against the nr database using default parameters. Three new display options are shown in this figure. The first is the ‘Pairwise with identities’ option. Nucleotide matches in the database sequence are shown as dots (‘.’), nucleotide mismatches in the database sequence (as well as the database sequence identification) are colored red. The second new option is the presentation of the CDS features, which is shown for both the query and database sequences above and below the BLAST alignment, respectively. The CDS feature annotated on the database sequence was retrieved from Entrez the putative CDS feature on the query was produced automatically using the CDS of the database sequence as a guide. Mismatches for the amino acid sequence derived from the database sequence are colored pink. Finally the new masking option is shown (see text). Bases 175–181 of the query were masked for low-complexity during the search and are shown in lower-case gray letters.

Enabling new features on the BLAST format page. The red arrows point to new report features that may be enabled or modified from this page. The check-box highlighted by arrow 1 enables the CDS feature on a BLASTN or megaBLAST search. The two menus highlighted by arrow 2 change the default behavior for display of masked regions. The menu highlighted by arrow 3 changes how the alignments are displayed in the BLAST report.

Enabling new features on the BLAST format page. The red arrows point to new report features that may be enabled or modified from this page. The check-box highlighted by arrow 1 enables the CDS feature on a BLASTN or megaBLAST search. The two menus highlighted by arrow 2 change the default behavior for display of masked regions. The menu highlighted by arrow 3 changes how the alignments are displayed in the BLAST report.

The authors would like to acknowledge Richa Agarwala, Stephen Altschul, Kevin Bealer, Christiam Camacho, Peter Cooper, George Coulouris, Susan Dombrowski, Mike Gertz, David Lipman, Wayne Matten, Yuri Merezhuk, Alexander Morgulis, Jim Ostell, Jason Papadopoulos, Yan Raytselis, Eric Sayers, Alejandro Schaffer, Tao Tao, David Wheeler and Irena Zaretskaya, as well as members of the C++ toolkit group at the NCBI, for their work that has made this Web site possible. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. Funding to pay the Open Access publication charges for this article was provided by the National Institutes of Health.


Contents

Nucleic acids consist of a chain of linked units called nucleotides. Each nucleotide consists of three subunits: a phosphate group and a sugar (ribose in the case of RNA, deoxyribose in DNA) make up the backbone of the nucleic acid strand, and attached to the sugar is one of a set of nucleobases. The nucleobases are important in base pairing of strands to form higher-level secondary and tertiary structure such as the famed double helix.

The possible letters are A, C, G, and T, representing the four nucleotide bases of a DNA strand – adenine, cytosine, guanine, thymine – covalently linked to a phosphodiester backbone. In the typical case, the sequences are printed abutting one another without gaps, as in the sequence AAAGTCTGAC, read left to right in the 5' to 3' direction. With regards to transcription, a sequence is on the coding strand if it has the same order as the transcribed RNA.

One sequence can be complementary to another sequence, meaning that they have the base on each position in the complementary (i.e. A to T, C to G) and in the reverse order. For example, the complementary sequence to TTAC is GTAA. If one strand of the double-stranded DNA is considered the sense strand, then the other strand, considered the antisense strand, will have the complementary sequence to the sense strand.

Notation Edit

Comparing and determining % difference between two nucleotide sequences.

  • AATCCGCTAG
  • AAACCCTTAG
  • Given the two 10-nucleotide sequences, line them up and compare the differences between them. Calculate the percent similarity by taking the number of different DNA bases divided by the total number of nucleotides. In the above case, there are three differences in the 10 nucleotide sequence. Therefore, divide 7/10 to get the 70% similarity and subtract that from 100% to get a 30% difference.

While A, T, C, and G represent a particular nucleotide at a position, there are also letters that represent ambiguity which are used when more than one kind of nucleotide could occur at that position. The rules of the International Union of Pure and Applied Chemistry (IUPAC) are as follows: [1]

Symbol [2] Description Bases represented Complement
A Adenine A 1 T
C Cytosine C G
G Guanine G C
T Thymine T A
U Uracil U A
W Weak A T 2 W
S Strong C G S
M aMino A C K
K Keto G T M
R puRine A G Y
Y pYrimidine C T R
B not A (B comes after A) C G T 3 V
D not C (D comes after C) A G T H
H not G (H comes after G) A C T D
V not T (V comes after T and U) A C G B
N any Nucleotide (not a gap) A C G T 4 N
Z Zero 0 Z

These symbols are also valid for RNA, except with U (uracil) replacing T (thymine). [1]

Apart from adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U), DNA and RNA also contain bases that have been modified after the nucleic acid chain has been formed. In DNA, the most common modified base is 5-methylcytidine (m5C). In RNA, there are many modified bases, including pseudouridine (Ψ), dihydrouridine (D), inosine (I), ribothymidine (rT) and 7-methylguanosine (m7G). [3] [4] Hypoxanthine and xanthine are two of the many bases created through mutagen presence, both of them through deamination (replacement of the amine-group with a carbonyl-group). Hypoxanthine is produced from adenine, and xanthine is produced from guanine. [5] Similarly, deamination of cytosine results in uracil.

In biological systems, nucleic acids contain information which is used by a living cell to construct specific proteins. The sequence of nucleobases on a nucleic acid strand is translated by cell machinery into a sequence of amino acids making up a protein strand. Each group of three bases, called a codon, corresponds to a single amino acid, and there is a specific genetic code by which each possible combination of three bases corresponds to a specific amino acid.

The central dogma of molecular biology outlines the mechanism by which proteins are constructed using information contained in nucleic acids. DNA is transcribed into mRNA molecules, which travels to the ribosome where the mRNA is used as a template for the construction of the protein strand. Since nucleic acids can bind to molecules with complementary sequences, there is a distinction between "sense" sequences which code for proteins, and the complementary "antisense" sequence which is by itself nonfunctional, but can bind to the sense strand.

DNA sequencing is the process of determining the nucleotide sequence of a given DNA fragment. The sequence of the DNA of a living thing encodes the necessary information for that living thing to survive and reproduce. Therefore, determining the sequence is useful in fundamental research into why and how organisms live, as well as in applied subjects. Because of the importance of DNA to living things, knowledge of a DNA sequence may be useful in practically any biological research. For example, in medicine it can be used to identify, diagnose and potentially develop treatments for genetic diseases. Similarly, research into pathogens may lead to treatments for contagious diseases. Biotechnology is a burgeoning discipline, with the potential for many useful products and services.

RNA is not sequenced directly. Instead, it is copied to a DNA by reverse transcriptase, and this DNA is then sequenced.

Current sequencing methods rely on the discriminatory ability of DNA polymerases, and therefore can only distinguish four bases. An inosine (created from adenosine during RNA editing) is read as a G, and 5-methyl-cytosine (created from cytosine by DNA methylation) is read as a C. With current technology, it is difficult to sequence small amounts of DNA, as the signal is too weak to measure. This is overcome by polymerase chain reaction (PCR) amplification.

Digital representation Edit

Once a nucleic acid sequence has been obtained from an organism, it is stored in silico in digital format. Digital genetic sequences may be stored in sequence databases, be analyzed (see Sequence analysis below), be digitally altered and be used as templates for creating new actual DNA using artificial gene synthesis.

Digital genetic sequences may be analyzed using the tools of bioinformatics to attempt to determine its function.

Genetic testing Edit

The DNA in an organism's genome can be analyzed to diagnose vulnerabilities to inherited diseases, and can also be used to determine a child's paternity (genetic father) or a person's ancestry. Normally, every person carries two variations of every gene, one inherited from their mother, the other inherited from their father. The human genome is believed to contain around 20,000–25,000 genes. In addition to studying chromosomes to the level of individual genes, genetic testing in a broader sense includes biochemical tests for the possible presence of genetic diseases, or mutant forms of genes associated with increased risk of developing genetic disorders.

Genetic testing identifies changes in chromosomes, genes, or proteins. [6] Usually, testing is used to find changes that are associated with inherited disorders. The results of a genetic test can confirm or rule out a suspected genetic condition or help determine a person's chance of developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being developed. [7] [8]

Sequence alignment Edit

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be due to functional, structural, or evolutionary relationships between the sequences. [9] If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as insertion or deletion mutations (indels) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest [10] that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role. [11]

Computational phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young most recent common ancestor, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence time), assumes that the effects of mutation and selection are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of DNA repair or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid being incorporated into the protein.) More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.

Sequence motifs Edit

Frequently the primary structure encodes motifs that are of functional importance. Some examples of sequence motifs are: the C/D [12] and H/ACA boxes [13] of snoRNAs, Sm binding site found in spliceosomal RNAs such as U1, U2, U4, U5, U6, U12 and U3, the Shine-Dalgarno sequence, [14] the Kozak consensus sequence [15] and the RNA polymerase III terminator. [16]

Long range correlations Edit

Peng et al. [17] [18] found the existence of long-range correlations in the non-coding base pair sequences of DNA. In contrast, such correlations seem not to appear in coding DNA sequences. This finding has been explained by Grosberg et al. [19] by the global spatial structure of the DNA.

Sequence entropy Edit

In Bioinformatics, a sequence entropy, also known as sequence complexity or information profile, [20] is a numerical sequence providing a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. The manipulations of the information profiles enable the analysis of the sequences using alignment-free techniques, such as for example in motif and rearrangements detection. [20] [21] [22]


Watch the video: PSIBLAST (August 2022).