Information

D5. Multiple Conformations from The Same Sequence - Biology

D5. Multiple Conformations from The Same Sequence - Biology


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

1. This would especially be true if two possible structures where close enough in free energy but separated by a significant activation energy barrier, precluding simple conformational rearrangement of one conformation to another.

2. Metamorphic Proteins: In addition to prion proteins, it appears that many proteins can adopt more than one conformation under the same set of conditions. In contrast to prion proteins, however, in which the formation of the beta-structure variant is irreversible since the conformational change is associated with aggregation, many proteins can change conformations reversibly. Often, these changes do not appear to be associated only with binding interactions that trigger the change. Murzin has described proteins that change conformations on change of pH (viral glycoproteins), redox state (chloride channel), disulfide isomerization (lysozyme), and bound ligand (RNA polymerase as it initiates and then elongates the growing RNA polymer). He cites two proteins that appear to changes state without external signals. These include Mad2, in which the two conformers share extensive similarity, and Ltn10 (lymphotactin), in which they don't. One form of lymphotactin (Ltn 10) binds to similar lymphokine receptors, while the other (Ltn 40) binds to heparin. Folding kinetics may play a part in these examples as well, as proteins capable of folding to two conformers independently and quickly might prevent misfolding and aggregation that might occur if they had to completely unfold first before a conformational transition. Both Mad2 and Ltn10 alter conformation through transient formations of dimers, which facilitate conformational changes without widespread unfolding. Mutations in Ltn10 can cause the protein to adopt the Ltn40 conformation, Hence primordial "metamorphic" proteins could, by simple mutation, produce new protein functionalities.

3. Intrinsically Disordered Proteins (IDPs): Many examples of proteins that are partially or completely disordered but still retain biological function have been found. At first glance this might appear to be unexpected, since how could such a protein bind its natural ligand with specificity and selectivity to express its function? Of course one could postulate ligand binding would induce conformational changes necessary for function (such as catalysis) in an extreme example of an induced fit of a ligand compared to a "lock-and-key" fit. Decades ago, Linus Pauling predicted that antibodies, proteins that recognize foreign molecules (antigens), would bind loosely to the antigen, followed by a conformational change to form a more complementary and tighter fit. This was the easiest way to allow for a finite number of possible protein antibodies to bind a seemingly endless number of possible foreign molecules. This is indeed one method in which antibodies can recognize foreign antigens. Antibodies that bind to antigen with high affinity and hence high specificity more likely bind through a lock and key fit. (Pauling, however, didn't know that the genes that encode the proteins chains in antibodies are differentially spliced and subjected to enhanced mutational rates which allow the generation of incredible antibody diversity from a limited set of genes.)

It's been estimated that over half of all native proteins have regions (greater than 30 amino acids) that are disordered, and upwards of 20% of proteins are completely disordered. Regions of disorder are enriched in polar and charged side chains which follows since these might expected to assume many available conformations in aqueous solutions compared to sequences enriched in hydrophobic side chains, which would probably collapse into a compact core stabilized by the hydrophobic effect. Mutations in the disordered regions tend to preserve the disordered region, suggesting that the disordered region is advantageous for "future" function. In addition, mutations that cause a noncoding sequence to produce a coding one invariably produce disordered protein sequences. Disordered proteins tend to have regulatory properties and bind multiple ligands, in comparison to ordered one, which are involved in highly specific ligand binding necessary for catalysis and transport. The intracellular concentration of disordered proteins has also been shown to be lower than ordered proteins, possibly to prevent occurrences of inappropriate binding interactions mediated through hydrophobic interactions, for example. Processes to accomplish this include more rapid mRNA and protein degradation and slower translation of mRNA for disordered proteins. For a similar reason, misfolded proteins are targeted for degradation as well. Figure A below shows the mean net charge vs the mean hydrophobicity for 275 folded and 91 natively unfolded proteins. Figure B shows the relative amino acid composition of globular (ordered) proteins compared to regions of disorder greater than 10 amino acids in disordered proteins. The two different grey bars were obtained with two different versions of the software used to analyze the proteins. Again the graph shows an enrichment of hydrophilic amino acids in disordered proteins.

Figure: Characteristics of Intrinsically Disordered Proteins

from open access journal: Dunker, A. et al. BMC Genomics 2008, 9(Suppl 2):S1 doi:10.1186/1471-2164-9-S2-S1

Many experimental methods can be used to detect disordered regions in proteins. Such regions are not resolved well in X-Ray crystal structures (have high B factors). NMR solution structures would show multiple, and differing conformations. CD spectroscopy likewise would show ill-defined secondary structure. In addition solution measurements of size (light scattering, centrifugation) would show larger size distributions for a given protein.

What types of proteins contain disorder? The above experimental and new computational methods have been developed to classify proteins as to their degree of disorder. There appears to be more IDPs in eukaryotes than in archea and prokaryotes. Many IDPS are involved in cell signaling processes (when external molecules signal cells to respond by proliferating, differentiating, dying, etc). Most appear to reside in the nucleus. The largest percentage of known IDPs bind to other proteins and also to DNA. These results suggest that IEPs are essential to protein function and probably confer significant advantages to eukaryotic cells as multiple functions can be elicited from the interaction of a single IEP (derived from a single gene) with different protein binding partners. This would greatly extend the effective genome size in humans, for examples, from around 25,000 with specified function, to many more. This doesn't even take into account the increase functionalities derived from post-translational chemical modifications.

We will discuss intrinsically disordered proteins further in Chapter 5. What is clear from recent finding is that protein structure is fluid and complex and our simple notions and words to denote proteins as either native or denatured are misguided and constrain our ideas about how protein structure elicits biological function. For example, what does the word "native" mean, if proteins exist in multiple states in vivo and in vitro simultaneously? Dunker et al (2001) have coined the concept "Protein Trinity" to move past the notion that a single protein folds to a single state which elicits a single function. Rather each of the states in the "trinity", the ordered, collapsed (or molten globule) and extended (random coil) coexist in the cell. Hence all can be considered "native" and all contribute to the function of the cell. A single IDP could bind to many different protein partners, each producing different final structures and functions. IDPs would also be more accessible and hence susceptible to proteolysis, which would lead to a simple mechanism to control their concentrations, an important way to regulate their biological activity. Their propensity to post-translational chemical modification would likewise lead to new types of biological regulation.

Figure: The Protein Trinity: Ordered, Collapsed and Extended States

These ideas have profound ramifications for our understanding of the expression of cellular phenotype. In addition, a whole new world of drug target is available by finding drugs that modulate the transitions between ordered, collapsed and extended protein states. Likewise, side effects of drugs might be understood by investigating drug effects of these transitions in IDPs not initially targeted.

4. Catalysis by Molten Globule: A recent example (Bemporad) that a bacterial acylphosphatase has catalytic activity as a molten globule further questions our notions of structure and enzyme activity. In this example, substrate interaction did not induce global conformational changes in the protein. Molecular dynamics simulations showed that many partially disordered conformations of the protein are present, and the disorder involved the active site. However, parts of the protein are more ordered and form a "scaffold" which keeps the catalytic and substrate binding amino acids near enough that binding could engender conformational rearrangements at the active side and subsequent catalytic activity.


Structural alignment

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

Structural alignments can compare two sequences or multiple sequences. Because these alignments rely on information about all the query sequences' three-dimensional conformations, the method can only be used on sequences where these structures are known. These are usually found by X-ray crystallography or NMR spectroscopy. It is possible to perform a structural alignment on structures produced by structure prediction methods. Indeed, evaluating such predictions often requires a structural alignment between the model and the true known structure to assess the model's quality. [1] Structural alignments are especially useful in analyzing data from structural genomics and proteomics efforts, and they can be used as comparison points to evaluate alignments produced by purely sequence-based bioinformatics methods. [2] [3] [4]

The outputs of a structural alignment are a superposition of the atomic coordinate sets and a minimal root mean square deviation (RMSD) between the structures. The RMSD of two aligned structures indicates their divergence from one another. Structural alignment can be complicated by the existence of multiple protein domains within one or more of the input structures, because changes in relative orientation of the domains between two structures to be aligned can artificially inflate the RMSD.


Results and Discussion

Fig. ​ Fig.1 1 A (Top and Middle) shows rotary-shadowed electron micrographs of the (I27-PEVK)3 polyprotein. The micrographs show groups of three globular domains that appear to be tethered by an invisible string. We interpret these results as evidence of three clearly marked I27 modules separated by two PEVK segments. The PEVK region is 186 aa long (www.embl-heidelberg.de˾xternalInfo/Titin˺nnotation.html). Hence this region has a predicted contour length of � nm (0.38 nm⾪ × 186 aa). A histogram of the separation between the I27 domains shows a wide distribution with values from 9 to 24 nm, with a suggestion of distinct peaks at 11 and 17 nm. The observed end-to-end length of PEVK is significantly smaller than its contour length, suggesting that the PEVK region in the relaxed state is coiled. Moreover the PEVK region is invisible in the EM micrographs, indicating that PEVK forms a much less compact structure than the folded Ig domain. Bustamante and colleagues (18, 19) investigated the end-to-end distance of DNA molecules adsorbed and equilibrated onto a two-dimensional plane and determined that this distance, R, could be calculated as < R 2 > = 4 pL, where p is the persistence length and L is the contour length of a WLC polymer. Assuming that the EM measurements are a good approximation of the expected value of the end-to-end length distribution, we estimate a broad range of persistence lengths for the PEVK segments (0.2𠄲.3 nm).

EM of individual (I27-PEVK)3 polyproteins. (A) Representative (I27-PEVK)3 (Top and Middle) and I2712 (Bottom) molecules as seen by rotary-shadowing EM. The polyprotein is visible as three small globular particles (I27 modules), apparently connected by an invisible thread (PEVK). In contrast, rotary-shadowed images of I2712 show solid rods with an average length of 58 nm, predicting a folded length of 𢒄.8 nm/module. The bar corresponds to 50 nm. (B) A histogram of the average distance between I27 modules, corresponding to the end-to-end distance of PEVK, shows a broad distribution from 9 to 24 nm, with apparent peaks at 11 and 17 nm.

In contrast to the (I27-PEVK)3 polyprotein, rotary-shadowed electron micrographs of an I2712 polyprotein show continuous rod-like proteins (Fig. ​ (Fig.1A 1 A Bottom) with an average length of 58 nm (not shown), giving an average length of 4.8 nm for the folded I27 module, which is slightly larger than the NMR estimate of 4.4 nm (20). The contrast between the appearance of the I27 polyproteins and that of the I27-PEVK polyprotein (Fig. ​ (Fig.1 1 A) supports the view that the PEVK region is largely an unfolded and coiled polypeptide. However, Fig. ​ Fig.1 1 B shows a broad range in the end-to-end distance of the relaxed PEVK segment, implying that PEVK may have multiple conformations with different persistence lengths.

The unfolded conformation of the PEVK segments also was indicated by sedimentation. (I27-PEVK)3 and I2712 sedimented at 2.9 S and 4.2 S, respectively, giving an Smax/S of 2.6 and 2.1. Smax is the S for a compact, unhydrated sphere containing the equivalent mass of protein. Smax/S is equivalent to f/fmin, where f and fmin are the frictional coefficients of the protein and equivalent sphere (21, 22). A value over 2 indicates a highly extended molecule. The Smax/S = 2.1 is consistent with the elongated structure (58 × 2.5 nm) of I2712. The Smax/S = 2.6 of (I27-PEVK)3 indicates an even greater extension. However, the PEVK molecule is coiled and the overall length of the (I27-PEVK)3 seen in the EM pictures of Fig. ​ Fig.1 1 is shorter than that of I2712. Hence, it is likely that the larger Smax/S ratio results from an increased exposure of the protein to the solvent. The sedimentation data are consistent with the PEVK segments being largely unfolded polypeptide chain, creating a substantial hydrodynamic drag.

Fig. ​ Fig.2 2 A shows force-extension curves obtained from the I2712 polyprotein. Because the AFM tip picks the protein at random locations, the number of peaks observed serves as a count of the number of modules contained in the segment that was picked up. The number of modules picked up can vary from one to 12. The folded length of a polyprotein can be determined by fitting the WLC model of polymer elasticity (5, 6) to the initial part of the force-extension curve, before any unfolding is observed (Fig. ​ (Fig.2 2 A, L0, solid line). The folded length of the polyprotein depends on the number of modules picked up by the AFM tip (n) and the folded length of a single I27 module (4.4 nm) (20). In close agreement with this prediction, a plot of L0 versus n shows a slope of 4.3 nm/module (Fig. ​ (Fig.2 2 B, solid line). Hence, both the electron micrographs and the AFM spectrographs give values that agree closely with those determined from the NMR structure of the I27 module.

Characteristic fingerprint of I27 domain unfolding by a stretching force. (A) Stretching of an I2712 polyprotein produces a force-extension curve showing the characteristic sawtooth pattern of unfolding. The force-extension curves show different number of sawteeth, depending on the number of modules picked up by the AFM tip. The red squares represent the modules being picked up in this particular experiment. The solid lines are fits of the data to the WLC model of polymer elasticity. L0 is the contour length of the fully folded polyprotein upon module unfolding at about � pN, an additional 28.1 nm will be added to the contour length of the protein. (B) Relationship between L0, determined from the fit to the first sawtooth, and the number of modules picked up by the AFM tip, determined by the number of sawteeth. The solid line is a linear regression of the data with a slope of 4.3 nm/module.

We then used the same techniques to investigate the mechanical nature of the PEVK protein. Fig. ​ Fig.3 3 A shows several force-extension curves for the (I27-PEVK)3 polyprotein. The characteristic fingerprint of the I27 module is that it unfolds at � pN, extending the contour length of the protein by 28.1 nm (12). In Fig. ​ Fig.3 3 A we show three types of recordings with one (top two traces), two (three middle traces), and three (bottom two traces) I27 unfolding events, excluding the last peak that corresponds to the detachment of the molecule from the AFM tip. In contrast to the I27 polyproteins, the force-extension relationships of the I27-PEVK polyproteins are characterized by a long initial length L0, ranging from 50 to 230 nm, which corresponds to the stretching of one, two, or three PEVK segments. The three traces in the middle of Fig. ​ Fig.3 3 A have two unfolding peaks, but they have very different values of L0, e.g. 68 nm for the first one, 133 nm for the second one, and 211 nm for the third one. Because the polyprotein is constructed in an I27/PEVK alternating pattern, if three I27 unfolding peaks are observed, at least two PEVK segments must have been stretched (see Insets in Fig. ​ Fig.3 3 A). If two I27 unfolding peaks are observed, one, two, or three PEVK segments must have been stretched. If only one I27 module is observed to unfold, there can be either one or two PEVK segments that are stretched but never three, in agreement with our observations. Using this approach, we can be sure that the PEVK segments are stretched and that the mechanical properties of PEVK are represented by the initial part of the force-extension relationships, before any of the I27 modules unfold.

Identification and measurement of the elasticity of a PEVK segment. (A) Stretching an (I27-PEVK)3 polyprotein produces a sawtooth pattern only after a long initial spacer, L0. The sawtooth peaks are typical for I27 domain unfolding because they occur at 200 pN and extend the protein by �.1 nm. Events with only one I27 unfolding event show two discrete values of L0: � nm or � nm (top two traces). When two or three I27 domains unfold we can measure an even longer value for L0 at � nm. The discrete values of L0 result from stretching one, two, or three PEVK segments before any I27 module unfolding occurs. The diagrams of the polyprotein accompanying each record show the various combinations of modules picked up by the AFM tip, where red squares and red circles represent I27 modules and PEVK segments being picked up, respectively. (B) Frequency histogram for the initial length L0. The distribution shows three clearly separated peaks (n = 142 recordings). Gaussian fits give distributions that peak at 82, 135, and 190 nm.

Fig. ​ Fig.3 3 B shows a histogram of the initial length L0. The histogram shows three distinct peaks centered at about 82, 135, and 190 nm. The contour length of human cardiac PEVK is � nm. Thus, this distribution can be explained by assuming that the initial length L0 occurs as approximately integer multiples of about 70 nm. These results strongly suggest that the elastic properties of the initial region of the force-extension curves shown in Fig. ​ Fig.3 3 A are due to the PEVK segments of the (I27-PEVK)3 polyprotein.

Having determined the contour length of the cardiac PEVK segments, we next measured their persistence length. Stretching a PEVK segment results in a nonlinear monotonic increase in force that can be described by the WLC model of polymer elasticity. The Inset in Fig. ​ Fig.4 4 A shows two examples of force-extension traces where the dotted lines represent nonlinear Leverberg–Marquardt fits to the WLC equation. Within a force resolution of 𢒅 pN, the WLC model fits data well and no stable, mechanically resistive structure can be detected, suggesting that the PEVK behaves as a purely entropic spring that has a random coil structure. However, the two force-extension traces do not superimpose, meaning that they have different persistence lengths. The histogram of measured persistence lengths (Fig. ​ (Fig.4) 4 ) shows a broad distribution (p = 0.3𠄲.3 nm), in reasonable agreement with the EM data (red line). This result provides direct evidence that cardiac PEVK displays multiple conformations with different mechanical flexibility. The small difference in the distributions obtained from the EM and AFM data could be caused by the approximations used to calculate the persistence length of PEVK from its end-to-end distance (19). However, it is remarkable that they agree so closely, validating the single-molecule techniques demonstrated here.

The PEVK segment of cardiac titin shows multiple mechanical conformations. (A) Frequency histogram of the measured persistence length of the PEVK segment (gray bars). The force-extension relationships are accurately described by the WLC model of polymer elasticity (as shown in the Inset). The close agreement between the WLC model and the data (within 𢒅 pN) demonstrates that extension of the PEVK segment does not involve the rupture of hydrogen-bonded structures. The PEVK segment shows a wide range of persistence length values, in agreement with the persistence lengths calculated from the end-to-end distributions observed with EM (red line). (Inset) Two representative PEVK recordings with different persistence lengths. In these two traces only the initial length of a stretched (I27-PEVK)3 polyprotein (before I27 domain unfolding) is shown (solid lines are experimental recordings, open symbols are Levenberg–Marquardt nonlinear fit of WLC model to the individual recordings). The red recording has a persistence length of 0.40 nm, the black recording has a persistence length of 1.08 nm. For comparison, the red recording (with a contour length of 207 nm) was normalized to have the same contour length as the black trace (140 nm). (B) Frequency histogram of the measured persistence length of the PEVK segments of a single (I27-PEVK)3 molecule during repeated stretch and relaxation cycles. The persistence length is narrowly distributed around 1.1 nm and is the same for the stretch (gray bars) and the relaxation (red bars) traces, showing that there is no detectable change in persistence length during these cycles. The scatter is due to the error margin of the fits to the data. (Inset) Two consecutive stretch (black) and relaxation (red) recordings. No hysteresis between stretching and relaxation was observed, indicating that this process is fully reversible.

We also probed the dissipative properties of the (I27-PEVK)3 polyprotein. The Inset in Fig. ​ Fig.4 4 B shows a single stretch-relaxation cycle of the PEVK segments. Under these conditions, the force-extension curve was fully reversible because the forward (in black) and backward (in red) traces could be superimposed, in contrast to the hysteresis observed during Ig domains unfolding (8�, 12). The persistence length measured from the same molecule during many consecutive stretch-relaxation cycles surprisingly shows a much narrower distribution (from 0.8 to 1.7 nm), in contrast to the wide distribution of persistence lengths measured on different molecules (Fig. ​ (Fig.4 4 A). This result indicates that individual PEVK molecules retain their distinctive elastic conformations through many stretch-relaxation cycles, and these distinctive conformations cannot be interconverted by force.

Previous studies of PEVK have provided insights into the mechanical properties of this segment (3, 7, 8, 23�). However, the ensemble-averaged elasticity in intact myofibrils (3, 7) and uncertainty of the number of molecules in single-molecule experiments (8) may have prevented the detection of the PEVK's multiple conformations. The broad range of mechanical conformations of cardiac PEVK observed in our studies demonstrate the utility of a 𠇏ingerprint” in single-molecule experiments and provide a general approach to study the mechanical properties and conformations of other proteins.

It is well known that including l -prolyl residues into a polypeptide chain has significant structural effects. Polyproline chains have two distinct secondary structure conformations named polyproline type I and polyproline type II (PPII) helices. Cooperative transitions between these two conformations are caused by proline isomerization where a change from trans to cis leads to a transition from type II to type I helix (26). Isomerization from trans to cis results in significantly different end-to-end lengths, suggesting different persistence lengths (27, 28). Recent CD and two-dimensional NMR studies have revealed the existence of short PPII helix conformations in a 28-aa-long PEVK module from human fetal titin (29, 30). Because the energy difference between trans- and cis-prolyl residues is only 1𠄲 kcal/mol (31), if a fraction of PEVK PPII helix changes conformation caused by the trans-cis transition, this could readily explain our data. Because the cis conformation is shorter than the trans, one might expect that stretching the PEVK molecule would force all of the cis l -prolyl residues into the longer trans form. However, the activation energy barrier from cis to trans is large, 21 kcal/mol (32), and the gain in length is small (1.0 Å). We estimated the most probable force required for a mechanically driven transition from cis to trans according to (33, 34),

where v is the loading rate (200 pN/s), Δxu is the distance between cis and the transition state, and αo is the isomerization rate from cis to trans in the absence of force. Using a value for αo of 2 × 10 𢄣 s 𢄡 (35), we calculate that the cis to trans transition will occur at � pN and therefore is unlikely to happen under our experimental conditions. This may explain why we observe a broad range of persistence lengths in different PEVK molecules (different amounts of cis and trans l -prolyl residues) that cannot be interconverted by force.

In addition to the cis and trans forms of the l -prolyl residue, Schimmel and Flory (36) noted that the two allowed conformations of the trans prolyl residue can significantly alter the compactness of a polypeptide chain. For example, a polyalanine chain containing 10% of isolated l -prolyl residues was shown to change its end-to-end length by a factor of 𢒁.6 when all of the trans prolyl residues changed between these rotamers. Furthermore, because there is no change in contour length of the polypeptide between these two conformers, they cannot be interconverted by an applied force. Hence, PEVK chains with different proportions of the two conformers of the trans prolyl residue also could explain our results.

What forces during protein assembly could cause such a variation in proline isomers from one molecule to another that result in PEVK molecules of very different flexibility? Our engineered proteins were made in bacteria and are typically used several days after purification. One might expect that the isomerization tendency will be the same for each proline and therefore average out about the same in each molecule. However, one must look at the protein as a whole. If a PPII helix begins to form, it is likely that it will favor a trans-proline rotamer, in a cooperative folding reaction. Hence, different PEVK molecules could acquire stable, but complex, combinations of PPII and coiled conformations.

We propose that cardiac PEVK behaves as an entropic spring that has a coiled structure overall with PPII helix and unordered structure coexisting inside the coil. Because the PPII helical structures are predominantly dictated by steric effect, the elasticity of PEVK remains entropic. The conformation and mechanical properties of PEVK depend on the available conformers of the l -prolyl residue in the PEVK sequence. Because these conformers can be switched enzymatically, it is possible that the elasticity of PEVK, in vivo, can be regulated by yet unknown signaling mechanisms.


The reference genome is not a baseline

The current reference genome is a type specimen

Although the reference genome is meant to be a standard, what that means in a practical sense is not clearly defined. For example, the allelic diversity within the reference genome is not an average of the global population (or any population), but rather contains long stretches that are highly specific to one individual. Of the 20 donors the reference was meant to sample from, 70% of the sequence was obtained from a single sample, ‘RPC-11’, from an individual who had a high risk for diabetes [27]. The remaining 30% is split 23% from 10 samples and 7% from over 50 sources [28]. After the sequencing of the first personal genomes in 2007 [29, 30], the emerging differences between genomes suggested that the reference could not easily serve as a universal or ‘gold-standard’ genome (see Box 1 for definitions). This observation is easily extended to other populations [31,32,33,34], where higher diversity can be observed. The HapMap project [35, 36] and the subsequent 1000 Genomes Project [37] were a partial consequence of the need to sample broader population variability [38]. Although the first major efforts to improve the reference focused on the need to fill in the gaps, work is now shifting towards incorporating diversity, through the addition of alternative loci scaffolds and haplotype sequences [39]. But just how similar to a personal genome is the current reference? We performed a short series of analyses to answer this question (Fig. 1), using the 1000 Genomes Project samples. Looking first at the allele frequencies (AF) of known variants, we found that around two million reference alleles have population frequencies of less than 0.5, indicating that they are the minor allele (dark blue line in Fig. 1a). This might seem high for a reference. In fact, the allelic distribution of the current reference is almost identical to the allelic distributions of personal genomes sampled from the 1000 Genomes Project (light blue lines in Fig. 1a). In practice, the current reference can be considered a well-defined (and well-assembled) haploid personal genome. As such, it is a good type specimen, exemplifying the properties of the individual genomes. This means, however, that the reference genome does not represent a default genome any more than any other arbitrarily chosen personal genome would.

The reference genome is a type specimen. a Cumulative distributions of variants in the reference genome and those in personal/individual genomes. If we collapse the diploid whole genomes genotyped in the 1000 Genomes Project into haploid genomes, we can observe just how similar the reference is to an individual genome. First, taking population allele frequencies from a random sample of 100 individual genomes, we generated new haploid ‘reference’ sequences. We replaced the alleles of the reference genome with the personal homozygous variant, and a randomly chosen heterozygous allele. For simplicity, all calculations were performed against the autosomal chromosomes of the GRCh37 assembly and include only single nucleotide bi-allelic variants (i.e., only two alleles per single nucleotide polymorphism (SNP)). b Cumulative distributions of allele frequencies for variants called in 100 randomly chosen personal genomes, computed against the reference genome. Here, the presence of a variant with respect to the reference is quite likely to mean that the reference itself has the ‘variant’ with respect to any default expectation, particularly if the variant is homozygous

Reference bias

Because the reference genome is close to being a type specimen, it can distort results where it’s sequence is not very typical. In alignment, reference bias refers to the tendency for some reads or sequences to map more readily to the reference alleles, whereas reads with non-reference alleles may not be mapped or mapped at lower rates. In RNA-seq-based alignment and quantification, reference bias has a major impact when differential mapping matters (such as in allele-specific expression), but can be overcome by the use of personal genomes or through the filtering of biased sites [40,41,42]. In variant calling, reference bias can be more important. Alignment to the reference to infer variation related to disease is still a step in most analyses, and is crucial in clinical assignments of variant significance and interpretation [43, 44]. In these cases, reference bias will induce a particular error. Variant callers might call more ‘variants’ when the reference alleles are rare or could fail to call variants that are rare but also shared by the reference [45,46,47,48]. Owing to the presence of rare alleles in the reference genome, some known pathogenic variants are easily ignored as benign [25]. A variant called with respect to the reference genome will be biased, reflecting the properties of the reference genome rather than properties that are broadly shared in the population. Indeed, continuing with our analysis (Fig. 1b), if we compare the variant calls within personal genomes against the reference, we find that close to two-thirds of the homozygous variants (blue lines) and one-third of the heterozygous variants (green lines) actually have allele frequencies above 0.5. Variation with respect to the reference is quite likely to indicate the presence of a ‘variant’ in the reference genome with respect to any default expectation, particularly if that ‘variant’ is homozygous.


Conclusions

It is clear that there is much to learn about the nature of protein structure dynamics that is not addressed in the static information contained in PDB. The intermediate structures representing a protein as it moves from one conformation to another may yield much information about how a protein functions. Experimental techniques are inadequate for this task due to practical and technological limitations. For this reason, structural biology is in great need of algorithms which can accurately predict the intermediate structures as a protein undergoes a conformational change.

While other morphing algorithms require computationally expensive energy and elastic network modeling calculations, our morphing algorithm is based on a few simple observations of protein structure, and therefore produces multiple intermediate conformations very quickly. Our intermediate structures represent possible protein structures, and demonstrate the motion of a protein as it changes between conformations. In the case of morphing between homologs, the intermediate structures give us clues to how protein structures have evolved.

The morphed structures also show promise in the area of virtual screening. Most techniques limit protein flexibility to the side chain atoms, and may allow limited flexibility of the substrate. Our morph produces intermediate structures which are hypotheses for possible backbone movements. For this reason, some ligands bound more favorably to our intermediate structures than the solved structures. These are strong implications for the potential of morphs in guiding drug development.

Like all other approaches, our algorithm also has limitations. Linear interpolation, with only small corrections, prevents our method from correctly producing a morph for proteins with very large or complex movements. Many of these morphs could be solved by allowing a larger movement from the first approximation (a larger lattice), or allowing higher granularity of possible C α positions (more points in each lattice) but the time cost would be significant. Clearly, in protein morphing there is a trade-off between speed and accuracy.


Just released: Addgene's Plasmids 101 eBook

And now for some exciting news. You can now download Addgene's new eBook! Our goal was to create a one-stop reference guide for plasmids. We've combined our Plasmids 101 blog posts and additional resources into one downloadable PDF that you can save on your desktop. The Plasmids 101 eBook is designed to educate all levels of scientists and plasmid lovers and serves as an introduction to plasmids, allowing you to spend less time researching plasmid basic features and spend more time developing the clever experiments and innovative solutions necessary for advancing the field.

1. Internal initiation of translation of eukaryotic mRNA directed by a sequence derived from poliovirus RNA. Pelletier et al (Nature. 1988 Jul 28334(6180):320-5.) PubMed.

2. A segment of the 5' nontranslated region of encephalomyocarditis virus RNA directs internal entry of ribosomes during in vitro translation. Jang et al. (J Virol. 1988 Aug62(8):2636-43.) PubMed.

3. Highly Efficient Multicistronic Lentiviral Vectorswith Peptide 2A Sequences. Ibrahimi et al. ( Hum Gene Ther. 2009 Aug20(8):845-60. doi: 10.1089/hum.2008.188. ) PubMed .

4. High cleavage efficiency of a 2A peptide derived from porcine teschovirus-1 in human cell lines, zebrafish and mice. Kim et al ( PLoS One. 20116(4):e18556. doi: 10.1371/journal.pone.0018556. Epub 2011 Apr 29. ) PubMed .

5. Scalable signaling mediated by T cell antigen receptor-CD3 ITAMs ensures effective negative selection and prevents autoimmunity. Holst et al (Nat Immunol. 2008 Jun9(6):658-66. doi: 10.1038/ni.1611. Epub 2008 May 11.) PubMed .

Read more Plasmids 101 posts or browse Addgene's Plasmid Guide for more molecular biology and cloning information.