Which Pair Of Animals Is Probably Most Closely Related Evolutionarily

2.1. Similarity, Homology, Divergence and Convergence

2.ane.i. The disquisitional definitions

In times past, gathering information on a potential partner in spousal relationship or business concern routinely started with the simplest question "What family does he or she come from?" Affiliation with a sure family immediately provided a starting point for farther inquiries, a general thought of what might be expected from a sure individual. Of course, families are never uniform, and archetype literature from Homer to Shakespeare to Tolstoy provides ample illustrations that whatsoever expectation based solely on family unit history should exist taken with a grain of salt. Withal, in the absence of other clues to the character of the subject field in question, an educated guess could be fabricated based on the family unit construction and the individual'south position inside that construction.

Essentially the same approach is used in predicting potential functions for a newly sequenced factor and its poly peptide product. Since it is technically incommunicable to experimentally test activity of the product of every single open reading frame in every organism, understanding their cellular roles routinely relies on family history.

So how tin can one determine what family a given protein belongs to? Sequence analysis aims at finding of import sequence similarities that would allow one to infer homology. The latter term is extensively used in scientific literature, often without a clear understanding of its meaning, which is simply mutual origin. Since the mid-19^th century, zoologists and botanists have learned to make a distinction between homologous organs (e.k. bat'due south wing and human's paw) and like (analogous) organs (due east.1000. bat's wing and butterfly's wing). Homologous organs are not necessarily similar (at least the similarity may not be obvious); like organs are non necessarily homologous. For some reason, this simple concept tends to get extremely muddled when applied to protein and DNA sequences [695]. Phrases like "sequence (structural) homology", "high homology", "meaning homology", or fifty-fifty "35% homology" are equally common, fifty-fifty in height scientific journals, every bit they are cool, considering the in a higher place definition. "Sequence homology" is peculiarly pervasive, having constitute its way even into the NLM's Medical Subject Heading (MeSH) arrangement. Information technology has been assigned every bit a keyword to more than than fourscore,000 papers in MEDLINE, including, to the embarrassment of the authors, most of their own. In all of the above cases, the term "homology" is used basically equally a glorified substitute for "sequence (or structural) similarity".

All this misuse of "homology", in principle, could be dismissed every bit an inconsequential semantic trouble. Ane could even suggest that, after all, since it so happened that in molecular biology literature "homology" has been often used to designate quantifiable similarity betwixt sequences (or, less often, structures), the term should be redefined, legitimizing this usage. We believe, however, that the notion of homology is of major fundamental and practical importance and, on this occasion, semantics matters. In our opinion, misuse of the term 'homology' has the potential of washing out the meaning of the very concept of common evolutionary descent [695].

A conclusion that ii (or more than) genes or proteins are homologous is a conjecture, not an experimental fact. Nosotros would be able to know for a fact that genes are homologous only if we could directly explore their common ancestor and all intermediate forms. Since in that location is no fossil record of these extinct forms, a decision on homology between genes has to be made on the ground of the similarity between them, the only observable variable that can exist expressed numerically and correlated with probability. The higher the similarity between 2 sequences, the lower the probability that they have originated independently of each other and became similar merely by chance (see 4.2). Indeed, if we take two sequences of 100 amino acrid residues each that have, say, 80% identical residues, we can calculate the probability of this occurring by chance, find that information technology is so depression that such an event is extremely unlikely to have happened in the terminal 5 billion years, and conclude that the sequences in question must be homologous (share a common ancestry). Even for proteins that share a much lesser caste of identity, alignment of counterparts from all walks of life is often straightforward, and at that place seems to be no reasonable doubt of homology. For example, although sequences of the ribosomal protein L36 from dissimilar species (Figure 2.1) exhibit considerable diversity and only a single amino acrid residue is conserved in all the sequences, they align unequivocally and are indisputable homologs.

Figure 2.i

Multiple alignment of the ribosomal poly peptide L36 sequences. Conserved amino acid residues are shown in bold and/or xanthous. The following proteins are listed: A. aeolicus, aq_075; B. subtilis, RpmJ; C. jejuni, Cj1591; C. trachomatis, CT786; E. coli, RpmJ; (more...)

A real trouble arises only when the similarity between two given sequences is much lower, so it is not immediately clear how to properly align them and how to calculate their degree of similarity. Even when one comes up with a figure—say, 2 protein sequences have 10% identical residues and additional eight% similar amino acid residues (a total of 18% similarity)—does this imply homology or not? The simply reasonable answer is: it depends. This and lower levels of similarity might be indicative of homology provided that one or more of the following applies: (i) the similarity extends over a long stretch of sequence and is statistically significant by criteria known to be reliable (such equally those practical in the Smash algorithm and its derivatives); (ii) although the sequence similarity is depression, the aforementioned pattern of identical and similar amino acid residues is seen in multiple sequences; or (3) the design of sequence similarity reflects the similarity between experimentally determined structures of the corresponding proteins or at least corresponds to the known key elements of ane such construction.

In the rest of this affiliate and in the subsequent chapters as well, nosotros volition accept multiple opportunities to examine each blazon of show. Right here and now, however, information technology is pertinent to ponder the question: Why is sequence and structural similarity considered to be prove of homology (common origin) in the first identify? One time we are confident that a item similarity is not spurious, but rather, co-ordinate to the above criteria, represents certain biological reality, is common ancestry the only explanation? The respond is: no, a logically consistent culling does be and involves convergence from unrelated sequences.

The functional convergence hypothesis would posit that sequence and structural similarities between proteins are observed considering the shared features are strictly required for these proteins to perform their identical or similar functions. Functional convergence per se is an undeniable reality. In the broadest sense, convergence is observed, for example, between all proteins that contain disulfide bonds stabilizing their structure or between all enzymes that have the same catalytic residues (eastward.g. a constellation of histidines and aspartates). Even more than prominent motifs associated with catalytic residues are found within different structural context and, in all likelihood, have evolved convergently [722,724]. In the case of disulfide-bonded domains, convergence can fifty-fifty fool sequence comparison programs, translating into statistically significant (albeit non overwhelming) sequence similarity. A rather dramatic manifestation of convergence is the recent clarification of a "homologous" disulfide-bonded domain in Wnt proteins and phospholipase A2 [699], which was later recognized as "mistaken identity", on the grounds of structural implausibility [77]. The classic piece of work of Alan Wilson and colleagues comparing lysozymes from ruminants, langur monkeys, and leaf-eating birds is a textbook case that reveals the nature and extent of convergence in enzymes [471,806,816]. These studies have shown across incertitude that several amino acid residues required for performance in the stomach accept evolved independently (convergently) in unlike lineages of lysozymes. Chiefly, nonetheless, this set of convergent positions consists of only vii amino acid residues, a small subset of the residues that comprises the lysozyme molecule.

A pan-adaptationist view of evolution would hold that functional convergence is the sole (or at least the principal) factor responsible for similarity between proteins. Formally disproving this image might not exist possible, merely there seem to be at to the lowest degree 2 compelling arguments against it. The first one stems from the notion of a continuous gradient of similarity between proteins. The convergence explanation is implausible for closely related sequences, such as those of the aforementioned proteins (or, more precisely, orthologs; see below) from different mammalian species, which are normally 70–fourscore% identical. For such sequences, the convergence hypothesis is equivalent to the argument that almost, if not all, amino acrid residues in a protein are fixed through positive selection. This runs against the neutral theory of molecular evolution, which has shown that, given the known parameters of animal populations, positive selection could not be responsible for the bulk of amino acrid substitutions, which are therefore effectively neutral [440]. Convergence could only be a realistic possibility for deep relationships between proteins, which involve express similarities; indeed, the neutral theory does non preclude positive option acting, say, on x% of the positions in a poly peptide. Then, the observed spectrum of similarities between proteins would accept two distinct explanations: (i) divergence from common ancestors for tight families with high levels of sequence similarity, and (ii) convergence from independent ancestors for larger groups of related proteins (superfamilies), in which but limited similarity is observed. While non theoretically impossible, such an opposition of two vastly different modes of evolution, with a mysterious bottleneck separating the two phases, appears extremely unlikely. This view of evolution is clearly inferior to the alternative, whereby all significant similarities observed within a class of proteins are interpreted within a unmarried theoretical framework of divergence from an ultimate common ancestor.

The second, probably nearly convincing, argument against convergence as the principal caption for the observed similarities between proteins has to exercise with the nature of structural constraints associated with a particular function. A fundamental observation is that a single function, such as catalysis of a specific enzymatic reaction, is ofttimes performed past 2 or more than proteins that have unrelated structures [187,271]. In 2.two.5, we discuss this phenomenon in some detail and nowadays several specific examples. These observations indicate that the same role does not necessarily crave significantly similar structures, which ways that, equally a rule, there is no basis for convergent evolution of extensive sequence and structural similarity betwixt proteins. This is not to say that unrelated enzymes that catalyze the aforementioned reaction behave no structural resemblance whatsoever. Indeed, subtle similarities in the spatial configuration of amino acrid residues in the agile centers are likely to be, and these are precisely the kind of similarity that is expected to sally due to functional convergence. These similarities, even so, do not translate into structural and sequence similarity detectable past existing methods for comparing of proteins (at least in the overwhelming majority of cases). Past inference, we are justified to conclude that whenever statistically significant sequence or structural similarity between proteins or protein domains is observed, this is an indication of their divergent evolution from a mutual ancestor or, in other words, evidence of homology . We will revisit the outcome of convergence versus difference when discussing the deepest structural connections between proteins.

Now that we take established the connexion betwixt similarity and homology, it should exist emphasized that demonstration of homology is key to the interpretation of similarities between proteins. The feasibility of this conclusion, which sometimes is reached on the footing of limited similarity, is what makes sequence and structure comparing the major staples of computational biology and inspires the development of increasingly sensitive methods for such comparisons. Indeed, under the notion of homology, a sequence or structural alignment becomes a powerful tool for evolutionary and functional inferences.

In one case sequences are correctly aligned, homology implies that the respective residues in homologous proteins are also homologous, i.e. derived from the same bequeathed residual and, typically, inherit its part. If the residue in question is the same in a set of homologous sequences, nosotros say that it is (evolutionarily) conserved . Thus, homology lends legitimacy to the transfer of functional information from experimentally characterized proteins (or nucleic acids) to uncharacterized homologs, the single near mutual and practically important application of computational methods in molecular biology. Conversely, an alignment of non-homologous sequences is inherently meaningless and potentially misleading. Even if such an alignment attains a relatively high percentage of identity or similarity, no conclusions at all can be inferred from the (spurious, in this case) correspondence between aligned residues. This is why phrases like "significant homology" or "percent homology" are and so ludicrous. Homology is a qualitative notion of common ancestry. Equally long as homology is established, ten% identical residues between two protein sequences could be highly meaningful and amenable to functional interpretation. In contrast, even 30% identity betwixt two sequences that are not homologous in reality could be totally misleading.

2.one.2. Conservation of poly peptide sequence and structure in evolution

Protein structure is conserved during evolution much ameliorate than protein sequence. There are numerous examples of proteins that prove piffling sequence similarity but still prefer similar structures, contain identical or related amino acid residues in their active sites, and accept similar catalytic mechanisms. These shared features support the notion that, despite depression sequence similarity, such proteins are homologous.

Consider, for example, the structure of lysozyme, the enzyme that hydrolyzes bacterial cell walls (formal name: 1,four-beta-Due north-acetylmuramidase, EC 3.two.ane.17). Different lysozymes are institute in many organisms, from bacteriophages to mammals, and in general, they show little sequence similarity to each other. PDB, the database of protein structures (see 3.3), includes the lysozyme from goose (PDB code 153L), which consists of 185 amino acid residues (Figure 2.2). The sequence neighbors of this poly peptide in the protein database (see 3.1.2) are lysozymes from black swan (same length, 96% identity), ostrich (aforementioned length, 83% identity), craven (same length, 80% identity), likewise equally unannotated proteins from human (44% identity), mouse (43% identity), and B. subtilis bacteriophage SPBc2 (25% identity in 176-aa overlap). The vertebrate proteins in this list, including the uncharacterized ones, are obvious homologs of the goose lysozyme. The phage protein is more than unlike and, in this case, the issue of homology is worth some investigation. However, the sequence similarity between lysozymes and this phage protein is statistically significant (as can be shown, for example, using PSI-Blast, see 4.3.3), and their multiple alignment shows a consequent pattern of shared residues, thus establishing homology (Figure two.2).

Effigy 2.2

Multiple sequence alignment of goose lysozyme and its closest homologs. Absolutely conserved amino acid residues are shown in bold; conserved hydrophobic residues are yellow.

In contrast, the list of closest structural neighbors of goose lysozyme, according to the MMDB database (http://www.ncbi.nlm.nih.gov/Structure, see iii.iii), includes the classic chicken egg white lysozyme (east.grand. PDB code 3LZT, 11% identity) and lysozymes from E. coli bacteriophages λ (PDB code 1AM7, thirteen% identity) and T4 (PDB code 149L, 11% identity). Nevertheless, a superposition of the 3-dimensional structures of these three proteins clearly reveals the conserved structural core and many shared features (Effigy ii.three).

Figure two.3

Structural alignment of goose lysozyme (PDB code 153L), chicken egg white lysozyme (3LZT), and lysozymes from E. coli bacteriophages λ (1AM7) and T4 (1L92). Structures of the four unlike types of lysozyme were aligned using VAST (http://www.ncbi.nlm.nih.gov/Construction/VAST/vast.shtml) and (more...)

A dissimilar method of structural comparing, DALI, used in the FSSP database (run across 3.three), as well identifies them as the nearest structural neighbors. Importantly, structural and sequence comparisons are a ii-way street: the structural alignment shown in Figure 2.3 can be transformed into a multiple sequence alignment (Figure 2.4) in which conserved positions, including the catalytic glutamate, can exist readily identified [217].

Figure 2.4

Structure-based sequence alignment of goose lysozyme (153L), chicken egg white lysozyme (3LZT), and lysozymes from E. coli bacteriophages λ (1AM7) and T4 (1L92). Multiple alignment, generated by the DALI program [354], was extracted from the FSSP (more...)

This straightforward assay makes the states conclude that all lysozymes are homologous, which, in this case, is easy to take given their similar, if not identical, functions. Furthermore, this analysis can be extended to a broad group of other transglycosylases, which all turn out to share a conserved catalytic domain with lysozyme and comprise a superfamily of homologous proteins [594,863]

Does structural similarity e'er reflect homology? For reasons discussed in the previous section, structural similarity that spans at least one complete domain most likely does. It is this blazon of similarity that is sought by structure comparison methods, such as VAST and DALI (see 3.iii). Thus, the general rule of structure-homology correspondence seems to be straightforward: protein domains that have the aforementioned fold according to structure classification systems, such as SCOP or CATH, are homologs .

In principle, however, information technology is hard to rule out that some common folds are so advantageous thermodynamically that they have evolved several times independently (convergently). This possibility has been considered, for example, for the triose phosphate isomerase (TIM) barrel fold, given its high stability and symmetrical, quasi-journal organization [157].

How far does the notion of divergent evolution go? The overreaching idea that all proteins evolved from a single primordial poly peptide does not seem plausible. Indeed, at that place is no reason to believe that proteins of unlike structural classes, e.thou. all-α (consisting exclusively of α-helices) and all-β (consisting exclusively of β-strands), take a mutual origin. However, certain topological changes in poly peptide folds seem to occur during evolution [317], and the possibility of primordial common ancestry might go realistic if unlike folds within the same structural class are considered.

Interestingly, credible relationships between sure proteins that, according to SCOP, accept different folds are detectable even through PSI-BLAST searches. For instance, statistically significant similarities betwixt NAD-dependent oxidoreductases and South-adenosylmethionine-dependent methyltransferases are regularly detected in iterative database searches, and the alignments produced are usually consequent with structural superpositions (N.Five. Grishin and Eastward.V.Yard., unpublished). Consequently, there is little doubt that these proteins, which formally have distinct folds, do share a common ancestry. At to the lowest degree in principle, such comparisons could exist extended to all the numerous proteins whose structural cadre consists of parallel β-sheets, leading to the more than or less radical proposal that they all have evolved from the same primordial "Rossmann-type" domain, which perhaps possessed nucleotide-binding properties [37]. The notion of departure can be similarly extended to unite other types of structurally similar domains (eastward.g. different all-α-helical folds) into broad monophyletic classes. We find such generalizations attractive and apparent, but caution is due, and further elaboration of the methods for structure comparison, perhaps combined with theoretical analysis of evolutionary models, is required earlier more certainty is accomplished on these potential distant evolutionary relationships. We will return to the give-and-take of the possible nature of primordial proteins when considering the early stages of biological evolution from a comparative-genomic perspective (meet 6.4).

Coming dorsum to earth, it is important to annotation that approximately the aforementioned level of sequence similarity that is seen between distantly related proteins whose homology is established via a combination of iterative sequence searches and structural comparisons (roughly, viii–15% identity with gaps) can be expected to exist between two randomly called protein sequences. Nosotros already listed above some criteria that allow i to distinguish betwixt truthful evidence of homology and spurious similarities. More than generally, it cannot be overemphasized that, when this level of similarity between proteins is involved, there is no substitute (at least every bit of this writing) for a careful analysis of each item relationship. Such an assay ordinarily pays off, assuasive one to avoid false 'cardinal discoveries' and sometimes opening up new avenues of investigation.

ii.i.3. Homologs: orthologs and paralogs

As discussed above, one of the main objectives of Deoxyribonucleic acid and protein sequence analysis is to identify homologous sequences and to utilise sequence and structure conservation to predict mutual biochemical activities and biological functions of proteins and not-coding sequences. The second major goal of sequence analysis is evolutionary reconstruction per se. To accost each of these goals, it is critical to distinguish betwixt two principal types of homologous relationships, which differ in their evolutionary history and functional implications. The ii categories of homologs are orthologs , defined as evolutionary counterparts derived from a single ancestral gene in the last mutual ancestor of the given two species, and paralogs, which are homologous genes evolved through duplication within the same (perhaps ancestral) genome. These definitions were first introduced by Walter Fitch in 1970 [228,229] and remained virtually unknown to molecular biologists until the advent of genomics, at which time information technology has become clear that the distinction between the two types of homologs was crucial for agreement evolutionary relationships between genomes and factor functions. In evolutionary terms, robust identification of orthologs is essential considering otherwise any evolutionary scenarios, for instance, attempts to reconstruct the gene repertoire and factor order in ancestral genomes (come across discussion beneath), are jump to exist meaningless. With respect to functional analysis, orthologs typically retain the same, ancestral function, which makes transfer of functional data within a gear up of orthologs generally reliable. The evolutionary basis of such conservation of role among orthologs appears fairly obvious. Indeed, consider a gene (or, rather, its product) in an ancestral species that was responsible for carrying out some essential biological function. As long as the progeny of this antecedent carries a unmarried copy of the gene in question and does not evolve or acquire an unrelated gene capable of providing the same role, information technology has to rely on the original gene to continue carrying out that office. This puts orthologs under strict evolutionary constraints and makes them perform the aforementioned function equally long as this part remains essential for survival or at to the lowest degree confers a substantial selective reward to its bearers.

In contrast, paralogs tend to evolve new functions, and study of paralogous families may provide means for understanding adaptation. Equally first detailed by Susumu Ohno in his classic 1970 book Evolution by Gene Duplication [627], once paralogs emerge equally a outcome of a gene duplication, the pressure of purifying selection decreases for either one (in Ohno'due south original model) or, under new, more elaborate models [448,534,877] both paralogs, which eventually enables development of new functions. In each sequenced genome, a substantial fraction (from 25 to 80% [374,408,484,506]) of genes belongs to families of paralogs, each of which reflects functional diversification via duplications that occurred at different stages of evolution. Classic examples include animate being olfactory receptors or nuclear hormone receptors, vast families in which an amazing repertoire of specificities evolved as the result of multiple duplications.

The interplay of speciation events, leading to the divergence of orthologs, and duplications, giving rise to paralogous families, results in complex evolutionary scenarios, which may exist difficult to resolve (Figure ii.5). When duplication precedes speciation, each of the paralogs gives rise to a distinct line of orthologous descent. Conversely, when duplication occurs afterward a particular speciation event in one lineage or in both lineages independently (this can be referred to every bit a lineage-specific duplication or lineage-specific expansion of a paralogous family), a state of affairs ensues whereby a one-to-one orthologous human relationship cannot be delineated in principle (Effigy 2.5). Instead, all one can say is that the family AB in lineage 1 is orthologous to family unit A'B'C' in lineage two or, in other words, that A and B are co-orthologs (a new term recently introduced to more accurately describe such relationships [700]) of A', B', and C' (Figure 2.5). Clearly, in such a case, the functional correspondence betwixt the two orthologous families of paralogs is less straightforward than it is betwixt regular, one-to-1 orthologs. The relationships between homologs could go particularly catchy if some genes in sure lineages accept been lost during evolution (a phenomenon referred to as lineage-specific gene loss , see ii.2.3). In such cases, genes that, at face up value, appear to exist orthologous may actually be paralogs, whereas the genuine orthologs might have been lost. One time again, functional inferences made on the basis of this type of homologous relationship require particular caution.

Figure ii.5

Orthologous and paralogous genes in three lineages descending from a mutual ancestor. Cistron sets I, 2, and III should be considered co-orthologous.

Reliable identification of orthologs is just possible when consummate sets of genes from two or more genomes are compared. Indeed, if ane of the compared genomes is incomplete, a possibility always remains that the true ortholog of the given gene is "hiding" in the unsequenced part. Even with complete genomes, identification of orthologous gene sets is not a simple task considering of the complex evolutionary scenarios, which involve multiple duplications, speciations, and near chiefly, lineage-specific factor loss events. In principle, complete phylogenetic analysis of all groups of homologous genes is required to decipher truthful orthologous relationships. This is an extremely labor-intensive chore; moreover, it is well known that non all phylogenetic trees provide the required resolution. "Shortcut" approaches have been adult to circumvent the need for comprehensive phylogenetic analyses, and some of these are discussed in subsequent chapters.

ii.2. Patterns and Mechanisms in Genome Evolution

Although all the same a young subject area, comparative genomics has matured enough to allow delineation of the about common and of import types of events that occur during genome evolution. These include different forms of genome rearrangement, gene duplication, and more specifically, lineage-specific expansion of factor families, lineage-specific factor loss, horizontal factor transfer, and non-orthologous gene displacement.

ii.ii.1. Development of gene lodge

Comparing of the first completely sequenced genomes promptly showed that cistron order is much less conserved than protein sequences. Genomes of the closely related leaner Mycoplasma genitalium and Grand. pneumoniae, for case, consist of six big segments with like organization of genes, simply the segments themselves are shifted relative to each other and partially scrambled in the two genomes [348]. Much greater differences were plant between Haemophilus influenzae and E. coli, or even between E. coli K-12 and its pathogenic relative E. coli O157:H7 [669,829]. The gradient of factor order conservation is illustrated in Figure 2.6 (see color plates). In the chlamydial genomes, a genome-scale alignment is readily traceable along the main diagonal, although gaps in the alignment and two major inversions are every bit obvious (Figure 2.six). In contrast, the comparison of E. coli and P. aeruginosa looks completely disordered on the genome scale (Figure two.6B).

Figure 2.6

Gene order comparing plots. A Chlamydia trachomatic (10 axis) vs Chlamydophila pneumoniae (Y centrality) B Escherichia coli (X axis) vs Pseudomonas aeruginosa (Y axis)

In fact, any such comparison between more than or less distantly related prokaryotic genomes, eastward.one thousand. bacteria or archaea from dissimilar genera, would expect disordered at a scale where merely conservation of about a dozen genes in a row is noticeable. On a smaller scale, however, there is important conservation of gene club within operons, the units of prokaryotic gene coregulation. Extensive genome comparisons showed that, in each genome, 5% to 25% of the genes belong to conserved (predicted) operons, i.e. strings of genes that are shared with at least one relatively afar genome [916]. As should be expected, this fraction gradually increases as new genomes are sequenced. A few operons that are conserved in distantly related prokaryotes consist of genes for ribosomal proteins and another components of the translation mechanism. Other conserved operons include those encoding subunits of the H-ATPase and ABC-type transporter complexes [169,385,461,595].

2.2.two. Lineage-specific gene loss

A quick look at the genome sizes of the organisms with completely sequenced genomes (Table i.4) shows that many pairs of closely related organisms accept vastly different numbers of genes. Thus, Eastward. coli K-12 has 7 times more genes than the aphid symbiont Buchnera sp., which is located correct next to E. coli in the 16S rRNA-based phylogenetic tree. Two more than representatives of gamma-proteobacteria, H. influenzae and P. multocida, have 2.5 times fewer genes than E. coli. Substantial differences in the gene number tin can exist found even within the same genus. The cistron ready of Mycoplasma pneumoniae, for example, includes all the 480 genes of M. genitalium, as well as 197 additional genes. Mycobacterium leprae is closely related to M. tuberculosis only has at least 1,200 fewer genes [153].

The same phenomenon is seen throughout eukaryotes. Bakery'due south yeast S. cerevisiae, for instance, has about 6,000 genes, which is at to the lowest degree 2,000 genes fewer than in its relatives, multicellular ascomycetes such as Aspergillus. Furthermore, a eukaryotic intracellular parasite, microsporidian Encephalitozoon cuniculi, which has been identified as a derived fungus in several consistent phylogenetic studies, has only ~2,000 genes [425], which points to a truly dramatic scale of gene loss. Almost 300 genes were obviously lost past Due south. cerevisiae after its radiation from the common ancestor with fission yeast S. pombe, although the latter has fifty-fifty fewer genes than S. cerevisiae [55]. All these observations prove that sure phylogenetic lineages experienced a significant factor loss, often linked to the adaptations to the parasitic lifestyle (H. influenzae, P. multocida, M. pneumoniae, K. genitalium, G. leprae), or intracellular symbiosis (Buchnera sp.), or but adaptation to a constant (narrow) range of environmental conditions. Indeed, parasites might non demand a complicated web of metabolic pathways for the biosynthesis of amino acids, nucleotides, and cofactors as long as they tin can fetch those nutrients from their host.

In the same vein, the well-known absence of the biosynthetic pathways for 12 amino acids in humans and other vertebrates was probably made possible by the abundance of these amino acids in the food consumed by their mutual ancestor at the time of their divergence.

An analysis of cistron loss in bacterial parasites showed that, in many cases, it led to the emptying of entire pathways, such equally amino acid, nucleotide, and cofactor biosynthetic pathways (Chapter 7). For example, a number of parasitic bacteria lack pyrimidine biosynthesis genes that are nowadays in their gratis-living relatives (Effigy two.7). This has, of course, a simple evolutionary explanation: if the necessary food is available in the medium, the genes responsible for its synthesis become redundant and can be eliminated. Moreover, in one case at least 1 of these genes is lost, expression of the others would lead to the accumulation of metabolic intermediates that can be harmful for the cell. This would result in an evolutionary pressure toward coordinated loss of all the genes in a pathway [270]. A like trend toward coelimination of functionally connected groups of proteins, such every bit the signalosome and the spliceosome components, has been detected in yeast [55].

Figure ii.7

Pyrimidine biosynthesis genes in organisms with completely sequenced genomes. Each rectangle signifies an enzyme of the pyrimidine biosynthesis pathway, indicated by its factor name and COG number. Alternative enzymes catalyzing the same reaction are shown (more...)

In a remarkable exception to the principle of coordinated factor loss, there are cases when only a certain (typically, upstream) part of the pathway is eliminated. Effigy ii.7 shows that the complete pyrimidine biosynthesis pathway is missing in M. genitalium and Yard. pneumoniae, whereas H. influenzae lacks genes for the commencement 3 reactions of this pathway merely has the consummate ready of genes for all the enzymes that catalyze the conversion of dihydroorotate into CTP. Thus, while H. influenzae is evidently incapable of de novo pyrimidine biosynthesis, information technology has preserved certain metabolic plasticity to accommodate any pyrimidine it can get from its host. The same trend is seen in the fifty-fifty smaller genomes of B. burgdorferi and C. trachomatis, which accept lost virtually of the pyrimidine biosynthesis genes but still contain genes coding for the downstream steps of this pathway.

two.2.3. Lineage-specific expansion of gene families

We have already mentioned the evolutionary importance of cistron duplication leading to the emergence of paralogs, which may presume new functions, sometimes essentially unlike from those of the ancestral gene. Genome comparisons suggest that lineage-specific expansion of paralogous factor families, which in some cases business relationship for a sizable fraction of a genome, is one of the major mechanisms of accommodation [408,506]. Analysis of lineage-specific gene expansions tin provide useful clues to the evolution of each particular lineage. Tabular array 2.i shows that, indeed, in pathogens M. tuberculosis and H. pylori, the nigh conspicuous expansions are those of genes encoding factors involved in interactions with and survival within the host organisms. In contrast, in free-living autotrophs Synechocystis sp. and A. fulgidus, the largest expansion involves signal transduction proteins, sensor histidine kinases, and related ATPases.

Tabular array ii.one

Lineage-specific expansions of paralogous families in prokaryotic genomes ^a.

In eukaryotes, lineage-specific expansion of certain protein families is even more than evident than in prokaryotes. A comparison of the genome counts of signaling domains in the nematode C. elegans against the corresponding numbers in the yeast S. cerevisiae and some free-living bacteria and archaea (Table 2.two) shows that certain domains are dramatically expanded in C. elegans, fifty-fifty when the greater number of genes in the worm is taken into account (come across also the counts of ankyrin repeats in C. elegans in 3.ii.two).

Table 2.2

Expansion of signaling domains in C. elegans ^a.

2.two.four. Horizontal (lateral) cistron transfer

Horizontal (lateral) gene transfer, as opposed to the standard (vertical) transfer from ancestors to progeny, refers to acquisition of genes from organisms that belong to other species, genera, or even higher taxa. Some mechanisms of lateral gene transfer between different strains of the aforementioned species, or between closely related species, are well established and include conjugation, acquisition of plasmids, and viral (phage) infection [134]. These events are common and practice not stir much controversy. Afterward all, it was the experiment on pneumococcal transformation by heterologous Dna past Avery, MacLeod, and McCarthy that proved the office of Deoxyribonucleic acid in heredity. However, in the pre-genomic era, the long-range lateral gene transfer across taxa has been considered to be extremely rare and more than or less unimportant in the general scheme of development [782]. The but instance where the fact and impact of horizontal gene transfer have been clearly recognized was the apparent massive menstruum of genes from the genomes of endosymbiotic organelles, mitochondria in all eukaryotes and particularly chloroplasts in plants, to the eukaryotic nuclear genome [311,312].

As soon every bit commencement comparisons of multiple, complete genome sequences representing various taxa had been performed, it became apparent that lateral cistron transfer was too common to be dismissed as inconsequential [194]. First, horizontal gene menstruation betwixt closely related species turned out to be much more pervasive than ever suspected before. Lawrence and Ochman guess, for example, that equally much equally 25% of the E. coli genome consists of recently caused "foreign" genes [497,625]. The bodily rate of influx and loss of new genes is fifty-fifty faster: it appears that, in the ~100 million years since the divide between Escherichia and Salmonella lineages, E. coli has picked upwards and lost as much Deoxyribonucleic acid as it has now [496,497].

In addition, genome comparisons helped to uncover numerous cases of (predicted) horizontal cistron transfer between organisms belonging to distinct phylogenetic lineages. Archaeal genomes presented a particularly hit picture, with some genes having close homologs only among eukaryotes and others being much more similar to their bacterial homologs than to those from eukaryotes, if eukaryotic homologs were detectable at all [466]. With some exceptions, the "bacterial" and "eukaryotic" proteins in archaea were divided along functional lines, with those involved in information processing (translation, transcription, and replication) showing the eukaryotic affinity, and metabolic enzymes, structural components, and a variety of uncharacterized proteins appearing "bacterial" [466,540]. Because the informational components generally appear to be less prone to horizontal factor transfer [703] and in accord with the "standard model" of early evolution whereby eukaryotes share a common ancestor with archaea [906], these observations could be explained by massive gene exchange between archaea and bacteria [466]. This hypothesis was farther supported by the results of genome analysis of ii hyperthermophilic bacteria, A. aeolicus and T. maritima. Each of these genomes contained a significantly greater proportion of "archaeal" genes than any of the other bacterial genomes, in an obvious correlation betwixt the similarity in the life styles of evolutionarily very afar organisms (bacterial and archaeal hyperthermophiles) and the apparent rate of horizontal gene exchange between them [52,610]. Further analyses led to the discovery of genes of clear bacterial origin in the hyperthermophilic archaeon P. furiosus, which proved lateral gene transfer from leaner to archaea [184].

We believe that the demonstration of the evolutionary prominence of lateral gene transfer can be considered the single greatest alter in perspective in biological science brought well-nigh by comparative genomics. A new round of controversy has been sparked by the discovery of genes of possible bacterial origin in the human being genome [488]. In Affiliate 6, we revisit this event and discuss implications of large-scale lateral cistron transfer for the "tree of life".

2.ii.5. Non-orthologous gene displacement and the minimal gene gear up concept

Proteins responsible for the same office in different organisms typically show significant sequence and structural conservation and tin can exist inferred to be orthologs. However, there are exceptions to this rule. Examples of obviously unrelated enzymes with the same specificity were noted as early on as 1943 when Warburg and Christian described two singled-out forms of fructose-one,half dozen-bisphosphate aldolase in yeast and rabbit muscle, respectively. These 2 enzymes, referred to as class I and class Ii aldolases, were later shown to be associated with different phylogenetic lineages and have different catalytic mechanisms and little structural similarity [95,549]. Unrelated enzymes that catalyze the same reaction have been referred to as coordinating, as opposed to homologous, enzymes [228,271].

Comparative analysis of complete genomes shows that cases like this are mutual. Strikingly, only well-nigh 65 orthologous protein sets are universally represented in all sequenced genomes. While, in big part, this is due to lineage-specific gene loss, this number is much lower than the number of essential functions, indicating that other such functions are performed by unrelated (or at least non-orthologous) proteins in dissimilar life forms. This major evolutionary phenomenon, which came to light already in the outset comparisons of sequenced genomes, was dubbed non-orthologous gene deportation [465]. The total range of mechanisms leading to not-orthologous cistron deportation is non known. All the same, in cases when essential functions are involved, the main sequence of events appears to exist clear. Since an organism cannot survive without a protein that performs an essential role, transient functional redundancy, when an organism has both forms of the corresponding protein, appears to be a pre-requisite of non-orthologous cistron deportation [464]. Such back-up might evolve via horizontal gene transfer or via recruitment of a protein whose original function was different from the given ane (recruitment is likely to occur later factor duplication). The redundancy phase is followed by lineage-specific gene loss, resulting in non-orthologous gene displacement (Figure ii.viii). In case of non-essential functions, the redundancy phase might be bypassed, with non-orthologous gene displacement evolving directly via horizontal gene transfer or recruitment.

Figure 2.8

A scenario for the evolution of non-orthologous gene displacement via an ancestral back-up stage and lineage-specific gene loss.

Enzyme recruitment is a common evolutionary phenomenon leading to non-orthologous gene displacement. Typically, one of the 2 non-orthologous enzymes with the same catalytic activity belongs to a diverse family of enzymes and could have evolved by shifting the substrate specificity of a related only distinct enzyme [271]. A good example is the 2 unrelated forms of gluconate kinase. Gluconate kinases from E. coli, yeast, and S. pombe course a narrow conserved group. In contrast, the gluconate kinase of B. subtilis belongs to the and then-called FGGY family of saccharide kinases, which as well includes glycerol kinase (GlpK), D-xylulose kinase (XylB), L-fuculose kinase, and 50-xylulose kinase (LyxK). The scenario of enzyme recruitment in this instance seems straightforward: a duplication of the glpK or xylB factor in the Bacillus lineage produced a new paralog, which accumulated several mutations resulting in a shift of substrate specificity from glycerol (or xylulose) to gluconate.

Enzyme recruitment seems to be particularly common in organisms that have adapted to novel ecological niches past developing unusual, idiosyncratic metabolic pathways. For example, most of the enzymes that are responsible for the biosynthesis of polyketide antibiotics in actinomycetes appear to be recent recruits from the enzymes of fatty acrid biosynthesis. Similarly, enzymes that hydrolyze man-made halogenated hydrocarbons take close relatives among regular metabolic enzymes and, in all likelihood, accept been recruited from this source. Perhaps the most remarkable example is the evolution of apyrase (ATP-diphosphohydrolases, EC three.6.ane.5), the enzyme secreted by claret-sucking insects into the blood of human or other mammalian victims in gild to forestall or slow downwards claret clotting [862]. Because ADP in the claret can serve as a trigger of blood clotting, any enzyme capable of hydrolyzing it would give the hematophagous insect a substantial evolutionary advantage. As a effect of this evolutionary pressure toward increasing salivary apyrase activity, insect apyrases are found in at least three dissimilar forms, which are homologous, respectively, to ATPases, 5'-nucleotidases, and inositoltriphosphate phosphatases [271,862].

It is worth noting that enzyme recruitment can be legitimately described as contained, convergent development of the same enzymatic activity. In Affiliate seven, we expect at the comparative genomic of central metabolic pathways and encounter numerous cases of non-orthologous gene deportation and, specifically, enzyme recruitment.

The thought of non-orthologous gene displacement was originally developed in conjunction with the concept of a minimal gene set for a living jail cell [596]. This was construed as the minimal set of genes that are essential for the functioning of a mod-type cell even under the most favorable environmental conditions, including abundance of nutrients and absenteeism of competition. An attempt to explicitly derive a version of such a minimal gene set was undertaken by comparison the get-go two sequenced bacterial genomes, those of the parasites H. influenzae and M. genitalium. The straightforward logic of this reconstruction was that these two leaner, which belong to distant phylogenetic lineages, have been independently losing genes during their adaptation to the parasitic lifestyle, and whichever common genes remain in both genomes were probable to belong to the minimal set of essential genes. It was noticed, however, that for certain essential functions (e.g. glycyl-tRNA synthetase), there was no orthologous pair of genes in the two bacteria, hence not-orthologous cistron displacement had to be invoked.

The original version of the minimal gene fix included 256 genes, with xvi inferred non-orthologous gene deportation cases. (The magic of these numbers must non be lost on the reader: 16 is 2² to the ability of 2; 256=16², and accordingly, 256 is 2^two to the power of 2 to the power of 2. Thus, 256 is the simply number that tin be represented every bit such a succession of powers of 2 and, at the same time, can be a reasonable approximation of a minimal gene set: 16 is evidently too few and 256²=65,536 is, in all likelihood, much greater than the number of genes in the homo genome.)

A subsequent big-calibration experimental study has shown that about of the genes included in this theoretical minimal gene set were, indeed, essential in Grand. genitalium, although a few, surprisingly, were non [364]. Notwithstanding, sequencing of additional genomes and the corresponding genome comparisons have clearly shown that this early reconstruction vastly underestimated the extent of non-orthologous gene displacement [452,591,674]. Indeed, every bit indicated in a higher place, merely about 65 genes seem to exist truly ubiquitous in cellular life forms, comprising perhaps 25% of the minimal prepare of essential functions. Therefore, it probably makes more than sense to consider not so much a minimal gene set only rather a minimal set of functional requirements for cell survival. Comparative genomics shows that, for some of these requirements, a unique solution has evolved, only for the majority, evolution has come with two or more unrelated or distantly related solutions. Equally discussed in 6.4, non-orthologous gene displacement is prominent even in the DNA replication machinery, the central functional system of all cells.

Figure 2.9

Distribution of different phylogenetic lineages in the COG database. The plot shows the number of protein families (COGs) in a release of the COG database (run across 3.4), which included proteins from the given number of phylogenetic lineages of the total of (more...)

2.ii.vi. Phyletic patterns (profiles)

Every bit a result of numerous lineage-specific factor losses, horizontal gene transfers and not-orthologous factor displacements, about poly peptide families bear witness a "patchy" distribution amid the sequenced genomes. The information from the database of Clusters of Orthologous Groups of proteins (COGs, see 3.4) show that the majority of COGs are represented in only three or four phylogenetic lineages; universal or nigh universal COGs are much less common.

This distribution can be conveniently presented in the form of phyletic patterns (profiles), which evidence the presence or absence of a COG in each analyzed species. This approach, initially introduced equally a feature of the COGs [828] and subsequently adapted, with various modifications, by several enquiry groups [547,665,689], provides a convenient mode to compare genomes and investigate the evolutionary history of individual cellular functions. For example, a quick test of the phyletic patterns of the ii singled-out forms of phosphoglycerate mutase (the cofactor-dependent form GpmA and the cofactor-independent form GpmI [393]) immediately shows several interesting trends (the species symbols are the same as in Effigy ii.seven):

Image ch2e1.jpg

Firstly, the 2 forms have largely complementary phyletic patterns, a clear sign of non-orthologous gene displacement. Only East. coli encodes both forms of the enzyme (and hence shows apparent functional redundancy), whereas other organisms encode either 1 or the other. Secondly, several organisms do not encode either of the two forms of this enzyme. Assuming that glycolysis is an essential metabolic pathway, glycolytic enzymes should be encoded in every genome (we are aware of ane exception, Rickettsia, which does non encode any glycolytic enzymes; see 7.1.1). Therefore, one might suggest that in that location should be an additional, 3rd form of phosphoglycerate mutase, which is encoded in archaeal genomes and too in T. maritima, A. aeolicus, and D. radiodurans. Indeed, sequence analysis of those genomes shows that they all encode an uncharacterized enzyme, distantly related to alkali metal phosphatase and cofactor-independent phosphoglycerate mutase. Based on the conservation of active site residues, this archaeal enzyme has been predicted to have a phosphoglycerate mutase activity [258,261]; this prediction has now been experimentally confirmed in ii independent studies [308,866]. Remarkably, the phyletic blueprint of the respective COG complements the union of the patterns for the two forms of phosphoglycerate mutase, which ensures the presence of at least ane type of phosphoglycerate mutase in every species, except for Rickettsia:

Figure 2.x

Phyletic patterns of the three forms of phosphoglycerate mutase. The species symbols are equally in Figure 2.7.

This summation also shows that in that location is no necessity in yet another form of phosphoglycerate mutase, which has been designated GpmB in East. coli (see iii.2.1.3), simply has never been experimentally demonstrated to have this activity:

Image ch2e2.jpg

Indeed, recent data testify that this poly peptide does not have a phosphoglycerate mutase action, at to the lowest degree in B. subtilis. Instead, it appears to role as a not-specific sugar phosphatase [702]. This case shows the impressive ability of the comparative-genomic approach for prediction of gene functions. This methodology is discussed in greater detail later on in this book (see five.2).

two.iii. Conclusions and Outlook

In this chapter, we discussed some general principles of molecular development that are key to the comparative-genomic approaches and major evolutionary phenomena that became apparent as the result of genome comparison. The above discussion is obviously quite sketchy. However, this should exist sufficient for understanding the principles underlying methods of computational genomics and the system of various databases, which nosotros discuss in the next two chapters. In Chapters five through 8, we return to problems of genome evolution at a new level and clarify some of the concepts outlined in greater depth.

Table 2.iii

Examples of non-orthologous factor displacement between Thou. genitalium and H. influenzae.

2.4. Further Reading

1.: Darwin C. 1859. The Origin of Species. Murray, London.
2.: Kimura 1000. 1983. The Neutral Theory of Molecular Evolution. Cambridge University Printing, Cambridge, U.k..
3.: Ohno S. 1970. Evolution past Gene Duplication. Springer, New York.
4.: Graur D, Li W-H. 2000. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, MA.
5.: Doolittle WF. Uprooting the tree of life. Scientific American. 2000;282:90–95. [PubMed: 10710791]
6.: Koonin EV, Aravind 50, Kondrashov Every bit. The bear upon of comparative genomics on our understanding of evolution. Cell. 2000;101:573–576. [PubMed: 10892642]

Source: https://www.ncbi.nlm.nih.gov/books/NBK20255/

Posted by: childsrecare68.blogspot.com