Are there any animals whose genome haven't been sequenced yet?

Are there any animals whose genome haven't been sequenced yet?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I referred to a few websites and it seems like almost every animal's genome has been sequenced. However, are there any animals that haven't yet been sequenced? If so, can you provide the list here or provide the link to the website which provides that information??

Wikipedia maintains an (incomplete) list of sequenced animal genomes here.

There are a few million living species of animals, we're not that close to sequencing them all, and listing them all doesn't make much sense.

I'd recommend starting from the animals that have been sequenced, and in particular, animals from particular families or smaller taxa, if you'd like more detailed information.

Why domesticated foxes are genetically fascinating (and terrible pets)

Cultures across the globe consider foxes to be incorrigibly wild. In both ancient fables and big-budget movies, these fluffy mammals are depicted as being clever, intelligent and untamable. Untamable, that is, until an unparalleled biology experiment started in Siberia almost 60 years ago.

The tale begins with Dmitry Belyaev, who was studying genetics during a very dangerous time in the Soviet Union. State officials campaigned actively against genetic research with a tactic known as Lysenkoism, under which hundreds of biologists were either thrown in prison or executed. After Joseph Stalin’s death, the government’s grasp on genetic research loosened, and though it was still controversial, Belyaev was finally able to test a hypothesis he had been secretly pursuing.

Dmitry Belyaev, the brains behind the breeding. Photo by Institute of Cytology and Genetics

As director of the newly-minted Institute of Cytology and Genetics, Belyaev was curious as to how dogs first became domesticated. He decided that to fully understand the process, he must attempt to replicate the early days of domestication. He picked foxes for the experiment because of their close family ties with dogs (both are canids). His research team visited fur farms across the Soviet Union and purchased the tamest foxes on hand. They figured using the most docile of the wild foxes for their breeding program would hasten the pace of domestication, relative to the thousands of years it took to breed dogs.

To prove the foxes’ friendly demeanor was the result of genetic selection, Belyaev’s team began to breed foxes that showed opposite traits of the tame pups. Instead of being outgoing and excited by encountering people, these foxes were defensive and aggressive. This result showed certain aspects of the fox’s behavior could be tied to genetics and spotted during breeding.

What does the (tame) fox say?

Unfortunately, Belyaev died before seeing the final results. But today, 58 years after the start of the program, there is now a large, sustainable population of domesticated foxes. These animals have no fear of humans, and actively seek out human companionship. The most friendly are known as “elite” foxes.

“By the tenth generation, 18 percent of fox pups were elite by the 20th, the figure had reached 35 percent,” Lyudmilla Trut, one of the lead researchers at the Institute of Cytology and Genetics, wrote in a paper describing the experiment in 1999. “Today elite foxes make up 70 to 80 percent of our experimentally selected population.”

University of Illinois biologist Anna Kukekova has been studying these domesticated foxes since the late 1990s. Her lab digs into the genes behind the desirable traits in the animals.

Two domesticated foxes, produced as part of a long-term breeding program in Russia, begging for pets. Photo by Judith A. Bassett Canid Education and Conservation Center

One of the lab’s most interesting findings is that the friendly foxes exhibit physical traits not seen in the wild, such as spots in their fur and curled tails. Their ears show weird traits, too.

Like puppies, young foxes have floppy ears. But the ears of domesticated foxes stay floppier for a longer time after birth, said Jennifer Johnson, a biologist who has worked with Kukekova since the early 2000s.

As the researchers peered into the reasons behind the behavioral traits, they found there isn’t just one gene responsible for the friendly and outgoing behavior.

“The tameness (the nice versus mean) is actually separate from the bold animals versus the shy animals, and the active animals versus quiet animals,” Johnson said. “When these [tame and aggressive] animals are bred, we see a lot of interesting new behaviors.”

Johnson said it has been difficult to decipher these genetic secrets, because unlike for humans and dogs, no one has sequenced the genome of foxes … yet. Kukekova’s lab expects to publish a fox genome sometime soon.

Fly foxes, fly!

After the collapse of the Soviet Union, the domesticated fox experiment fell on hard times as public funding for the project evaporated. The researchers realized quickly that keeping more than 300 foxes is an expensive enterprise. In the 1990s, the lab switched to selling some of the foxes as fur pelts to sustain the breeding program.

“The current situation is not catastrophic, but not stable at the same time,” Institute of Cytology and Genetics research assistant Anastasiya Kharlamova told BBC Earth last year. Now, the lab’s primary source of revenue is selling the foxes to people and organizations across the globe.

One customer is the Judith A. Bassett Canid Education and Conservation Center, located near San Diego. The center keeps six foxes — five of which are domesticated — as ambassadors for their species, so that people can get an up-close-and-personal view of the animals.

“We have a fox whose name is Boris, and as soon as someone walks in, he’ll run up to them like a dog will,” said David Bassett, president of the Conservation Center. “He wants to be scratched and if you don’t scratch him he’ll make you.”

Boris the domesticated fox. Photo by Judith A. Bassett Canid Education and Conservation Center

Want a domesticated fox of your own? Remember these rules. First, bringing one into the United States costs almost $9,000. Several states outright ban people from keeping foxes as pets, including California, New York, Texas and Oregon. And of course, while domesticated foxes are friendlier than those in the wild, they can still be unpredictable.

“[You can be] sitting there drinking your cup of coffee and turning your head for a second, and then taking a swig and realizing, ‘Yeah, Boris came up here and peed in my coffee cup,’” said Amy Bassett, the Canid Conservation Center’s founder. “You can easily train and manage behavioral problems in dogs, but there are a lot of behaviors in foxes, regardless of if they’re Russian or U.S., that you will never be able to manage.”

Left: A domesticated fox, produced as part of a long-term breeding program in Russia, being cuddled. Photo by Judith A. Bassett Canid Education and Conservation Center

Drosophila Genome Sequence Completed

Researchers unveil the complete genetic sequence of one of the workhorses of modern biology.

The common fruit fly, Drosophila melanogaster, has been the workhorse of biology and genetics laboratories for the past 90 years. Now the entire Drosophila genome has been sequenced through the collaborative effort of researchers from the Drosophila Genome Project Group, led by Howard Hughes Medical Institute (HHMI) vice president Gerald Rubin at the University of California Berkeley, and researchers led by J. Craig Venter at the Celera Genomics Corporation.

If you give people very efficient tools for figuring out the functions of genes, you can do it in a massively parallel way.

The Drosophila genome sequence was published in the March 24, 2000, issue of Science. The researchers report that they have sequenced 97 to 98 percent of the genome and perhaps 99 percent of the estimated 13,600 genes. The sequence data will be accessible to scientists worldwide through Genbank, the National Institutes of Health genetic sequence database.

In an accompanying editorial in Science, Thomas Kornberg at the University of California, San Francisco, and HHMI investigator Mark Krasnow at Stanford University, report that the Drosophila sequence will be a "critical resource" for research in genetics, biology and medicine.

Over the years, Drosophila has been one of the most influential model systems for geneticists. "The conservation of biological processes from flies to mammals extends the influence of Drosophila to human health," write Kornberg and Krasnow. "When a Drosophila homology of an important but poorly understood mammalian gene is isolated, the arsenal of genetic techniques in the Drosophila system can be applied to its characterization."

The Drosophila sequencing project was launched in 1991 when Rubin and HHMI investigator Allan Spradling at the Carnegie Institution decided, says Rubin, that the time was right to begin a fly genome project. In May 1998, the Berkeley Drosophila Genome Project was one year into a three-year NIH grant and had finished 20 percent of the sequencing, when Rubin was approached by Venter with what Rubin calls "an offer that was too good to turn down."

Venter proposed that his newly-formed company, Celera, would sequence the Drosophila genome free-of-charge using a controversial technique known as whole genome shotgunning. The technique requires shearing the Drosophila DNA into three million random clones with overlapping ends. These clones are then sequenced by automated DNA sequencing machines—at Celera, some 300 sequencers, each costing $300,000—and then massive computing power is put to work to assemble the complete genome sequence in a process similar to reconstructing a jigsaw puzzle.

Venter formed Celera with backing from PE Corporation (formerly known as Perkin-Elmer Corporation), which makes the DNA sequencing machines, as a commercial venture to sequence the human genome by 2001, several years before the date projected for completion by the international Human Genome Project. While promising the data would be made available to researchers, Venter was also betting that Celera could make money by licensing early looks at the sequencing data to the pharmaceutical industry.

The Drosophila genome, says Mark Adams, Celera's vice president for genome programs, would be "a proof-of-principle" for the whole genome shotgun strategy. "It seemed like a good idea to do a medium-sized organism in which there was extensive scientific interest," he says, "and in which there was already a lot of good information available in terms of map and sequence data that we could use to validate the strategy."

While Rubin says he had some concern about working with Celera, he was delighted by the offer nonetheless. "Anyone who would help me get the Drosophila sequence done and out of the way was my friend," says Rubin. "They were offering to do all this work in a collaborative way and not expecting any money for it."

Celera started the sequencing last April and finished collecting the raw data in early September. "Since then," says Rubin, "we've been putting all the pieces together, which is not trivial. It's the big challenge of the whole genome shotgun approach."

The finished genome already seems to be remarkably revealing. Of the 289 genetic flaws known to cause disease in humans, says Rubin, they have found Drosophila homologues for 60 percent and for 70 percent of the genes involved in human cancers. Among the genes that have already been identified are Drosophila homologues of genes involved in Parkinson's disease, and the long-sought Drosophila homologue of the p53 tumor suppressor gene, which is implicated in a host of human cancers.

The biggest surprise to come out of the Drosophila sequencing project, says Rubin, is that flies have only twice as many genes as yeast. "Yeast is a simple, single-cell fungus, " says Rubin, "and yet flies only need twice as many genes to make an animal that can fly around without crashing into walls, has tissues, nerves, muscles, memories and other kinds of complicated behaviors like circadian rhythms. The take-home message is that the higher complexity in animals like flies and humans comes without needing a lot of new parts. You can build them with the same parts list—with more of the same parts organized together—in much the same way a supercomputer can be built from a bunch of desktop PCs hooked together in parallel."

Rubin sees the genome drastically changing the pace of his research. With less than 15,000 genes in Drosophila, and some 5,000 researchers worldwide working on the organism, he says, "that's one human being for every three genes. If you give those people very efficient tools for figuring out the functions of genes, you can do it in a massively parallel way." Moreover, the full Drosophila sequence allows researchers to look at multiple genes simultaneously to understand the complex signal transduction pathways that regulate cellular processes. "That is where the genome project really comes into play," he says. "It enables us to know all the genes so we can look at all of them at once and see what they're doing. "

At the Princess Margaret Hospital in Toronto, researcher Tak Mak says he has been working to understand the signal transduction pathways involved in cancer formation. "The easiest way to understand that would be some kind of a genetic screen." As a result he has recently dedicated one-third of his laboratory to Drosophila genetics in anticipation of the publication of the sequence. "It will make Drosophila genetics relatively easy," he says.

Whether the whole genome shotgun technique will work as impressively for the human genome is now the next question. Celera's Adams says the Drosophila work is obviously encouraging, and that Celera's human sequencing work has already begun and should "start to look like a genome" toward the end of the year. Rubin says, "It worked better in Drosophila than most people expected it would. I think it will work for humans. But the problems are more complex for humans, so we'll have to wait and see."

Lecture 25: Genomics

Download the video from iTunes U or the Internet Archive.

Topics covered: Genomics

Instructors: Prof. Eric Lander

Lecture 10: Molecular Biolo.

Lecture 11: Molecular Biolo.

Lecture 12: Molecular Biolo.

Lecture 13: Gene Regulation

Lecture 14: Protein Localiz.

Lecture 15: Recombinant DNA 1

Lecture 16: Recombinant DNA 2

Lecture 17: Recombinant DNA 3

Lecture 18: Recombinant DNA 4

Lecture 19: Cell Cycle/Sign.

Lecture 26: Nervous System 1

Lecture 27: Nervous System 2

Lecture 28: Nervous System 3

Lecture 29: Stem Cells/Clon.

Lecture 30: Stem Cells/Clon.

Lecture 31: Molecular Medic.

Lecture 32: Molecular Evolu.

Lecture 33: Molecular Medic.

Lecture 34: Human Polymorph.

Lecture 35: Human Polymorph.

Good morning. Welcome back. So, the Red Sox won, it's pretty convincing, yeah, very good. Yay Red Sox.

So, as you can also tell, I have something of a cold, so I'll see if I, if my voice makes it through, but what I wanted to do today, if the voice allows, was to talk about genomics.

Now, this is a little bit different than what we normally do in the class because, I work on genomics, it's something I'm extremely interested in.

And so, what I wanted to do today, and I'll do it one more time before the end of the term, is to talk about research that's going on in genomics, give you a sense of what's really going on. I can assure you that what I say is not going to be in the text book, or any other text book. And, I'm not entirely sure how this might appear on an exam, so don't ask, because I'm really just going to talk about research that's going on today.

And part of the purpose in doing that is to a, show you that it's possible for you to understand the kind of research that's going on in this field, and b, to excite you about what's going on in this field. So each year I pick different things to talk about, and I've picked a few things, and we'll see. So feel free to interrupt and to ask questions, and all of that, but this is very much more, sort of the edge of genomics, including stuff that's going on, you know, right now as we speak. So, we'll fire away.

So a little introductory stuff. I call this, we can actually keep the lights up, I think people, can people read that? Yeah, it's fine, good, so we'll leave the lights up and I can see people.

So, I think the thing that sets apart this revolution of biology that we're looking through right now, is the transformation of biology, not just from being the study of living organisms, to the study of chemicals and enzymes, to the study of molecules, but to the study of biology as information. That is what's distinctive about this decade, is the idea that the information sciences have begun to merge with biology, or biology merged with information sciences, and that it's having a profound effect on driving biomedicine. In both of the two talks I'll give, this one and near the end of the term, that will be the common theme, because I think that's the most important thing that's going on right now. Now, just to remind you, of course, the idea that biology is about information is an old one, it goes back to my hero, Gregor Mendel, with the recognition that information was passed from parent to offspring, according to rules.

And, as you know, the history of biology in the 20th century can be read as the development of biology's information.

The first quarter of the 20th century was the development of the idea that the information lives in chromosomes. The next quarter of the 20th century, the idea that the information of the chromosomes resides in the DNA double-helix, and that information was contained in this molecule, and somehow in it's sequence, and you know all of this. And the next quarter of the 20th century, basically from 1950 to 1975, understanding how it is that the cell reads out that information, from DNA to RNA to protein, how it uses a genetic code to translate RNA's into proteins, and the development of the tools of recombinant DNA that made it possible for us to read out the information that the cell reads out.

So that brought us ¾ of the way through the 20th century, with the ability to read out genetic information, at least in little ways, but they were little ways. You could write a PhD thesis, around that time, for sequencing 200 letters of DNA.

That would be, you know, considered amazingly exciting PhD thesis. The next quarter of the 20th century, the last quarter of the 20th century, was characterized by a veracious appetite to read as much of this information as possible.

It started, first, with trying to read out the sequence of individual genes, then sets of genes, then genomes of small organisms' bacteria, medium-sized organisms. And then, you know, in a wonderful closure to the 20th century, the reading out of the nearly complete genetic information of the human being in the closing weeks of the 20th century. When you remember that, that Mendel was rediscovered in January of 1900, that's when the papers rediscovering Mendel came out, and you figure you've got perfect bookends from the rediscovery of Mendel in January of 1900, to the sequencing of the human genome in around 2000.

You realize what a century can do. It's not bad, as centuries go, you know, to accomplish all that, and it gives you know, as students, you get a point estimate in time of what science knows, but you guys aren't old enough yet and haven't lived long enough yet, to measure the derivative, and see how rapidly it's changing.

But just look at what happened over the course of that century, and then just project forward to what that can mean for the next century. So what that's done is it's brought us to the next picture. I have a picture in my head, of biology as a vast library of information, a library of information in which evolution has been taking patient notes.

Evolution is a very good experimentalist, and it's a very patient note taker. It's notes, of course, are written in the genomes, and everyday evolution wakes up, changes a few nucleotides, sees how the organism works, if it was an improvement, evolution keeps the notes, if it was disadvantageous, evolution discards the notes.

That, by the way, for those of you working in labs, is no longer considered appropriate laboratory practice.

You're obliged to keep your laboratory notes from failed experiments, as well, but evolution got into this before those rules were codified, and so it discards the notes from unsuccessful experiments, and keeps the notes from the successful experiments. But nonetheless, we have all the notes from the successful experiments, and we can learn a tremendous amount from it. There's a volume on the shelf corresponding to each species on the planet. There's a volume on the shelf corresponding to each individual within each species, to each tissue within each individual within each species, and there's information there about the DNA sequence, about the RNA readouts, about the protein expression levels, and in principle, even if not yet in practice, we can pull down any volume we want, and interrogate it, and compare it for related species, for individuals within a species, some of whom might have a disease, some of whom might not, for different kinds of tissues treated in different ways.

That is, I think, going to be a tremendous theme of biology going forward, and that's why it's a particular pleasure to teach biology at MIT, where you guys understand what that could mean, that fusion could mean. Now, this idea of extracting genomic information in large-scale, is a relatively new one. In the mid-1980's, the scientific community began debating what was a pretty radical idea, sequencing the human genome.

This was floated in a couple of places, in 1984 at one meeting, somebody raised the idea, you've got to realize that sequencing itself, that sequencing DNA, only came from the late 70's, so within six, seven years of being able to sequence anything, people were now saying, let's sequence everything.

That was a reasonably audacious thing to do, and it was controversial. There were many people who felt that the human genome project was a terrible idea, and with good reason, because the initial version of the human genome project was, kind of, a blunderbuss approach.

It was, let's immediately mount a massive factory and start sequencing the human genome with the just horrible technologies of the mid-80's, with radioactive sequencing gels, and you know, lots and lots of people doing stuff.

And so, you know, many people in science were, were concerned that an entire generation of students would need to be chained to the bench, sequencing DNA. Sydney Brenner, a great molecular biologist, proposed the whole thing be done at institutions [LAUGHTER], because you know, people could be sentenced to, 20 million bases, with time off for accuracy, or things like that [LAUGHTER]. And so what happened was, the scientific community came together well, in it's best form.

Group, a group was put together by the National Academy of Sciences, who said, well look, this is a really good idea, but we also need a carefully thought-through program to do it.

We need intermediate goals that will get us things that will advance the science along the way, we need to improve the technologies, and laid out a plan. The goals of that plan, to develop a genetic map, a map showing the locations of DNA polymorphisms, sites of variation, genetic markers, just like Sturdiman did with fruit flies, but to do it with humans, and with DNA sequence differences, to be used to trace inheritance.

That, that genetic map could be used to map human diseases, and if all you accomplish was, got a human map of the human being, that would be a good thing. Then you could get a physical map of the human being, all the pieces of DNA overlapping each other, so that you would know if you had a genetic marker linked to cystic fibrosis, you would be able to get the piece of DNA that contains the gene. Then, if we managed to pull that off, we could get a sequence of the human genome, all three billion nucleotides, on the web, so that you could go to just any place on the genome, double-click, and up would pop the sequence. Now, you guys of course, don't laugh at that, but about eight years ago, when I would give talks about this, I would speak about, oh you'll be able to go double-click and up will pop the sequence, and of course, everybody thought that was really funny, and that, that was something people laughed at. But of course, you can just do that today, if anybody has a wireless you can just double-click, and up will pop the sequence. And then, of course, a complete inventory of all the genes within that sequence. And a very importantly, and from the very beginning, the notion that all this information should be completely, freely available to anybody, regardless of where they were, whether in academia, or industry, in first world, third world countries, that everybody should have free and unrestricted access to that information.

So a plan was laid out, I won't go into the details here, but the plan was laid out that involved work constructing genetic maps, physical maps, sequence maps, in the human, the mouse, and some model organisms, including the bacteria yeast, fruit flies, worms. And, quite remarkably, it largely went according to plan, over the course of about 15 years.

A lot of people in the scientific community came together and took up different tasks. I should say, with some pride, that MIT was by far, one of the leading contributors to this effort, having been involved in essentially every stage of this, the genetic mapping of human and mouse, the physical mapping of human and mouse, and the sequencing of human and mouse, and having been the leading contributor to the latter, and it's not an accident because MIT's a marvelous environment in which to undertake this kind of research.

It involved changing the way we do biology. Back in the mid-80's, when we sequenced DNA, we did it with radioactivity, remember I taught you how to sequence using radioactive label of a gel, and all that. That's how we did it, stood behind this plastic shield, and you loaded the gels. Of course, now it's done in a highly automated fashion. This is the production floor at the Broad Institute, which is here at MIT, where robots prepare all the DNA samples, so E. coli's grown up, and then you have to crack open the cells, purify the DNA, purify the plasmid, do a sequencing reaction, etc., etc. it's all done robotically there, and this is capable of processing, and does process, in a given day, about 200,000 samples per day. They then go, and this is all equipment designed by people here at MIT, and then commercially built for us. They then go to the back room where, actually, these are the previous generation of DNA sequencers, commercial detectors, those capillary detectors that have little lasers on them, there's a whole farm of them that sit there, and are able to get data out.

In the course of a single day, we can now generate about 40 billion bases, I'm sorry, in the course of a single year we can generate about 40 billion bases of DNA sequence.

The genome project itself, was a collaboration involving 20 different groups around the world, groups in the United States, United Kingdom, France, Germany, and Japan, and China. They were of different sizes, they used different approaches, but everybody was committed to one common cause of producing this information, and making it freely available, and everybody worked together. And for the rest of my life, when it comes to Friday, at 11 o'clock, I will always think genome project, because we had a weekly conference call of all the groups in the world working on this Fridays, at eleven, and it was a fascinating experience, there were many, many years of that. So a draft sequence, a rough draft sequence of the human genome, was published in the year, in February of 2001, it was announced with some fanfare in June of 2000, but the real scientific paper came out in February of 2001.

This was not a perfect sequence of the human genome, by any means. We discovered about 90% of the sequence of the human genome. It still had about 150, 00 gaps in it, it had errors. But, it still did have 90% of the sequence of the human genome.

For the next three years, people worked very hard, and, as of last April, a finished sequence of the human genome was produced, and was published a couple weeks ago, and it contains, our best guess, about 99.

% of the human genome, and it still has about 343 gaps, they're, we know what they are, we know where they are, but they're not sequence able with current technology.

That's the “finished human genome”. What is it like? Well, this is a picture of the genome, do we have a pointer, yes, I see here we do have a pointer. This is your genome here, this is chromosome number 11, and I'll call attention to some interesting bits. So these colored lines here, represent genes, or gene-predictions, based on both, sequencing of the DNA, and mapping them back to the genome, as well as computer programs that analyze the genome.

And, right here, you have a big pileup of lots of genes, very few genes of here. Lots of genes, few genes. Notice the places where there are lots of genes, match up with these light-grey bands, which are the light-grey bands of the microscope, on chromosomes. The places with very few genes match up with the dark bands in the chromosome.

Do you know why that is, that the gene-rich regions are these light bands, and the gene-poor regions are the chromosome dark bands? Me neither. Nobody has a clue. It's really, it's really just one of these things. We had no reason to expect that we'd see these striking patterns, and other genomes, e-coli, doesn't have this dense, urban cluster, and these big, rural plains that are gene-poor. This is very weird, and it's distinctive to mammals. You'll also notice that the gene-rich regions, here, are rich in G's and C's, they have different distributions of some repeat elements, it's all sorts of weirdness that comes from just looking at the genome. The biggest weirdness was the number of genes, the count of genes is, our best guess, about 22, 00 genes, if I had to pick a number today, it would be our count of genes, and of course, that's down from the 100, 00 that was in some textbooks, and it's down from even 30 to 40, 00 that was in the genome paper of February, 2001.

Our best guess is that it's really just about that range.

Genes, themselves, are very interesting.

When you look at, you know, if we only have 22,000 genes we know of, how do we manage to run a human being with so few genes?

It is, by the way, probably fewer genes than the mustard weed, or Arabidopsis thaliana. So, what do we do? Well, humans, one thing we may take comfort in, is that we, although we only have about 22,000 genes, there's a lot of alternative splicing, on average the typical gene, on average, has about two alternative splice products.

Some have many, some have few, but probably, when you're all done, those 22, 00 genes may encode 70-80,000 different proteins, and it could be more than that because we don't know all the alternative splice products, and what they do. But, if you ask, humans get credit for being really inventive or creative, for having lots of new genes that make us human, the answer is, no.

Not only are humans not different in their gene complement from other mammals, mammals, as a group, really haven't invented that much, when you get down to it. Most of the recognizable sub-domains of proteins, proteins are built up of sub-domains, recognizable sequences that have certain motifs that fold up in certain ways, or carry out certain enzymatic functions.

And it looks like our genomes, our genes, are mixed-and-matched combinations of many domains that were invented a long time ago, in invertebrates and before, and that most of evolutionary innovation in the more complex, multi-cellular animals, has simply been mixing-and-matching these domains in new ways, to get slightly different functions.

You don't get a lot of points for creativity, but it does seem to work.

By far, the most derivative of all, and what characterizes our genome tremendously is, when a gene works, make extra copies of it, and let it diverge slightly, and take up new functions. Really, your genome is just characterized by large expansions of families, immunoglobulin-like genes, intermediate filament proteins holding together the cytoskeleton.

There are 111 different keratin-like genes in your genome.

They're all different, they do different things, but they all came from one gene that was copied, copied, copied, at random, randomly duplicated, and then diverged to take up new functions. Growth factors, flies and worms managed to get by just fine, thank you, with two growth factors of the TGF beta-class, whatever that is. You have 42 growth factors of this TGF beta-class, all of which help communicate, cells communicate, in different ways.

And then, of course, all the olfactory receptors.

In your genome, you have about 1, 00 genes for olfactory, for smell receptors. This is what Richard Axel and Linda Buck won a Nobel Prize for this year, was their work on the olfactory receptors. Sad to say though, out of all your olfactory receptors, genes, most of them are broken. They're most pseudo-genes.

It's not true in dogs and mice, who keep their olfactory receptor genes in pretty fine-working order, but it's very clear that in primates with color vision, our olfactory receptor genes have been going to seed. They've been piling up mutations, and there's no selective pressure to keep many of them.

And, in fact, we've now shown, in a paper that will come out soon, that this process is accelerating dramatically in the last 7 million years since we diverged from chimps. And so, humans have almost completely lost interest in smell, that's not totally true, some of these olfactory receptors surely matter for various processes, but most of them are probably irrelevant right now.

And so, anyway, that's the nature of the genes there.

Anyway, another interesting fact that's worth mentioning about your genome is half of your genome consists of transposable elements, elements that simply duplicate themselves, and hop around the genome. Elements that are like viruses, they make a copy, sometimes in RNA, the RNA is copied back into DNA and slammed elsewhere in your genome. These elements, well the, there are four classes.

Alo elements, Line elements, Retro-Virus like elements, all these go through RNA intermediates, and use reverse transcription.

And then there's certain DNA transposons, that go through DNA intermediate. The number of copies of the aloe element, the aloe element that's hopped around your genome, you have about a million, you have a million fossils of this element. You say, why is it there, and the answer is, because it's there. Because anything that knows how to make a copy of itself, and insert it itself in it's genome, you can't get rid of. You can consider it, if you wish, an infection, but half of your genome consists of an infection, with these kinds of transposable elements.

Well, it's very interesting, what's the effect? Well, they do, some of them are transcribed and, it's very interesting.

Sometimes it's bad, one of them will hop into a gene and mutate it, and that's bad, that person will have a lethal mutation, but the genome has probably begun to use them, and count on their being there. So, when a bunch, when a transposable goes in, and creates a spacing, if you, for example, if an engineering committee came in and cleaned up the genome by getting rid of all the transposable elements, it would surely not work.

Because we have evolutionarily come to count on the spacing there.

It's sort of like, if in some very, some very messy attic, you put a cup of coffee down on top of a stack of papers, those papers may be utterly irrelevant, but now they're holding up that cup of coffee that you put down on it. And if you were to just, poof, magically get rid of them, the cup of coffee would come crashing to the ground.

So, you know it, they're just there, taking up space. Now sometimes, even more than that, a few of them have actually been co-opted into being human genes.

We know that a few of these transposable elements have mutated into being our genes that do something for us.

And others of them may do things in affecting the general neighborhood with regard to transcription, and so, instead of it being a parasite, think of them as a symbiont, that's a genomic symbiont, which takes some advantage of us, and we may, you know, have worked out a compromise to take some advantage of it.

Every time a copy is made of these, and it hops in the genome, some mutations may happen in the master element, but when it lands in the new place, we have a record of that hop. And if you reconstruct the sequence of the million AluI elements, you can see which ones are very close relatives of each other, and had to have hopped recently, and which ones are somewhat more distant relatives.

And you can build an evolutionary tree connecting all of the repeat elements that have hopped around your genome, and thereby attaching a date to each of them, as to when they hopped.

So it really is a fossil record, and you can figure out how many of them have been hopping at different times over history.

And we can even make a plot of that, this is long ago, sometime here, some 30 million years ago, there was a huge explosion and in transposion, transposons, in our genome.

We don't know why that happened, but it's very interesting, it does correspond to very interesting periods of primate evolution.

And then, interestingly, there's been a huge crash, and transposition has dropped dramatically. We have no clue why this is, but we have a whole fossil record here of the rate of transposition of different kinds of repeat elements around our genome, and people are now starting to try to figure out what in the world this means. So all this is sort of there, inherent in the sequence, and if you want the sequence, as I say, you can go to the web and pull all this stuff now. So how do we understand the sequence? Well, I've told you a little bit about it, from the simple things that we've done, but there's a lot more that needs to be learned about the sequence, so what I really want to turn to, is how we're extracting information out of this sequence.

So, DNA sequence is long and boring, it's only marginally more interesting than reading your hard disk, because it has four letters, instead of ones and zeros, but it's, you know, well, it's pretty really boring if you take a look at it. How do you attach meaning to all this stuff? One of the most powerful ways is by comparison with other genomes. And so, comparing the human genome to the mouse genome is very informative in many ways.

So, as soon as the human genome was far along, a portion of the international consortium, set to work getting a sequence of the mouse genome. And that was published in December of 2002. We have a nice map of the mouse genome, with all these things, it, too, shows these gene-rich regions, gene-poor regions, all sorts of funny things. And if we look closely at a portion of the human genome over here, I've picked about a million bases of the human genome, and we take any little spot in that million bases of the human genome, let's say over here.

And we take half the DNA sequence corresponding to this spot, and we run it in the computer against the mouse genome, and ask where in the mouse genome do we get the best match for this, the best match to this is here. Now let's do it for this piece, here. The best match anywhere in the mouse genome lands in the same million bases here as the mouse genome. In fact, for every single sequence that we pull out from this million bases in the human genome, the best match is in this million bases of the mouse genome. That's very interesting. Why is that? Sorry? No, people do know.

It, it was a good try, though. [LAUGHTER]. This million bases in the mouse genome, and this million bases in the human genome, represent the evolutionary descendents of a common million bases that occurred in our common ancestor 75-million years ago.

This is a clear evidence of the evolution here, because we can see that this is a segment of DNA from our common ancestor that really hasn't undergone much rearrangement, and we can just line up the sequences and see.

In fact, we can build a whole map across the mouse genome like this.

For any bit of the mouse genome, I don't know, here's a bit on mouse chromosome 17, this whole stretch corresponds to a portion of human chromosome number eight. This stretch here, I don't know, this green color here on chromosome number six, corresponds to chromosome four in the human. And so, we can build a look-up table that says, for any portion of the human genome, what's the corresponding portion of the mouse genome that came from the same ancestor, has basically the same complement of genes in it. And there's only about 330 such regions that we need to cut-and-paste the human genome order to the mouse genome order, roughly speaking. There's a lot of little local rearrangements, but at this gross level. So now, if we go back more closely and we look at this, and we say, OK, so now we look at this region, we now know these two regions descend from a common ancestor, if we do a careful evolutionary analysis, lining up all the sequences, and see how well-preserved the sequences are, some are much better preserved than others. Evolution has been much more lovingly conserving other sequences than others, and so, so let's now zoom-in on a gene, this is a gene that goes by the name, PP-Gama, I'm fond of this gene but, it doesn't matter. If we look, I've indicated all the regions here, in which there's a heightened degree of conservation. The sequence is well-conserved here, here, here, here, here, here, here, and here, here, here, here, here, here. These correspond to the exons of the PPR-Gama gene, they encode the protein of the gene, then the splicing goes like this, OK? These things here do not correspond to the exons. People have no idea what they are, in fact, this is not supposed to be here. The official textbook picture says, the vast majority of what matters for a gene, what evolution should preserve, is the exons plus the promoter.

Here's the promoter. But in fact, what we found is that an awful lot more is being preserved. In fact, across the genome, our best estimate is there are about 500,000 conserved elements across the genome, and only 1/3 of them are protein-coding exons.

That means 2/3 of the stuff evolution has been interested in, is not protein-coding exons, and the truth is, we do not know what it is, this was a very radical finding, when this mouse paper came out, about a year and a half, about two years ago now.

What it must be, I think, but we're guessing, are regulatory signals, the structural elements in chromosomes, RNA genes, but there's an awful lot more of it than we had imagined.

And we've, now we're in this fascinating situation, where computational analysis has told us what's on evolution's mind, and now we have to go to the lab and figure out what in the world it does.

But there's no doubt that it must do something, because evolution has preserved it quite well. Now, I oversimplified greatly in this discussion, let me first say, and I'll come back to that. We do know, if we take some of those elements, here's one, there's a 481 base-pair elements that's 84% identical between human and mouse. You could write yourself a little statistical model to say that's way unusual to have something that's so well preserved. When Eddie Ruben and his colleagues from Berkley made a knockout mouse that deleted that segment, this knockout mouse loses regulation of three different genes in the neighborhood, saying that this must be a regulatory sequence that affects multiple genes in the neighborhood. That, that's one, with about 300, 00 such elements to go, in order to attach meaning to them. So doing this entirely by knocking out mice will be a slow process, one's going to need other ways to be able to attach meaning, but there's no doubt. Now, there's some other interesting papers where people have knocked some of these things out, and they've seen no effect on the mouse. They get a totally viable mouse. Can you conclude from that, that they have no function? Why not? The knockout mouse is viable.

Could be redundant, it could even not be redundant, but yes, it could be redundant, but you couldn't knock out both of two things. It turns out, suppose knocking it out affected the mouse's viability by part, ten to the third, it was only 99.9% as fertile, would you be able to see that in the laboratory? No. Would that matter to evolution?

It would be lethal, in an evolutionary sense.

Such mutation could never propagate through a population.

One part, and ten to the third, is massive selection against, from an evolutionary point of view, but almost undetectable in a laboratory batch. Evolution has a far more sensitive assay than we do. Now, I won't go into detail, but for the mathematically inclined here, showing that there really were about 5% of the human genome under, under evolutionary selection, it was a complicated affair, because with only two genomes, what we really had to do, and if this doesn't make sense, ignore it.

We looked at the background distribution of conservation of the genome in unimportant elements, in those repeat elements that we knew to be functionally broken. We looked at the overall conservation of the genome, and found that the overall genome has this rightward tail, by subtracting the distributions we were able to see how much excess conservation there was.

That's because we only had two genomes, we had to draw inferences.

If we had more genomes, like the mouse and the rat, and the dog and the-this-and-the-that, we would be able to extract signal from noise.

We would be able to see right away, which bits were well-conserved, and we wouldn't have to do this as a sensitive statistical analysis.

So, in fact, we need more mammalian genomes, so, so right now there's been a sequence of the rat genome in the past year or so, there's a sequence of the dog genome, we're writing up that paper now, but it's on the web already. There's a sequence of the chimpanzee genome we're writing up a paper on that, in collaboration with our friends in the genome-sequencing community.

We're currently sequencing a variety of other organisms, as well. And if you had enough organisms, you ought to be able to just line it up and say, what has evolution preserved, and figure out exactly which nucleotides matter, and which nucleotides don't, are allowed to drift freely, at the background rate. How far could you go with this?

Well, we decided to try an interesting experiment.

We said, since mammals are very big, then we're going to need a lot of genome sequences, how about we try a small organism, like yeast? What if we were to try to do this, this kind of evolutionary, genomic analysis on something like the yeast genome? And so, this is work that I'll describe, that was between a bunch of people here at MIT who do genome-sequencing, and a student in computer science, Manolis Kellis, was PhD student in computer science, he now just joined the faculty here at MIT in computer science. But it was a really great example of how biology and computer science could come together.

So, the genome-sequencing folks sequenced three related species, through our friend, the baker's yeast, Saccharomyces cerevisiae, workhorse of geneticist. These three different species are separated by different evolutionary distances, from Saccharomyces cerevisiae. When you line up their genomes, just like with human and mouse, you find the genes occur largely in the same order, and it's not hard to pick out, oh there's this gene there, there, it's all lined up, you've got these evolutionary segments, and very few rearrangements have occurred across these species, despite the fact that they're about 20 million years apart in history.

But here's an interesting thing. When the yeast genome, Saccharomyces cerevisiae, was first published in 1995, the paper describing it reported 6, 00 genes. Now, how did they know there were 6,200 genes? They ran a computer program looking for open reading frames. Any open reading frame, consecutive codons without a stop sufficiently long, was called a gene.

But statistically, you could, by chance, just have a long stretch of codons without a stop codon.

And so, if I saw 100 codons in a row, without a stop, they called it a gene, but it might just be chance.

And they knew that, of course, they wrote that in the paper, but for many years, people then had 6, 00 open reading frames, which were the yeast's genes.

Could evolution now tell us which one of them were real and which weren't? Well, it turns out that evolution was tremendously powerful in doing that.

If you take something that's a well-known gene that has been extensively studied by yeast geneticists, you line it up across all four species, you almost never see deletions.

And when you do see the lesions, here in grey, they're always a multiple of three. Why are they a multiple of three?

They preserve the reading frame. By contrast, if I take some clear, intergenetic DNA, that's not protein-coding, and I compare it across these four species, I see lots and lots of frame shifting deletions that occur, Evolution tolerates frame shifting deletions, and if I juts write down the rates, frame shifting deletions are 75x more common in intergenic DNA, than genic DNA. This provides a very powerful test.

Run this test across the genome, looking for the density of frame shifting deletions, any place that doesn't tolerate frame shifting deletions is probably a real gene, anything that does tolerate it is probably not. When you sorted through all this, it turned out that 528 of the official yeast genes were clearly not real, not real genes. They were just chock-a-block full of these frame shifting deletions. And, and a bunch of others could be confirmed. So the yeast gene count, and I won't tell you all the experimental and other that shows this is right, but the yeast genome has now been revised downward to 5, 00 genes, and we have great confidence that almost all of those are real genes, there are 20 whose origins that we're not sure of, and new genes could be found in this way. Here's a really audacious thing.

This graduate student in computer science said, I think, based on these other species, there was a mistake made in the sequencing of the first yeast, and that the reason these things are called two separate genes, is that somebody made a sequencing error that got a stop codon here, but I think these are really part of one gene. And so, somebody went back and re-sequenced some of these, and sure enough, he had correctly predicted that there had been a mistake made at that letter, and that these were in fact, a single gene.

The computational analysis was incredibly powerful in this regard, it could go further than this, you could ask, could I also figure out the way genes are regulated in this fashion, could I work out the intergenic signals in the promoter regions? Remember that lac repressor to a certain operator site, well, all of these regulatory proteins bind to different sequences, could we figure out what the sequences were, computational? Well, if we look closely at a genic, intergenic region, here's one where there's two genes being transcribed in opposite directions, gal-1 and gal-10, both involved in galactose metabolism, and there's a particular protein, a transcription factor here, called Gal-4, in this region, and it has a particular sequence that it likes, CCG, 11 bases, GGC. So, that Gal-4 we see, is very well preserved across all of the species.

So, in no regulatory sequence is well-preserved, now let's look at that closely. This Gal-4 binding site is a measly, crummy, six nucleotides of information. At random, it's going to occur in many places in the yeast genome, but not be a real, important Gal-4, right? Some of them matter, some of them don't. How do we figure out which of these occurrences are real Gal-4, well, if we look across all four species, what we find is that those occurrences that occur in promoter regions, are much more likely to be conserved by evolution than those that don't. So there's a special property here, conservation of the motif and the motor regions.

In fact, this particular sequence is four times more likely to be preserved when it occurs in a promoter region, than when it occurs in a coded region. And for a typical control region, the opposite is true. Since genes, since coding sequences are better preserved in general, for a randomly chosen sequence, I don't know, ATGGCAT, it's more likely to be preserved in coding regions than non-coding regions.

So this Gal-4 motif has a very funky property that, on average, it's 12x more likely than background, to be preserved when it occurs in a promoter. Now, that's a test you apply to another motif, and another motif.

In fact, you could, by computer, test all possible motifs, and ask which ones have that property? Make a scatter plot, most motifs are better conserved when they occur in promoter regions, than when they occur in coding regions, some however, are better preserved in promoter regions than in coding regions.

Our friend, Gal-4, is up there, but there are a lot more things like it, that are better preserved by evolution than promoters are. You can make a list of them. You can get about 72 well-conserved, regulatory motifs and it turns out that 20 years of yeast work produced knowledge about things like the Gal-4 site, and other sites. Almost all the known regulatory sites that had been discovered over the course of 20 years of experimental work appear on this list that falls out of the computer analysis of evolutionary comparison of genomes.

You can actually go a step further, I'll hesitate to tell you, but I'll try anyway. If you wanted to find out, without knowing in advance, what these motifs were doing, what their biological function was, you can do that informationally, too. It turns out that if I take my motif, Gal-4, and I ask, which chains does it occur in front of? Well, across Saccharomyces cerevisiae, you find this crummy little motif in many, many places because, as I said, most of it's just noise. But if I ask, which genes have this motif in all four species, these genes, there's a huge overlap with a class of genes involved in carbohydrate metabolism.

So, if I didn't know in advance that the Gal-4 motif was involved in regulating genes in carbohydrate metabolism, I could tell, just from the fact that the genes that'd conserved it, are genes involved in carbohydrate metabolism.

You can do that using all sorts of tricks, expression of genes, protein mass spec, blah, blah, blah, and the short answer is, for almost all of those motifs that you can find in the computer, by consulting public data bases of sets of genes that are co-expressed, or have similar properties and all that, the computer can also offer you a pretty good hypothesis about what that motif is associated with.

You can even go a step further than that. You can begin to look at pairs of motifs, you can say, if I have a certain regulatory sequence, number one, and a second regulatory sequence, number two, do they tend to be preserved in front of the same genes as each other? Is their conservation correlated? And you can build a map of these two guys tend, when this guy's correlated, this guy tends to be correlated. And you can say, oh those proteins must be talking to each other, and you can read that off from the patterns of evolution, as well. There are two regulators, one called Sterile 12, one called Tec1. This computational analysis shows that they tend to co-occur in a conserved fashion, far more often then you'd expect by chance. And when you do the analysis, you find that those genes that just have a conserved Sterile 12, those genes tend to be involved in mating. Genes that just have a conserved instance of Tec1 tend to be involved in the budding of the yeast, and those genes that have conserved the occurrences of both tend to be involved in fillamentation. Now all that can be read out, which is way cool, this is not the way we used to do biology.

Now don't get me wrong, there's a ton of experiments that underlay creating these databases, and there's a ton of experiments that have to be done to check any of these things. But what we have is one of the most powerful hypothesis generators that's ever been seen here. Evolution, by telling us what to focus on, is giving us, on a silver platter, hundreds of hypothesis about who's interacting with whom, and sending us back to the lab then, to test these hypotheses. Now, what are the implications of all of this for the human genome?

Could we do this for the human genome? Well, these species, Saccharomyces cerevisiase, S.

paradoxus, S. mikatae and S. bayanus, are they a good model for mammals? Well it turns out that their evolutionary distance from each other is the same as the distance of human to lemur, to dog, to mouse.

So they were chosen with a purpose. Those are actually fairly good models for the human. So could we do exactly the same analysis for the human, for the entire human genome?

If we had, human, lemur, dog, and mouse, are basically four species, human, mouse, rat, and dog.

Well, there's one little fly in the ointment. The human genome is 20x bigger than the yeast genome. If I want to analyze the whole human genome, I have a problem of signal-to-noise.

The genome is 20x bigger, I've got 20x as much noise to get rid of. I won't walk you through it, but I need more evolutionary information to get rid of all that noise. And, you can do a simple calculation that says, my evolutionary tree needs to be bigger, it's branch length needs to be bigger by about the natural log of 20, to get rid of 20 fold more noise.

And that would mean I'd need more species, I'd need about 16 species, or something like that to be able to do that. But if I built an evolutionary tree that had a branch length of four, that is, four substitutions per base across this evolutionary tree, as indicated by these colored lines here, I should have enough power to analyze the entire human genome, the way we just did the yeast genome.

So we currently have human, chimp, mouse, rat, dog. As of this fall, during in fact, right at the beginning of this term, the National Institute of Health signed off on the sequencing of these additional eight mammals. These mammals are now in process, and in fact, the elephant is done, and the armadillo is in process, and the tree shrew, I think, is being caught at the moment.

[LAUGHTER]. The ten-, don't talk about the tree shrews. The tenrec is actually being tested right now, etc, and all this is going on right now, as we speak, and I think that by next summer, we should have much of, and by certainly, by a year from now, we should have all this information to do such an analysis. That said, we're of course, very impatient people, you could just take the human, the mouse, the rat, and the dog. And I said that's not enough if you wanted to analyze the whole genome, but suppose you just wanted to analyze a portion of the genome, maybe about a yeast-size piece of the genome, well let's see, at 20,000 genes, I don't know, suppose I take, I don't know, two kilo bases around each 20, 00 genes, well that's you know, 40 mega bases of DNA, it's only a couple-fold more than yeast. Maybe, if I just focus on a limited region around each promoter, I could start reading out these regulatory signals, with just four species.

So in fact, the post-doctorate fellow is, has been working on this problem over the summer, and a little bit, too, through the spring and summer, together with Manolis Kellis, who's now in the computer science department. And I think we have a preliminary list for the human genome that's fallen out over the course of the past couple of months, and we're in the process, right now, of finishing up a paper that we're hoping to get submitted by Friday, with a preliminary list of regulatory signals in the human genome, read out from evolution of human, mouse, rat, and dog.

It won't be everything, we don't have full power to pick up all possible signals, but we're picking up a lot of the signals, we're picking up a very large fraction of previously discovered signals, and lots more new signals, as well, are falling out of that analysis. So anyway, I can assure you that that's not in the textbooks because, actually, it hasn't been submitted yet. This other stuff I've described about the yeast analysis, this, you do want to look it up, there's a paper in nature about a year and change ago, Kellis et. al. describes this yeast work. This is what's going on.

This is what's fun about teaching at MIT, as I can tell you this stuff, and you guys have a sense for the convergence that's going on in our field. Much of what I've tried to make the biology, you know, in making the biology clear, I've talked about how the different directions, genetics, biochemistry, have converged together. What we're really seeing now is information sciences converging with that as well, and I've got to say, it's a tremendous amount of fun. See you on Monday, good luck on the quiz.

Within-species variability in gene content

For every acquired gene for which a role in a radical species-creating LGT event might be inferred, there will be dozens or hundreds more whose contributions - if any - to evolutionary novelty remain unknown. And even within species as traditionally defined there can be enormous strain-to-strain variation in gene content. In a survey of 33 clusters of strains (with 2-11 genomes per cluster) that would be considered species by the greater than 94% ANI criterion, we find anywhere from 1 to 4,404 genes per cluster that are present in some strains but absent from others (O. Zhaxybayeva, C.L. Nesbø and W.F.D, unpublished work). From a similar study, Konstantinidis and Tiedje [7] observe that strains of the same species by this criterion "can vary up to 30% in gene content", and raise the possibility of resetting the 'species' to something like a 99% ANI cut-off.

Five years ago, when only the tip of the iceberg of variability in gene content was visible, Lan and Reeves [8] suggested that we look at 'species genomes' as comprising a core set (all genes present in at least 95% of strains) and an auxiliary set (present in 1-95% of strains). Something like this notion is embraced in the more recently articulated 'pangenome' concept, this term denoting the total number of genes found in at least one of the strains of a species [21]. In some species (such as Bacillus anthracis) the depth of the pangenome may have been plumbed after only a few genomes have been sequenced. For others, such as the ecologically versatile Streptococcus agalactiae, Tettelin et al. [22] suggest that "unique genes will continue to be identified even after sequencing hundreds of genomes."

This variability, we would argue, makes highly problematic one of the more appealing 'magic bullets' proposed for recognizing species as coherent natural units in the environment, namely as tight clusters of strains with very similar sequences for certain marker genes (sometimes 16S rRNA, sometimes more rapidly evolving genomic regions). Such 'microdiverse' clusters (Figure 1) are often observed in environmental surveys in which marker genes are amplified by PCR from environmental DNA samples, and have been interpreted in terms of Cohan's 'ecotype' model for bacterial species [5, 11, 23, 24]. This model imagines that genomic coherence within ecotypes is maintained by periodic selection, as discussed above, while barriers between ecological niches (spatial, temporal or nutritional) prevent genomes that sweep to fixation in one niche from invading another (Figure 2). The minor variations in marker gene sequences within a microdiverse cluster of isolates from a given site would then just be neutral substitutions accumulated since the last diversity-purging genomic sweep of the ecotype.

Microdiversity and diversity in gene content. Environmental surveys, using PCR amplification and sequencing of marker genes such as 16S rRNA or more rapidly evolving protein-coding genes and intergenic spacers, often reveal microdiverse clusters of strains with closely related sequences. The diagram shows a hypothetical phylogenetic tree compiled from such sequences, with each cluster indicated by a set of circles of the same color. Such a pattern of clustering by sequence might be expected if there were process other than random divergence and extinction of lineages at play (see Figure 2), and has been attributed [11,23,24] to an ecotype speciation process (see text). In this context, a microdiverse cluster might generally be a species. Comparisons of sequenced genomes for multiple strains of many designated species, and of genome sizes from isolates of others, show, however, that gene content can vary by up to 30% among different lineages of strains, even when the 'species' marker genes are identical in sequence [25]. The different sizes of the circles represent on an exaggerated scale the diversity in genome size in closely related strains found by such studies.

Models of processes that promote genomic coherence. (a) The ecotype species concept and (b) the biological species concept both entail processes that lead to genomic coherence within populations and divergence (horizontal dimension) between populations. Black arrowheads indicate organisms or isolates. The crosses in (a) indicate the clones eliminated in the process, while the red arrows in (b) indicate recombination between genomes. Blue lines indicate speciation. (c) If only random lineage splitting and lineage extinction occurred, coherence would not be expected, and the designation of speciation events (dashed blue lines) would be arbitrary. In the ecotype (periodic selection) model in (a), which is applicable to organisms without significant genetic recombination, favorable mutations sweep to fixation, carrying the genome in which they first occurred along, so that diversity is reduced to zero at all loci. Accumulation of neutral mutations, prior to the next sweep, generates the sort of microdiversity illustrated in Figure 1. Gray bars are niche boundaries. In the biological species model, it is individual favorable mutations that are fixed, because recombination (indicated by red arrows) separates them from alleles at other loci in the genome in which they first occurred. Still, recombination at all loci will in time promote genomic coherence within populations and divergence between populations, because with time all alleles at all loci will be traceable to mutations that occurred within the population. The gray block indicates a barrier to recombination.

The problem here (as we might have predicted from the comparisons of sequenced 'conspecific' genomes discussed above) is that these same strains may be enormously more diverse in gene content than they are in gene sequence (see Figure 1). In a survey of genome sizes of Vibrio splendidus isolates by pulsed-field gel electrophoresis, in which all the isolates were greater than 99% identical at the 16S level and all taken from a single site (albeit at multiple times) on the coast of Massachusetts, Thompson et al. [25] concluded that "this group consists of at least a thousand distinct genotypes, each occurring at extremely low environmental concentrations (on average less than one cell per milliliter)." Genome sizes varied by as much as 1 Mb among them. The authors' suggestion that much of the observed genome size (and hence gene content) variation may be selectively neutral is attractive. What clearly cannot be supported, however, is the notion that species qua ecotypes are genomically coherent.

Unlocking giraffeness

When the team probed the genome further, they identified almost 500 genes that are either unique to giraffes or contain variants found only in giraffes.

A functional analysis of these genes showed that they are most often associated with growth and development, nervous and visual systems, circadian rhythms, and blood pressure regulation, all areas in which the giraffe differs from other ruminants. As a consequence of their tall stature, for example, giraffes must maintain a blood pressure that is roughly 2.5 times higher than that of humans in order to pump blood up to their brain. In addition, giraffes have sharp eyesight for scanning the horizon, and because their strange bodies make it difficult for them to stand quickly, they sleep lightly, often standing up and for only minutes at a time, likely a result of changes during evolution to genes that regulate circadian rhythms.

Within those hundreds of genes, FGFRL1 stood out. In addition to being the giraffe’s most divergent gene from other ruminants’, its seven amino acid substitutions are unique to giraffes. In humans, this gene appears to be involved in cardiovascular development and bone growth, leading the researchers to hypothesize that it might also play a role in the giraffe’s unique adaptations to a highly vertical life.

To test this idea, Heller and his team used CRISPR to create mice with the giraffe-type FGFRL1 gene. Inserting the giraffe-specific gene didn’t cause any drastic changes to how the mice looked—they didn’t, as the team initially hoped, sprout the giraffe’s iconic long neck—but there were what Heller calls “more subtle changes.”

The bones of prenatal mice with the giraffe genotype grew more slowly compared to unaltered mice. Once born, however, the CRISPR mice quickly grew to a comparable size. When the researchers looked more closely at the bones’ structure, they saw that the mice with the giraffe variant had a slightly higher bone mineral density, a compensatory mechanism that keeps fast-growing bones from becoming structurally weak. “What we tentatively hypothesize is that . . . this gene is doing something to help the giraffe grow strong bones despite having the fastest growth rate of bones of any known animal,” Heller says.

Douglas Cavener, a molecular biologist at Penn State who was part of the team that sequenced the first giraffe genome, tells The Scientist that, despite the lack of an obvious morphological change, he agrees with the team’s hypothesis. “I suspect FGFRL1 of being critically involved in the giraffe-specific differences in the skeleton, but there are other genes that are necessary as well” that haven’t been built into the CRISPR mice, Cavener says. “FGFRL1 . . .may be necessary, but it’s not sufficient.”

To assess whether FGFRL1 helps giraffes cope with the hypertension necessary to push blood throughout their long bodies, Heller’s team next injected five mutant mice and five normal mice with a drug called angiotensin-II that induces high blood pressure. They also included five mutant mice that did not receive the drug as a control. After 28 days, the normal mice had developed hypertension and were beginning to suffer from heart and kidney damage. The giraffe-type mice, meanwhile, were largely unaffected, a finding that strongly suggests FGFRL1 is protective against lifelong high blood pressure in giraffes.

“What really makes this paper significant is the experiments that they did with the infusion of angiotensin,” says Julian Lui, a staff scientist at the National Institute of Child Health and Human Development who was not involved in the study. These results, he tells The Scientist, give “insight into one part of the giraffe story because the giraffe has such unique evolutionary adaptations for dealing with hypertension.”

In addition to cultivating a more complete understanding of giraffe genetics—knowledge that may be useful in protecting them, as the species is listed as vulnerable to extinction by the International Union for Conservation of Nature—insight into FGFRL1 could help efforts to develop treatments for high blood pressure in humans.

Heller adds that while there’s no evidence yet that FGFRL1 is associated with heart disease in people, it’s a promising place to start looking. “When we find these genes that are linked to phenotypes that we are interested in as humans, it’s natural to at least ask the questions,” Heller tells The Scientist. “What we have done here is identify a new variant of a gene that may have a dramatic impact on controlling hypertension in some settings. That makes it an interesting gene for further study.”

The Octopus Genome: Not “Alien” but Still a Big Problem for Darwinism

These days, new genomes of different types of organisms are being sequenced and published on a regular basis. When some new genome is sequenced, evolutionary biologists expect that it will be highly similar to the genomes of other organisms that are assumed to be closely related.

As ENV already noted, the latest organism to have its genome sequenced has confounded that expectation: the octopus, whose genome was recently reported in Nature. It turns out to be so unlike other mollusks and other invertebrates that it’s being called “alien” by the scientists who worked on that project.

Not to send you into a meltdown or anything but octopuses are basically ‘aliens’ — according to scientists.

Researchers have found a new map of the octopus genetic code that is so strange that it could be actually be an “alien”.


“The octopus appears to be utterly different from all other animals, even other molluscs, with its eight prehensile arms, its large brain and its clever problem-solving abilities,” said US researcher Dr Clifton Ragsdale, from the University of Chicago.


Analysis of 12 different tissues revealed hundreds of octopus-specific genes found in no other animal, many of them highly active in structures such as the brain, skin and suckers.

Obviously no one thinks the octopus is an “alien” from another planet. (Nature News quotes one co-author of the paper on the genome noting that the alien quip is a “joke.”) But it certainly is alien to standard evolutionary expectations that genomes of related species ought to be highly similar. Thus, Nature points out the large number of unique genes found in the octopus genome:

Surprisingly, the octopus genome turned out to be almost as large as a human’s and to contain a greater number of protein-coding genes — some 33,000, compared with fewer than 25,000 in Homo sapiens.

This excess results mostly from the expansion of a few specific gene families, Ragsdale says. One of the most remarkable gene groups is the protocadherins, which regulate the development of neurons and the short-range interactions between them. The octopus has 168 of these genes — more than twice as many as mammals. This resonates with the creature’s unusually large brain and the organ’s even-stranger anatomy. …

A gene family that is involved in development, the zinc-finger transcription factors, is also highly expanded in octopuses. At around 1,800 genes, it is the second-largest gene family to be discovered in an animal, after the elephant’s 2,000 olfactory-receptor genes.

The analysis also turned up hundreds of other genes that are specific to the octopus and highly expressed in particular tissues. The suckers, for example, express a curious set of genes that are similar to those that encode receptors for the neurotransmitter acetylcholine. The genes seem to enable the octopus’s remarkable ability to taste with its suckers.

Scientists identified six genes for proteins called reflectins, which are expressed in an octopus’s skin. These alter the way light reflects from the octopus, giving the appearance of a different colour — one of several ways that an octopus can disguise itself, along with changing its texture, pattern or brightness.

The technical paper explains that the octopus genome reveals “massive expansions in two gene families previously thought to be uniquely enlarged in vertebrates: the protocadherins, which regulate neuronal development, and the C2H2 superfamily of zinc-finger transcription factors.” Moreover:

We identified hundreds of cephalopod-specific genes, many of which showed elevated expression levels in such specialized structures as the skin, the suckers and the nervous system.

They conclude: “Our analysis suggests that substantial expansion of a handful of gene families, along with extensive remodelling of genome linkage and repetitive content, played a critical role in the evolution of cephalopod morphological innovations, including their large and complex nervous systems.” In other words, the cephalopod genome is unusual in many major respects, unlike other organisms we’ve sequenced.

Actually, that’s not completely correct. There are some peculiar similarities between the cephalopod genome and something else they’ve seen — but they aren’t the kind of similarities that were predicted by common descent. The technical papers notes that the cephalopod genome bears unexpected resemblance in certain respects to vertebrate genomes — and since these similarities aren’t predicted by common descent, they predictably attribute them to convergent evolution:

the independent expansions and nervous system enrichment of protocadherins in coleoid cephalopods and vertebrates offers a striking example of convergent evolution between these clades at the molecular level.

Indeed, even within cephalopods they found evidence of convergent evolution (i.e., genetic similarity that didn’t fit the expectations of common descent): “Surprisingly, our phylogenetic analyses suggest that the squid and octopus protocadherin arrays arose independently. Unlinked octopus protocadherins appear to have expanded

135 Mya, after octopuses diverged from squid.”

But the big story here is the large number of unique genes found in the octopus genome. The technical paper elaborates on one of these major gene groups:

The octopus genome encodes 168 multi-exonic protocadherin genes, nearly three-quarters of which are found in tandem clusters on the genome (Fig. 2b), a striking expansion relative to the 17-25 genes found in Lottia [a limpet], Crassostrea gigas (oyster) and Capitella [polychaete worm, and annelid] genomes.

The paper doesn’t even try to speculate about how these unique cephalopod genes might have arisen. The standard view — that new genes originate via gene duplication — is hardly mentioned. But invoking gene duplication requires one to find another gene elsewhere that’s similar. Given that cephalopods apparently have many unique genes not similar to genes found in other organisms, gene duplication might not be a candidate explanation in many of these cases. One wonders if future investigators will resort to “de novo” gene origin.

What’s that? Stephen Meyer explains in Darwin’s Doubt:

Remember: ORFans, by definition, have no homologs. These genes are unique — one of a kind — a fact tacitly acknowledged by the increasing number of evolutionary biologists who attempt to “explain” the origin of such genes through de novo (“out of nowhere”) origination.


Many other papers invoke de novo origination of genes. Long mentions, for example, a study seeking to explain the origin of an antifreeze protein in an Antarctic fish that cites “de novo amplification of a short DNA sequence to spawn a novel protein with a new function.” Likewise, Long cites an article in Science to explain the origin of two human genes involved in neurodevelopment that appealed to “de novo generation of building blocks — single genes or gene segments coding for protein domains,” where an exon spontaneously “originated from a unique noncoding sequence.” Other papers make similar appeals. A paper in 2009 reported “the de novo origin of at least three human protein- coding genes since the divergence with chimp[s],” where each of them “has no proteincoding homologs in any other genome.” An even more recent paper in PLoS Genetics reported 󈬬 new protein- coding genes that originated de novo on the human lineage since divergence from the chimpanzee,” a finding that was called “a lot higher than a previous, admittedly conservative, estimate.”

Another 2009 paper in the journal Genome Research was appropriately titled “Darwinian Alchemy: Human Genes from Noncoding RNA.” It investigated the de novo origin of genes and acknowledged, “The emergence of complete, functional genes — with promoters, open reading frames (ORFs), and functional proteins — from ‘junk’ DNA would seem highly improbable, almost like the elusive transmutation of lead into gold that was sought by medieval alchemists.” Nonetheless, the article asserted without saying how that: “evolution by natural selection can forge completely new functional elements from apparently nonfunctional DNA — the process by which molecular evolution turns lead into gold.”

The presence of unique gene sequences forces researchers to invoke de novo origin of genes more often than they would like. After one study of fruit flies reported that “as many as

12% of newly emerged genes in the Drosophila melanogaster subgroup may have arisen de novo from noncoding DNA,” the author went on to acknowledge that invoking this “mechanism” poses a severe problem for evolutionary theory, since it doesn’t really explain the origin of any of its “nontrivial requirements for functionality.” The author proposes that “preadaptation” might have played some role. But that adds nothing by way of explanation, since it only specifies when (before selection played a role) and where (in noncoding DNA), not how the genes in question first arose. Details about how the gene became “preadapted” for some future function is never explained. Indeed, evolutionary biologists typically use the term “de novo origination” to describe unexplained increases in genetic information it does not refer to any known mutational process. (Darwin’s Doubt, pp. 216, 220-221.)

In other words, de novo isn’t an explanation at all. It’s more like a magic wand to be invoked when evolutionary biologists encounter some unique gene and they have no way to explain its origin via duplication from a similar pre-existing gene. (As an evolutionary mechanism, gene duplication has its own issues.)

Nonetheless, a recent article in Quanta Magazine points out just how many recent scientific studies have resorted to calling upon de novo origin of genes:

For most of the last 40 years, scientists thought that this was the primary way new genes were born — they simply arose from copies of existing genes. The old version went on doing its job, and the new copy became free to evolve novel functions.

Certain genes, however, seem to defy that origin story. They have no known relatives, and they bear no resemblance to any other gene. They’re the molecular equivalent of a mysterious beast discovered in the depths of a remote rainforest, a biological enigma seemingly unrelated to anything else on earth.

The mystery of where these orphan genes came from has puzzled scientists for decades. But in the past few years, a once-heretical explanation has quickly gained momentum — that many of these orphans arose out of so-called junk DNA, or non-coding DNA, the mysterious stretches of DNA between genes. “Genetic function somehow springs into existence,” said David Begun, a biologist at the University of California, Davis.

If the idea that “Genetic function somehow springs into existence” doesn’t sound compelling to you, join the club. But that’s about as much detail as you’re likely to get from proponents of de novo gene origination. One proponent of this idea in the article is even quoted saying: “It’s hard to see how to get a new protein out of random sequence when you expect random sequences to cause so much trouble.” Unfortunately for evolutionists, this problem seems to be common among animals, as the Quanta article continues:

This metamorphosis was once considered to be impossible, but a growing number of examples in organisms ranging from yeast and flies to mice and humans has convinced most of the field that these de novo genes exist. Some scientists say they may even be common. Just last month, research presented at the Society for Molecular Biology and Evolution in Vienna identified 600 potentially new human genes. “The existence of de novo genes was supposed to be a rare thing,” said Mar Albà, an evolutionary biologist at the Hospital del Mar Research Institute in Barcelona, who presented the research. “But people have started seeing it more and more.”

Whenever you see “de novo” origin of a gene invoked, you know that evolutionary biologists lack any explanation for how that gene arose. Scientists haven’t had much time yet to analyze the cephalopod genome, but given early reports of many unique genes, it will be interesting to learn to what extent they are forced to invoke these mysterious processes — what amounts to evolution ex nihilo — to explain how this “alien” genome arose.

Image: Minoan clay vase, c. 1500 BCE, by Wolfgang Sauber (Own work) [GFDL or CC BY-SA 3.0], via Wikimedia Commons.

Planned, In-progress, and Private genome sequencing efforts (a partial list)

  • The genome of the dwarf birch tree Betula nana is currently being assembled by Richard Buggs at the university of London.
  • The sunflower genome project was just announced in early 2010. While it's far too early to predict when this genome will be released, it is still worth mentioning, because species within the sunflower genus (Helianthus) have genome sizes around 3000 megabases (sometimes substantially more) making this genome a candidate to steal from maize/corn the position of largest sequenced plant genome. More information here (warning this is a pdf formatted press release)
  • Bayer CropScience announced they have a complete genome sequence for canola (Bassica napus) as well as varieties of Brassica rapa and Brassica oleracea. These aren't being released publicly, but from what I've heard they are open to collaborating with individual researchers who want access to the data.
  • Several different companies have announced that they have sequenced the genome of the oil palm, but to the best of our knowledge none of these sequences are publicly available. News reports of the sequencing:
  • Seed plant genomes listed by JGI (the Joint Genome Institute, part of the US Department of Energy) in approximate order of progress (as best I can tell I've listed those closest to completion at the top):
    • Boechera stricta Arabidopsis relative
    • Seagrass (Zostera marina) a monocot.
    • Loblolly Pine (Pinus taeda)
    • Arabidopsis halleri Arabidopsis species
    • Boechera holboellii Arabidopsis relative
    • Miscanthus giganteus a biofuel crop not unlike switchgrass
    • Panic grass (Panicum hallii) a switchgrass relative
    • Arabidopsis arenosa an arabidopsis species
    • Boechera divericarpa an arabidopsis relative
    • Switchgrass (Panicum virgatum)
    • Purple willow (Salix purpure)
    • Pear (Pyrus communis (?) ) (Prunus avium)
    • Other Genomes in progress culled from abstracts from the 2011 Plant and Animal Genome Conference

    New human gene tally reignites debate

    One of the earliest attempts to estimate the number of genes in the human genome involved tipsy geneticists, a bar in Cold Spring Harbor, New York, and pure guesswork.

    That was in 2000, when a draft human genome sequence was still in the works geneticists were running a sweepstake on how many genes humans have, and wagers ranged from tens of thousands to hundreds of thousands. Almost two decades later, scientists armed with real data still can’t agree on the number — a knowledge gap that they say hampers efforts to spot disease-related mutations.

    The latest attempt to plug that gap uses data from hundreds of human tissue samples and was posted on the BioRxiv preprint server on 29 May 1 . It includes almost 5,000 genes that haven’t previously been spotted — among them nearly 1,200 that carry instructions for making proteins. And the overall tally of more than 21,000 protein-coding genes is a substantial jump from previous estimates, which put the figure at around 20,000.

    But many geneticists aren’t yet convinced that all the newly proposed genes will stand up to close scrutiny. Their criticisms underscore just how difficult it is to identify new genes, or even define what a gene is.

    “People have been working hard at this for 20 years, and we still don’t have the answer,” says Steven Salzberg, a computational biologist at Johns Hopkins University in Baltimore, Maryland, whose team produced the latest count.

    In 2000, with the genomics community abuzz over the question of how many human genes would be found, Ewan Birney launched the GeneSweep contest. Birney, now co-director of the European Bioinformatics Institute (EBI) in Hinxton, UK, took the first bets at a bar during an annual genetics meeting, and the contest eventually attracted more than 1,000 entries and a US$3,000 jackpot. Bets on the number of genes ranged from more than 312,000 to just under 26,000, with an average of around 40,000. These days, the span of estimates has shrunk — with most now between 19,000 and 22,000 — but there is still disagreement (See 'Gene Tally').

    Source: M. Pertea & S. L. Salzberg

    The gene count can vary depending on the data being analysed, the tools used and the criteria for weeding out false positives. The latest count used a larger data set and different computational methods from previous efforts, as well as broader criteria for defining a gene.

    Salzberg’s team used data from the Genotype-Tissue Expression (GTEx) project, which sequenced RNA from more than 30 different tissues taken from several hundred cadavers. RNA is the intermediary between DNA and proteins. The researchers wanted to identify genes that encode a protein and those that don’t but still serve an important role in cells. So they assembled GTEx’s 900 billion tiny RNA snippets and aligned them with the human genome.

    Just because a stretch of DNA is expressed as RNA, however, does not necessarily mean it’s a gene. So the team attempted to filter out noise using a variety of criteria. For example, they compared their results with genomes from other species, reasoning that sequences shared by distantly related creatures have probably been preserved by evolution because they serve a useful purpose, and so are likely to be genes.

    The team was left with 21,306 protein-coding genes and 21,856 non-coding genes — many more than are included in the two most widely used human-gene databases. The GENCODE gene set, maintained by the EBI, includes 19,901 protein-coding genes and 15,779 non-coding genes. RefSeq, a database run by the US National Center for Biotechnology Information (NCBI), lists 20,203 protein-coding genes and 17,871 non-coding genes.

    Kim Pruitt, a genome researcher at the NCBI in Bethesda, Maryland, and a former head of RefSeq, says the difference is probably due in part to the volume of data that Salzberg’s team analysed. And there’s another major difference. Both GENCODE and RefSeq rely on manual curation — a person reviews the evidence for each gene and makes a final determination. Salzberg’s group relied solely on computer programmes to sift the data.

    “If people like our gene list, then maybe a couple years from now we’ll be the arbiter of human genes,” says Salzberg.

    But many scientists say they need more evidence to be convinced that the list is accurate. Adam Frankish, a computational biologist at the EBI who coordinates the manual annotation of GENCODE, says that he and his group have scanned about 100 of the protein-coding genes identified by Salzberg’s team. By their assessment, only one of those seems to be a true protein-coding gene.

    And Pruitt’s team looked at about a dozen of the Salzberg group’s new protein-coding genes, but didn’t find any that would meet RefSeq’s criteria. Some overlapped with regions of the genome that seem to belong to retroviruses that invaded our ancestors’ genomes others belong to other repetitive stretches, which are rarely translated into proteins.

    But Salzberg says that some repetitive sequences can be considered genes. One example is ERV3-1, which appears in RefSeq and encodes a protein that is overexpressed in colorectal cancer. Salzberg also acknowledges that the new genes on his team’s list will require validation by his team and others.

    Further confounding counting efforts is the imprecise and changing definition of a gene. Biologists used to see genes as sequences that code for proteins, but then it became clear that some non-coding RNA molecules have important roles in cells. Judging which are important — and should be deemed genes — is controversial, and could explain some of the discrepancies between Salzberg’s count and others.

    Still, it’s likely that at least some of the genes identified by Salzberg’s group will turn out to be valid, says Emmanouil Dermitzakis, a geneticist at the University of Geneva in Switzerland, who co-chairs the GTEx project. He isn’t surprised that the team’s count for protein-coding genes is a 5% increase on previous tallies, given the gargantuan size of the GTEx data set.

    Having an accurate tally of all human genes is important for efforts to uncover links between genes and disease. Uncounted genes are often ignored, even if they contain a disease-causing mutation, Salzberg says. But hastily adding genes to the master list can pose risks, too, says Frankish. A gene that turns out to be incorrect can divert geneticists’ attention away from the real problem.

    Still, the inconsistencies in the number of genes from database to database are problematic for researchers, Pruitt says. “People want one answer,” she adds, “but biology is complex.”

    The DNA of three aurochs found next to the Elba shepherdess opens up a new enigma for paleontology

    Artistic reconstruction of the Elba shepherdess, accompanied by the three aurochs found at the site, whose mitochondrial DNA has been analyzed. Credit: José Antonio Peñas (SINC)

    Research involving scientists from the University of A Coruña has succeeded in sequencing the oldest mitochondrial genome of the immediate ancestor of modern cows that has been analyzed to date. The remains, some 9,000 years old, were found next to a woman. Why were they with her if cattle had not yet been domesticated? Do they belong to ancestors of today's Iberian cows?

    Humans have maintained a very close relationship with aurochs (Bos primigenius) since their beginnings, first by hunting them and then by breeding and selecting them. This extinct species of mammal is little known in the Peninsula because its skeletal remains are difficult to distinguish from bison. In fact, there have been references to the presence of "large bovids" in many sites because they cannot be differentiated. At a European level, there is also a lack of genetic data.

    An international team of scientists has managed to extract mitochondrial DNA from ruminants from different periods in Galicia. They have analyzed the remains of B. primigenius from the Chan do Lindeiro cave (Lugo). These remains were found in a chasm together with the human fossils of the shepherdess of O Courel, "Elba", dated at around 9,000 years old. The aurochs analyzed are not the oldest ones discovered, but they are the oldest ones whose mitochondrial DNA has been sequenced so far. Interestingly, although they were found together, they are genetically very different.

    "Their discovery in the chasm together with a human is a great enigma. Given all the evidence, such as their similar chronology and the fact that the bones are intermingled at the base of a slump caused by the sinking of the ground -at a depth of 15 to 20 meters-, we think that the woman and the aurochs were found together. This interpretation is controversial because domestication is not regarded as having existed at the time," as Aurora Grandal, a researcher at the University of A Coruña and the co-author of the study published in the PLoS ONE journal, has explained to SINC.

    The analysis of their mitochondrial DNA has not allowed these three aurochs to be related to the modern cows of the Peninsula. To investigate this possible relationship, the next step for the research team is to analyze the nuclear DNA.

    Until now, different varieties of aurochs have been described, based on their morphology only. The three analyzed in this study belong to haplogroup P, which is characteristic of the species. However, they differ from each other in a large number of base pairs [pieces that make up the genetic sequences], which is striking considering they are coeval. "This may indicate that they were from different origins, in a scenario in which the Elba woman played an active role or a trait that simply reflected a very high genetic variability in the aurochs," says the researcher.

    The origin of cattle domestication in the NorthDomestic cattle were introduced into Spain by the first settlers and agricultural societies. Due to the absence of Neolithic sites in Galicia, very little is known about the process in this region.

    To extract information about the introduction of this livestock in Galicia, researchers sampled 18 cattle fossils of different ages from different Galician mountain caves, of which eleven were subjected to mitochondrial genome sequencing and phylogenetic analysis.

    Fossils of the three aurochs found in Galicia and analyzed in this study. Credit: UDC

    The study of the three aurochs revealed their kinship with aurochs from other parts of Europe. "By studying their mitochondrial DNA, which is transmitted almost intact from mother to offspring, we can determine in which geographical areas the different lineages predominated and what their movements were due to changes in climatic conditions or even to humans following the onset of livestock farming," the paleontologist and veterinarian Amalia Vidal, co-author of the study at the same university, tells SINC.

    Thanks to the DNA, it is possible to know whether the native aurochs contributed to local livestock farming or, on the contrary, were imported animals, "with all the information that this provides about the movement of bovine and human populations," Vidal continues.

    Her data show a close relationship between the first domesticated cattle in Galicia and modern cow breeds and provide an overview of cattle phylogeny. The results of the study indicate that settlers migrated to this region of Spain from Europe and introduced European cow breeds now common in Galicia.

    Aurochs related to the British

    "Specifically, these aurochs are more closely related to the aurochs of the British Isles than to the Central European specimens. British aurochs are more recent than those from Galicia. This may be related to the role of the Peninsula as a glacial refuge and the origin of the later recolonisation of the islands," Grandal points out.

    These three coeval animals are small and have relatively short horns compared to those of northern Europe, and their morphology is different.

    The researchers are now endeavoring to analyze the nuclear DNA of the three aurochs, which will allow them to learn about the possible contributions of these individuals to later domestic livestock. "For example, fragments of nuclear DNA from the British aurochs can be recognized in some breeds of northern European cows. This shows that there was a genetic contribution from aurochs to the already domestic cattle. We are going to look for possible contributions from our aurochs to Iberian cows, whether present-day or fossil," Grandal stresses.

    In recent years, there has been a growing interest in the scientific community in learning about the origins of domestic animals, and there are a large number of projects to reconstruct their ancestors. One of the reasons for this is that these species are considered to be more rustic and with a better capacity to adapt to harsh environmental conditions.

    "Early projects sought to generate phenotypes similar to the species they were trying to recreate (as was done with Heck cattle), but more modern projects also use DNA as a source of information," Vidal concludes.

    Reconstruction of the woman Elba in the Museo Xeolóxico de Quiroga. Credit: MUXEQ

    Ancestors of bulls and cows

    The social organization of aurochs herds is assumed to have been similar to that of their domesticated bovine descendants: a single male who is relieved by another male as he weakens and his group of females.

    The new males, when they reach adulthood, do not remain in the group, whereas the females do. In this way, it is normal for females of the same group to be related, which means that their mitochondrial lineages are similar.

    The domestic cow comes from the domestication of the aurochs, albeit not in the Iberian Peninsula but in Asia, specifically in the Middle East, and from a small number of uses. This is the origin of the domestic cow, which then spread along with humans to occupy the whole of Europe.

    In Italy, some researchers claim that the already domesticated cows had genetic contributions from local aurochs. The same holds for the British Isles. The contribution of local aurochs to cows is best observed in the nuclear DNA and was detected in some cases in northern European breeds.

    In the north of the peninsula, the oldest domestic cows are about 7 to 6 thousand years old.


  1. Gagar

    I can suggest to come on a site where there is a lot of information on a theme interesting you.

  2. Blane

    wonderfully, very valuable thought

  3. Aethelbald

    Same a urbanization any

  4. Arashik

    In the evening, a friend threw off the address of your site on the soap. But I didn't attach much importance, I went in today and realized that she was right - the site is really SUPER!

  5. Erasmo

    It absolutely not agree with the previous phrase

  6. Treabhar

    I'm afraid I don't know.

  7. Aberto

    In it something is. I will know, many thanks for the help in this question.

Write a message