Deciphering the information hidden in genomes4 min read . Updated: 18 Oct 2009, 10:05 PM IST
Deciphering the information hidden in genomes
Deciphering the information hidden in genomes
Last week, a company called Complete Genomics Inc. announced 10 new customers for its genome-sequencing service. The price was not specified, but the firm said its goal is to offer the service for $5,000 within a year.
What struck me was not the announcement itself, but the name of the chief executive: Cliff Reid. When I knew him in the 1980s, he was chief of a text-search company called Verity Inc. The connection hit me almost immediately. Genes are, in a sense, the instruction language for building humans (or any other living thing). And language is symbols that interact to build meaning. And yes, it was the same Cliff Reid I knew back in the late 1980s.
What I find interesting are the implications. Right now, a genome is akin to a novel written in an unknown language. There is a huge amount of information in there, but we can’t understand it.
Imagine getting a copy of Tolstoy’s War and Peace in Russian and (assuming you can’t read Russian) trying to figure out the story. Impossible. That’s pretty much the situation of natural language understanding at the time Reid joined Verity.
Also Read Esther Dyson’s earlier columns
On the other hand, we have started recognizing some words—specific genetic variants—that seem to correspond to certain incidents in history. In genetics, those incidents are diseases and conditions. And just as it usually takes several individuals to cause an incident, so it often takes several genetic variations, plus ambient factors, to cause a disease. Genes often work together, sometimes aided by factors such as a person’s diet or behaviour, to cause a condition.
There are two key challenges in genomics. One is simply detecting the genes, alone or in combination, that seem to lead to certain diseases. That alone can be useful. With enough data, we can then figure out that the same disease is, in fact, a variety of different disorders, some susceptible to particular known treatments and some susceptible to others or simply incurable.
For this, mere correlation is sufficient. People with BRCA-derived breast cancer benefit from treatment with herceptin, whereas those with other kinds of breast cancer do not. We don’t know why, but the correlation is clear.
The second challenge is to understand how the genes interact among themselves or with other factors to produce the condition, which should enable the development of new preventive measures or treatments based on the details of how the condition begins and how it progresses.
That, of course, is much more interesting—and harder to do. In a sense, it’s the difference between matching words and understanding a piece of text.
So, it is no surprise that Reid has found a role in this new marketplace. Complete Genomics and its competitors are about to create huge amounts of data. Complete Genomics’ edge is not just sequencing the genomes cheaply but also refining the data into lists of variations. In other words, for most research, the questions revolve not around an entire genome, but around the relevant differences of any individual’s genome from the norm.
There are common differences, such as the differences between blue eyes and brown eyes, or even between people likely to have Crohn’s disease and those who are unlikely to have it. Then there are differences that result simply from a broken gene, which is not a variant but simply a mistake. Most of these are harmless; the really harmful ones don’t survive long enough to show up anywhere.
The researchers’ task is to find meaning from all this data. We’re just at the beginning of this process, which will take many years. While some researchers are looking for statistical correlations, others are studying how the individual genes interact.
For all of them, access to genome sequences is important. But the genomes mean little without the corresponding medical records, just as the Russian novel—in any language—means little without a corresponding knowledge of Russian history.
Obtaining that history requires consent from the individuals whose genomes are sequenced. It also requires a lot of data processing to make the records usable. Much of the information is simply not recorded. And much is still on paper, or in scanned images, insurance records and pharmacy transactions. There is a standard language for representing diseases, but in many cases the records containing this language might as well be hidden under mattresses.
The current movement in many developed countries towards electronic medical records will improve healthcare directly, but it will also lead to much improved information liquidity to help genetic and other medical research.
We now have the ability to sequence genomes at increasingly lower costs, and we are slowly making the corresponding health information computer-readable. Several firms are developing software that can process the data.
There is, of course, still a huge amount of data to collect and process, and huge amounts of research and discovery to happen. But it is hard not to be optimistic about our increasing medical knowledge. The challenge five years from now will be to turn that knowledge into practice through better preventive measures, better drugs, and better care.
Esther Dyson, chairman of EDventure Holdings, is an active investor in a variety of start-ups around the world. Her interests include information technology, healthcare and private aviation and space travel.
Respond to this column at firstname.lastname@example.org