Вы находитесь на странице: 1из 3

Making Personalized Medicine Practical

1 of 3

http://engineering.stanford.edu/print/node/38295

Making Personalized Medicine Practical


Personalized medicine will bring with it the problem of storing and processing the vast
amounts genetic information needed to tailor medical care to individual needs. Stanford
electrical engineers have an answer.
By Andrew Myers
Research News
Personalized medicine will bring with it the problem of storing and processing the vast
amounts genetic information needed to tailor medical care to individual needs. Stanford
electrical engineers have an answer.
By Andrew Myers
Genome compression improved
Stanford engineers learn to store and process the vast genetic information needed to tailor
medical care to individuals.
In 2003, the Human Genome Project culminated in the successful sequencing of the more than 3
billion base pairs making up a single human genome, costing an international consortium of
researchers 13 years and $3 billion to complete.
Today, similar sequencing can happen in weeks for about $4,000. But soon, science will realize
the hallowed $1,000 genome, the symbolic marker of entry into the era of personalized
medicine, in which people have their DNA sequenced to help tailor medical care to their specific
needs.
Such great progress comes at a cost, however the sheer amount of space required to store
and process all that information. The amount of raw data required to obtain a single human
genome can consume many terabytes of storage. There is now so much data that medical
science is having a hard time retaining, much less processing it.
But a team led by Stanford electrical engineers has compressed a completely sequenced human
genome to just 2.5 megabytes small enough to attach to an email. The engineers used what is
known as reference-based compression, relying on a human genome sequence that is already
known and available. Their compression has improved on the previous record by 37 percent. The
genome the team compressed was that of James Watson, who co-discovered the structure of
DNA more than 60 years ago.
From left, post-doctoral scholars Zhiying Wang, Dmitri Pavlichin and Mikel Hernaez, graduate

5/24/2016 8:51 PM

Making Personalized Medicine Practical

2 of 3

http://engineering.stanford.edu/print/node/38295

student Idoia Ochoa and Associate Professor Tsachy Weissman have developed a way to store
and process the vast amounts genetic information needed to tailor medical care to individual
needs. (Photo: Rod Searcey)
On the surface, this might not seem like a problem for electrical engineers, said Tsachy
Weissman, an associate professor of Electrical Engineering. But our work in information theory
is guiding the development of new and improved ways to model and compress the incredibly
voluminous genomic data the world is amassing. In addition to Weissman, the team included
Golan Yona, a senior research engineer in Electrical Engineering, and Dmitri Pavlichin, a
post-doctoral scholar in Applied Physics and Electrical Engineering.
Genomic data compression is necessary for efficient storage, of course, but also for swiftly
transferring and communicating data for various post-sequencing applications and analysis that
will divine from the genetic information what diseases a person might be suffering from, is
susceptible to or is in the process of developing. The analysis also helps determine what
therapies and medications might best be suited to a particular person at a particular juncture in
time. These are the promises of personalized medicine that effective genomic data compression
would enable.
The need to retain as much detail as possible of the raw measurements is particularly acute with
genomic data. Imagine discarding an important mutation or, worse, introducing a non-existent
one into a patients DNA, both of which might adversely influence any number of crucial medical
decisions. This would seem to suggest that the only acceptable compression mode is lossless,
in which all of the data is perfectly retained in the decompression, in contrast to lossy
compression, in which some of the data is lost or distorted in the decompression.
In music and video compression, for instance, lossy compression systems are able to achieve
considerable data size reductions by discarding parts of the signals to which the human ear and
or eye are not sensitive. Such loss of information is more than offset by the convenience of being
able to carry or stream entire libraries of music to virtually any location in seconds.
The graphic above shows the rapidly decreasing cost of sequencing a human genome. (Graphic:
National Institutes of Health National Genome Research Institute)
Traditionally, the tradeoff in data compression has been between high-quality-but-large files and
smaller-but-distorted files. Weissman and his group have recently been working toward
disrupting this tradeoff: shrinking file sizes while maintaining or even boosting the data integrity.
With genomic data, accuracy really matters. But, then again, so does storage and processing
time, so we are searching for a solution, Weissman said.
To achieve even more dramatic levels of compression, Weissman, Idoia Ochoa, a graduate
student in the department of Electrical Engineering, and Mikel Hernaez, a postdoctoral fellow in
the same department, are pursuing collaborations with researchers from the medical school
focused on compressing what are known as the "quality scores that accompany DNA
sequences. Todays high-speed genetic sequencers provide read outs consisting of segments of
a genomes 3 billion base pairs and include corresponding sequences assessing the reliability of
these reads. A quality score is assigned to every base pair in every read, conveying the
likelihood that it is correct.
Quality scores are useful for boosting the reliability of the genome assembly process and are

5/24/2016 8:51 PM

Making Personalized Medicine Practical

3 of 3

http://engineering.stanford.edu/print/node/38295

particularly important for genotyping, the process of determining differences in the genetic
make-up of an individual relative to that of a reference sequence. But storing quality scores
significantly increases the file size, often taking the majority of the pre-compression disk space of
the raw genome data. Compression of the quality scores can significantly reduce file size as well
as speed the transmission, processing and analysis of the data.
In recording quality scores, DNA sequencers introduce all sorts of imperfections that are
collectively considered noise. Different sequencers have different noise characteristics.
Weissman and his team are developing theory and algorithms for processing the quality scores
in a way that reduces the noise and at the same time results in significant compression.
Counterintuitive as it might sound at first, they are using lossy compression as a mechanism not
only for considerable reduction in storage requirements, but also for enhancing the integrity of
the data.
But, in fact, it is quite intuitive, Weissman said. Lossy compression, when done right, forces the
compressor to discard the part of the signal which is hardest to compress, namely, the noise.
There is still much more to be done and the potential for significant improvement in lossless
compression of genomic data. DNA of individuals in a population, when considered collectively,
is in some ways similar to a video file. Each individuals genome is akin to a frame of a movie. A
new frame can be represented very succinctly using previous frames as references.
Fortunately for the human race, there are growing databases of known genomes. One of these,
known as the 1,000 Genomes Project, has led to a database containing the full genomes of more
than 1,000 people. By using the database as a reference, Weissman and collaborator Tom
Courtade, an assistant professor of Electrical Engineering at the University of California,
Berkeley, are working with students from both institutions to achieve significant further lossless
compression. Theyve already reduced the size of the file needed to represent a new human
genome that is not in the database by another order of magnitude, to about 200 kilobytes.
At this point we feel like we might be approaching the fundamental limits on compression of an
individuals genome given a reference database consisting of genomes of many other
individuals, Weissman said. Whether or not we are truly approaching this limit is a question we
are tackling. The answer will be exciting either way.
Monday, November 10, 2014
RSD14_088_0022a_680x320.jpg

Visit engineering.stanford.edu for the latest news and events from Stanford Engineering.

5/24/2016 8:51 PM

Вам также может понравиться