Вы находитесь на странице: 1из 4

ORGANIC DATA MEMORY

Using the DNA Approach For very long-term storage


and retrieval, encode
A data preservation problem experiments we’ve conducted to information as artificial
looms over today’s information determine the feasibility of our DNA strands and
superhighway. Ancient humans approach, as well as several poten- insert into living hosts.
preserved their knowledge by tial applications.
engraving bones and stone. About Nature magazine reported a As vectors, bacteria, even
two millennia ago people invented study [1] resembling the first part some bugs and weeds,
paper to publish their thoughts. of our effort—encoding mean- might be good for hundreds
Today, we use magnetic media and ingful information as DNA of millions of years.
silicon chips to store our data. But sequences. It described an experi-
bones and stone erode, paper disin- ment in which a group of scien-
tegrates, and electronic memory tists at Mount Sinai School of molecules over time. Therefore, a
degrades. All these storage media Medicine in New York created an key to our success is finding a
require constant attention to main- encoded DNA strand and hid it super-dependable storage medium
tain their information content, and behind a period (a dot) in a printed to ensure adequate protection for
all are easily destroyed by people document. The document was the encoded DNA strands. Our
and natural disasters, whether then sealed and mailed to its own- solution is to provide a living host
intentionally or by accident. In ers through the U.S. Postal Service. for the DNA that tolerates the
light of the vast amount of infor- Eventually, the embedded message addition of artificial gene sequences
mation being generated every day, was successfully recovered in a lab- and survives extreme environmen-
it’s time to find a new medium. oratory environment. tal conditions. Perhaps more
Searching for an inexpensive, The article reported that the important, the host with the
long-lasting medium for informa- embedded information survived its embedded information must be
tion storage, scientists at the Pacific rough handling in the mail, prov- able to grow and multiply.
Northwest National Laboratory ing that a DNA strand can be as
(PNNL) are investigating deoxyri- dependable as a piece of paper in Challenges
bonucleic acid—commonly known terms of information storage. It is, Recent advances in genetic engi-
as DNA—to develop a data mem- however, still far from being able to neering have allowed the introduc-
ory technology with a life outlast existing data-memory tion of foreign DNA molecules
expectancy much greater than any devices. In fact, a naked DNA mol- into the living cells of bacteria,
existing counterpart. Our initial ecule is easily destroyed in any humans, and other organisms.
DNA memory prototype consists open environment inhabited by Typically, a short, one-of-a-kind,
of four main steps: encoding mean- people or potential enemies of well-researched DNA strand is
ingful information as artificial nature. The so-called “double- applied to a living host for some
DNA sequences; transforming the strand break” of DNA, which is particular biological study, with lit-
sequences to living organisms; usually fatal, can be caused by com- tle or no intention of retrieving the
allowing the organisms to grow mon unfavorable environmental embedded DNA afterward. This
and multiply; and eventually conditions, including excessive process is somewhat contrary to
extracting the information back temperature and desiccation/rehy- our basic DNA memory require-
from the organisms. Here we dration. Even nucleases (a kind of ments that new and artificial DNA
describe the objective of this inves- DNA-degrading enzyme) in the be generated frequently and that
tigation, which began in 1998, and environment can corrupt DNA we be able to retrieve the embed-

BY PAK CHUNG WONG, KWONG-KWOK WONG, AND HARLAN FOOTE

COMMUNICATIONS OF THE ACM January 2003/Vol. 46, No. 1 95


ded information afterward. DNA. Each of these AT or CG pairs in a double-
These requirements pose serious challenges to our stranded DNA is called a base pair. Because a DNA
DNA memory design due to the
size of a whole genome, which AAA - 0 AAC - 1 AAG - 2 AAT - 3 ACA - 4 ACC - 5 ACG - 6 ACT - 7
ranges from a few million DNA AGA - 8 AGC - 9 AGG - A AGT - B ATA - C ATC - D ATG - E ATT - F
units in a bacterium to more than CAA - G CAC - H CAG - I CAT - J CCA - K CCC - L CCG - M CCT - N
three billion in a mouse or human. CGA - O CGC - P CGG - Q CGT - R CTA - S CTC - T CTG - U CTT - V
It is practically impossible to GAA - W GAC - X GAG - Y GAT - Z GCA - SP GCC - : GCG - , GCT - -
retrieve an embedded message GGA - . GGC - ! GGG - ( GGT - ) GTA - ` GTC - ‘ GTG - “ GTT - "
from a whole genome in a wet lab- TAA - ? TAC - ; TAG - / TAT - [ TCA - ] TCC - TCG - TCT -
oratory without knowing the con- TGA - TGC - TGG - TGT - TTA - TTC - TTG - TTT -

tent or whereabouts of the


encoded DNA. The unpredictable Table 1. DNAsequence is digital, we can use them to construct any
encoding.
nature of genomic mutation repre- English text, just as we use binary numbers 0 and 1 to
sents yet another obstacle, further reducing the odds of encode ASCII characters. Table 1 outlines the encod-
locating the message within a whole genome. ing table of our experiments using a set of triplets—a
DNA sequence with any three of the four bases—the
Experimental Design exact encoding scheme for our initial experiment.
The customized computational and wet-laboratory Note several triplets listed at the end of the table are
approach we developed leaves a trail of the embedded open (intentionally) for later expansion.
message for later retrieval while allowing us to preserve Unique DNA searching. The whole genomes of E.
the integrity of the message. Our experiments were car- coli and Deinococcus have been completely sequenced
ried out in four primary stages: and are available from
DNA host identification. In AAGGTAGGTAGGTTAGTTAG AGAGTAGTGAGGATAGTTAG The Institute for
the process of identifying candi- AGGTTTGGTGGTATAGTTAG ATAAGTAGTGGGGTAGTTAG Genomic Research
dates to carry the embedded ATAGGAGTGTGTGTAGTTAG ATAGGGGTATGGATAGTTAG (www.tigr. org). Our
DNA molecules, we considered ATATTAGAGGGGGTAGTTAG ATGGGTGGATTGATAGTTAG task is to identify a set
microorganisms (such as bacte- GGAGTAGTGTGTATAGTTAG GGGAATAGAGTGTTAGTTAG of fixed-size sequences
ria) and other agents (such as GGGAGTATGTAGTTAGTTAG GGGATGATTGGTTTAGTTAG (20-base-pair long in
plants, including Arabidopsis) GGTTAGATGAGTGTAGTTAG GTATGGGAATGGTTAGTTAG our experiments) that
as message hosts. We eventually TAAGGGATGTGTGTAGTTAG TAGAGAGAGTGTGTAGTTAG
do not exist in the
selected two well-understood TAGAGGAGGGATATAGTTAG TAGAGTGGTGTGTTAGTTAG
candidate bacteria yet
bacteria—Escherichai coli (E. TAGATGGGAGGTATAGTTAG TAGATTGGATGGGTAGTTAG
satisfy all the genomic
TAGGAGAGATGTGTAGTTAG TAGGGTTGGTAGTTAGTTAG
coli) and Deinococcus radiodu- constraints and
TATAGGGAGGGTATAGTTAG TATAGGGTAGGGTTAGTTAG
rans (Deinococcus)—because TGTGGGATAGTGATAGTTAG restrictions. This
microorganisms generally grow process is critical to
quickly and embedded information can be inherited Table 2. our experiment, as we do not want
quickly and continuously. We also considered the 25 20-base-pair
sequences for our
to cause unnecessary mutation or
physical endurance of the DNA host candidates. experiment. damage to the bacteria. The resul-
Deinococcus, we learned, survive extreme conditions, tant sequences also serve as sen-
including ultraviolet, desiccation, partial vacuum, and tinels to tag the beginning and end of the embedded
ionizing radiation up to 1.6 million rad, or radiation messages—similar to file headers and footers in mag-
absorbed dose (about 0.1% of this dose is fatal to netic tape—for later identification and retrieval.
humans); some strains of Deinococcus also tolerate Of the 10 billion potential candidates in the bac-
high temperatures. terium Deinococcus, we found through intensive
Information encoding. The four basic building computation only 25 qualified sequences that would
units in DNA are bases called deoxyribonucleosides. be acceptable for our experiments. These sequences
In the biology literature, they are usually labeled A (see Table 2) serve as blueprints for chemically synthe-
(Adenine), C (Cytosine), G (Guanine), and T sizing oligos for subsequent steps in our experiments.
(Thymine). Each base always bonds with another base The multiple triplets (such as TAA, TGA, and TAG)
(such as A with T and C with G). A single chain of seen in many of the sequences are called stop codons
these bases is called an oligonucleotide, or oligo. It and tell the bacterium repeatedly it has reached the
pairs with another complementary oligo (such as end of the native DNA sequence and should stop
GATCG with CTAGC) to form a double-stranded translating its contents. Without the protection of

96 January 2003/Vol. 46, No. 1 COMMUNICATIONS OF THE ACM


stop codons, the bacterium Message and retrieval. Deinococcus
could misinterpret the encoded granted perfect protection
information and produce arti- for the embedded message,
ficial proteins that could as it tolerates extreme desic-
destroy the integrity of the cation, high doses of radia-
embedded message or even kill tion, high levels of organic
the bacterium. solvent, and vacuum-pres-
Wet laboratory procedures. sure environments, as
We conducted four main pro- Sentinels shown in our experiments.
cedures: Extract the message. Finally,
whenever embedded infor-
Create complementary oligos. mation was needed, we
We started by creating two extracted the message part of
complementary oligos, each the DNA strand from the
with 46 bases and consisting bacterium through a labora-
of two different segments of tory procedure called poly-
20-base-long oligos con- Figure 1. A recombinant merase chain reaction. Using prior knowledge of the
nected by a six-base-long plasmid with two DNA sequences at both borders of the segments, it pro-
fragments as sentinels
restriction enzyme site. The protecting the encoded ceeds through a series of heating and cooling cycles
two 20-base-long oligos were message in between. to amplify the DNA segment. The whole process
based on two different took about two hours. Figure 2 shows a machine
sequences listed in Table 2. Enzymes that recognize readout of our DNA analysis and its English inter-
a specific sequence of double-stranded DNA and pretation at the top.
that cut the DNA at that loca-
tion are known as restriction
enzymes. We created a restric-
tion enzyme site for later inser-
tion of encoded DNA
fragments. These two 46-base-
long complementary oligos
form a double-stranded, 46-
base-pair DNA fragment. The
DNA fragment was then
cloned into a recombinant plas-
mid—a union of foreign DNA
Figure 2. DNA analysis
fragments into a circular DNA result and its Enormous Potential Capacity
molecule (see Figure 1). Because interpretation; the We successfully stored and retrieved seven chemically
the two 20-base-long oligos do English text is part of synthesized DNA fragments with 57–99 base pairs of
the children’s song
not exist in the genome of the “It’s a Small World” [2]. non-native information in seven different individual
host, they serve as identification bacteria. Even without further technology improve-
markers for later message ment, the capacity of bacterial-based DNA memory
retrieval. The stop codons in these two oligos also can be expanded dramatically by storing different
help protect the message, as well as the host, from pieces of information in a population of bacteria; for
potential damage. example, the seven bacteria in our experiment carry dif-
Insert DNA. The embedded DNA was then inserted ferent parts of the children’s song “It’s a Small World”
into cloning vectors—a circular DNA molecule [2] in their genomes. Considering that a milliliter of
that can self-replicate within a bacterial host. The liquid can contain up to 109 bacteria, the potential
resultant vectors were then transferred into E. coli capacity of bacterial-based DNA memory is enormous,
by electroporations (high-voltage shocks), allowing assuming we have a well-designed data index scheme.
the vector with the encoded DNA fragment to A potential challenge is the mutation of the organ-
multiply for later applications. isms affecting the integrity of the embedded messages.
Incorporate into genome. The vector and the encoded Although a bacterium can be selected with a low muta-
DNA were then incorporated into the genome of tion rate, random changes still occur. Nature has had
Deinococcus for permanent information storage to deal with this problem since the beginning of bio-

COMMUNICATIONS OF THE ACM January 2003/Vol. 46, No. 1 97


logical evolution and developed mechanisms for chronological overview of the migration and a com-
detecting and correcting errors. With the extremely plete local database useful to scientists. The same tech-
efficient DNA repair mechanisms associated with nological approach could also be used to study
Deinococcus, we did not detect any mutations in our endangered species; for example, a DNA watermark in
experiment in which we retrieved the DNA after the the subject’s genome could replace other artificial
bacteria that carried the message was allowed to prop- identification means (such as microchip implants).
agate for about a hundred generations. However, the For the computer science and information technol-
mutation rate may depend on a specific sequence and ogy communities, suppose people could safely and
the bacterium’s genetic background. permanently store their personal information (such as
family history and medical data) in the cells in their
DNA Memory Applications own bodies. Suppose we could replace computer disks
Most of the potential applications of this organic data with our bodies as a primary memory storage medium.
memory technology relate to the core missions of the Such options no longer represent speculative sci-
U.S. Department of Energy (DOE). Other security- ence fiction. All are potentially accomplished through
related applications include information hiding and organic data memory based on DNA. The DNA
data steganography for commercial products, as well as memory described here is neither impossible nor
those related to national security. impractical—only challenging.
As one of nine DOE national laboratories, a major
PNNL concern is protecting information in case of Conclusion
nuclear catastrophe. Suppose the U.S. experienced a With a careful coding scheme and arrangement,
devastating nuclear disaster and the national informa- important information can be encoded as an artificial
tion infrastructure was paralyzed or deactivated by DNA strand and safely and permanently stored in a
radiation and fire. Suppose we had planted critical living host. In the short run, this technology can be
relief information in certain bacteria (such as used to identify origins and protect R&D investment
Deinococcus) that could live and multiply indepen- in, say, agricultural products and endangered species.
dently without human intervention. Suppose these It can also be used in environmental research to track
data hosts could survive high doses of radiation and generations of organisms and observe the ecological
other extreme conditions. All critical information effect of pollutants. The microorganisms that survive
would therefore be available upon the arrival of a dis- heavy radiation exposure, high temperatures, and
aster relief team. other extreme conditions are among the perfect pro-
The research into and development of sterile tectors for the otherwise fragile DNA strands that pre-
seeds—yield one crop, then terminate—has prompted serve encoded information. Finally, living organisms,
recent controversy, especially in the farming commu- including weeds and cockroaches, that have lived on
nity throughout the U.S. The competition between Earth for hundreds of millions of years represent excel-
the proprietary rights of seed companies to protect lent candidates for protecting critical information for
their investments and the overwhelming need of poor future generations. c
farmers in third-world countries who cannot afford
new seed every year will probably continue until a References
practical solution emerges. Suppose the seed compa- 1. Clelland Taylor, C., Risca, V., and Bancroft, C. Hiding messages in DNA
microdots. Nature 399 (June 10, 1999), 533–534.
nies were able to put unique DNA watermarks based 2. Sherman, R.M. and Sherman, R.B. It’s a Small World. Walt Disney Enter-
on our technology in all their seeds. They could effec- prises, Inc., 1963.
tively track their sales and protect their proprietary Pak Chung Wong (pak.wong@pnl.gov) is a chief scientist in the
products against illegal planting by greedy farmers Energy Science and Technology Division at the Pacific Northwest
without affecting the needy farmers. National Laboratory, Richland, WA.
Remediating environmental pollution in the U.S. Kwong-Kwok Wong (kkwong@txccc.org) is an assistant
professor at the Baylor School of Medicine and the Director of
has been a PNNL core mission since the 1980s. Microarray Laboratory at Texas Children’s Cancer Center, Houston,
PNNL scientists periodically drill sampling wells and TX. This research was conducted while he was a senior research
collect soil samples to monitor the migration of pollu- scientist at the Pacific Northwest National Laboratory, Richland, WA.
tants that might contaminate the U.S.’s natural Harlan Foote (harlan.foote@pnl.gov) is a senior research scientist
in the Energy Science and Technology Division at the Pacific
resources, including water. Suppose we were able to Northwest National Laboratory, Richland, WA.
put enough information in a bacteria population in
the water and update it continuously and progressively The Pacific Northwest National Laboratory is operated for the U.S. Department of
Energy by Battelle Memorial Institute under contract DE-AC06-76RL0.
according to the bacteria’s spatial and temporal distri-
bution. The bacteria would eventually provide both a © 2003 ACM 0002-0782/03/0100 $5.00

98 January 2003/Vol. 46, No. 1 COMMUNICATIONS OF THE ACM

Вам также может понравиться