and retrieval, encode A data preservation problem experiments we’ve conducted to information as artificial looms over today’s information determine the feasibility of our DNA strands and superhighway. Ancient humans approach, as well as several poten- insert into living hosts. preserved their knowledge by tial applications. engraving bones and stone. About Nature magazine reported a As vectors, bacteria, even two millennia ago people invented study [1] resembling the first part some bugs and weeds, paper to publish their thoughts. of our effort—encoding mean- might be good for hundreds Today, we use magnetic media and ingful information as DNA of millions of years. silicon chips to store our data. But sequences. It described an experi- bones and stone erode, paper disin- ment in which a group of scien- tegrates, and electronic memory tists at Mount Sinai School of molecules over time. Therefore, a degrades. All these storage media Medicine in New York created an key to our success is finding a require constant attention to main- encoded DNA strand and hid it super-dependable storage medium tain their information content, and behind a period (a dot) in a printed to ensure adequate protection for all are easily destroyed by people document. The document was the encoded DNA strands. Our and natural disasters, whether then sealed and mailed to its own- solution is to provide a living host intentionally or by accident. In ers through the U.S. Postal Service. for the DNA that tolerates the light of the vast amount of infor- Eventually, the embedded message addition of artificial gene sequences mation being generated every day, was successfully recovered in a lab- and survives extreme environmen- it’s time to find a new medium. oratory environment. tal conditions. Perhaps more Searching for an inexpensive, The article reported that the important, the host with the long-lasting medium for informa- embedded information survived its embedded information must be tion storage, scientists at the Pacific rough handling in the mail, prov- able to grow and multiply. Northwest National Laboratory ing that a DNA strand can be as (PNNL) are investigating deoxyri- dependable as a piece of paper in Challenges bonucleic acid—commonly known terms of information storage. It is, Recent advances in genetic engi- as DNA—to develop a data mem- however, still far from being able to neering have allowed the introduc- ory technology with a life outlast existing data-memory tion of foreign DNA molecules expectancy much greater than any devices. In fact, a naked DNA mol- into the living cells of bacteria, existing counterpart. Our initial ecule is easily destroyed in any humans, and other organisms. DNA memory prototype consists open environment inhabited by Typically, a short, one-of-a-kind, of four main steps: encoding mean- people or potential enemies of well-researched DNA strand is ingful information as artificial nature. The so-called “double- applied to a living host for some DNA sequences; transforming the strand break” of DNA, which is particular biological study, with lit- sequences to living organisms; usually fatal, can be caused by com- tle or no intention of retrieving the allowing the organisms to grow mon unfavorable environmental embedded DNA afterward. This and multiply; and eventually conditions, including excessive process is somewhat contrary to extracting the information back temperature and desiccation/rehy- our basic DNA memory require- from the organisms. Here we dration. Even nucleases (a kind of ments that new and artificial DNA describe the objective of this inves- DNA-degrading enzyme) in the be generated frequently and that tigation, which began in 1998, and environment can corrupt DNA we be able to retrieve the embed-
BY PAK CHUNG WONG, KWONG-KWOK WONG, AND HARLAN FOOTE
COMMUNICATIONS OF THE ACM January 2003/Vol. 46, No. 1 95
ded information afterward. DNA. Each of these AT or CG pairs in a double- These requirements pose serious challenges to our stranded DNA is called a base pair. Because a DNA DNA memory design due to the size of a whole genome, which AAA - 0 AAC - 1 AAG - 2 AAT - 3 ACA - 4 ACC - 5 ACG - 6 ACT - 7 ranges from a few million DNA AGA - 8 AGC - 9 AGG - A AGT - B ATA - C ATC - D ATG - E ATT - F units in a bacterium to more than CAA - G CAC - H CAG - I CAT - J CCA - K CCC - L CCG - M CCT - N three billion in a mouse or human. CGA - O CGC - P CGG - Q CGT - R CTA - S CTC - T CTG - U CTT - V It is practically impossible to GAA - W GAC - X GAG - Y GAT - Z GCA - SP GCC - : GCG - , GCT - - retrieve an embedded message GGA - . GGC - ! GGG - ( GGT - ) GTA - ` GTC - ‘ GTG - “ GTT - " from a whole genome in a wet lab- TAA - ? TAC - ; TAG - / TAT - [ TCA - ] TCC - TCG - TCT - oratory without knowing the con- TGA - TGC - TGG - TGT - TTA - TTC - TTG - TTT -
tent or whereabouts of the
encoded DNA. The unpredictable Table 1. DNAsequence is digital, we can use them to construct any encoding. nature of genomic mutation repre- English text, just as we use binary numbers 0 and 1 to sents yet another obstacle, further reducing the odds of encode ASCII characters. Table 1 outlines the encod- locating the message within a whole genome. ing table of our experiments using a set of triplets—a DNA sequence with any three of the four bases—the Experimental Design exact encoding scheme for our initial experiment. The customized computational and wet-laboratory Note several triplets listed at the end of the table are approach we developed leaves a trail of the embedded open (intentionally) for later expansion. message for later retrieval while allowing us to preserve Unique DNA searching. The whole genomes of E. the integrity of the message. Our experiments were car- coli and Deinococcus have been completely sequenced ried out in four primary stages: and are available from DNA host identification. In AAGGTAGGTAGGTTAGTTAG AGAGTAGTGAGGATAGTTAG The Institute for the process of identifying candi- AGGTTTGGTGGTATAGTTAG ATAAGTAGTGGGGTAGTTAG Genomic Research dates to carry the embedded ATAGGAGTGTGTGTAGTTAG ATAGGGGTATGGATAGTTAG (www.tigr. org). Our DNA molecules, we considered ATATTAGAGGGGGTAGTTAG ATGGGTGGATTGATAGTTAG task is to identify a set microorganisms (such as bacte- GGAGTAGTGTGTATAGTTAG GGGAATAGAGTGTTAGTTAG of fixed-size sequences ria) and other agents (such as GGGAGTATGTAGTTAGTTAG GGGATGATTGGTTTAGTTAG (20-base-pair long in plants, including Arabidopsis) GGTTAGATGAGTGTAGTTAG GTATGGGAATGGTTAGTTAG our experiments) that as message hosts. We eventually TAAGGGATGTGTGTAGTTAG TAGAGAGAGTGTGTAGTTAG do not exist in the selected two well-understood TAGAGGAGGGATATAGTTAG TAGAGTGGTGTGTTAGTTAG candidate bacteria yet bacteria—Escherichai coli (E. TAGATGGGAGGTATAGTTAG TAGATTGGATGGGTAGTTAG satisfy all the genomic TAGGAGAGATGTGTAGTTAG TAGGGTTGGTAGTTAGTTAG coli) and Deinococcus radiodu- constraints and TATAGGGAGGGTATAGTTAG TATAGGGTAGGGTTAGTTAG rans (Deinococcus)—because TGTGGGATAGTGATAGTTAG restrictions. This microorganisms generally grow process is critical to quickly and embedded information can be inherited Table 2. our experiment, as we do not want quickly and continuously. We also considered the 25 20-base-pair sequences for our to cause unnecessary mutation or physical endurance of the DNA host candidates. experiment. damage to the bacteria. The resul- Deinococcus, we learned, survive extreme conditions, tant sequences also serve as sen- including ultraviolet, desiccation, partial vacuum, and tinels to tag the beginning and end of the embedded ionizing radiation up to 1.6 million rad, or radiation messages—similar to file headers and footers in mag- absorbed dose (about 0.1% of this dose is fatal to netic tape—for later identification and retrieval. humans); some strains of Deinococcus also tolerate Of the 10 billion potential candidates in the bac- high temperatures. terium Deinococcus, we found through intensive Information encoding. The four basic building computation only 25 qualified sequences that would units in DNA are bases called deoxyribonucleosides. be acceptable for our experiments. These sequences In the biology literature, they are usually labeled A (see Table 2) serve as blueprints for chemically synthe- (Adenine), C (Cytosine), G (Guanine), and T sizing oligos for subsequent steps in our experiments. (Thymine). Each base always bonds with another base The multiple triplets (such as TAA, TGA, and TAG) (such as A with T and C with G). A single chain of seen in many of the sequences are called stop codons these bases is called an oligonucleotide, or oligo. It and tell the bacterium repeatedly it has reached the pairs with another complementary oligo (such as end of the native DNA sequence and should stop GATCG with CTAGC) to form a double-stranded translating its contents. Without the protection of
96 January 2003/Vol. 46, No. 1 COMMUNICATIONS OF THE ACM
stop codons, the bacterium Message and retrieval. Deinococcus could misinterpret the encoded granted perfect protection information and produce arti- for the embedded message, ficial proteins that could as it tolerates extreme desic- destroy the integrity of the cation, high doses of radia- embedded message or even kill tion, high levels of organic the bacterium. solvent, and vacuum-pres- Wet laboratory procedures. sure environments, as We conducted four main pro- Sentinels shown in our experiments. cedures: Extract the message. Finally, whenever embedded infor- Create complementary oligos. mation was needed, we We started by creating two extracted the message part of complementary oligos, each the DNA strand from the with 46 bases and consisting bacterium through a labora- of two different segments of tory procedure called poly- 20-base-long oligos con- Figure 1. A recombinant merase chain reaction. Using prior knowledge of the nected by a six-base-long plasmid with two DNA sequences at both borders of the segments, it pro- fragments as sentinels restriction enzyme site. The protecting the encoded ceeds through a series of heating and cooling cycles two 20-base-long oligos were message in between. to amplify the DNA segment. The whole process based on two different took about two hours. Figure 2 shows a machine sequences listed in Table 2. Enzymes that recognize readout of our DNA analysis and its English inter- a specific sequence of double-stranded DNA and pretation at the top. that cut the DNA at that loca- tion are known as restriction enzymes. We created a restric- tion enzyme site for later inser- tion of encoded DNA fragments. These two 46-base- long complementary oligos form a double-stranded, 46- base-pair DNA fragment. The DNA fragment was then cloned into a recombinant plas- mid—a union of foreign DNA Figure 2. DNA analysis fragments into a circular DNA result and its Enormous Potential Capacity molecule (see Figure 1). Because interpretation; the We successfully stored and retrieved seven chemically the two 20-base-long oligos do English text is part of synthesized DNA fragments with 57–99 base pairs of the children’s song not exist in the genome of the “It’s a Small World” [2]. non-native information in seven different individual host, they serve as identification bacteria. Even without further technology improve- markers for later message ment, the capacity of bacterial-based DNA memory retrieval. The stop codons in these two oligos also can be expanded dramatically by storing different help protect the message, as well as the host, from pieces of information in a population of bacteria; for potential damage. example, the seven bacteria in our experiment carry dif- Insert DNA. The embedded DNA was then inserted ferent parts of the children’s song “It’s a Small World” into cloning vectors—a circular DNA molecule [2] in their genomes. Considering that a milliliter of that can self-replicate within a bacterial host. The liquid can contain up to 109 bacteria, the potential resultant vectors were then transferred into E. coli capacity of bacterial-based DNA memory is enormous, by electroporations (high-voltage shocks), allowing assuming we have a well-designed data index scheme. the vector with the encoded DNA fragment to A potential challenge is the mutation of the organ- multiply for later applications. isms affecting the integrity of the embedded messages. Incorporate into genome. The vector and the encoded Although a bacterium can be selected with a low muta- DNA were then incorporated into the genome of tion rate, random changes still occur. Nature has had Deinococcus for permanent information storage to deal with this problem since the beginning of bio-
COMMUNICATIONS OF THE ACM January 2003/Vol. 46, No. 1 97