Вы находитесь на странице: 1из 49

Bioinformatics and Role of Software Engineers in It

By-

Makarand Arjun Kokane BE-2 Roll number 222

Year 2002-2003

Pune Institute of Computer Technology Pune 411043

2
Pune Institute of Computer Technology Pune 411043

CERTIFICATE

This is to certify that Mr. Makarand Arjun Kokane has successfully completed his seminar on the topic Bioinformatics and role of software engineers in it under the guidance of Prof. R. B. Ingle towards the partial completion of Bachelors degree in Computer Engineering at Pune Institute of Computer Engineering during academic year 2002-2003.

Date:

Seminar Incharge

HOD
(Computer Engineering)

Principal
PICT

3
Acknowledgement
I take this opportunity to thank respected Prof. R. B. Ingle Sir (my seminar guide) for his generous assistance. I am immensely grateful to our HOD Prof. Dr. C.V.K Rao for his encouragement and guidance. I extend my sincere thanks to our college library staff and all the staff members for their valuable assistance. I am also thankful to my fellow colleagues for their help and suggestions.

Makarand Kokane (B.E. 2)

4
Abstract
Bioinformatics is the application of computers in biological sciences. It is concerned with capturing, storing, graphically displaying, modeling and ultimately distributing biological information. It is becoming an essential tool in molecular biology as genome projects generate vast quantities of data. The Human Genome Project has created the need for new kinds of scientific specialists who can be creative at the interface of biology and other disciplines, such as computer science, engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of genomic research increases, the demand for these specialists greatly exceeds the supply. In the past, the genome project has benefited immensely from the talents of non-biological scientists, and their participation in the future is likely to be even more crucial. Through this report I have tried to analyze the future requirements in development of advances technologies in this field and what role, we, as software engineers can play in development of these technologies.

5
Contents
1 Introduction to Bioinformatics 1.1 1.2 1.3 1.4 1.5 1.6 2 What is Bioinformatics? Computers and Biology Limitations in the use of computers Current Stage of Research Microbial, Plant and Animal Genomes History (Stages of development)
Page

1 1 2 2 3 4 4 7 7 7 8 8 9 9 9 9 9 10 10 11

Basics of Molecular Biology 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Nucleotide Amino acid Properties of Genetic Code DNA (Deoxy-ribonucleic Acid) Chromosomes Gene Protein Sequencing Genome

2.10 Clone 2.11 Model Organism 3 Role of software Engineers and Technology in Biotechnology 3.1 Need for software automation

6
11 3.2 Genetic Algorithms 3.2.1 Database Searching 3.2.2 Comparing Two Sequences 3.2.3 Multiple Sequence Alignment 3.3 3.4 3.5 3.6 3.7 3.8 Genome Projects Goals for Advancements in Sequencing Technology Developing Technology to handle Sequence Variations Need of Technology in Functional Genomics Bioinformatics and Computational Biology Job Opportunities and Job Requirements 3.9 Plan 21 4 Human Genome Project 4.1 4.2 4.3 Introduction Details of the Human Genome Project U.S. Human Genome Project 5-Year Goals 1998-2003 4.3.1 Human DNA Sequencing 25 4.3.2 Sequencing Technology 4.3.3 Sequence Variation 4.3.4 Functional Genomics 4.3.5 Comparative Genomics 4.3.6 Ethical, Legal, and Social Implications (ELSI) 27 27 27 28 28 23 23 23 25 12 12 12 13 13 14 15 18 19 21

Training Goals included in the Human Genome Project

7
4.3.7 Bioinformatics and Computational Biology 4.3.8 Training 5 Biological Databases 5.1 5.2 5.3 The Biological sequence/structure deficit Biological Databases Primary Sequence Databases 5.3.1 Nucleic acid Sequence Databases 5.3.2 Protein Sequence Databases 5.4 5.5 5.6 6 Composite Protein Sequences Databases Secondary Databases Tertiary Databases 29 29 30 30 30 31 31 32 32 33 33 34 34 36 36 37 39 39 40 41

Applications of Bioinformatics 6.1 6.2 6.3 6.4 6.5 6.6 Application to the Ailments of Diseases Application of Bioinformatics to Agriculture 6.2.1 Improvements in Crop Yield and Quality Applications of Microbial Genomics Risk Assessment Evolution and Human Migration DNA Forensics (Identification)

Bibliography

8
Chapter 1

Introduction to Bioinformatics

1.1

What is Bioinformatics?
Bioinformatics is the application of computers in biological sciences and especially

analysis of biological sequence data. It is concerned with capturing, storing, graphically displaying, modeling and ultimately distributing biological information. It is becoming an essential tool in molecular biology as genome projects generate vast quantities of data. With new sequences being added to DNA databases on an average, once every minute, there is a pressing need to convert this information into biochemical and biophysical knowledge by deciphering the structural, functional and evolutionary clues encoded in the language of biological sequences. What Bioinformatics therefore offers to the researcher, the entrepreneur, or the Venture Capitalist is an enormous and exciting array of opportunities to discover how living systems metabolise, grow, combat disease, reproduce and regenerate. The current knowledge represents only the tip of the iceberg. Exciting and startling discoveries are being made everyday through Bioinformatics, which is building up an extensive encyclopedia from which lifes mysteries will be unraveled. The importance of computational science in collating this information and its simultaneous interpretation by biologists is the underlying ethos of Bioinformatics. Having an interest in biology and having a strong inclination towards genetics is all right. But from our point of view, the most important thing is that biocomputing requires lots of software professionals. And there is more to do for these people than the experts in biology.

9
1.2 Computers and Biology
Bioinformatics is the symbolic relationship between computational and biological sciences. The ability to sort and extricate genetic codes from a human genomic database of 3 billion base pairs of DNA in a meaningful way is perhaps the simplest form of Bioinformatics. Moving on to another level, Bioinformatics is useful in mapping different peoples genomes and deriving differences in their genetic make-up through pattern recognition software. But that is the easiest part. What is more complex is to decipher the genetic code itself to see what the differences in genetic make-up between different people translate into in terms of physiological traits. And there is yet another level, which is even more intricate and that is the genetic code itself. The genetic code actually codes for amino acids and thereby proteins and the specific role, played by each of these proteins controls the state of our health. The role or function of each of our genes in coding for a specific protein, which in turn regulates a particular metabolic pathway, is described as functional genomics. The true benefit of Bioinformatics therefore lies in harnessing information pertaining to these genetic functions in order to understand how human beings and other living systems operate. Computational simulation of experimental biology is an important application of Bioinformatics, which is referred to as in silico testing. This is perhaps an area that will expand in a prolific way, given the need to obtain a greater degree of predictability in animal and human clinical trials. Added to this, is the interesting scope that in silico testing provides to deal with the growing hostility towards animal testing. The growth of this sector will largely depend on the acceptance of in silico testing by the regulatory authorities. However, irrespective of this, research strategies will certainly find computational modeling to be a vital tool in speeding up research with enormous cost benefits.

1.3

Limitations in the use of computers


The last decade has witnessed the dawn of a new era of silicon-based biology, opening

the door, for the first time, to the possible investigation and comparative analysis of complete genomes. Genome analysis means to elucidate and characterize the genes and gene products of an

10
organism. It depends on a number of pivotal concepts, concerning the processes of evolution (divergence and convergence), the mechanism of protein folding, and the manifestation of protein function. Today, our use of computers to model such processes is limited by, and must be placed in the context of, the current limits of our understanding of these central themes. At the outset, it is important to recognize that we do not yet fully understand the rules of protein folding; we cannot invariably say that a particular sequence or a fold has arisen by divergent or convergent evolution; and we cannot necessarily diagnose a protein function, given knowledge only of its sequence or of its structure, in isolation. Accepting what we cannot do with computers plays an essential role in forming an appreciation of what, in fact, we can do. Without this kind of understanding, it is easy to be misled, as spurious arguments are often used to promote perhaps rather overenthusiastic points of view about what particular programs and software packages can achieve. Nature has its own complex rules, which we only poorly understand and which we cannot easily encapsulate within computer programs. No current algorithm can do biology. Programs provide mathematical and therefore infallible, models of biological systems. To interpret correctly whether sequences or structures are meaningfully similar, whether they have arisen by the processes of divergence or convergence, whether similar sequences or similar folds have the same or different functions: these are the most challenging problems. There are no simple solutions, and computers do not give us the answers; rather, given a sea of data, they help to narrow the options down so that the users can begin to draw informed biologically reasonable conclusions.

1.4

Current Stage of Research


In the field of Bioinformatics, the current research drive is to be able to understand

evolutionary relationships in terms of the expression of protein function. Two computational approaches have been brought up to bear on the problem, tackling the identification of protein function from the perspectives of sequence analysis and of structure analysis respectively. From the point of view of sequence analysis, we are concerned with the detection of relationships between newly determined sequences and those of known function (usually within a database).

11
This may mean pinpointing functional sites shared by disparate proteins (probably the result of convergent evolution), or identifying related functions in similar proteins (most commonly the result of divergent evolution. The identification of protein function from sequence sounds straightforward, and indeed, sequence analysis is usually a fruitful technique. But, function cannot be inferred from sequence for about one-third of proteins in any of the sequenced genomes, largely because biological characterization cannot keep pace with the volume of data issuing from the genome projects (large number of database sequences thus either carry no annotation beyond the parent gene name, or are simply designated as hypothetical proteins). Another important point is that, in some instances, closely related sequences, which may be assumed to share a common structure, may not share the same function. What this means is that, though sequence or structure analysis can be used for deducing gene functions, still neither technique can be applied infallibly without reference to the underlying biology.

1.5

Microbial, Plant and Animal Genomes


Although the human genome appears to be the focal point of interest, microbial, plant and

animal genomes are equally exciting to explore through Bioinformatics. Mining plant genomics has an important impact on opening up new vistas for research in agriculture. Microbial genomics offers a dual opportunity of developing new fermentation-based products and technologies as well as defining new ways of combating microbial infections. Exploring animal genomics opens up unlimited scope to pursue research in veterinary science and transgenic models.

12
1.6 History (Stages of development)
The science of sequencing began slowly. The earliest techniques were based on methods for separation of proteins and peptides, coupled with methods for identification and quantification of amino acids. Prior to 1945, there was not a single quantitative analysis available for any one protein. However, significant progress with chromatographic and labeling techniques over the next decade eventually led to the elucidation of the first complete sequence, that of the peptide hormone insulin (1955). Yet, it was the first five years before the sequence of the first enzyme (ribonuclease) was complete (1960). By 1965, around 20 proteins with more than 100 residues had been sequenced, and by 1980, the number was estimated to be of the order of 1500. Today, there are more than 3,00,000 sequences available. Initially, the manual process of sequential Edman degradation dansylation, obtained the majority of protein sequences. A key step towards the rapid increase in the number of sequenced proteins was the development of automated sequencers, which, by 1980, offered a 104-fold increase in sensitivity relative to the automated procedure implemented by Edman and Begg in 1967. In the 1960s, scientists struggled to develop methods to sequence nucleic acids, but the first techniques to emerge were really only applicable to tRNA because they are short (74 to 95 nucleotides in length) and it is possible to purify individual molecules. As against RNA, DNA is very long: human chromosomal molecules may contain between 55*106 and 250*106 base pairs. Assembling the complete nucleotide sequence of a complete DNA molecule is a huge task. Even if the sequence can be broken down into smaller fragments, purification remains a problem. The advent of gene cloning provided a solution to how the fragments can be separated. By 1977, two sequencing methods had emerged using chain termination and chemical degradation approaches. With only minor changes, the techniques propagated to laboratories throughout the world, and laid the foundation for the sequence revolution of the next two decades, and the subsequent birth of Bioinformatics.

13
During the last decade, molecular biology has witnessed an information revolution as a result of both of the development of rapid DNA sequencing techniques and of the corresponding progress in computer base technologies, which are allowing us to cope with this information deluge in increasingly efficient ways. The broad term that was coined in the mid-1980s to encompass computer applications in biological sciences is Bioinformatics. The term Bioinformatics has been commandeered by several different disciplines to mean rather different things. In its broadest sense, the term can be considered to mean information technology applied to the management and analysis of biological sequence data; this has implications in diverse areas, ranging from artificial intelligence and robotics to genome analysis. In the context of genome initiatives, the term was originally applied to the computational manipulation and analysis of biological sequence data. However in view of this recent rapid accumulation of available protein structures, the term now tends also to be used to embrace the manipulation and analysis of 3D structure data.

14

Chapter 2

Basics of Molecular Biology


This chapter explains in short some of the common biological terms absolutely essential to get a clear understanding of what exactly is Bioinformatics all about. I have avoided getting into the intricacies of Genetics because the basic aim of this report is to know the latest developments in the field of Bioinformatics, try to visualize where it is heading, understand what it has got to offer to the community, and exploit the opportunities available in this field.

2.1

Nucleotide
A nucleotide is a macromolecule made up of three sub-units: a pentose sugar, a nitrogen

base and a phosphate. Nucleic acids are polymers of nucleotides. Pentose sugar is either ribose or deoxyribose (this decides whether the genetic material formed is RNA or DNA). Nitrogen bases are of two types: Purines (Adenine (A), Guanine (G)) and Pyramidines (Cytosine (C), Thymine (T) and Uracil (U))

2.2

Amino acid
It is the fundamental building block of proteins. There are 20 naturally occurring amino

acids in animals and around 100 more found only in plants. A sequence of three nucleotides forms one amino acid. The logic behind this is as follows: There are four types of nucleotides depending on the nitrogenous base: (A,G,C,T) in DNA and (A,G,C,U) in RNA. 20 different amino acids are to be coded using permutations of 4 types of nucleotides. So obviously, 3 nucleotides are required to signify one amino acid (43 > 20), because less than 3 will be insufficient and more than 3 will cause redundancy. The sequence of three nucleotide specifying an amoni acid is called a triplet

15
code or codon (coding unit). All 64 codons specify something or the other. Most of them specify amino acids, but a few are instructions for starting and stopping the synthesis.

2.3

Properties of Genetic Code


1. Three nucleotides in a DNA molecule code for one amino acid in the corresponding protein. Such a triplet is called a codon. 2. The code is read from a fixed starting point. 3. Codes for starting and stopping are present, but not for a pause in the middle, or comma. 4. The nucleotides are read three at a time in a non-overlapping manner. 5. Most of the 64 possible nucleotide triplets stand for one amino acid or the other. 6. A few triplets stand for starting and stopping the synthesis. 7. There are two or more different codons for the same amino acid. Because of this, the genetic code is said to be degenerate. 8. The code has polarity because it can be read only in one direction. 9. The code is universal. Practically all the organisms use the same code.

2.4

DNA (Deoxy-ribonucleic Acid)


The long, thread-like DNA molecule consists of two strands that are joined to one another

all along their length. Each strand is a polymer made up of repeated sub-units (nucleotides). Hence each strand is also called a polynucleotide. DNA is the basic genetic material in all the living material existing on this earth. The two essential mechanisms possessed by DNA are (1) Transmission of hereditary characters and (2) Ability of self-duplication. In the DNA molecule, tow long polynucleotide chains are spirally twisted around each other. This is also called helical coiling and the DNA is often referred to as a double helix. A polynucleotide chain has polarity and the two strands of a DNA molecule run in opposite directions, hence they are said to be anti parallel. The two chains are joined together by hydrogen bonds existing between the nitrogenous bases on the inside. Adenine (A) forms a bond only with Thymine (T) and Guanine (G) can form a

16
bond only with Cytosine (C). Because of the base pairing restriction, the two strands are always complementary to each other. The sequence of bases along the polynucleotide is not restricted in any way. An infinite variety of combinations is possible. It is the precise sequence of bases that determines the genetic information. There is no theoretical limit to the length of a DNA molecule.

2.5

Chromosomes
Chromosomes are the paired, self-replicating genetic structures of cells that contain the

cellular DNA; the nucleotide sequence of the DNA encodes the linear array of genes.

2.6

Gene
A gene is the fundamental physical and functional unit of heredity. A gene is an ordered

sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e. a protein or RNA molecule).

2.7

Protein
Protein is a molecule composed of one or more chains of amino acids in a specific order.

The order is determined by the base sequence of nucleotides in the gene coding for the protein. Proteins are required for the structure, function and regulation of cells, tissues and organs, each protein having a specific role (e.g., hormones, enzymes and antibodies). DNA carries the hereditary material and the only thing that they do is to synthesize proteins, and thereafter, all the hereditary characteristics get reflected in the activities of the body cells because of proteins.

2.8

Sequencing
Sequencing means the determination of the order of nucleotides (base sequences) in a DNA

or RNA molecule, or the order of amino acids in a protein.

2.9

Genome
Genome of an organism means all the genetic material in its chromosomes. Its size is

17
generally given as its total number of base pairs. Genomes of different organisms can be compared to identify similarities and disparities in the strategies for the Logic of Life.

2.10 Clone
Clone is an exact copy made of biological material such as a DNA segment, a whole cell or a complete organism. The process of creating a clone is called as cloning.

2.11 Model Organism


Saccharomytes cerevisiae commonly known as the bakers yeast have emerged as the model organism. It has demonstrated the fundamental conservation of the basic informational pathways found in almost all the organisms. From the detailed study of the genomes of these organisms (which is possible today), we can gain an insight into their functioning. All this data will lead to the fundamental insights into human biology. Vast amount of genetic data available on this species provides important clues helpful for the ongoing research on human genetics. Saccharomytes cerevisiae has become the workhorse of many biotechnology labs. It can exist either in a haploid or a diploid state and divides by the vegetative process of budding. Yeast cultures can be easily propagated in labs. It has become the model organism partly because of the ease with which genetic manipulations can be carried out. Random mutations can be induced into the genome by the treatment of live cells with chemicals such as ethyl-methanesulphone or by exposure to ultra-violet rays. Targeted gene inactivations can also be carried out; this property is very important during experiments for the unambiguous assignments of gene functions. Saccharomytes cerevisiae has a compact genome of 12 lakh base pairs of DNA present on 16 chromosomes. This presented a reasonable goal for complete sequencing and analysis of its genome. The Saccharomytes genome database (SGD) was established at the Stanford University in 1995. Knowing the complete sequence of a genome is only the first step in understanding how the huge amount of information contained in genes is translated into functional proteins.

18
Chapter 3

Role of software Engineers and Technology in Biotechnology


The tools of computer science, statistics and mathematics are critical for studying biology as an informational science. Curiously, biology is the only science that at its very heart, employs a digital language. The grand challenge in biology is to determine how the digital language of the chromosomes is converted into 3-D and 4-D (time varying) languages of living organisms.

3.1

Need for software automation


DNA encodes the information necessary for building and maintaining life. DNA is a non-

branching, double-stranded macromolecule in which the nucleotide building blocks (A,C,G,T) are linked. Bases are arranged in A-T and C-G pairs. Small viral genomes of the order of several thousand bases were the first to be sequenced in 1970. Few years later, genomes of the order of 40 kilo base pairs represented the limit of what could reasonably be sequenced. At this stage, the need for automation was recognized and methods were applied to the degree possible. By the year 1997, the yeast genome consisting of 12 Mega base pairs was completed, and in 1998, the conclusion of the 100 Mega base pairs nematode genome project was announced. Most recently, the 180 Mega base pairs fruit-fly genome was also completed. All of these projects relied on substantially higher levels of software automation. We are now in the midst of the most ambitious project so far: sequencing of the 3 Giga base pairs Human Genome. For this effort, and those yet to come, software automation lies at the very core of the planning and executing of the project. The need for automation is driven largely by the trend of handling ever larger sizes of DNA and the corresponding increase in the amount of raw data this entails. Mathematical analysis indicates that the size of a project is roughly proportional to the size of the genome. This is due to the fact that the amount of information obtained for an individual sequencing experiment is relatively constant and is independent of the genome size. It is estimated that for the human genome, as order of 108 individual experiments are required to cover the genome. To meet the

19
projected goals, modern large scale sequencing centers have developed throughput capacities of the order of several million experiments per month, with data processing handled on a continuous basis. Managing such large projects without a high degree of automation would clearly be impossible in terms of cost and time requirements. So, DNA is the basic genetic material. It transmits hereditary characters from one generation to the next. During synthesis of proteins, mRNA which act as the messengers of information (the exact genetic code) are build from DNA. Proteins are synthesized using mRNA molecules. Protein interactions give rise to information pathways and networks which help in building cells which are identical to their parent cells. Clustering of many cells in a predefined format composes a tissue. An organ is a combination of tissues and an organism is nothing but an organization of organs. Refer figure 3.1. The challenge for computer professionals is to create tools that can capture and integrate these different levels of biological information.

20

3.2

Genetic Algorithms
All that computers can do is implement algorithms. Hence when we talk of using

computers for processing of biological information, we have to define precise mathematical algorithms. Following are a few absolutely basic algorithms in Bioinformatics.

3.2.1 Database Searching


Database interrogation can take the form of text queries (e.g. Display all the human adrenergic receptors) or sequence similarity searches (e.g. Given the sequence of a human adrenergic receptor, display all the similar sequences in the database). Sequence similarity searches are straightforward because the data in the databases is mostly in the form of sequences.

21
3.2.2 Comparing Two Sequences
Let us take the case of comparing two protein sequences. The alphabet complexity is 20, since a protein is nothing but a sequence of amino acids and there are 20 possible amino acids. The nave approach is to line up the sequences against each other and insert additional characters to bring the two strings into vertical alignment. More the matches, more is the closeness in the two sequences. The process of alignment can be measured in terms of the number of gaps introduced and the number of mismatches remaining in the alignment. A metric relating such parameters represents the distance between two sequences.

3.2.3 Multiple Sequence Alignment


In the previous sub section, we saw pairwise sequence alignment, which is fundamental to sequence analysis. However, analysis of groups of sequences that form gene families requires the ability to make connections between more than two members of the group, in order to reveal subtle conserved family characteristics. The goal of multiple sequence alignment is to generate a concise information-rich summary of sequence data in order to inform decision-making on the relatedness of sequences to a gene family. Multiple sequence alignment is a 2D table, in which the rows represent individual sequences and the columns the residue positions. The sequences are laid onto this grid in such a manner that (a) the relative positioning of residues within any one sequence is preserved, and (b) similar residues in all the sequences are brought into vertical register.

22
3.3 Genome Projects
In the mid-1980s, the united states department of Energy initiated a number of projects to construct detailed genetic and physical maps of the human genome, to determine its complete nucleotide sequence, and to localize its estimated 100000 genes. Work on this scale required the development of new computational methods for analysing genetic map and DNA sequence data, and demanded the design of new techniques and instrumentation for detecting and analysing DNA. To benefit the public most effectively, the projects also necessitated the use of advanced means of information dissemination in order to make the results available as rapidly as possible to scientists and physicians. The international effort arising from this vast initiative became known as the human genome project. Similar research efforts were also launched to map and sequence the genomes of a variety of organisms used extensively in research labs as model systems. In April 1998, although the sequencing projects of only a small number of relatively small genomes had been completed, and the human genome is not expected to be complete until after the year 2003, the results of such projects were already beginning to pour into the public sequence databases in overwhelming numbers.

3.4

Goals for Advancements in Sequencing Technology


DNA sequencing technology has improved dramatically since the genome projects began.

The amount of sequence produced each year is increasing steadily; individual centers are now producing tens of millions of base pairs of sequence annually. In the future, de novo sequencing of additional genomes, comparative sequencing of closely related genomes, and sequencing to assess variation within genomes will become increasingly indispensable tools for biological and medical research. Much more efficient sequencing technology will be needed than is currently available. The incremental improvements made to date have not yet resulted in any fundamental paradigm shifts. Nevertheless, the current state-of-the-art technology can still be significantly improved, and resources should be invested to accomplish this. Beyond that, research must be supported on new technologies that will make even higher throughput DNA sequencing efficient, accurate, and costeffective, thus providing the foundation for other advanced genomic analysis tools. Progress must

23
be achieved in three areas: a) Continue to increase the throughput and reduce the cost of current sequencing technology. Increased automation, miniaturization, and integration of the approaches currently in use, together with incremental, evolutionary improvements in all steps of the sequencing process, are needed to yield further increases in throughput (to at least 500 Mb of finished sequence per year by 2003) and reductions in cost. At least a twofold cost reduction from current levels (which average $0.50 per base for finished sequence in large-scale centers) should be achieved in the next 5 years. Production of the working draft of the human sequence will cost considerably less per base pair. b) Support research on novel technologies that can lead to significant improvements in sequencing technology. New conceptual approaches to DNA sequencing must be supported to attain substantial improvements over the current sequencing paradigm. For example, microelectromechanical systems (MEMS) may allow significant reduction of reagent use, increase in assay speed, and true integration of sequencing functions. Rapid mass spectrometric analysis methods are achieving impressive results in DNA fragment identification and offer the potential for very rapid DNA sequencing. Other more revolutionary approaches, such as single-molecule sequencing methods, must be explored as well. Significant investment in interdisciplinary research in instrumentation, combining chemistry, physics, biology, computer science, and engineering, will be required to meet this goal. Funding of far-sighted projects that may require 5 to 10 years to reach fruition will be essential. Ultimately, technologies that could, for example, sequence one vertebrate genome per year at affordable cost are highly desirable. c) Develop effective methods for the advanced development and introduction of new sequencing technologies into the sequencing process. As the scale of sequencing increases, the introduction of improvements into the production stream becomes more challenging and costly. New technology must therefore be robust and be carefully evaluated and validated in a high-throughput environment before its implementation

24
in a production setting. A strong commitment from both the technology developers and the technology users is essential in this process. It must be recognized that the advanced development process will often require significantly more funds than proof-of-principle studies. Targeted funding allocations and dedicated review mechanisms are needed for advanced technology development.

3.5

Developing Technology to handle Sequence Variations


Natural sequence variation is a fundamental property of all genomes. Any two haploid human genomes show multiple sites and types of polymorphism. Some of these have functional implications, whereas many probably do not. The most common polymorphisms in the human genome are single base-pair differences, also called single-nucleotide polymorphisms (SNPs). When two haploid genomes are compared, SNPs occur every kilobase, on average. Other kinds of sequence variation, such as copy number changes, insertions, deletions, duplications, and rearrangements also exist, but at low frequency and their distribution is poorly understood. Basic information about the types, frequencies, and distribution of polymorphisms in the human genome and in human populations is critical for progress in human genetics. Better high-throughput methods for using such information in the study of human disease are also needed. SNPs are abundant, stable, widely distributed across the genome, and lend themselves to

automated analysis on a very large scale, for example, with DNA array technologies. Because of these properties, SNPs will be a boon for mapping complex traits such as cancer, diabetes, and mental illness. Dense maps of SNPs will make possible genome-wide association studies, which are a powerful method for identifying genes that make a small contribution to disease risk. In some instances, such maps will also permit prediction of individual differences in drug response. Publicly available maps of large numbers of SNPs distributed across the whole genome, together with technology for rapid, large-scale identification and scoring of SNPs, must be developed to facilitate this research. a) Develop technologies for rapid, large-scale identification of SNPs and other DNA sequence variants. The study of sequence variation requires efficient technologies that can be used on a large scale and that can accomplish one or more of the following tasks: rapid identification of many thousands of new SNPs in large numbers of samples. Although the immediate emphasis is on

25
SNPs, ultimately technologies that can be applied to polymorphisms of any type must be developed. Technologies are also needed that can rapidly compare, by large-scale identification of similarities and differences, the DNA of a species that is closely related to one whose DNA has already been sequenced. The technologies that are developed should be cost-effective and broadly accessible. b) Identify common variants in the coding regions of the majority of identified genes Initially, association studies involving complex diseases will likely test a large series of candidate genes; eventually, sequences in all genes may be systematically tested. SNPs in coding sequences (also known as cSNPs) and the associated regulatory regions will be immediately useful as specific markers for disease. An effort should be made to identify such SNPs as soon as possible. Ultimately, a catalog of all common variants in all genes will be desirable. This should be cross-referenced with cDNA sequence data. c) Create an SNP map of at least 100,000 markers. A publicly available SNP map of sufficient density and informativeness to allow effective mapping in any population is the ultimate goal. A map of 100,000 SNPs (one SNP per 30,000 nucleotides) is likely to be sufficient for studies in some relatively homogeneous populations, while denser maps may be required for studies in large, heterogeneous populations. Thus, during this 5-year period, the HGP authorities have planned to create a map of at least 100,000 SNPs. If technological advances permit, a map of greater density is desirable. Research should be initiated to estimate the number of SNPs needed in different populations. d) Develop the intellectual foundations for studies of sequence variation. The methods and concepts developed for the study of single-gene disorders are not sufficient for the study of complex, multigene traits. The study of the relationship between human DNA sequence variation, phenotypic variation, and complex diseases depends critically on better methods. Effective research design and analysis of linkage, linkage disequilibrium, and association data are areas that need new insights. Questions such as which study designs are appropriate to which specific populations, and with which population genetics characteristics, must be answered. Appropriate statistical and computational tools and rigorous criteria for establishing and confirming associations must also be developed.

26
e) Create public resources of DNA samples and cell lines. To facilitate SNP discovery it is critical that common public resources of DNA samples and cell lines be made available as rapidly as possible. To maximize discovery of common variants in all human populations, a resource is needed that includes individuals whose ancestors derive from diverse geographic areas. It should encompass as much of the diversity found in the human population as possible. Samples in this initial public repository should be totally anonymous to avoid concerns that arise with linked or identifiable samples. DNA samples linked to phenotypic data and identified as to their geographic and other origins will be needed to allow studies of the frequency and distribution of DNA polymorphisms in specific populations and their relevance to disease. However, such collections raise many ethical, legal, and social concerns that must be addressed. Credible scientific strategies must be developed before creating these resources.

3.6

Need of Technology in Functional Genomics


Functional genomics is the interpretation of the function of DNA sequence on a genomic

scale. Already, the availability of the sequence of entire organisms has demonstrated that many genes and other functional elements of the genome are discovered only when the full DNA sequence is known. Such discoveries will accelerate as sequence data accumulate. However, knowing the structure of a gene or other element is only part of the answer. The next step is to elucidate function, which results from the interaction of genomes with their environment. Current methods for studying DNA function on a genomic scale include comparison and analysis of sequence patterns directly to infer function, large-scale analysis of the messenger RNA and protein products of genes, and various approaches to gene disruption. In the future, a host of novel strategies will be needed for elucidating genomic function. This will be a challenge for all of biology. The HGP will be contributing to this area by emphasizing the development of technology that can be used on a large scale, is efficient, and is capable of generating complete data for the genome as a whole. To the extent that available resources allow, expansion of current approaches as well as innovative technology ideas should be supported in the areas described below. a) Develop cDNA resources. Complete sets of full-length cDNA clones and sequences for both humans and model organisms would be enormously useful for biologists and are urgently needed.

27
Such resources would help in both gene discovery and functional analysis. High priority should be placed on developing technology for obtaining full-length cDNAs. Complete and validated inventories of full-length cDNA clones and corresponding sequences should be generated and made available to the community once such technology is at hand. b) Improved technologies are needed for global approaches to the study of non-protein-coding sequences, including production of relevant libraries, comparative sequencing, and computational analysis. c) Develop technology for comprehensive analysis of gene expression. Information about the spatial and temporal patterns of gene expression in both humans and model organisms offers one key to understanding gene expression. Efficient and cost-effective technology needs to be developed to measure various parameters of gene expression reliably and reproducibly. Complementary DNA sequences and validated sets of clones with unique identifiers will be needed for array technologies, large-scale in situ hybridization, and other strategies for measuring gene expression. Improved methods for quantifying, representing, analyzing, and archiving expression data should also be developed. d) Improve methods for genome-wide mutagenesis. Creating mutations that cause loss or alteration of function is another prime approach to studying gene function. Technologies, both gene- and phenotype-based, which can be used on a large scale in vivo or in vitro, are needed for generating or finding such mutations in all genes. Such technologies should be piloted in appropriate model systems, including both cell culture and whole organisms. e) Develop technology for global protein analysis. A full understanding of genome function requires an understanding of protein function on a genome-wide basis. Development of experimental and computational methods to study global spatial and temporal patterns of protein expression, protein-ligand interactions, and protein modification needs to be supported.

3.7

Bioinformatics and Computational Biology


Bioinformatics support is essential to the implementation of genome projects and for public

access to their output. Bioinformatics needs for the genome project fall into two broad areas: (i)

28
databases and (ii) development of analytical tools. Collection, analysis, annotation, and storage of the ever increasing amounts of mapping, sequencing, and expression data in publicly accessible, user-friendly databases is critical to the project's success. In addition, the community needs computational methods that will allow scientists to extract, view, annotate, and analyze genomic information efficiently. Thus, the genome project must continue to invest substantially in these areas. Conservation of resources through development of portable software should be encouraged. a) Improve content and utility of databases. Databases are the ultimate repository of genome projects data. As new kinds of data are generated and new biological relationships discovered, databases must provide for continuous and rapid expansion and adaptation to the evolving needs of the scientific community. To encourage broad use, databases should be responsive to a diverse range of users with respect to data display, data deposition, data access, and data analysis. Databases should be structured to allow the queries of greatest interest to the community to be answered in a seamless way. Communication among databases must be improved. Achieving this will require standardization of nomenclature. A database of human genomic information, analogous to the model organism databases and including links to many types of phenotypic information, is needed. b) Develop better tools for data generation, capture, and annotation. Large-scale, highthroughput genomics centers need readily available, transportable informatics tools for commonly performed tasks such as sample tracking, process management, map generation, sequence finishing, and primary annotation of data. Smaller users urgently need reliable tools to meet their sequencing and sequence analysis needs. Readily accessible information about the availability and utility of various tools should be provided, as well as training in the use of tools. c) Develop and improve tools and databases for comprehensive functional studies. Massive amounts of data on gene expression and function will be generated in the near future. Databases that can organize and display this data in useful ways need to be developed. New statistical and mathematical methods are needed for analysis and comparison of expression and function data, in a variety of cells and tissues, at various times and under different conditions. Also needed are tools for modeling complex networks and interactions. d) Develop and improve tools for representing and analyzing sequence similarity and

29
variation. The study of sequence similarity and variation within and among species will become an increasingly important approach to biological problems. There will be many forms of sequence variation, of which SNPs will be only one type. Tools need to be created for capturing, displaying, and analyzing information about sequence variation. e) Create mechanisms to support effective approaches for producing robust, exportable software that can be widely shared. Many useful software products are being developed in both academia and industry that could be of great benefit to the community. However, these tools generally are not robust enough to make them easily exportable to another laboratory. Mechanisms are needed for supporting the validation and development of such tools into products that can be readily shared and for providing training in the use of these products. Participation by the private sector is strongly encouraged.

3.8

Job Opportunities and Job Requirements


The Human Genome Project has created the need for new kinds of scientific specialists

who can be creative at the interface of biology and other disciplines, such as computer science, engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of genomic research increases, the demand for these specialists greatly exceeds the supply. In the past, the genome project has benefited immensely from the talents of non-biological scientists, and their participation in the future is likely to be even more crucial. There is an urgent need to train more scientists in interdisciplinary areas that can contribute to genomics. Programs must be developed that will encourage training of both biological and non-biological scientists for careers in genomics. Especially critical is the shortage of individuals trained in Bioinformatics. Also needed are scientists trained in the management skills required to lead large data-production efforts. Another urgent need is for scholars who are trained to undertake studies on the societal impact of genetic discoveries. Such scholars should be knowledgeable in both genome-related sciences and in the social sciences. Ultimately, a stable academic environment for genomic science must be created so that innovative research can be nurtured and training of new individuals can be assured. The latter is the responsibility of the academic sector, but funding agencies can encourage it through their grants programs.

30
3.9 Training Goals included in the Human Genome Project Plan

a) Nurture the training of scientists skilled in genomics research. A number of approaches to training for genomics research should be explored. These include providing fellowship and career awards and encouraging the development of institutional training programs and curricula. Training that will facilitate collaboration among scientists from different disciplines, as well as courses that introduce scientists to new technologies or approaches, should also be included. b) Encourage the establishment of academic career paths for genomic scientists. Ultimately, a strong academic presence for genomic science is needed to generate the training environment that will encourage individuals to enter the field. Currently, the high demand for genome scientists in industry threatens the retention of genome scientists in academia. Attractive incentives must be developed to maintain the critical mass essential for sponsoring the training of the next generation of genome scientists. c) Increase the number of scholars who are knowledgeable in both genomic and genetic sciences and in ethics, law, or the social sciences. As the pace of genetic discoveries increases, the need for individuals who have the necessary training to study the social impact of these discoveries also increases. The ELSI program should expand its efforts to provide postdoctoral and senior fellowship opportunities for cross-training. Such opportunities should be provided both to scientists and health professionals who wish to obtain training in the social sciences and humanities and to scholars trained in law, the social sciences, or the humanities who wish to obtain training in genomic or genetic sciences.

31
Chapter 4

Human Genome Project


4.1 Introduction

Begun in 1990, the U.S. Human Genome Project is a 13-year effort coordinated by the department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but effective resource and technological advances have accelerated the expected completion date to 2003. Project goals are to identify all the approximately 30,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues that may arise from the project.

4.2

Details of the Human Genome Project


The Human Genome Project (HGP) is fulfilling its promise as the single most important

project in biology and the biomedical sciences--one that will permanently change biology and medicine. With the recent completion of the genome sequences of several microorganisms, including Escherichia coli and Saccharomyces cerevisiae, and the imminent completion of the sequence of the metazoan Caenorhabditis elegans, the door has opened wide on the era of whole genome science. The ability to analyze entire genomes is accelerating gene discovery and revolutionizing the breadth and depth of biological questions that can be addressed in model organisms. These exciting successes confirm the view that acquisition of a comprehensive, highquality human genome sequence will have unprecedented impact and long-lasting value for basic biology, biomedical research, biotechnology, and health care. The transition to sequence-based

32
biology will spur continued progress in understanding gene-environment interactions and in development of highly accurate DNA-based medical diagnostics and therapeutics. Human DNA sequencing, the flagship endeavor of the HGP, is entering its decisive phase. It will be the project's central focus during the next 5 years. While partial subsets of the DNA sequence, such as expressed sequence tags (ESTs), have proven enormously valuable, experience with simpler organisms confirms that there can be no substitute for the complete genome sequence. In order to move vigorously toward this goal, the crucial task ahead is building sustainable capacity for producing publicly available DNA sequence. The full and incisive use of the human sequence, including comparisons to other vertebrate genomes, will require further increases in sustainable capacity at high accuracy and lower costs. Thus, a high-priority commitment to develop and deploy new and improved sequencing technologies must also be made. Availability of the human genome sequence presents unique scientific opportunities, chief among them the study of natural genetic variation in humans. Genetic or DNA sequence variation is the fundamental raw material for evolution. Importantly, it is also the basis for variations in risk among individuals for numerous medically important, genetically complex human diseases. An understanding of the relationship between genetic variation and disease risk promises to change significantly the future prevention and treatment of illness. The new focus on genetic variation, as well as other applications of the human genome sequence, raises additional ethical, legal, and social issues that need to be anticipated, considered, and resolved. The HGP has made genome research a central underpinning of biomedical research. It is essential that it continue to play a lead role in catalyzing large-scale studies of the structure and function of genes, particularly in functional analysis of the genome as a whole. However, full implementation of such methods is a much broader challenge and will ultimately be the responsibility of the entire biomedical research and funding communities. Success of the HGP critically depends on Bioinformatics and computational biology as well as training of scientists to be skilled in the genome sciences. The project must continue a strong commitment to support of these areas. As intended, the HGP has become a truly international effort to understand the structure and function of the human genome. Many countries are participating according to their specific interests and capabilities. Coordination is informal and generally effected at the scientist-toscientist level. The U.S. component of the project is sponsored by the National Human Genome

33
Research Institute at the National Institutes of Health (NIH) and the Office of Biological and Environmental Research at the Department of Energy (DOE). The HGP has benefited greatly from the contributions of its international partners. The private sector has also provided critical assistance. These collaborations will continue, and many will expand. Both NIH and DOE welcome participation of all interested parties in the accomplishment of the HGP's ultimate purpose, which is to develop and make publicly available to the international community the genomic resources that will expedite research to improve the lives of all people.

4.3

U.S. Human Genome Project 5-Year Goals 1998-2003

4.3.1 Human DNA Sequencing


Providing a complete, high-quality sequence of human genomic DNA to the research community as a publicly available resource continues to be the HGP's highest priority goal. The enormous value of the human genome sequence to scientists, and the considerable savings in research costs its widespread availability will allow, are compelling arguments for advancing the timetable for completion. Recent technological developments and experience with large-scale sequencing provide increasing confidence that it will be possible to complete an accurate, highquality sequence of the human genome by the end of 2003, 2 years sooner than previously predicted. NIH and DOE expect to contribute 60 to 70% of this sequence, with the remainder coming from the effort at the Sanger Center and other international partners. This is a highly ambitious goal, given that only about 6% of the human genome sequence has been completed thus far. Sequence completion by the end of 2003 is a major challenge, but within reach and well worth the risks and effort. Realizing the goal will require an intense and dedicated effort and a continuation and expansion of the collaborative spirit of the international sequencing community. Only sequence of high accuracy and long-range contiguity will allow a full interpretation of all the information encoded in the human genome. Availability of the human sequence will not end the need for large-scale sequencing. Full interpretation of that sequence will require much more sequence information from many other organisms, as well as information about sequence variation in humans. Thus, the development of sustainable, long-term sequencing capacity is a critical objective of the HGP. Achieving the goals

34
below will require a capacity of at least 500 megabases (Mb) of finished sequence per year by the end of 2003. a) Finish the complete human genome sequence by the end of 2003. To best meet the needs of the scientific community, the finished human DNA sequence must be a faithful representation of the genome, with high base-pair accuracy and long-range contiguity. Specific quality standards that balance cost and utility have already been established. These quality standards should be reexamined periodically; as experience in using sequence data is gained, the appropriate standards for sequence quality may change. One of the most important uses for the human sequence will be comparison with other human and nonhuman sequences. The sequence differences identified in such comparisons should, in nearly all cases, reflect real biological differences rather than errors or incomplete sequence. Consequently, the current standard for accuracy--an error rate of no more than 1 base in 10,000--remains appropriate. The current public sequencing strategy is based on mapped clones and occurs in two phases. The first, or "shotgun" phase, involves random determination of most of the sequence from a mapped clone of interest. Methods for doing this are now highly automated and efficient. Mapped shotgun data are assembled into a product ("working draft" sequence) that covers most of the region of interest but may still contain gaps and ambiguities. In the second, finishing phase, the gaps are filled and discrepancies resolved. At present, the finishing phase is more labor intensive than the shotgun phase. Already, partially finished, working-draft sequence is accumulating in public databases at about twice the rate of finished sequence. b) Make the sequence totally and freely accessible. The HGP was initiated because its proponents believed the human sequence is such a precious scientific resource that it must be made totally and publicly available to all who want to use it. Only the wide availability of this unique resource will maximally stimulate the research that will eventually improve human health.

35
4.3.2 Sequencing Technology
Create a long-term, sustainable sequencing capacity by improving current technology and developing highly efficient novel technologies. Achieving this HGP goal will require current sequencing capacity to be expanded 2-3 times, demanding further incremental advances in standard sequencing technologies and improvements in efficiency and cost. For future sequencing applications, planners emphasize the importance of supporting novel technologies that may be 5-10 years in development.

4.3.3 Sequence Variation

Develop technologies for rapid identification of DNA sequence variants. A new priority for the HGP is examining regions of natural variation that occur among genomes (except those of identical twins). Goals specify development of methods to detect different types of variation, particularly the most common type called single nucleotide polymorphisms (SNPs) that occur about once every 1000 bases. Scientists believe SNP maps will help them identify genes associated with complex diseases such as cancer, diabetes, vascular disease, and some forms of mental illness. These associations are difficult to make using conventional gene hunting methods because any individual gene may make only a small contribution to disease risk. DNA sequence variations also underlie many individual differences in responses to the environment and treatments.

4.3.4 Functional Genomics


Expand support for current approaches and innovative technologies. Efficient interpretation of the functions of human genes and other DNA sequences requires developing the resources and strategies to enable large-scale investigations across whole genomes. A technically challenging first priority is to generate complete sets of full-length cDNA clones and sequences for human and model organism genes. Other functional genomics goals include studies into gene expression and control, creation of mutations that cause loss or alteration of function in nonhuman organisms, and

36
development of experimental and computational methods for protein analyses.

4.3.5 Comparative Genomics


Obtain complete genomic sequences for C. elegans (1998), Drosophila (2002), and mouse (2008). A first clue toward identifying and understanding the functions of human genes or other DNA regions is often obtained by studying their parallels in nonhuman genomes. To enable efficient comparisons, complete genomic sequences already have been obtained for the bacterium E. coli and the yeast S. cerevisiae, and work continues on sequencing the genomes of the roundworm, fruit fly, and mouse. Planners note that other genomes will need to be sequenced to realize the full promise of comparative genomics, stressing the need to build a sustainable sequencing capacity.

4.3.6 Ethical, Legal, and Social Implications (ELSI)


Analyze and address implications of identifying DNA sequence information for individuals, families, and communities. Facilitate safe and effective integration of genetic technologies. Facilitate education about genomics in nonclinical and research settings.

Rapid advances in genetics and applications present new and complex ethical and policy issues for individuals and society. ELSI programs that identify and address these implications have been an integral part of the US HGP since its inception. These programs have resulted in a body of work that promotes education and helps guide the conduct of genetic research and the development of related health professional and public policies. Continuing and new challenges include safeguarding the privacy of individuals and groups who contribute samples for large-scale sequence variation studies; anticipating how resulting data may affect concepts of race and ethnicity; identifying how genetic data could potentially be used in workplaces, schools, and courts; commercial uses; and the impact of genetic advances on concepts of humanity and personal

37
responsibility.

4.3.7 Bioinformatics and Computational Biology


Improve current databases and develop new databases and better tools for data generation and capture and comprehensive functional studies. Continued investment in current and new databases and analytical tools is critical to the success of the Human Genome Project and to the future usefulness of the data. Databases must be structured to adapt to the evolving needs of the scientific community and allow queries to be answered easily. Planners suggest developing a human genome database analogous to model organism databases with links to phenotypic information. Also needed are databases and analytical tools for the expanding body of gene expression and function data, for modeling complex biological networks and interactions, and for collecting and analyzing sequence variation data.

4.3.8 Training
Nurture the training of genomic scientists and establish career paths. Increase the number of scholars knowledgeable in genomics and ethics, law, or the social sciences. Planners note that future genomics scientists will require training in interdisciplinary areas that include biology, computer science, engineering, mathematics, physics, and chemistry. Additionally, scientists with management skills will be needed for leading large data-production efforts.

38
Chapter 5

Biological Databases
5.1 The Biological sequence/structure deficit
At the beginning of 1998, in publicly available, non-redundant databases, more than 3,00,000 protein sequences have been deposited, and the number of partial sequences in public and proprietary Expressed sequence tag databases is estimated to run into millions. By contrast, the number of unique 3D structures in the Protein Data Bank (PDB) was less than 1500. Although structural information is far more complex to derive, store and manipulate than are sequence data, these figures nevertheless highlight an enormous information deficit. This situation is likely to get worse as the genome projects around the world begin top bear fruit. Of course, the acquisition of structural data is also hastening, and the future large-scale structure determination enterprise could conceivably furnish 2000 3D structures annually. But this is a small yield by comparison with that of sequence databases, which are doubling in size every year, with a new sequence being added, on average once a minute.

5.2

Biological Databases
If we are to derive the maximum benefit from the deluge of sequence information, we must

deal with it in a concerted way; this means establishing, maintaining and disseminating databases; providing easy to use software to access the information they contain; and designing state-of-theart analysis tools to visualize and interpret the structural and functional clues latent in the data. The first, then, in analysing sequence information is to assemble it into central shareable resources i.e. databases. Databases are effectively electronic filling cabinets, a convenient and efficient method of storing vast amounts of information. There are many different database types, depending both on the nature of the information being stored and on the manner of data storage( eg: whether in flat-files, tables in a relational database or objects in an object oriented database).

39
In the context of protein sequence analysis, we will encounter primary, composite and secondary databases. Such resources store different levels of information in totally different formats. In the past, this has led to a variety of communication problems, but emerging computer technologies are beginning to provide solutions, allowing seamless, transparent access to disparate, distributed data structures over the internet. Primary and secondary databases are used to address different aspects of sequence analysis, because they store different levels of protein sequence information. The primary structure of a protein is its amino acid sequence; these are stored in primary databases as linear alphabets that denote the constituent residues. The secondary structure of a protein corresponds to regions of local regularity, which, in sequence alignments, are often apparent as well conserved motifs; these are stored in secondary databases as patterns. The tertiary structure of a protein arises from the packing of its secondary structure elements which may form discrete domains within a fold, or may give rise to autonomous folding units or modules; complete folds, domains and modules are stored in structure databases as sets of atomic co-ordinates.

5.3

Primary Sequence Databases


In the early 1980s, sequence information started to become more abundant in the scientific

literature. Realising this, several laboratories saw that there might be advantages to harvesting and storing these sequences in central repositories. Thus, several primary database projects began to evolve in different parts of the world.

5.3.1 Nucleic acid Sequence Databases


The principle DNA sequence databases are GenBank (USA), EMBL (Europe) and DDBJ (Japan), which exchange data on a daily basis to ensure comprehensive coverage at each of the sites. EMBL is the nucleotide sequence database from the European Bioinformatics Institute. The rate of growth of DNA databases has been following an exponential trend, with a doubling time less than a year. EMBL data predominantly (more than 50%) consist of model organisms. DNA Data Bank of Japan is produced, distributed and maintained by the National Institute of Genetics.

40
GenBank, the DNA database from the National Center for Biotechnology Information, exchanges data with both EMBL and DDBJ to help ensure comprehensive coverage. The database is split into 17 smaller discrete divisions.

5.3.2 Protein Sequence Databases


PIR, MIPS, SWISS-PROT, and TrEMBL are the major protein sequence databases. PIR was developed for investigating evolutionary relations between proteins. In its current form, the database is split into four distinct sections PIR1-PIR4, which differ in terms of the quality of data and the level of annotation provided. MIPS collects and processes sequence data for the tripartite PIR-International Protein sequence Database Project. SWISS-PROT is a protein sequence database which, endeavors to provide high level annotations, including descriptions of the function of the protein, and of the structure of its domain, its post translational modifications and so on. TrEMBL was created as a supplement to the SWISS-PROT. It was designed to address the need for a well structured SWISS-PROT-like resource that would allow very rapid access to sequence data from the genome projects, without having to compromise the quality of SWISSPROT itself by incorporating sequences with insufficient analysis and annotation.

5.4

Composite Protein Sequences Databases


One solution to the problem of proliferation primary databases is to compile a composite,

i.e. a database that amalgamates a variety of different primary sources. Composite databases render sequence searching much more efficient, because they obviate the need to interrogate multiple resources. The interrogation process is stream lined still further if the composite has been designed to be non-redundant, as this means that the same sequence need not be searched more than once. The choices of different sources and the application of different redundancy criteria have led to the emergence of different composites. The major composite databases are Non-Redundant Database, OWL, MIPSX, SWISS-PROT+TrMBL.

41
5.5 Secondary Databases
Secondary databases contain the fruits of analyses of the sequences in the primary resources. Because there are several different primary databases and a variety of ways of analysing protein sequences, the information housed in each of the secondary resources is different. Designing software tools that can search the different types of data, interpret the range of outputs, and assess the biological significance of the results is not a trivial task. SWISS-PROT has emerged as the most popular primary source and many secondary databases now use it as their basis. Some of the main secondary resources are as follows:

Secondary database PROSITE Profiles PRINTS Pfam BLOCKS IDENTIFY

Primary source SWISS-PROT SWISS-PROT OWL SWISS-PROT PROSITE/PRINTS BLOCKS/PRINTS

Stored Information Regular expressions Weighted matrices Aligned motifs Hidden Marcov Models Aligned motifs (blocks) Fuzzy regular expressions

5.6

Tertiary Databases
Tertiary databases are the databases derived from information housed in secondary

(pattern) databases (e.g. the BLOCKS and eMOTIF databases, which draw on the data stored within PROSITE and PRINTS). The value of such resources is in providing a different scoring perspective on the same underlying data, allowing the possibility to diagnose relationships that might be missed using the original implementation.

42
Chapter 6

Applications of Bioinformatics
A big amount of investment is being made in the field of biotechnology. In this chapter, I have attempted to take a review of the overall outcome obtained so far and what all is estimated in the future.

6.1

Application to the Ailments of Diseases


The miraculous substance that contains all of our genetic instructions, DNA, is rapidly

becoming a key to modern medicine. By focusing on the diaphanous and extraordinarily long filaments of DNA that we inherit from our parents, scientists are finding the root causes of dozens of previously mysterious diseases: abnormal genes. These discoveries are allowing researchers to make precise diagnoses and predictions, to design more effective drugs, and to prevent many painful disorders. The new findings also pave the way for the development of the ultimate therapy - substituting a normal gene for a malfunctioning one so as to correct a patient's genetic defect permanently. Recently, scientists have made spectacular progress against two fatal genetic diseases of children, cystic fibrosis and Duchenne muscular dystrophy. In addition, they have identified the genetic flaws that predispose people to more widespread, though still poorly understood ailments various forms of heart disease, breast and colon cancer, diabetes, arthritis - which are not usually thought of as genetic in origin. While many of the researchers who are exploring our genetic wilderness want to find the sources of the nearly 4,000 disorders caused by defects in single genes, others have an even broader goal: They hope to locate and map all of the 50,000 to 100,000 genes on our chromosomes. This map of our complete biological inheritance "the marvelous message, evolved for 3 billion years or more, which gives rise to each one of us," as Robert Sinsheimer of the University of California, Santa Barbara, calls it - will guide biological research for years to come.

43
And it will radically simplify the search for the genetic flaws that cause disease. Once scientists have identified such a flaw, they need to understand just how it produces a particular illness. They must determine the normal gene's function in human cells: What kind of protein does it instruct the cells to make, in what quantities, at what times, and in what specific places? Then the researchers can ask whether the genetic flaw results in too little protein, the wrong kind of protein, or no protein at all - and how best to counteract the effects of this failure. For most genetic disorders, researchers are still at the very beginning of the trail. They have no clues to the DNA error that causes a disease, and they are still trying to find large families whose DNA patterns can help them track it down. By contrast, scientists who work on cystic fibrosis and a few other diseases have covered much of the trail. They have already succeeded in correcting the gene defect inside living human cells by inserting healthy genes into these cells in a laboratory dish - an achievement that may lead to gene therapy. The farther scientists go along the trail, the broader the implications of their findings. For example, the discovery of the gene defect that causes Duchenne muscular dystrophy, a musclewasting disease, led scientists to identify a previously unknown protein that plays an important role in all muscle function. This gives them a clearer view of how muscle cells work and allows them to diagnose other muscle disorders with exceptional precision, as well as devise new approaches to treatment. Any new treatment will need to be tested on animals. In fact, the next explosion of information in medical genetics is expected to come from the study of animals - particularly with defects that mimic human disorders. The techniques for producing animal models of disease are improving rapidly. Even today, "designer mice" are playing an increasingly important role in research. The growth of powerful computerized databases is bringing further insights. Only a month after the discovery of the genetic error involved in neurofibromatosis, a disfiguring and sometimes disabling hereditary disease, a computer search revealed a match between the protein made by normal copies of the newly uncovered gene and a protein that acts to suppress the development of

44
cancers of the lung, liver, and brain - a key finding for cancer researchers. Such revelations are becoming increasingly frequent. "If a new sequence has no match in the databases as they are, a week later a still newer sequence will match it," observes Walter Gilbert of Harvard University. Brain disorders such as schizophrenia or Alzheimer's disease may be next to yield to the genetic approach. "We won't know what went wrong in most cases of mental disease until we can find the gene that sets it off," says James Watson, co-discoverer of the structure of DNA.

6.2

Application of Bioinformatics to Agriculture


Techniques aimed at crop improvement have been utilized for centuries. Today, applied

plant science has three overall goals: increased crop yield, improved crop quality, and reduced production costs. Biotechnology is proving its value in meeting these goals. Progress has, however, been slower than with medical and other areas of research. Because plants are genetically and physiologically more complex than single-cell organisms such as bacteria and yeasts, the necessary technologies are developing more slowly.

6.2.1 Improvements in Crop Yield and Quality


In one active area of plant research, scientists are exploring ways to use genetic modification to confer desirable characteristics on food crops. Similarly, agronomists are looking for ways to harden plants against adverse environmental conditions such as soil salinity, drought, alkaline earth metals, and anaerobic (lacking air) soil conditions. Genetic engineering methods to improve fruit and vegetable crop characteristics - such as taste, texture, size, color, acidity or sweetness, and ripening process, are being explored as a potentially superior strategy to the traditional method of cross-breeding. Research in this area of agricultural biotechnology is complicated by the fact that many of a crop's traits are encoded not by one gene but by many genes working together. Therefore, one must first identify all of the genes that function as a set to express a particular property. This knowledge

45
can then be applied to altering the germlines of commercially important food crops. For example, it will be possible to transfer the genes regulating nutrient content from one variety of tomatoes into a variety that naturally grows to a larger size. Similarly, by modifying the genes that control ripening, agronomists can provide supplies of seasonal fruits and vegetables for extended periods of time. Biotechnological methods for improving field crops, such as wheat, corn and soybeans, are also being sought, since seeds serve both as a source of nutrition for people and animals and as the material for producing the next plant generation. By increasing the quality and quantity of protein or varying the types in these crops, we can improve their nutritional value.

6.3

Applications of Microbial Genomics


new energy sources (biofuels) environmental monitoring to detect pollutants protection from biological and chemical warfare safe, efficient toxic waste cleanup understanding disease vulnerabilities and revealing drug targets

In 1994, taking advantage of new capabilities developed by the genome project, DOE initiated the Microbial Genome Program to sequence the genomes of bacteria useful in energy production, environmental remediation, toxic waste reduction, and industrial processing. Despite our reliance on the inhabitants of the microbial world, we know little of their number or their nature: estimates are that less than 0.01% of all microbes have been cultivated and characterized. Programs like the DOE Microbial Genome Program help lay a foundation for knowledge that will ultimately benefit human health and the environment. The economy will benefit from further industrial applications of microbial capabilities. Information gleaned from the characterization of complete genomes in MGP will lead to insights into the development of such new energy-related biotechnologies as photosynthetic systems, microbial systems that function in extreme environments, and organisms that can metabolize readily available renewable resources and waste material with equal facility.

46
Expected benefits also include development of diverse new products, processes, and test methods that will open the door to a cleaner environment. Biomanufacturing will use nontoxic chemicals and enzymes to reduce the cost and improve the efficiency of industrial processes. Already, microbial enzymes are being used to bleach paper pulp, stone wash denim, remove lipstick from glassware, break down starch in brewing, and coagulate milk protein for cheese production. In the health arena, microbial sequences may help researchers find new human genes and shed light on the disease-producing properties of pathogens. Microbial genomics will also help pharmaceutical researchers gain a better understanding of how pathogenic microbes cause disease. Sequencing these microbes will help reveal vulnerabilities and identify new drug targets. Gaining a deeper understanding of the microbial world also will provide insights into the strategies and limits of life on this planet. Data generated in this young program already have helped scientists identify the minimum number of genes necessary for life and confirm the existence of a third major kingdom of life. Additionally, the new genetic techniques now allow us to establish more precisely the diversity of microorganisms and identify those critical to maintaining or restoring the function and integrity of large and small ecosystems; this knowledge also can be useful in monitoring and predicting environmental change. Finally, studies on microbial communities provide models for understanding biological interactions and evolutionary history.

6.4

Risk Assessment
assess health damage and risks caused by radiation exposure, including low-dose exposures assess health damage and risks caused by exposure to mutagenic chemicals and cancercausing toxins

reduce the likelihood of heritable mutations

47
Understanding the human genome will have an enormous impact on the ability to assess risks posed to individuals by exposure to toxic agents. Scientists know that genetic differences make some people more susceptible and others more resistant to such agents. Far more work must be done to determine the genetic basis of such variability. This knowledge will directly address DOE's long-term mission to understand the effects of low-level exposures to radiation and other energyrelated agents, especially in terms of cancer risk.

6.5

Bioarchaeology, Anthropology, Evolution, and Human Migration

study evolution through germline mutations in lineages study migration of different population groups based on female genetic inheritance study mutations on the Y chromosome to trace lineage and migration of males compare breakpoints in the evolution of mutations with ages of populations and historical events

Understanding genomics will help us understand human evolution and the common biology we share with all of life. Comparative genomics between humans and other organisms such as mice already has led to similar genes associated with diseases and traits. Further comparative studies will help determine the yet-unknown function of thousands of other genes. Comparing the DNA sequences of entire genomes of differerent microbes will provide new insights about relationships among the three kingdoms of life: archaebacteria, eukaryotes, and prokaryotes.

6.6

DNA Forensics (Identification)


identify potential suspects whose DNA may match evidence left at crime scenes exonerate persons wrongly accused of crimes identify crime and catastrophe victims establish paternity and other family relationships

48

identify endangered and protected species as an aid to wildlife officials (could be used for prosecuting poachers)

detect bacteria and other organisms that may pollute air, water, soil, and food match organ donors with recipients in transplant programs determine pedigree for seed or livestock breeds authenticate consumables such as caviar and wine

Any type of organism can be identified by examination of DNA sequences unique to that species. Identifying individuals is less precise at this time, although when DNA sequencing technologies progress further, direct characterization of very large DNA segments, and possibly even whole genomes, will become feasible and practical and will allow precise individual identification. To identify individuals, forensic scientists scan about 10 DNA regions that vary from person to person and use the data to create a DNA profile of that individual (sometimes called a DNA fingerprint). There is an extremely small chance that another person has the same DNA profile for a particular set of regions.

49
Bibliography
1. IEEE Magazine
Engineering in Medicine and Biology Volume 20, Number 4, July/August 2002

2. Introduction to Bioinformatics
By T. K. Attwood and D. J. Parry-Smith First Edition Publication: Pearson Education Ltd.

3. Web Sites
Human Genome Project Beyond Discovery Bioinformatics in India Other sites http://www.ornl.gov/TechResources/Human_Genome/ http://www4.nas.edu/beyond/beyonddiscovery.nsf/ http://bioinformatics-india.com http://bioinform.com http://bioinformatics.org

Вам также может понравиться