Вы находитесь на странице: 1из 91

VISVESVARAYA TECHNOLOGICAL UNIVERSITY Belgaum, Karnataka 590 014 A Project Report On

Optimized DNA sequencing on Hadoop


Submitted in partial fulfillment of the curriculum prescribed for the award of the degree of Bachelor of Engineering In Computer Science and Engineering 2011-2012 Submitted by Rahul Singhal Tapadia Ankita Anup Yash Shah 1BJ08CS038 1BJ08CS052 1BJ08CS056

Under the Guidance of Mr. Chidananda Murthy P Assistant Professor, CSE

Department of Computer Science and Engineering Sri Bhagawan Mahaveer Jain College of Engineering
Jakkasandra, Kanakapura (T), Ramangara District.-562 112

June - 2012

Sri Bhagawan Mahaveer Jain College of Engineering


Jakkasandra, Kanakapura (T), Ramangara District.-562 112 Department of Computer Science and Engineering

CERTIFICATE
Students of 8th Semester, Computer Science and Engineering, in the partial fulfillment for the award of Bachelors degree in Computer Science and Engineering of Visvesvaraya Technological University, Belgaum During the year 2011-12 It is certified that all corrections / suggestions indicated for Internal Assessment have been incorporated in the report. This project report has been approved as it satisfies the academic requirements in respect of Project Work prescribed for the Bachelor of Engineering Degree.

Signature of Guide Mr. Chidananda Murthy P


Asst. Professor, Dept. of CSE

Signature of HOD Ms. Pushpa H. G.


Head of Department, Dept. of CSE

Signature of Principal Dr. Y Vijay Kumar


Principal

Rahul Singhal Tapadia Ankita Anup Yash Shah

1BJ08CS038 1BJ08CS052 1BJ08CS056

External Viva:
Name of the Examiners 1. Signature with date

2.

ACKNOWLEDGEMENT
We owe a great gratitude towards our Professors who have helped us stay well grounded in the real world during the tenure of our engineering program at, Sri Bhagawan Mahaveer Jain College Of Engineering and helped us to attain profound technical skills in the field of Computer Science & Engineering, thereby fulfilling the most cherished goal of our life to become a Computer Science Engineer.

We convey our sincere gratitude to our guide, Asst. Prof. Mr. Chidananda Murthy P, department of Computer Science & Engineering, for his support, continuing co-operation, valuable suggestion and encouragement during the development of project.

We are thankful to Asst. Prof. Ms. Pushpa H. G., Head of Department of Computer Science and Engineering, SBMJCE, for her encouragement, inspiration and help throughout the course.

We express our immense gratitude to Dr. Y Vijay Kumar, Principal, SBMJCE, for providing us with this opportunity and inspiration during the tenure of the course.

We would also like to extend out our gratitude to all the teachers and staff members of our department and college, who directly or indirectly supported us in our endeavors.

Last but not the least; we are very much thankful to all, our family members and friends for their valuable support during this period of our study.

Thank you...this project would have never reached this point without all of you.

Rahul Singhal Tapadia Ankita Anup Yash Shah

ABSTRACT
The generationThe generation of an enormous amount of sequence data, from the

Nextthe Next-generation Deoxyribonucleic acid sequencing machines has placed unprecedented demands on traditional single processor read mapping algorithms. To optimize the mapping of next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including single-nucleotide polymorphism discovery, genotyping and personal genomics, a short read mapping program is being modeled to run on a distributed cluster i.e., the Hadoop architecture. An algorithmic technique called seed-and-extend is used to accelerate the mapping process and reduce the time and makes it efficient. This further reduces the running time from hours to mere minutes for typical jobs involving mapping of millions of short reads to the human genome.

Our objectives are: 1. Design, simulate and analyze the feasibility of implementing the algorithm on a multi node cluster on Hadoop. 2. Implementation of the algorithm on MapReduce platform thereby increasing the performance by reducing the processing time.

With the results from our objectives we can obtain an optimized DNA sequencing on a multi node cluster on Hadoop.

II

Table of Contents

ACKNOWLEDGEMENT ................................................................................................... I ABSTRACT ........................................................................................................................ II Table of Contents ............................................................................................................... III List of Figures ....................................................................................................................VI List of Tables ................................................................................................................... VII Acronym and Abbreviations ........................................................................................... VIII Glossary .............................................................................................................................IX Chapter 1 INTRODUCTION............................................................................................... 1 1 INTRODUCTION TO DNA SEQUENCING ON HADOOP .................................... 2 1.1 Importance of DNA Sequencing ........................................................................... 2 DNA and its Sequencing ............................................................................... 2 High Throughput Sequencing or Next-generation sequencing ...................... 3

1.1.1 1.1.2 1.2

Introduction to the Hadoop ................................................................................... 4 Hadoop Framework ....................................................................................... 4 Hadoop Architecture ...................................................................................... 5 Hadoop File System ....................................................................................... 5 Hadoop MapReduce....................................................................................... 6

1.2.1 1.2.2 1.2.3 1.2.4 1.3 1.4 1.5 1.6 1.7

Statement of the Problem ...................................................................................... 6 Objective ............................................................................................................... 7 Scope ..................................................................................................................... 7 Literature Survey ................................................................................................... 7 Organization of the Report .................................................................................... 8

Chapter 2 REQUIREMENT ANALYSIS ........................................................................... 9 2 REQUIREMENTS ANALYSIS OF OPTIMIZING DNA SEQUENCING ON HADOOP ........................................................................................................................... 10 2.1 2.2 System Requirements .......................................................................................... 10 Input/output Requirements .................................................................................. 10 Input Requirements ...................................................................................... 10 Output Requirements ................................................................................... 11

2.2.1 2.2.2 2.3

Functional Requirements..................................................................................... 11 III

2.3.1 2.3.2 2.3.3 2.4

Extension convertor ..................................................................................... 11 DNA Sequencer ........................................................................................... 11 Print Sequences ............................................................................................ 12

Non-Functional Requirements ............................................................................ 12 Performance ................................................................................................. 12 Availability .................................................................................................. 12 Reliability..................................................................................................... 13 Robust .......................................................................................................... 13 Scalable ........................................................................................................ 13

2.4.1 2.4.2 2.4.3 2.4.4 2.4.5

Chapter 3 DESIGN ............................................................................................................ 14 3 DESIGN OF OPTIMIZING DNA SEQUENCING ON HADOOP .......................... 15 3.1 3.2 3.3 3.4 Design considerations ......................................................................................... 15 Assumptions and dependencies........................................................................... 15 General Constraints ............................................................................................. 15 Parameters ........................................................................................................... 16 Read mapping .............................................................................................. 16 MapReduce .................................................................................................. 16 Alignment filtration ..................................................................................... 17

3.4.1 3.4.2 3.4.3 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.11.1

Seed-and-extend Algorithm ................................................................................ 18 Landau-Vishkin k-difference alignment algorithm ............................................. 19 Data flow ............................................................................................................. 20 Sequence Diagram............................................................................................... 22 System Overview ................................................................................................ 23 Use Case Diagram ........................................................................................... 23 System Architecture ........................................................................................ 24 Hadoop Distributed File System .................................................................. 24

Chapter 4 IMPLEMENTATION ....................................................................................... 26 4 IMPLEMENTATION OF OPTIMIZING DNA SEQUENCING ON HADOOP ..... 27 4.1 4.2 4.3 4.4 Execution Flow .................................................................................................. 27 Convert Format ................................................................................................... 28 DNA Sequencing................................................................................................. 31 Print Sequences ................................................................................................... 41

Chapter 5 TESTING .......................................................................................................... 42 5 TESTING OF OPTIMIZING DNA SEQUENCING ON HADOOP ........................ 43 5.1 Test Setup ............................................................................................................ 43 IV

5.1.1 5.1.2 5.2

Applications used for Testing ...................................................................... 43 System Setup ................................................................................................ 43

TESTING SCENARIOS.................................................................................... 47 Test Case 1 ................................................................................................... 47 Test Case 2 ................................................................................................... 48 Test Case 3 ................................................................................................... 49 Test Case 4 ................................................................................................... 50

5.2.1 5.2.2 5.2.3 5.2.4

Chapter 6 RESULTS.......................................................................................................... 51 6 RESULTS OF OPTIMIZING DNA SEQUENCING ON HADOOP ....................... 52 CASE 1: Running the application on a One Node Hadoop Cluster .............................. 52 CASE 2: Running the application on a Two Node Hadoop Cluster .............................. 54 CASE 3: Running the application on a Four Node Hadoop Cluster ............................. 55 CONCLUSION .................................................................................................................. 56 FUTURE ENHANCEMENTS .......................................................................................... 57 APPENDIX A : DNA and its Sequencing ......................................................................... 58 APPENDIX B : MapReduce and RMAP........................................................................... 63 APPENDIX C : Single System Installation ....................................................................... 69 APPENDIX D : Multi-Node Installation ........................................................................... 75 BIBLIOGRAPHY .............................................................................................................. 80

List of Figures
Figure 1-1 Multi-node Hadoop cluster ............................................................................... 5 Figure 2-1 Overview of system.......................................................................................... 11 Figure 3-1 System Flow ..................................................................................................... 17 Figure 3-2 Hash Table view for Landau-Vishkin Algorithm ............................................ 18 Figure 3-3 Shuffle and matching of sequences .................................................................. 18 Figure 3-4 Data Flow Overview ........................................................................................ 20 Figure 3-5 Data Flow of DNA Sequencing ....................................................................... 21 Figure 3-6 Sequence Diagram ........................................................................................... 22 Figure 3-7 DNA Sequencing Overview ............................................................................. 23 Figure 3-8 Use Case Diagram ............................................................................................ 23 Figure 3-9 HDFS Architecture........................................................................................... 24 Figure 3-10 Overview of the HDFS................................................................................... 25 Figure 4-1 Flow chart for Sequence of Execution ............................................................. 27 Figure 4-2 Flow chart for Format Convertor ..................................................................... 28 Figure 4-3 Flow chart for Convert File .............................................................................. 29 Figure 4-4 Flow chart for Save Sequence .......................................................................... 30 Figure 4-5 Flow chart for DNA Sequencing...................................................................... 31 Figure 4-6 Flow chart for Map class .................................................................................. 32 Figure 4-7 Flow chart for Reduce class ............................................................................. 33 Figure 4-8 Flow chart for Aligning the sequences............................................................. 34 Figure 4-9 Flow chart for Extending the sequences .......................................................... 35 Figure 4-10 Flow chart for Landau-Vishkin algorithm ..................................................... 36 Figure 4-11 Flow chart for k-difference alignment ........................................................... 37 Figure 4-12 Flow chart for k-mismatch alignment ............................................................ 38 Figure 4-13 Flow chart for Filter Alignment ..................................................................... 39 Figure 4-14 Flow chart for Filer Reduce Class .................................................................. 40 Figure 4-15 Flow chart for Print Sequences ...................................................................... 41 Figure 4-16 Flow chart for Printing output ........................................................................ 41 Figure 5-1 Snapshot: starting ssh ....................................................................................... 44 Figure 5-2 Snapshot: starting datanode over the cluster .................................................... 45 Figure 5-3 Snapshot: starting namenode over the cluster .................................................. 46 Figure 5-4 Snapshot: Extension conversion of fasta file ................................................... 47 Figure 5-5 Snapshot: start of sequencing process .............................................................. 48 Figure 5-6 Snapshot: Printing results ................................................................................. 49 Figure 6-2 Snapshot: start of sequencer on a single node cluster ...................................... 52 Figure 6-3 Snapshot: completion of job with displayed total running time ...................... 53 Figure 6-5 Snapshot: start of sequencer on a two node cluster.......................................... 54 Figure 6-6 Snapshot: completion of job with displayed total running time ...................... 54 Figure 6-8 Snapshot: start of sequencer on a four node cluster ......................................... 55 Figure 6-9 Snapshot: completion of job with displayed total running time ...................... 55 Figure 0-1 Graphical comparison between total processing time against n active system 56 VI

List of Tables
Table 3-1 MapReduce Description .................................................................................... 16 Table 5-1Test setup scenario: Initialization of cluster nodes............................................. 44 Table 5-2 Test setup scenario: Initialization of datanode .................................................. 45 Table 5-3 Test setup scenario: Initialization of namenode ................................................ 46 Table 5-4 Test case1: Extension converter module ........................................................... 47 Table 5-5 Test case2: DNA sequencing module ................................................................ 48 Table 5-6 Test case3: Print Alignment module ................................................................. 49 Table 5-7 Test case4: Tracking job over web interface ..................................................... 50 Table 5-8 Snapshot: Running and Non-Running Tasks .................................................... 50 Table 5-9 Snapshot: Running Jobs .................................................................................... 50 Table 6-1 Snapshot: cluster summary- single active node over the cluster ....................... 52 Table 6-2 Snapshot: cluster summary- two active node over the cluster .......................... 54 Table 6-3 Snapshot: cluster summary- four active node over the cluster .......................... 55

VII

Acronym and Abbreviations


This section lists Acronyms and Abbreviations used throughout this project. DNA SNP BED ssh RPC BLAST bp DNA GFS HDFS NGS RC SNP SOAP Deoxyribonucleic Acid Single-Nucleotide Polymorphism Browser Extensible Data Secure Shell Remote Procedure Call Basic Local Alignment Search Tool Base pair Deoxyribonucleic acid Google File System Hadoop Distributed File system Next-generation sequencing Reverse Compliment Single-nucleotide polymorphism Simple Object Access Protocol

VIII

Glossary
This section contains definitions of terms used throughout this project Base spacing: The number of points from one peak (end) to the next in the matched seed of the sequence. FASTA format: A standard text-based file format for storing one or more sequences, in which nucleotide are represented using single-letter codes. Genes: It is a molecular unit from DNA of a living organism which performs one function. Genome: The total DNA contained in each cell of an organism. There are somewhere in the order of a hundred thousand genes. It includes both the genes and the non-coding sequences of the DNA. Genotyping: It is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence and comparing it to another individual's sequence or a reference sequence. Indels: insertion /deletion errors in sequences, while the seeds are formed . Mutation: these are the mismatch errors only which are found during seed formation. Reads: These are millions of short sequence of DNA which are taken into account for further sequencing. Seeds: The substrings found and matched in reference file and with the reads. Sequence: A linear series of nucleotide base characters that represent a DNA sequence, displayed in rows from left to right. SNP: are the most common form of genetic variation in humans and a resource for mapping complex genetic traits as they can alter DNA, RNA and protein sequences at different levels. It varies from one individual to another. Capillary electrophoresis (CE): It is used to separate ionic species by their charge and frictional forces and radius with use of an applied voltage. It is carried on in the interior of a small capillary filled with an electrolyte.

IX

Optimization of DNA Sequencing on Hadoop

Chapter 1 INTRODUCTION

Department of CSE, SBMJCE

2011-2012

Page 1

Optimization of DNA Sequencing on Hadoop

1 INTRODUCTION TO DNA SEQUENCING ON HADOOP


1.1 Importance of DNA Sequencing
1.1.1 DNA and its Sequencing
DNA is a long molecule like a chain, containing coded instructions in long links (nucleotides/bases) for the cells. There are four different types of nucleotides in DNA, 'A', 'G', 'C' and 'T' which are all necessary to write a code that describes our entire body plan. Everything the cells do is coded or reflected in DNA [1]. DNA sequencing is the process of determining the order of the nucleotide bases along a DNA strand. It is used to determine the primary structure of DNA, which explains DNA sequencing and analysis. Knowledge of it has become indispensable for basic biological research, other research branches utilizing DNA sequencing, and in numerous applied fields such as diagnostic, biotechnology, forensic biology and biological systematic. DNA sequences of thousands of organisms have been decoded and stored in databases. This sequence information is analyzed to determine genes. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species. With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Today, computer programs such as BLAST are used daily to search and analyze DNA, these programs can compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, to identify sequences that are related, but not identical. The human genome consists of 3 billion bp, arranged on 23 pairs of chromosomes. Another aspect of bioinformatics in sequence analysis is annotation. This involves computational gene finding to search for protein-coding genes, RNA genes, and other functional sequences within a genome.

Comment [e1]: Give the references to one of t paper on DNA

Department of CSE, SBMJCE

2011-2012

Page 2

Optimization of DNA Sequencing on Hadoop

DNA replication reaction is run in a test tube, in the presence of trace amounts of all nucleotides. Electrophoresis is used to separate the resulting fragments by size and i.e. how the sequences are read. In a large-scale sequencing lab, where automated DNA sequencers are used, in which thewhich the fragments are piped through a tiny glass-fiber capillary during the electrophoresis step, they come out the far end in size-order and their different color is monitored on the screen as they come out. The generated fragment on the screen consists of four colors red, green, blue and yellow each represent one of the four nucleotides. The actual gel image would be perhaps 3 or 4 meters long and 30 or 40 cm wide. The computer automatically generates the sequence called read from the gel after the fragment is generated. A few of the major uses of DNA sequencing are: Diagnosing Diseases Comparing normal sequences to sequences in people with genetic illnesses to determine what parts of the genome are involved in the disease. Forensic Genetics Comparing DNA left at a crime scene to the DNA sequence of suspects or victims. Paternity Tests Matching the DNA of parents and children to determine how they are related. Comparisons With Other Genomes - Comparing genomes of different species to help in scientific research and conservation efforts.

1.1.2 High Throughput Sequencing or Next-generation sequencing


Next-generation DNA sequencing machines are generating an enormous amount of sequence data, which consists of millions of short sequences of DNA (25250 bp) called reads. The high demand for low-cost sequencing has driven its development that parallelize the sequencing process, producing thousands or millions of sequences at once. It enables generation of genomic data in less time and at lower cost than traditional sequencing methods. However, sample preparation for NGS requires a considerable amount of time and effort. Hence, the Hadoop architecture comes into frame to reduce time by implementing and executing the sequencing on multiple nodes via a single master on a cluster. Department of CSE, SBMJCE 2011-2012 Page 3

Optimization of DNA Sequencing on Hadoop

1.2 Introduction to the Hadoop


1.2.1 Hadoop Framework
Hadoop[2] is an open source software framework that supports data-intensive distributed applications. It enables applications to work with thousands of computationally independent computers and petabytes of data. The Hadoop framework provides the basic services for building a cloud computing environment with commodity hardware, and the APIs for developing software that will run on that cloud. The two fundamental pieces of Hadoop Core are the :the: MapReduce framework, the cloud computing environment. Hadoop Distributed File System (HDFS).
Comment [e3]: Add Reference Comment [e2]: Add Reference

Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is: AccessibleHadoop runs on large clusters of commodity machines or on cloud computing services and hence is easily accessible. Robust it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ScalableHadoop scales linearly to handle larger data by adding more nodes to the cluster. SimpleHadoop allows users to quickly write efficient parallel code.
Comment [e4]: Make these bold Formatted: Font: Bold

Department of CSE, SBMJCE

2011-2012

Page 4

Optimization of DNA Sequencing on Hadoop

1.2.2 Hadoop Architecture


Hadoop consists of the Hadoop Common package, which provides access to the file systems supported by Hadoop. The Hadoop Common package contains the necessary JAR files and scripts needed to start Hadoop. It reduces backbone traffic.

Figure 1-1 Multi-node Hadoop cluster

A small Hadoop cluster[ ] will include a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, Name Node, and Data Node. A slave or worker node acts as both a Data Node and Task Tracker, though it is possible to have data-only worker nodes, and compute-only worker nodes. The standard startup and shutdown scripts require secure shell to be set up between nodes in the cluster.

Comment [e5]: Add reference on how to setup Hadoop cluster

1.2.3 Hadoop File System


HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single data node; a cluster of data nodes form the HDFS cluster. Each data node serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer Department of CSE, SBMJCE 2011-2012 Page 5

Optimization of DNA Sequencing on Hadoop

for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. File access can be achieved through the native Java API. With Hadoop, the same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the cluster via the Hadoop Distributed File System (HDFS). With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput. And such a cluster of commodity machines turns out to be cheaper than one high-end server. Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types. In Hadoop, data can originate in any form, but it eventually transforms into (key/value) pairs for the processing functions to work on.

1.2.4 Hadoop MapReduce


MapReduce is a framework (data processing model) which processes highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. Computational processing occurs on data stored either in a file system or a database. Its greatest advantage is the easy scaling of data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once the application is written in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change.

1.3 Statement of the Problem


The process of assembling, annotating and mapping a complete genome sequence to human genome or other reference genomes using the next-generation DNA sequencing takes several hours for processing[]. This process can be run over a distributed system (using Hadoop architecture) leading to less time requirement and hence an optimized DNA sequenced with achieved level of sensitivity and a gain in the performance of the system is attained. This distributed sequencing is modeled tomodeled to parallelize execution using multiple compute nodes. Department of CSE, SBMJCE 2011-2012 Page 6
Comment [e6]: Add Reference

Optimization of DNA Sequencing on Hadoop

1.4 Objective
DNA sequencing on a single system is a time consuming and an expensive effort with least efficient resultantefficient resultant output hence, the objective of this project is to Optimize Tthe DNA Sequencing process using the MapReduce over the Hadoop framework in a distributed system environment to achieve high performance gain and reduction in execution time.

1.5 Scope
The scope of this project is that, it is an implementation in which the running time reduces from hours to mere minutes for typical jobs involving mapping of millions of short reads to human genome. Running the sequencing on Hadoop cluster make the sequencing accelerate ,accelerate, providing much greater performance.

1.6 Literature Survey


Next-generation DNA sequencing is the technique used to sequence the DNA. The high demand for low-cost sequencing has driven its development. It parallelizes the sequencing process and produces thousands or millions of sequences at once. It enables generation of genomic data in less time and at lower cost than traditional sequencing methods. Scientists use DNA sequencing for a variety of projects, ranging from basic research on disease to forensic and paternity applications. The steps involved in sequencing a genome are complicated, but the process has become an invaluable tool in modern biological research. http://bioinformatics.oxfordjournals.org/content [1] journal that presents the basic of DNA and its importance in biological field. http://seqcore.brcf.med.umich.edu [2] has all information related to DNA sequencing. http://cmg.health.ufl.edu/ [3] describes the methods, importance and services of DNA sequencer. http://genomics.ucsd.edu/Publications [5] information about all the research related information on DNA sequencing. Department of CSE, SBMJCE 2011-2012 Page 7

Comment [e7]: Add minimum of 20 reference

Optimization of DNA Sequencing on Hadoop

Hadoop is a good choice for building batch processing systems to process huge amounts of unstructured data. Also, to use Hadoop effectively, the system should process data in parallel. Also, a definite advantage of Hadoop is that when there is a low requirement hardware and scale the cluster horizontally, these can be easily implemented. http://hadoop.apache.org/ [6] All about Hadoop. http://developer.yahoo.com/hadoop/ [7] Developers guide to Hadoop.

1.7 Organization of the Report


In chapter 2, Requirements-software and hardware requirements and functional requirements are discussed. In chapter 3 & 4, Design and Implementation of the project which includes the design of the main module, designs of the individual modules involved are discussed. In chapter 5, different testing methods are discussed. Chapter 6, discusses about test results for the same. This chapter is followed by the conclusion and future enhancement. More information is about DNA sequencing is found in Appendix A, Map Reduce in Appendix B, Setting up a single node cluster in Appendix C and setting up a multia multi-node cluster in Appendix D.

Department of CSE, SBMJCE

2011-2012

Page 8

Optimization of DNA Sequencing on Hadoop

Chapter 2 REQUIREMENT ANALYSIS

Department of CSE, SBMJCE

2011-2012

Page 9

Optimization of DNA Sequencing on Hadoop

2 REQUIREMENTS ANALYSIS OF OPTIMIZING DNA SEQUENCING ON HADOOP


2.1 System Requirements
There are someBelow mentioned are the tools used to resources required to develop, test and demonstrate the project. Hadoop File System: This DFS is used for the distribution of the data over the cluster (from master node to all slave nodes) for partial independent execution. Browser: A browser is required like Internet Explorer or Netscape or Google Chrome for displaying and monitoring the task being executed and its current and end status. Java API (1.6 or above) with NetBeans 6.9 IDE: For the purpose of development and testing of individual stand-alone modules. Along with NetBeans a plug-in called KarmaSphere is being used to enable NetBeans to debug and run the modules in line with Hadoop architecture. Platform: Any OS which allows Networking can be used to configure the cluster. The implemented system is utilizing Linux based Ubuntu operating system. A network of 2 or more computers systems for the formation of a multi-node Hadoop cluster.

2.2 Input/output Requirements


This section details the input and output requirements of the system, based on the user specifications:

Comment [e8]: Briefly explain the input form

2.2.1 Input Requirements


This system requires two data files, one the sample DNA file and the second a reference sequence file. Preferably these files should be in the standard Sequencer format of .fa (Fasta). The users are free to provide the DNA sample files in .fa format or in any other text based format.

Department of CSE, SBMJCE

2011-2012

Page 10

Optimization of DNA Sequencing on Hadoop

2.2.2 Output Requirements


Output generation is a two-step process depending on the user requirements. The first step is the compilation of the raw data in a tabular form which are used for cursory analyses of the sample DNA strand. The second is the generation of an output DNA sequence based on the mapped base-pairs which are utilized for higher Micro-Biological comparisons and are also used for a wide variety of biological analyses including SNP discovery, genotyping, gene expression, comparative genomics and personal genomics.

Comment [e9]: Briefly explain the output format

Figure 2-1 Overview of system

2.3 Functional Requirements


2.3.1 Extension convertor
This module performs the conversion of the input DNA Sequence sample and reference files in FASTA (.fa) or any other text based format to a binary format which is being used in the Sequencing process. The resultant out file is of .br (Binary Script) extension wherein the data is stored in a compressed compact manner to be utilized by the next module.

Comment [e10]: Use the keyword SHALL B to write all your functional requirements. Minimu of 20 functionall requirements should be listed.

Comment [e11]: Is completely wrong it shoul not be written as module wise.

2.3.2 DNA Sequencer


The DNA Sequencer is the actual Functional module which forms the computational backbone of the system. It is this module which takes two Sequence Files (a sample fie and reference file) as input and maps the sample sequences to reference sequence over the Hadoop distributed system. The generated part results obtained by the slave nodes are then stored together in a common file and fed to the next module for the compilation of the final output. Department of CSE, SBMJCE 2011-2012 Page 11

Optimization of DNA Sequencing on Hadoop

2.3.3 Print Sequences


This module is utilized for the generation of the output file in a readable text format. The raw result files generated from the previous module are read here and a readable output is generated. The module reads the data from the raw output provided by the previous module, analyzes them and by using the alignment information present in the Alignment Record sub-module and creates an output text file enumerating the processed data. As a secondary output the DNA sequence generated from the matched Reference sequences is outputted to the terminal. Based on the user requirements this sequence can be redirected into a text file to be utilized for higher Biological analyses.

2.4 Non-Functional Requirements


Non-functional requirements of a system are requirements specification of a system which impose constraints on the design or implementation (such as performance requirements, quality standards, or design constraints). The non-functional requirements that put restrictions or constrains on this system are as follows:

2.4.1 Performance
The main objective of this project to decrease the total time required for processing a DNA Sequence. In extension by decreasing the processing time, the performance quotient of the system is increased. The proposed system shall in real time demonstrate effectively reduction in the execution time of the sequencer and show comparable results over single and distributed systems.

2.4.2 Availability
The system shall at all times be available for the user to utilize. The Hadoop architecture has in it an inherent system wherein if a node malfunctions it will be flagged for maintenance and the work assigned to it will be sent to another system for completion. Thus this system transcends hardware dependency as the system is no longer completely dependent on the availability of each and every node present within the network.

Department of CSE, SBMJCE

2011-2012

Page 12

Optimization of DNA Sequencing on Hadoop

2.4.3 Reliability
This is the most important feature of the system. Since the data obtained from the system can potentially be used in a variety of biological analyses including SNP discovery, genotyping, and personal genomics (differences in one persons genome relative to a reference human genome, or compare the genomes of closely related species). Even a single base pair difference can have a significant biological impact, so researchers require highly sensitive mapping algorithms to analyze the reads.

2.4.4 Robust
Since the system is intended to run on commodity hardware, Hadoop is an architected created with the assumption of frequent hardware malfunctions. It can gracefully handle such failures without effecting the performance and reliability of the system.

2.4.5 Scalable
Researchers are generating sequence data at an incredible rate and need highly scalable system to analyze their data. The Hadoop framework is utilized because of its scalability factor. Over time N number of systems can be included in the network to handle more task intensive jobs.

Department of CSE, SBMJCE

2011-2012

Page 13

Optimization of DNA Sequencing on Hadoop

Chapter 3 DESIGN

Department of CSE, SBMJCE

2011-2012

Page 14

Optimization of DNA Sequencing on Hadoop

3 DESIGN OF OPTIMIZING DNA SEQUENCING ON HADOOP


3.1 Design considerations
There are many aspects to be considered when designing of a piece of software. There were some issues in this project which were needed to be resolved before attempting to design a complete solution like: 1. To optimize the processing of sequencing the DNA patterns. This particular project works for DNA sequences. 2. On which platform the project has to run. Here the project is run on Hadoop.

3.2 Assumptions and dependencies


For the completion of every project certain assumptions of users and technology are made. These assumptions basically arise from the fact that no application can work under all conditions irrespective of its sophistication. Some of the assumptions and dependencies are as follows: External users are the people who use this optimized sequencing process to map the query DNA to reference DNA to find the similarities or dissimilarities amongst any two DNA. The DNA sequences obtained from the biologists are not recommended to be modified.

Comment [e12]: Completely wrong. You need to specify what design methodology/design decisi that has been followed in designing this project.

Comment [e13]: Remove this story

Comment [e14]: Write about what technical assumptions you have made

3.3 General Constraints


During the course of the development of the project some constraints have to be faced inevitably. Constraints like time and resources are common to all software development teams. The constraint encountered during the software development is that, Sequencing is expensive: The DNA sequencing is a time consuming and expensive operation where it generates many sequencing data.
Comment [e15]: Remove this story

Department of CSE, SBMJCE

2011-2012

Page 15

Optimization of DNA Sequencing on Hadoop

3.4 Parameters
3.4.1 Read mapping
The sequences are aligned or mapped to the reads of reference genome to find the locations where each exact read occurs in the reference sequence, allowing a small number of differences. The read mapping allows 1-10% of the read length to differ from the reference.
Comment [e16]: Give the refrence genome format

3.4.2 MapReduce
MapReduce is the software framework which supports parallel execution of the data in data intensive applications. The framework automatically provides common services for parallel computing, such as the partitioning the input data, scheduling, monitoring, and inter-machine communication necessary for remote execution.
Table 3-1 MapReduce Description

FUNCTION

INPUT PARAMETER

RETURNS

Map

Binary input file

<seed, merInfo>

Reduce

<id, sequence info>

Alignments in the form of <seed, list

DESCRIPTION Each Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The Reduce function is called once for each unique key in the sorted order which iterates through the values that are associated with that key and produce zero or more outputs.

The map function processes each DNA filet, and emits a sequence of <seed, merInfo> pairs, where seed is the key, and the merInfo is a tuple(id, position, isRef, isRC, left_flank, right_flank). The same seed are merged when passed to reduce function as input, this is done in sort and shuffle phase. The reduce function accepts all pairs for a given seed, sorts the corresponding merInfo and emits a <seed, list (merInfo)> pair. The set of all output pairs are then printed. It is easy to understand and keep track of seeds which are sequenced and extended at the found positions.

Department of CSE, SBMJCE

2011-2012

Page 16

Optimization of DNA Sequencing on Hadoop

3.4.3 Alignment filtration


The alignment filtration allows only the unambiguous best alignment for each read, rather than full catalog of all alignments. The alignments are filtered to report the best alignment for each read, meaning the one with the fewest mismatches or the differences. This reports the top best alignments for each read with the second MapReduce run by comparing the differences from the current alignment to the best alignment. 3.4.3.1 System Flow The flow during the execution of the system is shown in figure 3-1, which depicts the flow of execution of the program. Firstly, the user starts the execution by specifying the parameters for running the job on the master node in a cluster. The input file is processed and the corresponding key, value pair is generated and distributed to the slave nodes on the cluster. The job is executed in parallel on all the active slaves active. After the complete execution, the result is stored on to the HDFS.

Figure 3-1 System Flow

The job executed by the cluster shows a significant reduction in execution time and thus the performance varies depending on the number of nodes present in the cluster.

Department of CSE, SBMJCE

2011-2012

Page 17

Optimization of DNA Sequencing on Hadoop

3.5 Seed-and-extend Algorithm


The seed-and-extend algorithm used is a MapReduce-based read-mapping algorithm, which runs in parallel on multiple machines with Hadoop. It optimizes the mapping of many short reads for next-generation sequencing. Step 1: Preprocess Query: Compile the file, read all reads; seed size s=m/k+1. m: minimum length of read k: maximum no of differences allowed Step 2: Emit k-mers of lengths and Construct Hash Table with key value pair.

Figure 3-2 Hash Table view for Landau-Vishkin Algorithm

Step 3: Shuffle phase groups all value based on key-value pair. Identify all exact matches with reference sequences.

Figure 3-3 Shuffle and matching of sequences

Step 4: Search optimal alignment For each match, extend un-gapped alignments. Step 5: Evaluate the alignment statistically Stop extension when k-value exceeds threshold value (10%).

Department of CSE, SBMJCE

2011-2012

Page 18

Optimization of DNA Sequencing on Hadoop

3.6 Landau-Vishkin k-difference alignment algorithm


Landau-Vishkin algorithm is used for determining if two strings align with at most kdifferences allowed. The algorithm considers only the most similar alignments up to a fixed number of differences by computing how many characters of each string can be aligned with i=0 to k differences.

Step 1: Get the length of the text and pattern. Step 2: loop from position=0 to position=last length of pattern/text if(text[position]==pattern[position]) then match+=2; else mismatch++; if(mismatch > k) then return bad Alignment; Step 3: loop from i=0 to i<mismatch what[i]= 0; distance[mismatch]=match; what[mismatch]=2; //indicates end of matching Step 4: Set Values and return good Alignment;

The Landau-Vishkin k-difference alignment algorithm is used in the reduce function, where both the reference and query alignments obtained are aligned by using this string matching algorithm for obtaining good matching alignment records.

Department of CSE, SBMJCE

2011-2012

Page 19

Optimization of DNA Sequencing on Hadoop

3.7 Data flow


The complete output data will be generated during the DNA Sequencing and the prerequired data is passed from the Convert Format module. Each module uses the input data with the required parameters, processes and sends it to the next module. The overall data flow during the flow of execution is shown in Figure 3-4 below.
Comment [e17]: This is not a dataflow rather a class diagram

Figure 3-4 Data Flow Overview

The data to be processed is spread across different modules starting from the Convert Format module, where the content of the file is converted to byte format. The byte data is processed by the DNA Sequence module and produces data in Output Collector format. This data is now fed to the Print Sequence module, which produces the tabular result.

Department of CSE, SBMJCE

2011-2012

Page 20

Optimization of DNA Sequencing on Hadoop

The complete data flow diagram of the DNA Sequence module is shown in Figure 3-5 The data flows step by step to different parts of the module.

Figure 3-5 Data Flow of DNA Sequencing

The data being moved from one module to another module is spread vastly during the processing of the system. Hence, an optimized DNA sequencing is obtained from the overall execution of the program.

Department of CSE, SBMJCE

2011-2012

Page 21

Optimization of DNA Sequencing on Hadoop

3.8 Sequence Diagram

Figure 3-6 Sequence Diagram

The above shown figure 3-6, shows the sequence of execution that takes place in the system during the period of execution.

Department of CSE, SBMJCE

2011-2012

Page 22

Optimization of DNA Sequencing on Hadoop

3.9 System Overview


The figure 3-7 shown is the overview of the entire system which is implemented over the Hadoop framework. The sequence seed is the input provided to the system, which after processing the sequences by aligning them is sent to the HDFS for storing the result. The stored result is then retrieved and is printed and is displayed in a file or a browser.

Figure 3-7 DNA Sequencing Overview

3.10 Use Case Diagram

Figure 3-8 Use Case Diagram

The use case shown in the figure 3-8 shows the interactions between the modules and the external user accessibility. Department of CSE, SBMJCE 2011-2012 Page 23

Optimization of DNA Sequencing on Hadoop

3.11 System Architecture


This section provides a high-level overview of how the functionality and the responsibilities of the system were partitioned and then assigned to subsystems or the components or the modules appropriately. The main purpose here is to gain a general understanding of how the system is decomposed, and how the individual parts work together to provide the desired functionality.

3.11.1 Hadoop Distributed File System


HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. It has large block size (default 64mb) for storage to compensate for seek time to network bandwidth, so very large files for storage are ideal and streaming data access. Write once and read many times architecture, since files are large time to read is significant parameter than seek to first record. Commodity hardware. It is designed to run on commodity hardware which may fail. HDFS is capable of handling it.

Figure 3-9 HDFS Architecture

Figure 3-9 gives a run-time view of the architecture showing three types of address spaces: the application, the Name Node and the Data Node. An essential portion of HDFS is that there are multiple instances of Data Node. The application incorporates the HDFS client library into its address space. The client library manages all communication from the application to the Name Node and the Data Department of CSE, SBMJCE 2011-2012 Page 24

Optimization of DNA Sequencing on Hadoop

Node. An HDFS cluster consists of a single Name Node - a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of Data Nodes, usually one per computer node in the cluster, which manage storage attached to the nodes that they run on. The Name Node and Data Node are pieces of software designed to run on commodity machines. These machines typically run a Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the Name Node or the Data Node software.

Figure 3-10 Overview of the HDFS

Usage of the Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the Name Node software. Each of the other machines in the cluster runs one instance of the Data Node software. The architecture does not preclude running multiple Data Nodes on the same machine but in a real deployment that is rarely the case.

Department of CSE, SBMJCE

2011-2012

Page 25

Optimization of DNA Sequencing on Hadoop

Chapter 4 IMPLEMENTATION

Department of CSE, SBMJCE

2011-2012

Page 26

Optimization of DNA Sequencing on Hadoop

4 IMPLEMENTATION OF OPTIMIZING DNA SEQUENCING ON HADOOP


4.1 Execution Flow
The Execution flow of is depicted in Figure 4-1, where the reference and query files are provided as input to the system after converting the file to binary format. In this module, there are two MapReduce functions performed to optimize the DNA Sequencing which is implemented on Hadoop.

Comment [e18]: Diagram is not clearly visibl

Comment [e19]: Every where this sentence is repeated.

Figure 4-1 Flow chart for Sequence of Execution

The first MapReduce program is alignall, which is started after a timer is started in the main function. The alignall module is divided in to other modules which are explained in the following section. After the map reduce processing, if the filter alignments is set to 1 then, the timer is started and the second map reduce program is executed. The filter module has many other sub-modules which are explained in the following section. After the complete execution of the program the output is printed and is displayed in tabular form. Department of CSE, SBMJCE 2011-2012 Page 27

Optimization of DNA Sequencing on Hadoop

4.2 Convert Format


The Convert Format module converts the fasta format to a binary format. Where the sequences are read from the file and are subsequently converted and saved in binary file.

Figure 4-2 Flow chart for Format Convertor

The figure 4-2 depicts the flow of the programFormat ,Converter; the program execution begins with specifying the input files as parameter for the process of converting the file format. At the end of the processing, the file is converted to a binary format.

Department of CSE, SBMJCE

2011-2012

Page 28

Optimization of DNA Sequencing on Hadoop

The convert module reads the file till the end of file, where first skip character is removed from the file and the sequences are converted to uppercase, appended and saved until the next skip character is encountered.

Figure 4-3 Flow chart for Convert File

The figure 4-3 depicts the flow of converting the file format. At the end of processing, the sequences are read and sent to save sequence module for writing the converted format to a file.

Department of CSE, SBMJCE

2011-2012

Page 29

Optimization of DNA Sequencing on Hadoop

The save sequence module is depicted in figure 4-4. Where the save sequence module reads till the end of the appended sequences and gets those sequences and writes them to the binary file in <id, sequence> format and saves the binary file. Once the reading is complete the offset value is varied and again the save sequences module is called by setting the appropriate offset value till the full length of the file is achieved.

Figure 4-4 Flow chart for Save Sequence

Department of CSE, SBMJCE

2011-2012

Page 30

Optimization of DNA Sequencing on Hadoop

4.3 DNA Sequencing


As shown in the figure 4-5, The alignall module sets the job configuration for the processing of the parameters or variables present in the program. After the processing of job configuration, the map class module followed by reduce class module is processed and the result is stored in the HDFS.

Figure 4-5 Flow chart for DNA Sequencing

Department of CSE, SBMJCE

2011-2012

Page 31

Optimization of DNA Sequencing on Hadoop

Map class module is shown in figure 4-6 where the map function is executed in parallel. The following module gets the sequences specified and checks if the sequence obtained are from reference type or query type.

Figure 4-6 Flow chart for Map class

If it is a reference sequence then the sequence is read till the end and sets the initial left and right flank details. The input is checked for repeat sequences, if found then the corresponding seed is obtained which matches the redundant value. Else the normal seed sequences are obtained and stored to the intermediate pair for further processing. If the sequence is of query type then the flanks are checked for reverse compliment and the corresponding sequence is reverse complimented in place. Now the sequences are processed till the end and is checked if has any repeat seeds present in the seed sequence, Department of CSE, SBMJCE 2011-2012 Page 32

Optimization of DNA Sequencing on Hadoop

if present then the corresponding seed is obtained and is stored. Else the seed sequence is obtained and stored to the intermediate pair for further processing. Reduce class module as shown in the figure 4-7 the Reduce function is executed in parallel and the values are read from the intermediate pair till it has next values. The results obtained are stored in a variable merIn. Then the merIn variable is checked and put in to respective tuples and then the align batch module is called for further processing.

Figure 4-7 Flow chart for Reduce class

Department of CSE, SBMJCE

2011-2012

Page 33

Optimization of DNA Sequencing on Hadoop

The Align batch module is shown in the figure 4-8. It first gets the current query tuple and extends the obtained tuple against the reference tuple and returns the extended full good alignment once the reads are matched with allowed differences.

Figure 4-8 Flow chart for Aligning the sequences

The obtained alignments are then checked if the filter alignment is specified to get the unambiguous best alignments based on the number of differences allowed for the sequencing. This processing is performed for all the matched extended tuples to get the best alignment and the second best alignments if filter alignments is specified.

Department of CSE, SBMJCE

2011-2012

Page 34

Optimization of DNA Sequencing on Hadoop

Extend module as shown in figure 4-9, obtains the left query tuple and gets the real flank length and performs Landau-Vishkin extend function. The returned alignment information is check if there the aligned length is -1, then it returns no alignment. Else the reference start and the differences encountered are recorded and then it processes the right query tuple in the same way.

Figure 4-9 Flow chart for Extending the sequences

At the end of processing, the full alignment is returned after setting the corresponding reference end position and the number of differences encountered.

Department of CSE, SBMJCE

2011-2012

Page 35

Optimization of DNA Sequencing on Hadoop

Landau-Vishkin Extend module as shown in figure 4-10, checks for the allowed differences, if assigned then the reference and the query tuples are obtained and are processed for k-difference alignment.

Figure 4-10 Flow chart for Landau-Vishkin algorithm

Department of CSE, SBMJCE

2011-2012

Page 36

Optimization of DNA Sequencing on Hadoop

K-difference alignment module as shown in figure 4-11, computes the dynamic programming model where in each text and pattern specified for processing. When the generated dynamic programming model is ready, it then returns the alignment based on the dynamic programming on the fly. The good alignment returned is then processed and stored such that the alignments are written to the output file.

Figure 4-11 Flow chart for k-difference alignment

Department of CSE, SBMJCE

2011-2012

Page 37

Optimization of DNA Sequencing on Hadoop

K-mismatch alignment as shown in figure 4-12, this module is the k-mismatch string matching algorithm specified by Landau-Vishkin. The module starts to align the query sequences against the reference sequences and performs string matching with a specified number of allowed difference to a set of string and then returns good alignment if the mismatch is less than equal to the specified mismatches, else it returns a bad alignment.

Figure 4-12 Flow chart for k-mismatch alignment

Department of CSE, SBMJCE

2011-2012

Page 38

Optimization of DNA Sequencing on Hadoop

Filter alignment module is shown in figure 4-13, the second map reduce program gets executed if the filter alignment is specified in the parameters. Firstly the job configuration is set in for processing. In the filter map class, filter combined class and the filter reduce class are set and executed.

Figure 4-13 Flow chart for Filter Alignment

The implemented module processes the previously stored result value and obtains the corresponding unambiguous best alignments with second best alignments. At the end of processing, the results are generated and stored to the HDFS for further processing as necessary.

Department of CSE, SBMJCE

2011-2012

Page 39

Optimization of DNA Sequencing on Hadoop

Filter Reduce module as shown in figure 4-14, obtains the processed results and gets the value of the same. It checks the current alignments difference to the best alignments difference and then sets the best alignment out of the two compared alignments. The process is repeated again to get the second best alignment and the final output is written to the HDFS.

Figure 4-14 Flow chart for Filer Reduce Class

Department of CSE, SBMJCE

2011-2012

Page 40

Optimization of DNA Sequencing on Hadoop

4.4

Print Sequences

Print sequences module prints the output from raw byte format to a textual column form after processing the data obtained from the HDFS. As shown in the figure 4-15, the file obtained from as a parameter is checked, if the result is a directory and it lists all the files present in that directory and then finally prints the file.

Figure 4-15 Flow chart for Print Sequences

Print file module as shown in figure 4-16, reads the content in <key,value> pair sequence and then prints the records in a tabular form till the end of values obtained.

Figure 4-16 Flow chart for Printing output

Department of CSE, SBMJCE

2011-2012

Page 41

Optimization of DNA Sequencing on Hadoop

Chapter 5 TESTING

Department of CSE, SBMJCE

2011-2012

Page 42

Optimization of DNA Sequencing on Hadoop

5 TESTING OF OPTIMIZING DNA SEQUENCING ON HADOOP


5.1

Test Setup

Before a test case can be executed the system wherein the test scenarios are performed need to be created to enable us to perform the test runs with proficiency.

5.1.1 Applications used for Testing


The modules developed are tested using one or more of the following tests. Java Tool: Java tool is JAVA based testing and debugging tool which is used for testing and debugging the three separate modules and their sub-module individually. The main advantage of this tool is that it can be used for testing and debugging in any platform. Java NetBeans: NetBeans is a Java based IDE which is used to test and debug the modules and track the progress of the flow of the module using its in-built functions to ensure that the module is working as expected.

5.1.2 System Setup


The test cases can be fully realized only when the complete system is setup and the modules execute its jobs in coalition with the system. But before the test cases can be executed the system needs to be online. The steps involved in the system setup are as follows.

Department of CSE, SBMJCE

2011-2012

Page 43

Optimization of DNA Sequencing on Hadoop

1. Creation of a dedicated connection between the cluster nodes using ssh system.

Table 5-1Test setup scenario: Initialization of cluster nodes

Test Setup Scenario Name of test Description Expected output Actual Output Remarks

Observation Scenario Check for initialization of cluster nodes Proper Initialization of Hadoop cluster Connection to nodes established Connection to nodes established Setup Successful

Figure 5-1 Snapshot: starting ssh

Department of CSE, SBMJCE

2011-2012

Page 44

Optimization of DNA Sequencing on Hadoop

2. Starting the Hadoop based Data Node over the cluster via the Master Node.

Table 5-2 Test setup scenario: Initialization of datanode

Test Setup Scenario Name of test Description Expected Output Actual Output Remarks

Observation Scenario Check for initialization of Data Node on all cluster nodes Proper Initialization of Data Nodes Data Nodes initialized Data Nodes initialized Setup Successful

Figure 5-2 Snapshot: starting datanode over the cluster

Department of CSE, SBMJCE

2011-2012

Page 45

Optimization of DNA Sequencing on Hadoop

3. Starting the Hadoop Based Task Tracker over the cluster via the Master Node.

Table 5-3 Test setup scenario: Initialization of namenode

Test Setup Scenario Name of test

Observation Scenario Check for initialization of Task Tracker on all cluster nodes

Description Expected Output Actual Output Remarks

Proper Initialization of Task Tracker Task Tracker initialized Task Tracker initialized Setup Successful

Figure 5-3 Snapshot: starting namenode over the cluster

Once the system is setup is completed as depicted in the above Test Setup cases the test cases can be executed and the results analyzed.

Department of CSE, SBMJCE

2011-2012

Page 46

Optimization of DNA Sequencing on Hadoop

5.2 TESTING SCENARIOS


5.2.1 Test Case 1
Table 5-4 Test case1: Extension converter module

Test Case Scenarios Feature being tested Feature being tested Description Expected output Actual Output Remarks

Observation Scenario Testing Extension Convertor Module Algorithm Proper working of Algorithm No Errors No Errors Test Successful

Figure 5-4 Snapshot: Extension conversion of fasta file

Department of CSE, SBMJCE

2011-2012

Page 47

Optimization of DNA Sequencing on Hadoop

5.2.2 Test Case 2


Table 5-5 Test case2: DNA sequencing module

Test Case Scenarios Feature being tested Feature being tested Description Expected output Actual Output Remarks

Observation Scenario Testing DNA Sequencer Module Algorithm Proper working of Algorithm No Errors No Errors Test Successful

Figure 5-5 Snapshot: start of sequencing process

Department of CSE, SBMJCE

2011-2012

Page 48

Optimization of DNA Sequencing on Hadoop

5.2.3 Test Case 3


Table 5-6 Test case3: Print Alignment module

Test Case Scenarios Feature being tested Feature being tested Description Expected output Actual Output Remarks

Observation Scenario Testing Print Alignment Module Algorithm Proper working of Algorithm No Errors No Errors Test Successful

Figure 5-6 Snapshot: Printing results

Department of CSE, SBMJCE

2011-2012

Page 49

Optimization of DNA Sequencing on Hadoop

5.2.4 Test Case 4


Table 5-7 Test case4: Tracking job over web interface

Test Case Scenarios Feature being tested Feature being tested Description Expected output Actual Output Remarks

Observation Scenario Tracking of Job over the Web Interface Feature Proper working of Algorithm No Errors No Errors Test Successful

Table 5-8 Snapshot: Running and Non-Running Tasks

Table 5-9 Snapshot: Running Jobs

Department of CSE, SBMJCE

2011-2012

Page 50

Optimization of DNA Sequencing on Hadoop

Chapter 6 RESULTS

Department of CSE, SBMJCE

2011-2012

Page 51

Optimization of DNA Sequencing on Hadoop

6 RESULTS OF OPTIMIZING DNA SEQUENCING ON HADOOP


To show the improvement in the performance by reduction in processing time three separate cases are considered. These cases are used to display the total processing time required for execution on: 1. One node Hadoop Cluster 2. Two node Hadoop Cluster 3. Four node Hadoop Cluster

CASE 1: Running the application on a One Node Hadoop Cluster


Table 6-1 Snapshot: cluster summary- single active node over the cluster

Figure 6-12 Snapshot: start of sequencer on a single node cluster

Department of CSE, SBMJCE

2011-2012

Page 52

Optimization of DNA Sequencing on Hadoop

Total Time Taken For Completion of Task: 787 seconds or 13 min 7 seconds

Figure 6-23 Snapshot: completion of job with displayed total running time

Department of CSE, SBMJCE

2011-2012

Page 53

Optimization of DNA Sequencing on Hadoop

CASE 2: Running the application on a Two Node Hadoop Cluster


Table 6-2 Snapshot: cluster summary- two active node over the cluster

Figure 6-35 Snapshot: start of sequencer on a two node cluster

Total Time Taken For Completion of Task: 403 seconds or 6 min 43 seconds

Figure 6-46 Snapshot: completion of job with displayed total running time

Department of CSE, SBMJCE

2011-2012

Page 54

Optimization of DNA Sequencing on Hadoop

CASE 3: Running the application on a Four Node Hadoop Cluster


Table 6-3 Snapshot: cluster summary- four active node over the cluster

Figure 6-58 Snapshot: start of sequencer on a four node cluster

Total Time Taken For Completion of Task: 221 seconds or 3 min 41 seconds

Figure 6-69 Snapshot: completion of job with displayed total running time

Department of CSE, SBMJCE

2011-2012

Page 55

Optimization of DNA Sequencing on Hadoop

CONCLUSION
The runtime of the DNA Sequencing Algorithm over the Hadoop system for single and multi node system is considered with relevant information. Three different run-time scenarios are performed. The results of the changes in run-time for different number of nodes is considered and contrasted. The runtime results are as follows: 1. Runtime in a Single node Hadoop cluster:787 seconds or 13 min 7 seconds 2. Runtime in a Multi-node Hadoop cluster with 2 systems: 403 seconds or 6 min 43 seconds 3. Runtime in a Multi-node Hadoop cluster with 4 systems: : 221 seconds or 3 min 41 seconds

Figure 0-1 Graphical comparison between total processing time against n active system

When the run-time of the algorithm for different test cases is contrasted against the increase in the number of Nodes it is evident that the run-time is inversely proportional to the number of active systems in the Hadoop cluster.

Department of CSE, SBMJCE

2011-2012

Page 56

Optimization of DNA Sequencing on Hadoop

Thus we can conclude that the implementation of the algorithm over the Hadoop interface using Map/Reduce programming technique enables in exponential reduction in processing time of the algorithm for large Datasets.

FUTURE ENHANCEMENTS
Building a complete interactive web interface for the system, i.e. , a complete Cloud Interfacing system. Automating the process of setting up of individual systems to be included into the Hadoop Cluster. creation of a Self-monitoring process which runs in the Background and monitors the state of the Node, such that, in case of node failure it automatically resolves the issue, else flag the system for immediate maintenance.

Department of CSE, SBMJCE

2011-2012

Page 57

Optimization of DNA Sequencing on Hadoop

APPENDIX A : DNA and its Sequencing


ALIGNMENTS (what is a match and what is a mismatch)
How can we get the best alignment? There are several possibilities: 1. Reduce the number of mismatches: TCAG-ACG-ATTG || | | | | | 0 mismatches 7 matches 6 gaps

TC-GGA-GC-T-G 2. Reduce the number of gaps : TCAGACGATTG || || 5 mismatches 4 matches 2 gaps

TCGGAGCTG-3. Reduce neither the number of gaps nor the number of mismatches : TCAG-ACGATTG || | | | | 2 mismatches 6 matches 4 gaps

TC-GGA-GCTG4. Same as 3. but one base (or gap) moved : TCAG-ACGATTG || | | | | | 1 mismatch 7 matches 4 gaps

TC-GGA-GCT-G

Methods of DNA Sequencing There are basically 4 methods of Sequencing: Maxam-Gilbert method (or) Chemical Sequencing It was the first DNA sequencing method developed based on chemical modification of DNA. It used the purified DNA directly which required radioactive labeling and then was treated Chemically. The fragments are electrophoresed side by side in gels for size separation. To visualize the fragments, the gel is exposed to X-ray film for autoradiography, yielding a series of dark bands each corresponding to a radio labeled DNA fragment, from which the sequence may be inferred. The extensive use of hazardous chemical, and the complexity in the technical procedure made it difficult with scale-up. Department of CSE, SBMJCE 2011-2012 Page 58

Optimization of DNA Sequencing on Hadoop

Sanger Di-Deoxy method (or) Chain Termination The principles of DNA replication were used by Sanger in the development of the process. It required that each read start be cloned for production of single-stranded DNA. It required single-stranded DNA template, and many other DNA sequence type and modified nucleotides that terminate DNA strand elongation, thus terminating DNA strand extension and resulting in DNA fragments of varying length. The newly synthesized and the labeled DNA fragments were then heated and separated by size by gel electrophoresis . The DNA bands are then visualized by the special rays or UV light, and the DNA sequence can be directly read off the X-ray film or gel image. The relative positions of the different bands among the four lanes are then used to read the DNA sequence. It is more efficient and uses fewer toxic chemicals and lower amounts of radioactivity than the before method, it rapidly became the method of choice. These methods have greatly simplified DNA sequencing. Shotgun Sequencing It is a method for determining the sequence of a very large piece of DNA. The basic DNA sequencing reaction can only get the sequence of a few hundred nucleotides. The large fragment is shotgun cloned, and then each of the resulting smaller clones (sub clones) is sequenced. By finding out where the sub clones overlap, the sequence of the larger piece becomes apparent. It does not require prior information about the sequence, and it can be used for DNA molecules as large as entire chromosomes.. The ends of these fragments overlap and, when aligned properly by a genome assembly program, can be used to reconstruct the complete genome. It yields the sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. For a genome as large as the human genome, it may take many days of CPU time on large-memory, multiprocessor computers to assemble the fragments, and the resulting assembly will usually contain numerous gaps that have to be filled in later. Primer walking An alternative to shotgun sequencing is primer walking. Following the initial sequencing determination, primed from a region of known sequence, subsequent primers were designed. These primers then serve as sequencing start point which establish an additional >500 BP of sequence data. New primers are synthesized for the newly established sequence in the template DNA, and the process continues. The advantage was that extensive sub-cloning was not required. The amount of overlap or coverage required is Department of CSE, SBMJCE 2011-2012 Page 59

Optimization of DNA Sequencing on Hadoop

also decreased because the direction and location of the new sequence is known, substantially decreasing the effort needed to assemble the final sequence. But, the consequence faced was the amount of time required for each step in the primer walk and the need to design a robust primer for every step. Hence, this method is basically used to fill gaps in a sequence that has been determined by shotgun cloning.

How are the Sequence Generated An Automated sequencing gel: DNA replication reaction is run in a test tube, but in the presence of trace amounts of all four of the dideoxy terminator nucleotides. Electrophoresis is used to separate the resulting fragments by size and i.e. how we get 'read' the sequence from it, as the colors march past in order.

In a large-scale sequencing lab, we use a machine to run the electrophoresis step and to monitor the different colors as they come out. Since about 2001, these machines - not surprisingly called automated DNA sequencers - have used 'capillary electrophoresis', where the fragments are piped through a tiny glass-fiber capillary during the electrophoresis step, and they come out the far end in size-order. At left is a screen shot of a real fragment of sequencing gel (this one from an older model of sequencer, but the concepts are identical). The four colors red, green, blue and yellow each represent one of the four nucleotides. The actual gel image would be perhaps 3 or 4 Department of CSE, SBMJCE 2011-2012 Page 60

Optimization of DNA Sequencing on Hadoop

meters long and 30 or 40 cm wide. We don't even have to 'read' the sequence from the gel - the computer does that.

Eg: This is a plot of the colors detected in one 'lane' of a gel (one sample), scanned from smallest fragments to largest. The computer interprets the colors by printing the nucleotide sequence across the top of the plot. This is just a fragment of the entire file, which would span around 900 or so nucleotides of accurate sequence. The sequencer also gives the operator a text file containing just the nucleotide sequence, without the color traces.

BLAST: Basic Local Alignment Search Tool It is an algorithm comparing primary biological sequence information, such as

the amino-acid sequences of different proteins or the nucleotides of DNA sequences. It enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. It is one of the most widely used bioinformatics programs, because it addresses a fundamental problem and the algorithm emphasizes speed over sensitivity. It emphasis on speed. It is faster, but it cannot "guarantee the optimal alignments of the query and database sequences" which Smith-Waterman does. It searches only for the more significant patterns in the sequences, but with comparative sensitivity. It is also often used as part of other algorithms that require approximate sequence matching. Department of CSE, SBMJCE 2011-2012 Page 61

Optimization of DNA Sequencing on Hadoop

It takes input sequences as FASTA format or Genbank format. While output format may include HTML, plain text, and XML formatting. Sometimes the results are given in a graphical format showing the hits found, a table showing sequence identifiers for the hits with scoring related data, as well as alignments for the sequence of interest and the hits received with corresponding BLAST scores for these. The easiest to read and most informative of these is probably the table. DNA Sequencing Applications and Approaches DNA sequencing can be used for a variety of applications, including: De novo sequencing of genomes Detection of variants (SNPs) and mutations Biological identification Confirmation of clone constructs Detection of methylation events Gene expression studies Detection of copy number variation.

It is often reported that the goal of sequencing a genome is to obtain information about the complete set of genes in that particular genome sequence. The proportion of a genome that encodes for genes may be very small. However, it is not always possible (or desirable) to only sequence the coding regions separately. Also, as scientists understand more about the role of this noncoding DNA (often referred to as junk DNA), it will become more important to have a complete genome sequence as a background to understanding the genetics and biology of any given organism. A few of the major uses of DNA sequencing are: Diagnosing Diseases Comparing normal sequences to sequences in people with genetic illnesses to determine what parts of the genome are involved in the disease. Forensic Genetics Comparing DNA left at a crime scene to the DNA sequence of suspects or victims Paternity Tests Matching the DNA of parents and children to determine how they are related.

Comparisons With Other Genomes - Comparing genomes of different species to


help in scientific research and conservation efforts. Department of CSE, SBMJCE 2011-2012 Page 62

Optimization of DNA Sequencing on Hadoop

APPENDIX B : MapReduce and RMAP


MAPREDUCE
It support distributed computing on large data sets on clusters of computer. Its inputs and outputs are usually stored in a distributed file system and data is usually stored on local disk and fetched remotely by the reducers. It consists of one Job Tracker, to which client applications submit MapReduce jobs. The Job Tracker pushes work out to available Task Tracker nodes in the cluster. If a Task Tracker fails or times out, that part of the job is rescheduled. A heartbeat is sent from the Task Tracker to the Job Tracker every few minutes to check its status. The Job Tracker and Task Tracker status and information is exposed by Jetty and can be viewed from a web browser. When a Job Tracker starts up, it looks for any left out work and its data, so that it can restart work from where it left off. Known limitations of this approach are: The allocation of work to Task Trackers is very simple. Every Task Tracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability. If one Task Tracker is very slow, it can delay the entire MapReduce job especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative-execution enabled, however, a single task can be executed on multiple slave nodes. "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve.

Department of CSE, SBMJCE

2011-2012

Page 63

Optimization of DNA Sequencing on Hadoop

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) list(k2,v2) The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, thus creating one group for each one of the different generated keys. The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) list(v3) Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list. Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. MapReduce allows for distributed processing of the map and reduction operations. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled assuming the input data is still available. MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes. The application defines the functions as follows: an input reader a Map function a partition function a compare function a Reduce function an output writer

Department of CSE, SBMJCE

2011-2012

Page 64

Optimization of DNA Sequencing on Hadoop

INPUT READER
The input reader divides the input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) and the framework assigns one split to each Map function. It reads data from stable storage and generates key/value pairs. Map function It takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other. Partition function Each Map function output is allocated to a particular reducer by the application's partition function. The partition function is given the key and the number of reducers and returns the index of the desired reduce. A typical default is to hash the key and modulo the number of reducers. Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it, to get it reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations. Comparison function The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function. Reduce function The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs. Output writer The Output Writer writes the output of the Reduce to stable storage, usually a distributed file system.

Department of CSE, SBMJCE

2011-2012

Page 65

Optimization of DNA Sequencing on Hadoop

RMAP
RMAP is aimed to map accurately reads from the next-generation sequencing technology. It can map reads with or without error probability information (quality scores) and supports paired-end reads mapping. There is no limitations on read widths or number of mismatches. It can map more than 8 million reads in an hour at full sensitivity to 2 mismatches. It can map sequencing reads to their genomic location. The length of read must be at most 64bp, and should not be shorter than 20bp. The user must specify a maximum number of mismatches permitted between a read and the genomic location to which it maps. For example, when mapping reads of length 36bp by rmap, it might be desirable to allow up to 3 mismatches in the mapping to account for sequencing errors or single nucleotide polymorphism (SNP). For 50bp reads, it might be desirable to allow, e.g., 5 mismatches. Specifying the reads Reads must be specified in either of the 2 ways: FASTA format sequence file Eg: >1_168_0365_0364 GTTAAAAGTATGTGTGTCCTATGTCCTCAAGA >1_168_0021_0625 TTTTATACACTTCAAAAAAAAAAAACCCTAGA ......

Solexa sequencing probability score files four numbers per base are listed to present the negative log-transform of the probabilities of four nucleotides (A, C, G, T) to be sequenced at this base position.

Department of CSE, SBMJCE

2011-2012

Page 66

Optimization of DNA Sequencing on Hadoop

For Example: -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 40 -40 ...... Each read should have at least a minimum length that is specified by the user as a command line argument. Reads with length exceeding the specified limit will be truncated at the 3'-end. Any characters other than {A,C,G,T,a,c,g,t} in the reads will be automatically transformed into an 'N'. When counting mismatches between a read and some genome location, any 'N' will mismatch whatever character it aligns with in program. Specifying the genome The genome can be specified in two ways: as a single chromosome file, or as a directory containing multiple files, one for each chromosome. The chromosome files must be in FASTA format, but must only have a single FASTA sequence (i.e. only the first line of the file can start with the '>' character). This is the format of the files downloadable from the UCSC Genome Browser. When specifying a directory containing multiple chromosomes, a filename suffix is also required to indicate those files in the directory that are to be searched. -40 -40 -40 40 40 -40 -40 -40 -40 -40 40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 40 -40 -40 -40 40 -40 -40 40 -40 -40 40 -40 -40 35 -35 -40 -40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 40 -40 40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 -40 -40 -40 40 40 -40 -40 -40

Department of CSE, SBMJCE

2011-2012

Page 67

Optimization of DNA Sequencing on Hadoop

How matches are scored Scoring is simply a count of mismatches, and fewer mismatches are better. For rmap, the mismatches are counted in all bases used in mapping. If for a given read, there is no location in the genome found to match that read with fewer than the specified number of mismatches, then nothing is reported for that read. If there is a "good" match, then the best match is reported, along with the number of mismatches. These matches are reported in BED format: chromosome start For example: chr1 153728548 153728583 s_2_0100_1 1 + end name score strand

If more than one genomic sequence matches the read with the fewest number of mismatches, then the read is considered ambiguous, and is not reported. There is an option to have the names of all ambiguous reads reported in a separate file.

Department of CSE, SBMJCE

2011-2012

Page 68

Optimization of DNA Sequencing on Hadoop

APPENDIX C : Single System Installation


This installation process has been tested with the following software versions with the system name as user@ubuntu:

Ubuntu Linux 10.04 LTS Hadoop 0.20.2 Java 1.6

Install Sun Java 6 JDK $ sudo apt-get install sun-java6-jdk $ sudo update-java-alternatives -s java-6-sun The full JDK will be placed in /usr/lib/jvm/java-6-sun (this directory is actually a symlink on Ubuntu). After installation, check whether Suns JDK is correctly set up:

user@ubuntu:~# java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing) Configuring SSH Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. For our single-node setup of Hadoop, we need to configure SSH access to localhost. First, we need to install ssh on our system then we have to generate an SSH key for the user. user@ubuntu:~$ sudo apt-get install ssh user@ubuntu:~$ su - user user@ubuntu:~$ ssh-keygen -t dsa -P "" Generating public/private dsa key pair. Enter file in which to save the key (/home/user/.ssh/id_dsa): Created directory '/home/user/.ssh'. Your identification has been saved in /home/user/.ssh/id_dsa. Department of CSE, SBMJCE 2011-2012 Page 69

Optimization of DNA Sequencing on Hadoop

Your public key has been saved in /home/user/.ssh/id_dsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2user@ubuntu The key's randomart image is: user@ubuntu:~$ The second line will create a DSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you dont want to enter the passphrase every time Hadoop interacts with its nodes). Second, you have to enable SSH access to your local machine with this newly created key. user@ubuntu:~$ cat $HOME/.ssh/id_dsa.pub >> $HOME/.ssh/authorized_keys The final step is to test the SSH setup by connecting to your local machine with the user machine. The step is also needed to save your local machines host key fingerprint to the user users known_hosts file. user@ubuntu:~$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (DSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS user@ubuntu:~$ You have to reboot your machine in order to make the changes take effect.

Department of CSE, SBMJCE

2011-2012

Page 70

Optimization of DNA Sequencing on Hadoop

Hadoop Installation You have to download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/Hadoop/ $ cd /usr/local $ sudo tar xzf hadoop-0.20.2.tar.gz $ sudo mv hadoop-0.20.2 hadoop Update $HOME/.bashrc Add the following lines to the end of the $HOME/.bashrc file of user. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc. # Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-6-sun # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin You can repeat this exercise also for other users who want to use Hadoop. Configuration hadoop-env.sh The only required environment variable we have to configure for Hadoop is JAVA_HOME. Open/conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is/usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory. Change

# The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun Department of CSE, SBMJCE 2011-2012 Page 71

Optimization of DNA Sequencing on Hadoop

to # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun conf/*-site.xml In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoops Distributed File System, HDFS. Now we create the directory and set the required ownerships and permissions:

$ sudo mkdir -p /app/hadoop/tmp $ sudo chownhduser:hadoop /app/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 777 /app/hadoop/tmp If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section). Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.

In file conf/core-site.xml: <!-- In: conf/core-site.xml --> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming

Department of CSE, SBMJCE

2011-2012

Page 72

Optimization of DNA Sequencing on Hadoop

theFileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

In file conf/mapred-site.xml:

<!-- In: conf/mapred-site.xml --> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>

In file conf/hdfs-site.xml:

<!-- In: conf/hdfs-site.xml --> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>

Department of CSE, SBMJCE

2011-2012

Page 73

Optimization of DNA Sequencing on Hadoop

Starting your single-node cluster Run the command: user@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

This will startup a Name node, Data node, Job tracker and a Task tracker on your machine. Stopping your single-node cluster Run the command

hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh to stop all the daemons running on your machine.

Department of CSE, SBMJCE

2011-2012

Page 74

Optimization of DNA Sequencing on Hadoop

APPENDIX D : Multi-Node Installation


After two single-node clusters up and running, we will modify the Hadoop configuration to make one Ubuntu box a master (which will also act as a slave) and the other Ubuntu box a slave. We will call the designated master machine just the master from now on and the slave-only machine the slave. We will also give the two machines these respective hostnames in their networking setup, most notably in /etc/hosts. If the hostnames of your machines are different (e.g. node01) then you must adapt the settings in this tutorial as appropriate.

Networking We will assume the IP address 192.168.0.1 to beof master machine and 192.168.0.2 of the slave machine. Change the IP address according to your system IP.

Update /etc/hosts on both machines with the following lines: # /etc/hosts (for master AND slave) 192.168.0.1<master system IP> 192.168.0.2<slave system IP> SSH access The user on the master (user@master) must be able to connect a) to its own user account on the master i.e. ssh master in this context and not necessarily ssh localhost b) to the user account on the slave (user@slave) via a password-less SSH login. We just have to add the user@masters public SSH key (which should be in $HOME/.ssh/id_dsa.pub) to the authorized_keys file of user@slave (in this users $HOME/.ssh/authorized_keys). master<master system name> slave<slave system name>

You can do this manually or use the following SSH command: user@master:~$ ssh-copy-id -i $HOME/.ssh/id_dsa.pub user@slave

Department of CSE, SBMJCE

2011-2012

Page 75

Optimization of DNA Sequencing on Hadoop

This command will prompt you for the login password for user on slave, then copy the public SSH key for you, creating the correct directory and fixing the permissions as necessary. The final step is to test the SSH setup by connecting with user from the master to the user account on the slave. The step is also needed to save slaves host key fingerprint to the user@mastersknown_hosts file.

So, connecting to master user@master:~$ ssh master The authenticity of host 'master (192.168.0.1)' can't be established. RSA key fingerprint is 3b:21:b3:c0:21:5c:7c:54:2f:1e:2d:96:79:eb:7f:95. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'master' (DSA) to the list of known hosts. Linux master 2.6.20-16-386 #2 Thu Jun 7 20:16:13 UTC 2007 i686 ... user@master:~$ and from master to slave.

user@master:~$ ssh slave The authenticity of host 'slave (192.168.0.2)' can't be established. DSA key fingerprint is 74:d7:61:86:db:86:8f:31:90:9c:68:b0:13:88:52:72. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'slave' (DSA) to the list of known hosts. Ubuntu 10.04 ... user@slave:~$

Department of CSE, SBMJCE

2011-2012

Page 76

Optimization of DNA Sequencing on Hadoop

Configuration conf/masters(master only)

The conf/masters file defines on which machines Hadoop will start secondary Name Nodes in our multi-node cluster. In our case, this is just the master machine. The primary Name Node and the Job Tracker will always be the machines on which you run the bin/start-dfs.sh and bin/start-mapred.sh scripts.

On master system, update /conf/masters that it looks like this:

Master<name of master system> conf/slaves(master only) This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (Data Nodes and Task Trackers) will be run. On master system, update conf/slaves that it looks like this: Master<name of master system> Slave<name of slave systems on each newline> conf/*-site.xml (all machines) Assuming you configured each machine as described in the single-node cluster installation, you will only have to change a few variables.

First, we have to change the fs.default.name variable (in conf/core-site.xml) which specifies the Name Node(the HDFS master) host and port. In our case, this is the master machine. <!-- In: conf/core-site.xml --> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming Department of CSE, SBMJCE 2011-2012 Page 77

Optimization of DNA Sequencing on Hadoop

theFileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> Second, we have to change the mapred.job.tracker variable (in conf/mapred-site.xml) which specifies the Job Tracker (MapReduce master) host and port. Again, this is the master in our case. <!-- In: conf/mapred-site.xml --> <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> Third, we change the dfs.replication variable (in conf/hdfs-site.xml) which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. The default value of dfs.replication is 3. However, we should always keep this value less than or equal to the number of slave systems in the network. <!-- In: conf/hdfs-site.xml --> <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> Starting the multi-node cluster

Department of CSE, SBMJCE

2011-2012

Page 78

Optimization of DNA Sequencing on Hadoop

Starting the cluster is done in two steps. First, the HDFS daemons are started: the Name Node daemon is started on master, and Data Node daemons are started on all slaves. Second, the MapReduce daemons are started: the Job Tracker is started on master, and Task Tracker daemons are started on all slaves. HDFS daemons Run the command /bin/start-dfs.sh on master system: user@master:/usr/local/hadoop$ bin/start-dfs.sh MapReduce daemons Run the command /bin/start-mapred.sh on master system: user@master:/usr/local/hadoop$ bin/start-mapred.sh Stopping the multi-node cluster Like starting the cluster, stopping it is done in two steps. The workflow is the opposite of starting, however MapReduce daemons Run the command /bin/stop-mapred.sh on master system: user@master:/usr/local/hadoop$ bin/stop-mapred.sh HDFS daemons Run the command /bin/stop-dfs.sh on master: hduser@master:/usr/local/hadoop$ bin/stop-dfs.sh

Department of CSE, SBMJCE

2011-2012

Page 79

Optimization of DNA Sequencing on Hadoop

BIBLIOGRAPHY
[1] [2] [3] [4] [5] [6] [7] [8] http://bioinformatics.oxfordjournals.org/content http://seqcore.brcf.med.umich.edu http://cmg.health.ufl.edu/ http://genomics.ucsd.edu/Publications http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/ http://www.wikipedia.org/ http://code.google.com/edu/parallel/mapreduce-tutorial.html

Department of CSE, SBMJCE

2011-2012

Page 80

Вам также может понравиться