Вы находитесь на странице: 1из 13

Big Data Analytics in Genomics

by
Ramraj S M.E.,
Assistant Professor
Department of Software Engineerin
Guide
Dr. S. S. Sridhar,Professor
Department of Computer Science and Engineering

Presentation
12-Nov-2016

by Ramraj S

Big Data Analytics in Genomics

Agenda

Genetics Vs Genomics

Terms in Genomics

Denovo Assembly

Challenges Involved

Distributed Computing

Hadoop

by Ramraj S

Big Data Analytics in Genomics

Genetics Vs Genomics

Genetics is the study of heredity, or how the characteristics of living


organisms are transmitted from one generation to the next via DNA,
the substance that comprises genes, the basic unit of heredity.

Genomics, in contrast, is the study of the entirety of an organisms


genes called the genome.

by Ramraj S

Big Data Analytics in Genomics

Terms in Genomics

Deoxyribo Nucleic Acid (DNA) - Decides Celluar Function

The complete set of an organisms DNA is called its genome.

Each gene is comprised of a string of nucleotide bases labeled A, C,


G and T.

Human DNA has approximately 3 billion nucleotide bases and their


precise order is known as the DNA sequence.

The output of DNA sequencer is a set of sequences, which is also


denoted as reads. A long read represents its sequence length is
longer than 50 base pairs.

A pair of A and T or a pair of G and C is known as a base pair

by Ramraj S

Big Data Analytics in Genomics

Assembly Process

by Ramraj S

Big Data Analytics in Genomics

Denovo Assembly

De novo assembly refers to sequencing or assembling a novel


genome where there is no reference sequence available for alignment.

Sequence reads are assembled as contigs and the coverage quality of


de novo sequence data depends on the size and continuity of the
contigs (ie, the number of gaps in the data).

Assemblers such as Velvet, Euler-USR, and SOAPdenovo successfully assembled small genomes from short reads.

Assemblers for assembling the larger mammalian-sized genomesrequire high memory and compute resources

by Ramraj S

Big Data Analytics in Genomics

Challenges in Denovo Assembly

For a 100GB NGS file of read length 36 and k-mer size 25,the total
size of intermediate data is 1.2 tera-bytes, i.e., each read is
replicated 12 times. Thus, new strategies to store and process large
quantities of data efficiently are required.

E. coli genome can be assembled in as little as 15 minutes using a


32bit Windows desktop computer with 32 GB of RAM.

De novo assembly of human genomes data of short reads stored in


100+ GB size compressed sequence data on a single machine with a
short read assembler is not feasible except on very expensive servers.

by Ramraj S

Big Data Analytics in Genomics

Distributed Computing

Distributed computation helps by breaking up a problem thats too big for


one server into pieces that several smaller servers can handle

by Ramraj S

Big Data Analytics in Genomics

Hadoop

Hadoop is a framework used for storing, analyzing and processing


big data

Allows distributed storage and distributed processing of large data


sets across clusters of commodity computers using a simple
programming model

It is an Apache open source framework

by Ramraj S

Big Data Analytics in Genomics

by Ramraj S

Big Data Analytics in Genomics

Conclusion & Future Work

Denovo assembly is required to analyze many new genomes.

Denovo assembly has computational complexity


Hadoop framework can be applied to solve the complexity and to
achieve a high performance denovo assembly process.

by Ramraj S

Big Data Analytics in Genomics

References I

[1]

http://hadoop.apache.org/

[2]

Yu-Jung Chang, Chien-Chih Chen, Chuen-Liang Chen, Jan-Ming Ho A


de novo next generation genomic sequence assembler based on string
graph and MapReduce cloud computing framework

[3]

Michael C. Schatz, Daniel Sommer, David Kelley, and Mihai Pop De


Novo Assembly of Large Genomes Using Cloud Computing

[4]

Owen, Sean and Anil, Robin and Dunning, Ted and Friedman, Ellen
Mahout in Action

[4]

Ka-Chun Wong Big Data Analytics in Genomics

[4]

Jared T. Simpson and Richard Durbin Efficient de novo assembly of large


genomes using compressed data structures

by Ramraj S

Big Data Analytics in Genomics

Thank you

by Ramraj S

Big Data Analytics in Genomics

Вам также может понравиться