Scaling Up Genomic Analysis With ADAM

Scaling up genomic
analysis with ADAM

Frank Austin Nothaft, UC Berkeley AMPLab
fnothaft@berkeley.edu, @fnothaft
12/8/2014
Data Intensive Genomics
Scale of genomic analyses is growing rapidly:
New experiments sequence 10-100k samples
Use high coverage, WGS for variant analyses
100k samples @ 60x WGS will generate ~20PB of

read data and ~300TB of genotype data
Petabytes Cause Problems
1. Analysis systems must be horizontally scalable

without substantial programmer overhead
2. Data storage format must compress well while

providing good read performance
3. Need to efficiently slice and dice dataset: not all

users want the same views or subsets of data
Analysis Characteristics
Current genomics pipelines are limited by I/O
Most genomics algorithms can be formulated as a

data or graph parallel computation
Analysis algorithms use iteration and pipelining
Reference genome/experiment metadata access

must be cheap! > impacts analysis performance
What is ADAM?
An open source, high performance, distributed
platform for genomic analysis
ADAM defines a:
1. Data schema and layout on disk*
2. A Scala API
3. A command line interface

* Via Avro and Parquet
Principles for Scalable
Design in ADAM
Reuse commodity horizontally scalable systems
Parallel FS and data representation (HDFS +

Parquet) combined with in-memory computing
eliminates disk bandwidth bottleneck
Spark provides horizontally scalable iterative/

pipelined Map-Reduce
Minimize data movement: send code to data,

efficiently encode metadata
An in-memory data parallel computing framework
Optimized for iterative jobs > unlike Hadoop
Data maintained in memory unless inter-node

movement needed (e.g., on repartitioning)
Presents a functional programing API, along with support

for iterative programming via REPL
Set Daytona Greysort record (100TB in 23 min, 206 nodes)

Data Format record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
Avro schema encoded by Parquet union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
Schema can be updated without union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
breaking backwards compatibility union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
Normalize metadata fields into union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
schema for O(1) metadata access union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
Genotype schema is strictly union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
biallelic, a cell in the matrix union { null, string } recordGroupDescription = null;

union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig} mateContig = null;
}
Parquet
ASF Incubator project, based on
Google Dremel
http://www.parquet.io
High performance columnar

store with support for projections
and push-down predicates
3 layers of parallelism:
File/row group
Column chunk
Page
Image from Parquet format definition: https://github.com/Parquet/parquet-format

Big Data in Parquet
ADAM in Parquet provides a 25% improvement over
compressed BAM
Enables efficient slice-and-dice:
Can select column projections > reduce I/O
Support pushdown predicates for efficient filtering
Have Parquet/S3 integration to push computing

down into remote block stores for cold data
Scalability
Evaluated on 1000G WGS
NA12878, 234GB dataset
Used 32-128 m2.4xlarge, 1

cr1.8xlarge from AWS
Achieve linear scalability out

to 128 nodes for most tasks
2-4x improvement vs {GATK,

samtools/Picard} on single
machine for most tasks
Long-read assembly
with PacMin
The State of Analysis
Conventional short-read alignment based pipelines
are really good at calling SNPs
Need improvement at calling INDELs and SVs
And are slow: 2 weeks to sequence, 1 week to

analyze. Not fast enough.
If we move away from short reads, do we have other

options?
Opportunities
New read technologies are available
Provide much longer reads (250bp vs. >10kbp)
Different error model (15% INDEL errors, vs. 2%

SNP errors)
Generally, lower sequence specific bias

Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
If long reads are available
We can use conventional methods:
Carneiro et al, Genome Biology 2012

But!
Why not make raw assemblies out of the reads?
Find overlapping reads Find consensus sequence

for all pairs of reads (i,j):
i ? j
=
ACACTGCGACTCATCGACTC
Problems:
2
1. Overlapping is O(n ) and single evaluation is expensive anyways
2. Typical algorithms find a single consensus sequence; what if weve got

polymorphisms?
Fast Overlapping with
MinHashing
Wonderful realization by Berlin et al1: overlapping is
similar to document similarity problem
Use MinHashing to approximate similarity:
Per document/read, Hash into buckets:! Compare:!

compute signature:!
! !
! Signatures of length l For two documents with
1. Cut into shingles can be hashed into b signatures of length l,
2. Apply random buckets, so we expect Jaccard similarity is
hashes to shingles to compare all elements estimated by
3. Take min over all with similarity (# equal hashes) / l
random hashes (1/b)^(b/l) !
Easy to implement in Spark: map, groupBy, map, filter

1: Berlin et al, bioRxiv 2014
Overlaps to Assemblies
Finding pairwise overlaps gives us a directed
graph between reads (lots of edges!)
Transitive Reduction
We can find a consensus between clique members
Or, we can reduce down:
Via two iterations of Pregel!

Monoallelic Sequence Model
Traditional probabilistic models assume independence
at each site and a good reference model
This discards information about local sequence context
Can consider a different formulation of the problem:
Per reduced segment, build a graph of the alleles
Find the allelic copy numbers that maximize

segment probability
Allele Graphs
C G
ACACTCG TCTCA TCCACACT
A C
Edges of graph define conditional probabilities
Can efficiently marginalize probabilities over graph using Eliminate

algorithm1, exactly solve for argmax
Notes:!
X = copy number of this allele
Y = copy number of preceding allele
k = number of reads observed
j = number of reads supporting Y > X transition
Pi = probability that read i supports Y > X transition 1. Jordan, Probabilistic Graphical Models.
Output
Current assemblers emit FASTA contigs
Well emit multigs, which well map back to reference

graph
Multig = multi-allelic (polymorphic) contig
Will include a confidence score per base
Working with UCSC, whove done some really neat work1

deriving formalisms & building software for mapping
between sequence graphs, and GA4GH ref. variation team
1. Paten et al, Mapping to a Reference Genome Structure, arXiv 2014.

Acknowledgements
UC Berkeley: Matt Massie, Andr Schumacher,
Jey Kottalam, Christos Kozanitis, Adam Bloniarz!
Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael
Linderman, Jeff Hammerbacher!
GenomeBridge: Timothy Danford, Carl Yeksigian!
Cloudera: Uri Laserson!
Microsoft Research: Jeremy Elson, Ravi Pandya!
And many other open source contributors: 26
contributors to ADAM/BDG from >11 institutions

Scaling Up Genomic Analysis With ADAM

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Scaling Up Genomic Analysis With ADAM

Загружено:

Авторское право:

Доступные форматы

Scaling up genomic

analysis with ADAM

Scale of genomic analyses is growing rapidly:

New experiments sequence 10-100k samples

Use high coverage, WGS for variant analyses

100k samples @ 60x WGS will generate ~20PB of

1. Analysis systems must be horizontally scalable

2. Data storage format must compress well while

3. Need to efficiently slice and dice dataset: not all

Most genomics algorithms can be formulated as a

Analysis algorithms use iteration and pipelining

Reference genome/experiment metadata access

1. Data schema and layout on disk*

3. A command line interface

Parallel FS and data representation (HDFS +

Spark provides horizontally scalable iterative/

Minimize data movement: send code to data,

Optimized for iterative jobs > unlike Hadoop

Data maintained in memory unless inter-node

Presents a functional programing API, along with support

Set Daytona Greysort record (100TB in 23 min, 206 nodes)

biallelic, a cell in the matrix union { null, string } recordGroupDescription = null;

High performance columnar

Image from Parquet format definition: https://github.com/Parquet/parquet-format

Enables efficient slice-and-dice:

Can select column projections > reduce I/O

Support pushdown predicates for efficient filtering

Have Parquet/S3 integration to push computing

Used 32-128 m2.4xlarge, 1

Achieve linear scalability out

2-4x improvement vs {GATK,

Need improvement at calling INDELs and SVs

And are slow: 2 weeks to sequence, 1 week to

If we move away from short reads, do we have other

New read technologies are available

Provide much longer reads (250bp vs. >10kbp)

Different error model (15% INDEL errors, vs. 2%

Generally, lower sequence specific bias

Carneiro et al, Genome Biology 2012

Find overlapping reads Find consensus sequence

2. Typical algorithms find a single consensus sequence; what if weve got

Use MinHashing to approximate similarity:

Per document/read, Hash into buckets:! Compare:!

Easy to implement in Spark: map, groupBy, map, filter

Or, we can reduce down:

Via two iterations of Pregel!

This discards information about local sequence context

Can consider a different formulation of the problem:

Per reduced segment, build a graph of the alleles

Find the allelic copy numbers that maximize

Edges of graph define conditional probabilities

Can efficiently marginalize probabilities over graph using Eliminate

Well emit multigs, which well map back to reference

Multig = multi-allelic (polymorphic) contig

Will include a confidence score per base

Working with UCSC, whove done some really neat work1

1. Paten et al, Mapping to a Reference Genome Structure, arXiv 2014.

Вам также может понравиться