Вы находитесь на странице: 1из 4

Towards block-based compression of genomic data

with random access functionality


Tom Paridaens , Yves Van Stappen , Wesley De Neve , Peter Lambert and Rik Van de Walle
Email: {tom.paridaens, yves.vanstappen, wesley.deneve, peter.lambert, rik.vandewalle}@ugent.be
Multimedia
Image

Lab, Ghent University - iMinds, Ghent, Belgium


and Video Systems Lab, KAIST, Daejeon, South Korea

AbstractCurrent algorithms for compressing genomic data


mostly focus on achieving high levels of effectiveness and reasonable levels of efficiency, ignoring the need for features such
as random access and stream processing. Therefore, in this
paper, we introduce a novel framework for compressing genomic
data, with the aim of allowing for a better trade-off between
effectiveness, efficiency and functionality. To that end, we draw
upon concepts taken from the area of media data processing. In
particular, we propose to compress genomic data as small blocks
of data, using encoding tools that predict the nucleotides and that
correct the prediction made by storing a residue. We also propose
two techniques that facilitate random access. Our experimental
results demonstrate that the compression effectiveness of the
proposed approach is up to 1.91 bpn (bits per nucleotide), which
is significantly better than binary encoding (3 bpn for an alphabet
of A, C, G, T and N) and Huffman coding (2.21 bpn).
Index TermsDNA Sequence Compression, Genomic Data
Storage, Random Access

I. I NTRODUCTION
Due to the advent of next-generation sequencing technology,
the amount of available genomic data has started to increase
exponentially, with doubling rates currently reaching down
to four months [1]. To keep the storage and transmission
of genomic data sustainable in the near future, compression
algorithms are needed. However, current algorithms for compressing genomic data mostly focus on achieving high levels
of effectiveness (compression ratio) and reasonable levels of
efficiency (processing speed), ignoring the need for features
such as random access and stream processing [2].
In this paper, we introduce a novel framework for compressing genomic data, with the aim of allowing for a better tradeoff between effectiveness, efficiency and functionality. To that
end, we draw upon concepts taken from the area of media data
processing. Our specific contributions are as follows:
we make use of the pipes and filters design pattern to
implement a framework for genomic data compression
that is modular and easily extensible;
we propose an algorithm for block-based compression of
genomic data, leveraging multiple encoding tools;
we present two techniques that make it possible to
provide random access to compressed genomic data; and
we discuss the way a number of tools for media data processing could be re-used within the context of genomic
data processing.

This paper is organized as follows. In Section II, we review


related work in the area of both genomic data compression
and media data processing. In Section III, we discuss the
architecture of the proposed compression framework and the
encoding tools implemented, paying particular attention to
the support for random access. In Section IV, we evaluate
the effectiveness of our compression framework. Finally, in
Section V, we present our conclusions and a number of
directions for future work.
II. R ELATED W ORK
A. Genomic Data Compression
Genomic data compression encompasses the processing of
different types of data, including separate reads and fullgenome sequences. In this context, four different types of
compression can be used [3]:
1) Bit encoding - Nucleotides are stored as a sequence of
two bits (or three bits when the N nucleotide is also
used), instead of the eight-bit ASCII format.
2) Dictionary-based encoding - Groups of nucleotides are
stored as a reference to entries in a dictionary [4].
3) Statistical encoding - Nucleotides are predicted based
on a probabilistic model [5].
4) Reference-based encoding - Groups of nucleotides are
stored as a link to part of a reference genome [6].
In this paper, we integrate several of the techniques mentioned above into a modular and extensible framework for
compressing full-genome sequences.
B. Media Data Processing
Advances in consumer technology (e.g., smartphones with
high-definition cameras) and services (e.g., Netflix) have the
same catalytic effect on media data as next-generation sequencing has on genomic data: a cheaper way to create
and consume data, causing an exponential growth in data.
As a result, a strong need exists for effective storage and
transmission of media data, and thus for compression [7].
Thanks to the international standardization of technology
for media data processing, researchers can easily build on
top of already existing results produced by both academia
and industry, making it possible to address a broad range of
advanced research topics in a highly effective manner. These
research topics include, amongst others:

random access functionality;


compressed-domain processing [8];
stream processing [9];
adaptive streaming [10]; and
encryption [11].
Interestingly, these research topics can also be studied
within the context of genomic data processing. In this paper,
we set out to do this for the topic of random access functionality, while leaving the study of the other research topics within
the context of genomic data processing as future work.

III. P ROPOSED C OMPRESSION F RAMEWORK


A. Architectural Design
The architectural design of the proposed framework for
compressing genomic data reflects our intention to balance
effectiveness and efficiency with functionality [2]. Indeed, in
order to allow for such a trade-off, we decided to make use of
the pipes and filters design pattern during the construction of
our compression framework. Using this design pattern, filters
are connected to each other by pipes that pass the data from
one filter to the next, and where the filters are responsible
for manipulating the data. That way, we can easily adapt the
functionality of our compression framework by modifying or
adding filters. Note that media tools such as GStreamer [12]
make use of a similar approach.
Statistics

Pipe

Input
filter

Pipe

Encoding
filter

Pipe

Output
filter

Pipe

Fig. 1. The genomic data compression track, using the pipes and filter pattern.

In Fig. 1, we show the genomic data compression track of


the proposed framework. This track consists of three filters:
The Input filter imports genomic data from files.
The Output filter exports the compressed genomic data
and associated metadata to files.
The Encoding filter splits the input data in blocks and
selects the most effective encoding tool.
The track in question also contains a Statistics module that
is used to study the effectiveness, efficiency and usage of the
different encoding tools.
B. Encoding Tools
The tools used by the encoding filter can be divided
into two groups: reference and non-reference encoding tools.
Reference tools use previously encoded data for prediction
purposes, whereas non-reference tools encode the data without
references. The implemented tools are as follows:
1) Non-reference encoding tools:
NUCLEOTIDE REPETITION - The prediction is either a repetition of a single nucleotide (e.g., AAAAA...)
or a repetition of two different nucleotides (e.g.,
ATATAT...).

HUFFMAN coding - The input block is coded per codon


using Huffman coding.
BINARY coding - The input block is coded by binary
representation of nucleotides. This type of coding is
limited to blocks that do not contain N nucleotides.
2) Reference encoding tools:
SEARCH coding - The prediction is the block that
contains the least differences when compared to the input
block. To find this block, the input block is compared to
all previously encoded blocks within a specified range
Search Window.
SEARCH COMPLEMENT coding - SEARCH coding
using the complement of the input block.
SEARCH INVERSE COMPLEMENT coding SEARCH coding the inverse complement of the input
block.
The proposed compression framework can be easily extended with additional encoding tools by incorporating these
tools into the encoding filter.

C. Media Functionality for Genomes


In order to support features such as random access and
stream processing, we propose to split the genomic data
into fixed-sized blocks of nucleotides, as is done by most
algorithms for media data compression (e.g., algorithms for
video compression typically split video frames into slices and
macroblocks [13]). Each of these blocks can then be encoded
and further processed separately, thus making it possible to
implement the aforementioned features.
Furthermore, we propose to make use of a so-called compression toolbox, as is also done by most algorithms for media
data compression. Indeed, during compression, we process
each block of genomic data by a collection of tools. For
each block, we then select the tool that offers the highest
compression ratio. That way, our compression algorithm is
able to automatically adapt itself to the characteristics of a
particular block.
Note that the tools in our compression toolbox can be
combined into profiles that target different applications, as
is also frequently done within the context of media data
processing. These profiles differ in the number and the type
of tools that are available, thus making it possible to tradeoff efficiency, effectiveness and functionality. For example,
whereas a profile that contains all tools will facilitate higher
compression ratios than a profile that contains a single tool,
the latter will have a lower time and memory complexity.
D. Random Access
Random access is an important functionality in the area
of media data processing. Indeed, random access makes it
possible to access and consume parts of a media file without
having to download and decode the complete media file. As
such, random access typically improves bandwidth usage at the
cost of a small loss in compression effectiveness. To support
random access, we store an index file that links a block number
to its corresponding byte address in the encoded bitstream. We

implemented two different techniques for providing random


access to genomic data:
Soft-reset random access (Soft-reset RA) - Files are
split into blocks. If a requested block is encoded using
one of the reference coding tools, all blocks that are
needed for decoding will also be transmitted.
Hard-reset random access (Hard-reset RA) - After a
fixed or variable number of blocks, hard-reset blocks are
added. None of the blocks located before the reset can be
used as a reference for encoding. The hard-reset allows
limiting the number of reference blocks, thus making
it possible to achieve low-complexity random access as
a server can transmit a block between two hard-resets
without the need for preprocessing.
In Fig. 2, we show the use of soft-reset RA. Block 1 is
encoded without a reference, Block 2 is encoded based on
Block 1 and Block 3 is encoded based on Block 2. As a result,
the transmission of Block 3 will also result in the transmission
of Block 1 and Block 2.
Soft-reset random access
Block 1

Block 2

Block 3

Block 4

non-ref

ref block 1

ref block 2

non-ref

Fig. 2. Encoding of Block 3 with soft-reset RA.


Block 1
Block 2
Block 3
Transmission:
block 1
Soft-reset non-ref
randomrefaccess

ref block 1

In Fig. 3, we show the use of hard-reset RA. Block 1 is


1 a reference,
Block 2 and
Block
Block
encoded Block
without
after3 Block
1, 4a hard-reset
non-ref
ref block 1 ref block 2
non-ref
is inserted. Block
2
is
encoded
without
a
reference
as it is
Hard-reset random access
the first block after the hard-reset. Block 3 is encoded based
Block 1
Block 2
Block 3
Transmission:
on Block
2. As 1a result,
the
ofBlock
3 will
non-ref
ref 3
block
1Block
ref 4
block
1 also
Block
Block
2 transmission
Block
non-ref
non-ref
ref 2
block
non-ref
result in the
transmission
of Block
but 2not in
the transmission
of Block 1.
Transmission:

Block 2

Block 3

non-ref
ref access
block 1
Hard-reset
random

Block 1

Block 2

Block 3

Block 4

non-ref

non-ref

ref block 2

non-ref

2 3 with
Block
3
Transmission:
Fig. 3. EncodingBlock
of Block
hard-reset
RA.
non-ref

ref block 1

Thanks to the use of block-wise processing and the modular


nature of our compression framework, we are able to support
random access to genomic data in a generic way, independent
of any adaptation to the compression tools used. Finally, in
order to facilitate byte-wise addressing of encoded blocks in a
compressed bitstream, we add padding bits to every block, an
operation that (slightly) lowers the compression effectiveness.
IV. E XPERIMENTS
To measure the compression effectiveness and encoding
tool usage, we processed the full Homo Sapiens Y chromosome [14] with our compression framework. This file, in
FASTA format, contains five different nucleotides: A, C, G, T
and N.

A. Experimental Results
The main focus of the research effort presented in this paper
is on flexible integration of functionality into a framework for
compressing genomic data. With the encoding tools currently
implemented, our compression effectiveness is in the range
of 2.54 to 1.91 bpn (bits per nucleotide), dependent on the
compression settings used. Note that binary encoding needs 3
bpn, whereas Huffman coding needs 2.21 bpn.
1) Compression Results: Within the current compression
framework, there are two parameters that can be configured
and that have an influence on the compression ratio (and the
random access functionality): block size and window size.
Block size is the number of nucleotides in each separate
block. A small block size will improve the effectiveness
of prediction of the reference encoding tools and lower the
granularity for random access but the overhead needed to
signal the encoding tool and parameters used is larger per
nucleotide. To fit codon borders, we are using multiples of
three for this value.
Window size is the number of previously encoded blocks
used for reference encoding. A larger window size will improve the effectiveness of prediction of the reference encoding
tools but will also require more bits to signal the reference
block.
In Table I, we show the effect of the aforementioned two
parameters. In general, we can observe that a larger window
size improves the effectiveness of compression, especially for
smaller block sizes.
TABLE I
R ESULTING BITS PER NUCLEOTIDE
block\window
21
30
63
126
513

2
2.54
2.29
2.09
1.97
1.91

8
2.54
2.29
2.09
1.97
1.91

128
2.53
2.28
2.09
1.97
1.91

512
2.52
2.27
2.08
1.97
1.91

2048
2.51
2.26
2.08
1.97
1.91

8192
2.49
2.25
2.06
1.96
1.91

32768
2.45
2.21
2.04
1.95
1.91

2) Encoding Tool Usage: In what follows, we discuss the


usage of the different encoding tools for different block and
window sizes. To investigate the usage of the repetition tools,
we separate the Huffman and binary coding from the nonreference encoding tools.
In Table II, we show the encoding tool usage for a fixed
search window size of 32768 blocks. We can observe that
with rising block size, the use of Huffman coding rises as it
becomes more difficult for the algorithm to find high-quality
reference predictions. The non-reference coding tools are also
less effective due to a lack of longer repetitions.
TABLE II
E NCODING TOOL USAGE AS A FUNCTION OF THE BLOCK SIZE
Binary+Huffman
Reference
Non-Reference

21
82.32%
11.53%
6.15%

30
83.41%
10.64%
5.95%

63
86.69%
7.81%
5.5%

126
90.64%
4.31%
5.04%

512
95.81%
0.78%
3.4%

In Table III, we show the effect of the search window on the


encoding tool usage for a fixed block size of 21 nucleotides.
We can observe that the usage of the reference encoding
tools rises with larger search windows as the latter increase
the probability of finding an effective prediction. While the
reference encoding tools improve encoding effectiveness, they
also affect the soft-reset RA effectiveness. A higher usage
of reference encoding tools will increase the need to select
additional blocks that are not within the requested range.
In case of hard-reset RA, the effectiveness depends on the
refresh strategy and the size of the refresh window. Due to
the resetting of the context at each refresh, hard-reset RA will
offer lower effectiveness than soft-reset RA, with equal search
window and block sizes. Therefore, hard-reset RA is only to be
used in cases where random access functionality is important.
TABLE III
E NCODING TOOL USAGE AS A FUNCTION OF THE SEARCH WINDOW SIZE
Binary+Huffman
Reference
Non-Reference

8
93.38%
0.37%
6.25%

64
92.78%
0.99%
6.24%

512
91.04%
2.74%
6.23%

4096
89.46%
4.32%
6.22%

32768
82.32%
11.53%
6.15%

V. C ONCLUSIONS AND F UTURE W ORK


In this paper, we introduced a novel framework for compressing genomic data, drawing upon a number of concepts
taken from the field of media data processing: usage of the
pipes and filter design pattern, block-based compression and
usage of a compression toolbox. In particular, we proposed
to compress genomic data as small blocks of data, using
encoding tools that predict the nucleotides and that correct the
prediction made by storing a residue. In addition, we proposed
two techniques that facilitate random access to compressed
genomic data. Our experimental results demonstrate that the
compression effectiveness of the proposed approach is up to
1.91 bpn (bits per nucleotide), which is significantly better
than binary encoding (3 bpn) and Huffman coding (2.21
bpn). We expect the effectiveness to improve by implementing
additional encoding tools.
In future work, we will integrate more advanced media
functionality in our framework for genomic data compression:
Network Abstraction Layer - We plan to introduce an
abstraction layer, mapping compressed genomic data blocks
onto so-called Network Abstraction Layer (NAL) units, an
approach also used by present-day video compression standards [13]. That way, we can reuse already existing (and
highly optimized) solutions for streaming media data within a
genomic data context.
Gene Parameter Set - The characteristics of genomic
data can change along the genome (e.g., the frequency of
nucleotides). By supporting the concept of a so-called Gene
Parameter Set, it is possible to for instance adapt parameters
or to change the list of available tools at a local level. This
approach is in line with the usage of different parameter sets
by present-day video compression standards (cf. the usage of
a Sequence Parameter Set by H.264/AVC [13]).

Linking to genes - When currently making use of hardreset RA, the reset points are inserted at fixed positions. To
improve the effectiveness of transmission, we can link hardresets to the start codons of genes. That way, we do not need
to transmit previous reference blocks.
Compressed-domain processing - In previous research, we
have shown that the MPEG-21 Bitstream Syntax Description
Language (BSDL) can be used to efficiently adapt already
existing media files to meet the characteristics of diverse usage environments. These adaptations include, amongst others,
selecting parts of media files and adapting the accuracy of the
data [10]. We expect similar concepts can be reused within the
context of compressed-domain processing of genomic data.
Encryption - Human genomic data often contains sensitive
information, thus requiring encryption. Encrypting a complete
genomic file is a safe solution but results in the loss of support
for random access. It is therefore necessary to support the
encryption of a single block or a small group of blocks.
Furthermore, it would even be better to apply encryption to
the residual data only. That way, the header information is still
available for compressed-domain processing.
ACKNOWLEDGEMENT
The research activities described in this paper were funded
by Ghent University, iMinds, the Institute for the Promotion
of Innovation by Science and Technology in Flanders (IWT),
the Fund for Scientific ResearchFlanders (FWOFlanders),
and the European Union.
R EFERENCES
[1] G. Cochrane, C. E. Cook, E. Birney, The future of DNA sequence
archiving, Gigascience, 2013.
[2] T. Paridaens, W. De Neve, P. Lambert, R. Van de Walle, Genome
sequences as media files, Proceedings of DCBIOSTEC, 2014.
[3] S. Wandelt, M. Bux, U. Leser, Trends in genome compression, Journal
of Current Bioinformatics, 2013.
[4] S. Kuruppu, S. J. Puglisi, J. Zobel, Optimized relative Lempel-Ziv compression of genomes, 34th Australasian Computer Science Conference,
2011.
[5] K. K. Kaipa, K. Lee, T. Ahn, R. Narayanan, DNA coding using finitecontext models and arithmetic coding, ICASSP, 2009.
[6] M. H.-Y. Fritz, R. Leinonen, G. Cochrane, E. Birney, Efficient storage
of high throughput DNA sequencing data using reference-based compression, Cold Spring Harbor Laboratory Press, 2011.
[7] Sandvine, https://www.sandvine.com/downloads/general/global-internetphenomena/2013/2h-2013-global-internet-phenomena-report.pdf, 2013.
[8] T. Paridaens, D. De Schrijver, W. De Neve, R. Van de Walle, XMLdriven bitrate adaptation of SVC bitstreams, WIAMIS, 2007.
[9] B. Pieters, D. Van Rijsselbergen, W. De Neve, R. Van de Walle, Performance evaluation of H.264/AVC decoding and visualization using the
GPU, SPIE Optics and Photonics, 2007.
[10] D. Van Deursen, W. Van Lancker, W. De Neve, T. Paridaens, E.
Mannens, R. Van de Walle, NinSuna : a fully integrated platform for
format-independent multimedia content adaptation and delivery using
Semantic Web technologies, Multimedia Tools and Applications, Vol.
46(2-3):371-398, 2010.
[11] G. Van Wallendael, A. Boho, J. De Cock, A. Munteanu, R. Van de
Walle, Encryption for high efficiency video coding with video adaptation
capabilities, IEEE ICCE, 2013.
[12] GStreamer, http://gstreamer.freedesktop.org/, 1999-2014.
[13] T. Wiegand, G. J. Sullivan, G. Bjontegaard, A. Luthra, Overview of the
H.264/AVC video coding standard, IEEE Transactions on Circuits and
Systems for Video Technology, Vol. 13(7), 2003.
[14] Human Y Chromosome, ftp://ftp.ncbi.nih.gov/genomes/H sapiens/CHR Y/
hs alt HuRef chrY.fa.gz, 2014.

Вам также может понравиться