Академический Документы
Профессиональный Документы
Культура Документы
I. I NTRODUCTION
Due to the advent of next-generation sequencing technology,
the amount of available genomic data has started to increase
exponentially, with doubling rates currently reaching down
to four months [1]. To keep the storage and transmission
of genomic data sustainable in the near future, compression
algorithms are needed. However, current algorithms for compressing genomic data mostly focus on achieving high levels
of effectiveness (compression ratio) and reasonable levels of
efficiency (processing speed), ignoring the need for features
such as random access and stream processing [2].
In this paper, we introduce a novel framework for compressing genomic data, with the aim of allowing for a better tradeoff between effectiveness, efficiency and functionality. To that
end, we draw upon concepts taken from the area of media data
processing. Our specific contributions are as follows:
we make use of the pipes and filters design pattern to
implement a framework for genomic data compression
that is modular and easily extensible;
we propose an algorithm for block-based compression of
genomic data, leveraging multiple encoding tools;
we present two techniques that make it possible to
provide random access to compressed genomic data; and
we discuss the way a number of tools for media data processing could be re-used within the context of genomic
data processing.
Pipe
Input
filter
Pipe
Encoding
filter
Pipe
Output
filter
Pipe
Fig. 1. The genomic data compression track, using the pipes and filter pattern.
Block 2
Block 3
Block 4
non-ref
ref block 1
ref block 2
non-ref
ref block 1
Block 2
Block 3
non-ref
ref access
block 1
Hard-reset
random
Block 1
Block 2
Block 3
Block 4
non-ref
non-ref
ref block 2
non-ref
2 3 with
Block
3
Transmission:
Fig. 3. EncodingBlock
of Block
hard-reset
RA.
non-ref
ref block 1
A. Experimental Results
The main focus of the research effort presented in this paper
is on flexible integration of functionality into a framework for
compressing genomic data. With the encoding tools currently
implemented, our compression effectiveness is in the range
of 2.54 to 1.91 bpn (bits per nucleotide), dependent on the
compression settings used. Note that binary encoding needs 3
bpn, whereas Huffman coding needs 2.21 bpn.
1) Compression Results: Within the current compression
framework, there are two parameters that can be configured
and that have an influence on the compression ratio (and the
random access functionality): block size and window size.
Block size is the number of nucleotides in each separate
block. A small block size will improve the effectiveness
of prediction of the reference encoding tools and lower the
granularity for random access but the overhead needed to
signal the encoding tool and parameters used is larger per
nucleotide. To fit codon borders, we are using multiples of
three for this value.
Window size is the number of previously encoded blocks
used for reference encoding. A larger window size will improve the effectiveness of prediction of the reference encoding
tools but will also require more bits to signal the reference
block.
In Table I, we show the effect of the aforementioned two
parameters. In general, we can observe that a larger window
size improves the effectiveness of compression, especially for
smaller block sizes.
TABLE I
R ESULTING BITS PER NUCLEOTIDE
block\window
21
30
63
126
513
2
2.54
2.29
2.09
1.97
1.91
8
2.54
2.29
2.09
1.97
1.91
128
2.53
2.28
2.09
1.97
1.91
512
2.52
2.27
2.08
1.97
1.91
2048
2.51
2.26
2.08
1.97
1.91
8192
2.49
2.25
2.06
1.96
1.91
32768
2.45
2.21
2.04
1.95
1.91
21
82.32%
11.53%
6.15%
30
83.41%
10.64%
5.95%
63
86.69%
7.81%
5.5%
126
90.64%
4.31%
5.04%
512
95.81%
0.78%
3.4%
8
93.38%
0.37%
6.25%
64
92.78%
0.99%
6.24%
512
91.04%
2.74%
6.23%
4096
89.46%
4.32%
6.22%
32768
82.32%
11.53%
6.15%
Linking to genes - When currently making use of hardreset RA, the reset points are inserted at fixed positions. To
improve the effectiveness of transmission, we can link hardresets to the start codons of genes. That way, we do not need
to transmit previous reference blocks.
Compressed-domain processing - In previous research, we
have shown that the MPEG-21 Bitstream Syntax Description
Language (BSDL) can be used to efficiently adapt already
existing media files to meet the characteristics of diverse usage environments. These adaptations include, amongst others,
selecting parts of media files and adapting the accuracy of the
data [10]. We expect similar concepts can be reused within the
context of compressed-domain processing of genomic data.
Encryption - Human genomic data often contains sensitive
information, thus requiring encryption. Encrypting a complete
genomic file is a safe solution but results in the loss of support
for random access. It is therefore necessary to support the
encryption of a single block or a small group of blocks.
Furthermore, it would even be better to apply encryption to
the residual data only. That way, the header information is still
available for compressed-domain processing.
ACKNOWLEDGEMENT
The research activities described in this paper were funded
by Ghent University, iMinds, the Institute for the Promotion
of Innovation by Science and Technology in Flanders (IWT),
the Fund for Scientific ResearchFlanders (FWOFlanders),
and the European Union.
R EFERENCES
[1] G. Cochrane, C. E. Cook, E. Birney, The future of DNA sequence
archiving, Gigascience, 2013.
[2] T. Paridaens, W. De Neve, P. Lambert, R. Van de Walle, Genome
sequences as media files, Proceedings of DCBIOSTEC, 2014.
[3] S. Wandelt, M. Bux, U. Leser, Trends in genome compression, Journal
of Current Bioinformatics, 2013.
[4] S. Kuruppu, S. J. Puglisi, J. Zobel, Optimized relative Lempel-Ziv compression of genomes, 34th Australasian Computer Science Conference,
2011.
[5] K. K. Kaipa, K. Lee, T. Ahn, R. Narayanan, DNA coding using finitecontext models and arithmetic coding, ICASSP, 2009.
[6] M. H.-Y. Fritz, R. Leinonen, G. Cochrane, E. Birney, Efficient storage
of high throughput DNA sequencing data using reference-based compression, Cold Spring Harbor Laboratory Press, 2011.
[7] Sandvine, https://www.sandvine.com/downloads/general/global-internetphenomena/2013/2h-2013-global-internet-phenomena-report.pdf, 2013.
[8] T. Paridaens, D. De Schrijver, W. De Neve, R. Van de Walle, XMLdriven bitrate adaptation of SVC bitstreams, WIAMIS, 2007.
[9] B. Pieters, D. Van Rijsselbergen, W. De Neve, R. Van de Walle, Performance evaluation of H.264/AVC decoding and visualization using the
GPU, SPIE Optics and Photonics, 2007.
[10] D. Van Deursen, W. Van Lancker, W. De Neve, T. Paridaens, E.
Mannens, R. Van de Walle, NinSuna : a fully integrated platform for
format-independent multimedia content adaptation and delivery using
Semantic Web technologies, Multimedia Tools and Applications, Vol.
46(2-3):371-398, 2010.
[11] G. Van Wallendael, A. Boho, J. De Cock, A. Munteanu, R. Van de
Walle, Encryption for high efficiency video coding with video adaptation
capabilities, IEEE ICCE, 2013.
[12] GStreamer, http://gstreamer.freedesktop.org/, 1999-2014.
[13] T. Wiegand, G. J. Sullivan, G. Bjontegaard, A. Luthra, Overview of the
H.264/AVC video coding standard, IEEE Transactions on Circuits and
Systems for Video Technology, Vol. 13(7), 2003.
[14] Human Y Chromosome, ftp://ftp.ncbi.nih.gov/genomes/H sapiens/CHR Y/
hs alt HuRef chrY.fa.gz, 2014.