Академический Документы
Профессиональный Документы
Культура Документы
Abstract—In recent years, the amount of time series data lacks operability over NoSQL based solutions. opentsdbr [7]
generated in different domains have grown consistently. and StatsD [8] uses HTTP/JSON APIs to query data from
Analyzing large time-series datasets coming from sensor OpenTSDB and import them into R, but the centralized
networks, power grids, stock exchanges, social networks and functioning creates performance bottleneck due to large
cloud monitoring logs at a massive scale is one of the biggest requirements of memory, processing and network throughput.
challenges that data scientists are facing. Big data storage and rbase [9] and rhbase [1] provides distributed functionalities for
processing frameworks provides an environment to handle the non time-series data in HBase. RHIPE [10] allows possibility
volume, velocity and frequency attributes associated with time- to analyse simple time-series data that are stored in HDFS,
series data. We propose an efficient and distributed computing
however has limitations in directly reading data from HBase.
framework - R2Time for processing such data in the Hadoop
environment. It integrates R with a distributed time-series
RHIPE can read data from HBase with the help of third-party
database (OpenTSDB) using a MapReduce programming libraries [10], but has shortcomings in reading composite keys
framework (RHIPE). R2Time allows analysts to work on huge [12] as used by OpenTSDB. RHadoop [13] provides similar
datasets from within a popular, well supported, and powerful functionalities as RHIPE, but has significantly poorer
analysis environment. performance [18] due to distribution of required R libraries
over the network to every node in the cluster participating in
Keywords—R; OpenTSDB; Hadoop; HBase; time-series processing the data.
971
is created and appended together. An example of regular \000\004\E. Data stored in HBase are sorted on row keys,
expression for filtering the specified tags is: Let binary which makes performance with regular expression efficient.
representation of two tags be [0 0 1 0 0 2] and [0 0 2 0 0 4] For a specified query and date-time range, data from HBase is
respectively. A regular expression for filtering them would be: read using tag filter regular expressions.
\Q\000\000\001\000\000\002\E and \Q\000\000\002\000
Figure 3: R2Time architecture. Step1: user supplies input parameters. Step2: R2Time convert input parameters in to start/end row key and filters. Step3: R2Time
define its RecordReader, getSplit() to get numbers of split to run map tasks. Step4: RecordReader of R2Time to read <key,value> pairs. Step5: ready to run map
and reduce task.
Once the row keys and filters are created, R2Time uses the
scan object of HBase to point at the start and end data blocks.
HBase provides several API for filtering of data [19]. Using
RowFilter, CompareFilter and RegexStringComparator the data
between the start and end blocks are filtered with the generated
regular expressions.
B. Data retrieval
Any distributed operations for data stored in HBase are
performed using MapReduce tasks. A MapReduce job is
defined through R2Time, which creates the required row keys
from the specified start/end timestamps, metric and tags. It also
provides an extended RecordReader class called
RHHBaseReader to read data from the tsdb table with its row
keys as "keys" and delta columns as "values". It, in turn uses
RHIPE to distribute the user-defined map() and reduce() tasks
through the underlying MapReduce framework.
The overview of this framework is shown in Figure 3 and a
pseudo-code for a simple MapReduce program to count the
number of data-points is provided in Example 1.
972
IV. RESULT AND ANALYSIS on each map function and later combined to obtain the final
Setup: Our cluster is comprised of 11 nodes with CentOS result in the reduce function. K-mean is the most expensive
linux distro, one node each for Namenode, Secondary operation out of all the four functions, due to extra computation
Namenode, HBase Master, Job Tracker, and Zookeeper. The cost in calculation as well as storing/reading the initial
remaining 6 nodes act as Data Nodes, Regional Severs and centroids. Moreover, it also consists of computational overhead
Task Trackers. All nodes have an AMD Opteron(TM) 4180 in terms of calculation of Euclidean distances [24-25].
six-core 2.6GHz processor, 16 GB of ECC DDR-2 RAM, 3x3 The aforementioned functions were chosen because they
TeraBytes secondary storage and HV.P ProCurve 2650 switch. constitute the core of advanced statistics [28][30]. Based on
Experiments were conducted using OpenTSDB-1.0, Hadoop- these functions, further analysis can be performed in terms of
0.20, Hbase-0.90 Apache releases. Our default HDFS trend analysis [27], seasonal variation [27] and clustering [26].
configuration had a block size is 64 MB and the replication
factor 3. Due to the distributed nature of MapReduce, loads are
equally distributed between different nodes. The methods used
A. Performance of statistical functions for distributed handling of composite keys allow the
The performance of different statistical functions was achievement of optimal performance, limited only by general
evaluated using the R2Time framework. Figure 3 shows the algorithmic complexities and regular Hadoop overheads. Even
computation time for count, mean, linear regression and k- for reasonably sized datasets (125 million), performance goals
means for different data size. for distribution of calculations outweighed all the overheads.
For larger datasets (>1 billion), the performance is expected to
follow intrusive complexity of each algorithm with minor
Hadoop overheads.
statistical functions
As observed from the evaluation, the functions appear to be
2000
K−mean
Regression
Mean
supralinear in nature. In particular, this is due to algorithms
G Count being scaled using the MapReduce paradigm. As mentioned
earlier, the framework supports composite keys in an optimal
manner and does not introduce added complexities to any
1500
HBase
140
G G R2Time
G G G G
OpenTSDB
0
120
80
973
OpenTSDB by almost 50% for a cluster with 2 nodes. With a
larger number of nodes in the cluster the performance tends to
further improve. The test was performed on 75 million data
points for reading data from HBase. R2Time is built using
Hadoop and HBase Java API. It does not comprise
performance despite allowing R users enrich statistical
functionality.
974
REFERENCES Monitoring and Analysis. 2012, Springer-Verlag: Vienna, Austria. p. 143-
[1] Revolution Analyltics, https://github.com/RevolutionAnalytics/RHadoop/ 156.
wiki/rhbase. [18] R. M. Esteves, R. Pais, and C. Rong, K-means Clustering in the Cloud –
[2] Sigoure, B. OpenTSDB. Available from: http://opentsdb.net/. A Mahout Test, Proceedings of the 2011 IEEE Workshops of
International Conference on Advanced Information Networking and
[3] N. Dimiduk and A. Khurana, HBase in Action, 1 edition. Shelter Island,
Applications, Washington, DC, USA, 2011, s. 514–519.
NY: Manning Publications, 2012, ISBN 9781617290527.
[19] Yahoo! Developer Network, (2014), Hadoop at Yahoo!, available at
[4] R. Kabacoff, R in Action, 1 edition. Shelter Island, NY: Manning http://developer.yahoo.com/hadoop/, last accessed on 25th March, 2014.
Publications, 2011, ISBN 9781935182399
[20] L. George, HBase: The Definitive Guide, 1 edition. Sebastopol, CA:
[5] Chang, F., et al., Bigtable: a distributed storage system for structured
O’Reilly Media, 2011, ISBN 9781449396107
data, in Proceedings of the 7th USENIX Symposium on Operating
Systems Design and Implementation - Volume 7. 2006, USENIX [21] Cooper, B.F., et al., Benchmarking cloud serving systems with YCSB, in
Association: Seattle, WA. p. 15-15. Proceedings of the 1st ACM symposium on Cloud computing. 2010,
ACM: Indianapolis, Indiana, USA. p. 143-154.
[6] Boshnakov, G.N., Time Series Analysis With Applications in R Series:
Springer Texts in Statistics, 2nd Edition. Journal of Time Series Analysis, [22] Han, J., M. Song, and J. Song, A Novel Solution of Distributed Memory
2009. 30(6): p. 708-709. NoSQL Database for Cloud Computing, in Proceedings of the 2011 10th
IEEE/ACIS International Conference on Computer and Information
[7] D. Holstius, https://github.com/holstius/opentsdbr.
Science. 2011, IEEE Computer Society. p. 351-355.
[8] Murphy, E. StatsD OpenTSDB publisher backend. 2012; Available from:
[23] J. D. Finn, A general model for multivariate analysis. Holt, Rinehart and
https://github.com/emurphy/statsd-opentsdb-backend.
Winston, ISBN 003083239X,
[9] S. Guha, https://github.com/saptarshiguha/rbase. [24] R. Loohach and K. Garg, Effect of Distance Functions on Simple K-
[10] S. Guha, Computing environment for the statistical analysis of large and means Clustering Algorithm, Int. J. Comput. Appl., bd. 49, nr. 6, s. 7–9,
complex data. 2010, Purdue University. p. 134. jul. 2012.
[11] Dean, J. and S. Ghemawat, MapReduce: simplified data processing on [25] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, 1
large clusters. Commun. ACM, 2008. 51(1): p. 107-113. edition. Boston: Addison-Wesley, 2005, ISBN 0321321367.
[12] Dan, H. and E. Stroulia. A three-dimensional data model in HBase for [26] Hill, T. & Lewicki, P. (2007). STATISTICS: Methods and Applications.
large time-series dataset analysis in Maintenance and Evolution of StatSoft, Tulsa, OK.
Service-Oriented and Cloud-Based Systems (MESOCA), 2012 IEEE 6th
[27] T. W. Anderson, The Statistical Analysis of Time Series, 1 edition. New
International Workshop on the. 2012.
York: Wiley-Interscience, 1994, ISBN 9781118150399.
[13] Revolution Analyltics, https://github.com/RevolutionAnalytics/RHadoop/
[28] S. Wasserman, Social Network Analysis: Methods and Applications.
[14] White, T., Hadoop: The Definitive Guide. 2009: O'Reilly Media, Inc. Cambridge University Press, 1994, ISBN: 9780521387071.
528, ISBN 9780596521974
[29] Wei Tan, Sandeep Tata, Yuzhe Tang, Liana Fong, "Diff-Index:
[15] Shvachko, K., et al. The Hadoop Distributed File System. in Mass Differentiated Index in Distributed Log-Structured Data Stores", to
Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium appear in EDBT'14.
on. 2010.
[30] Box, G. E. P., Jenkins, G. M., and Reinsel, G. C. (1994). Time Series
[16] S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, og W. S. Cleveland, Analysis, Forecasting and Control, 3rd ed. Prentice Hall, Englewood
Large complex data: divide and recombine (D&R) with RHIPE, Stat, bd. Clifs, NJ, ISBN 9781118619063.
1, nr. 1, s. 53–67, okt. 2012.
[31] A. Holmes, Hadoop in Practice. Greenwich, CT, USA: Manning
[17] Deri, L., S. Mainardi, and F. Fusco, tsdb: a compressed database for time Publications Co., 2012, ISBN 9781617290237.
series, in Proceedings of the 4th international conference on Traffic
975