Академический Документы
Профессиональный Документы
Культура Документы
AbstractData is increasingly affecting the automotive indus- volume, variety, velocity, veracity, and value (see [1] for the
try, from vehicle development, to manufacturing and service original 3 Vs): Volume: Industry sources estimate that on aver-
processes, to online services centered around the connected age about 480 TB of data were collected by every automotive
vehicle. Connected, mobile and Internet of Things devices and
machines generate immense amounts of sensor data. The ability manufacturer in 2013 (IHS) [2]. It is expected that this size will
to process and analyze this data to extract insights and knowledge increase to 11.1 PB per year in 2020. Variety: Heterogeneous
that enable intelligent services, new ways to understand business data sets generated by different sources stored in different
problems, improvements of processes and decisions, is a critical formats impose signicant challenges during the ingest and
capability. Hadoop is a scalable platform for compute and storage integration process; their fusion enables sophisticated analytic
and emerged as de-facto standard for Big Data processing at
Internet companies and in the scientic community. However, approaches ranging from self-learning algorithms for pattern
there is a lack of understanding of how and for what use cases detection to dimension reduction approaches for complex
these new Hadoop capabilities can be efciently used to augment predictions. Velocity: Data ingest rates and processing require-
automotive applications and systems. This paper surveys use ments vary widely from batch processing to real-time event
cases and applications for deploying Hadoop in the automotive processing of online data feeds, inducing high requirements
industry.
Over the years a rich ecosystem emerged around Hadoop on data infrastructure. Veracity: describes the uncertainty in
comprising tools for parallel, in-memory and stream processing the data, in the data provenance and the analytical modeling.
(most notable MapReduce and Spark), SQL and NOSQL engines Value: describes the ability to extract meaningful, actionable
(Hive, HBase), and machine learning (Mahout, MLlib). It is business insights from data. Further important dimensions
critical to develop an understanding of automotive applications include: data quality, provenance & lifecycle management.
and their characteristics and requirements for data discovery,
integration, exploration and analytics. We then map these re- The term data lake is used to refer to the ability to retain
quirements to a conned technical architecture consisting of core large volumes of data in its original, raw form to enable
Hadoop services and libraries for data ingest, processing and agile analytics on it. The phrase is commonly used to de-
analytics. The objective of this paper is to address questions, scribe Hadoop deployments. Hadoop [3] was developed as an
such as: What applications and datasets are suitable for Hadoop? open source implementation of the MapReduce [4] abstraction
How can a diverse set of frameworks and tools be managed
on multi-tenant Hadoop cluster? How do these tools integrate introduced by Google. Hadoop is based on a parallel and
with existing relational data management systems? How can distributed architecture and commodity hardware providing
enterprise security requirements be addressed? What are the per- economical storage and processing capabilities. A vibrant
formance characteristics of these tools for real-world automotive ecosystem of tools for various data-related tasks, for data in-
applications? To address the last question, we utilize a standard gest (Flume [5], Kafka [6]), data processing (e. g. MapReduce,
benchmark (TPCx-HS), and two application benchmarks (SQL
and machine learning) that operate on a dataset of multiple Spark [7], Flink [8] etc.) and advanced data analytics (e. g.
Terabytes and billions of rows. MLlib, Mahout) emerged. Hadoop is increasingly competing
with enterprise datawarehouse systems (Oracle, SQL Server,
I. I NTRODUCTION Teradata, HANA, etc.).
The increasing digitalization of the automotive industry While there is vast research on the usage of Hadoop in
driven by mobile and connected devices that carry an in- Internet companies [9] and in sciences [10], an understanding
creasing number of sensors is creating a signicant increase on the trade-offs of automotive Big Data applications and
in demand for data storage, processing and analytics. Enter- the suitability for a Hadoop data lake is missing. In this
prises are overwhelmed with large data volumes and analytics paper, we investigate the characteristics of automotive data
requirements as machine and sensor data generated by the applications to understand the variety of automotive data
Internet of Things (IoT) (soon the Internet of Everything) is sources, their volume and velocity requirements, and derive an
collected in ner granularities and higher frequencies. Exam- understanding how these requirements can be addressed using
ples of such applications in the automotive industry are: the Hadoop and Hadoop-based tools. We propose and validate a
connected vehicle, autonomous driving, smart manufacturing, Hadoop-based architecture for an automotive data lake: First,
the Industrial Internet, and mobility services. The requirements we must understand the automotive use cases and requirements
of Big Data applications are often described using the 5 Vs of for the infrastructure. Second, the tools and middleware stack
1202
is rapidly increasing. New opportunities are being created Business Intelligence
(Reporting, OLAP, Data
Advanced Analytics
(ETL, Exploration, Prediction, Clustering,
Data Applications
(Recommendations, Applications
to monitor, analyze, and optimize the manufacturing process Discovery) Search) Predications)
Higher-Level
and the supply chain. Currently, the majority of this data is
1203
special applications. In the processing layer we distinguish Kafka [6], which decouples message delivery and processing,
between data processing (batch and interactive) and stream- enabling the separation of the realtime and batch processing
ing frameworks (realtime). The data/compute resource layer pipeline. A message broker enables applications to support
encapsulates the core Hadoop services YARN and HDFS. multiple data consumers, e. g. a batch consumer for large
While we currently envision that the majority of interactions is historical analytics and realtime consumers for incremental
directed by users, in the future the degree of automation will model updates. While in the past several processing engines
improve and the number of automated machine connections specialized for Streaming emerged (e. g. Storm [27]), program-
to the cluster will increase conducting data loads, discovery, ming models between batch and streaming are harmonizing.
integration and analytics tasks without user interactions. Spark Streaming and Flink allows users to utilize the same
In the following we survey tools for the different layers: API on both batch and streaming data sources. Sparks MLlib
In section III-A we analyze the processing and execution supports the update of models inside the Spark Streaming
management layer for supporting batch and stream computing; environment.
we continue with the analytics layer investigating SQL and ad-
vanced analytics/machine learning capabilities in section III-B B. SQL Frameworks
and III-C. In section III-D we will analyze the current state of
security and governance in the Hadoop ecosystem. While Hadoop MapReduce and Spark provide a set of well-
dened, scalable abstractions for efcient data processing,
A. Processing and Execution Management higher-level abstraction to structured data - either based on
Hadoops original MapReduce uses a disk-oriented ap- SQL or document/object-store capabilities - enable a higher
proach for storage and processing of large volumes of data. productivity.
It will remain important for archival storage and computing, Many automotive use cases rely on SQL as common gram-
but can have slow access speeds for interactive or real-time mar for data extraction. SQL is particularly useful for querying
analytics requiring queries and for iterative processing for columnar data that has been derived from less structured
machine learning. To address these issues various processing data sources. In general, two architectures emerged: (i) the
and execution frameworks evolved, e. g. Spark [7], Flink [8], integration of Hadoop with existing relational datawarehouse
Tez [15], HARP [25]. Typically, a processing framework systems and (ii) the implementation of SQL engines on top of
comprises a programming model and an execution engine that core Hadoop services, i. e. HDFS and YARN. Architecture (i)
is responsible for lower-level resource management tasks, such is often used to integrate data from Hadoop into an existing
as scheduling, optimizations etc. datawarehouse. The scalability of this architecture is limited
In particular, Spark rapidly gained popularity. It utilizes as queried data always needs to be processed in the database
in-memory computing, which makes it particularly suitable (which is typically a magnitude smaller than the Hadoop
for iterative processing. The programming model is based on cluster). In the following we focus on Hadoop SQL engines.
an abstraction referred to as Resilient Distributed Datasets Inspired by Googles Dremel [28], various SQL query
(RDD). RDDs are immutable, read-only data structures on engines running over Hadoop have emerged: Hive [9],
which the application can apply transformations. The run- HAWQ [29], Impala [30] and Spark-SQL [31]. Hive was the
time handles the loading and distribution of the data into rst and is one of the most used SQL engine for Hadoop.
the memory of the cluster nodes. The immutable nature of Early versions of Hive compiled SQL into MapReduce jobs
RDDs enables efcient recoveries after a failure by simple which often yielded non-optimal performance. Thus, Hive was
re-applying transformations. A constraint of Spark is that extended to multiple runtimes, e. g. Tez [15] and Spark in addi-
its in-memory capabilities are limited to single application tion to MapReduce. Tez generalizes the MapReduce model to
jobs. There are multiple on-going efforts to extract in-memory a generic dataow oriented processing model while providing
capabilities into a separate runtime layer that is usable by an improved runtime system that supports in-memory caching
multiple frameworks and not restricted to caching data within and container re-use. Spark-SQL [7] is a SQL-engine that is
a single framework and/or job. Tachyon provides a HDFS part of the Spark distribution. A particular advantage of Spark-
compatible distributed in-memory lesystem. HDFS supports SQL is the tight integration with other analytic tools provided
since Hadoop 2.3 an in-memory cache. in the Spark ecosystems - results of SQL queries can be loaded
An emerging class of data-driven applications, e. g. trafc into a dataframe and then further processed with advanced
prediction and autonomous driving, requires realtime data pro- analytics tools, such as MLlib (see section III-C).
cessing and analytics. The Lambda architecture [26] aims to In contrast to traditional databases that often store data
support data-driven applications that have high data volumes, in proprietary formats tightly coupled to their processing
ingest rates, and require both realtime and batch analytics. For engine, open data storage formats are essential in the Hadoop
this purpose, the Lambda architecture denes a batch layer, environment in order to maintain interoperability between the
a serving layer and a speed layer. Hadoop while initially various tools and achieving a maximum level of exibility.
designed for MapReduce-style batch computing supports Very commonly there are multiple access paths and tools that
more stream processing via different frameworks and tools: access the same data. Thus, various le formats optimized for
The core of a realtime pipeline is a message broker, such as different use cases emerged: ORC [32] and Parquet [33] are
1204
particularly designed for SQL workloads using a columnar Hadoop
data structure.
Cluster
Analytics
Mahout Spark/MLLib
In summary, SQL processing on Hadoop provides more H2O
exibility compared to monolithic RDBMS, but requires a In-Database
Analytics
deeper understanding of the trade-offs: An application can PivotalR Madlib
Scale
optimize storage format (columnar storage for analytics, ran- Oracle Enterprise
R
dom access), processing engine (MapReduce, Tez, Spark),
Workstation
SQL frontend (Hive, Spark-SQL), the optimizer, to meet the
Workstation-based
Python/Dato R
Analytics
Open Source Python/Scikit pbdR Python
requirements of the particular workload. The optimizer plays
R
a critical role in achieving an optimal performance. Project Revolution R Other (Java, Scala, SQL)
Calcite used e. g. by Hive provides a data-format and None Data Data & Compute
processing engine agnostic optimizer that can be used to Parallelism
implement federated SQL engines on top of heterogeneous
data sources (e. g. Hive, HBase, MongoDB). Fig. 2. Scalable Data Analytics: We identify three categories of tools:
workstation-based tools typically provide the most comprehensive set of
features and best ease-of-use. In-database analytics provide a more scalable
C. Advanced Analytics environment, but are typically less scalable and exible than Hadoop-based
tools.
The term analytics is broadly used to describe systems
for extract-transform-load (ETL), business intelligence for
data exploration and reporting (e. g. QlikView, Tableau) and various parallel libraries for these tasks, e. g. ScaLAPACK [39]
advanced analytics tasks, such as data mining and machine or ARPACK [40], already exist.
learning. In the following, we focus on programmatic ad- Several R packages for out-of-core and parallel processing,
vanced analytics tools. The most known abstraction for data either via explicit implementation (e. g. pbdR [41]) or im-
analytics is the dataframe abstractions originally introduced in plicitly within a higher-level package, emerged. pdbR re-uses
R (respectively it predecessor S) exposes data in a tabulated ScaLAPACK for linear, while the open source R relies on the
format to the user and supports the efcient expression of non-parallel LAPACK and LINPACK. Further, various higher-
data transformations and analytics [34]. A dataframe provides level packages for R exist that provide dataframe compatible
a two-dimensional table abstraction that can store different abstraction, but delegate the implementation data transforma-
types of data (e. g. numeric, text data and categorical data) in tions and analytics to a distributed environment, e. g. Hadoop
its columns. Depending on the specic implementation, the or Spark. For example, Spark-SQL and MLLib [16] provide a
dataframe abstraction typically supports various functions for dataframe abstraction for Python, Scala and R the dataframe
data transformations, e. g. for subsetting, merging, ltering and is implemented using the well-known RDD abstraction of
aggregating data. In addition to R, several other dataframe Spark. MLLib relies on ARPACK encapsulated by netlib-
implementations exist, e. g. Pandas [35], NumPy [36], and java for linear algebra operations [42]. Revolution R provides
Dato SFrames [37] for Python. Traditionally, implementations data-frame level implementations of analytics functions that
of dataframe abstraction lacked the ability to scale-out. In are implemented by utilizing lower-level data-parallelism for
the following we investigate different analytic environments ltering, transformations etc. (e. g. rhadoop) and data/compute
with respect to their expressiveness, ease-of-use, scalability parallelism for advanced analytics (e. g. for generalized linear
and advanced analytics/machine learning support. models, decision trees and KMeans clustering). H2O [43]
Figure 2 shows the landscape of analytics tools. The R utilizes a similar architecture. A challenge is the need to copy
language [34] is a popular programming environment for data the data between the local R environment and the cluster.
analytics. While R provides an ecosystem comprising of a Further, the exposed dataframe APIs mimic R idioms, but
manifold set of packages for statistics and machine learning, cannot be used interchangeably with the standard R dataframe.
it has severe limitation with respect to scalability [38]. Various Another approach is the usage of in-database analytics. The
approaches for supporting better scalability via out-of-core levels of integration between R and databases can vary. In the
and distributed processing emerged. Often, R or Python serves simplest case the data-parallelism can be exploited using PL/R
as an interface to these scalable analytics environments (e. g. code. Other systems integrate Revolution R (e. g. Teradata)
Spark, H2O, or in-database analytics). In the following we or custom analytics implementations (e. g. Oracle). Another
focus in particular on the R and Python ecosystems as these are example is Madlib [44] a library that was designed for in-
the most prevalent programming languages for data science. database use. It runs embedded in databases, such as Postgres,
Scaling machine learning and analytics techniques requires HAWQ and Impala, enabling the usage of analytics functions
careful considerations of two forms of parallelism: data par- via SQL.
allelism typically can easily explored using programming In general, two kind of integrations of analytic tools with
models, such as MapReduce or Spark. Linear algebra solvers Hadoop exist: (i) tools that can load/store data from Hadoop
often used for machine learning typically demand a more ne- and (ii) tools that are tightly integrated with Hadoop and
grained parallelism, which is more difcult to scale. However, process data in-situ. Nearly all analytical tools support access
1205
to data stored in Hadoop either via HDFS or via an ODBC for user group and role management (e. g. based on LDAP)
connection to columnar data stored in Hive. Data discovery and a security management tool, such as Argus, to ensure that
tools (such as QlikView, Tableau), BI and analytical tools privileges are correctly translated to the different frameworks.
(e. g., R, Scikit-Learn, SPSS and SAS) fall into this category. The support for encryption of both network trafc and
While this approach enables the quick processing of small HDFS data is required for sensitive data. While network trafc
to medium size data, it also has some disadvantage: data encryption is supported by many Hadoop services, no built-
is duplicated, which increases the management and security in encryption for HDFS data is provided at the moment. The
overhead, and the scalability is often limited. The number and setup of network encryption is complex: encryption must be
maturity of higher-level tools that support scenario (ii) and enabled for all daemons, thus, management support for cong-
bridge the various layers and natively run analytics inside the uration and key management is important. HDFS encryption
cluster is still limited. is supported since Hadoop 2.6; several products attempt to
simplify and enhance this feature, e. g. Vormetric, Voltage or
D. Data Governance and Security Clouderas Navigator Encrypt and Key Trustee.
The data lake as central platform for storing diverse data sets Data lifecycle management describes the ability to track
supporting ad-hoc data integration, querying and analytics is a data provenance and lineage across the different phases: data
challenging environment with respect to data governance and ingest, storage and processing. Data sources and heteroge-
security. Often, the precise nature and privacy implications of neous data processing infrastructures contribute to the com-
the data at hand are unknown and requires analysis before plexity of this task. Often, it is required to move data between
the data can be efciently governed. Furthermore, the process different data stores with different, potentially inconsistent,
of linking data sources can potentially lead to unanticipated access policies. An important part of the data lifecycle is
privacy issues. Thus, it is critical to dene security measures the assessment of the data criticality and the assignment of
and governance processes for data access, retention, linkage access policies. Hadoops data lifecycle solutions are still in
and quality. The default security level of Hadoop is very low. their infancy with products, such as Clouderas Navigator
As Hadoop evolved security mechanisms have been retrotted and Apache Falcon. As part of the data lifecycle it is often
into different frameworks, however, it still lacks a coherent necessary that data is masked (i.e. parts of the data need to be
security architecture. The Kerberos feature is critical to prevent removed or pseudonymized) to meet the security level of the
the easy impersonation of other user accounts. However, the platform [17].
maturity of Kerberos support is in many tools very low -
some frameworks, e. g. Kafka, have no Kerberos support at E. Discussion
all, others, e. g. Spark, have difculty to interact with certain Automotive Big Data applications in the different domains:
secured services. manufacturing, quality, connected vehicle and intelligent trans-
It is best practice to deploy Hadoop behind a rewall and port systems share common characteristics: they bring together
only expose a small, selected set of services, either direct or a variety of data sources well structured data from transac-
via a gateway service, such as Apache Knox. Accordingly, tional systems and less structured data generated by machines,
the number of users with direct access to HDFS and YARN mobile devices, social networks etc. Loading and integrating
can be kept small. Knox allows the integration with single- these datasets and making it explorable via higher-level tools,
sign-on environments (e. g. Siteminder) and can forward the such as SQL is critical. Understanding the trade-offs of the
authenticated user or map it to specic service users. In non- different SQL on Hadoop frameworks in conjunction with the
Kerberos clusters, these gateway services are valid approaches different data formats is therefore crucial. As machine learning
for minimizing the number of users with direct access to HDF- techniques get deployed in the various phases of data science
S/YARN even though only a small number of use cases can be projects, e. g. for data cleaning, data discovery, and different
sufciently supported. The usage of Kerberos authentication is form of analytics (text analytics, pattern mining, predictive
essential for large-scale, multi-tenant data lakes and sensitive modeling, etc.). Specialized tools for deep learning e. g. often
data. require an specialized infrastructure (using e. g. GPU) and
Another important concern is authorization. Many frame- often do not integrate with the Hadoop scheduler YARN.
works chose to implement their own authorization infrastruc- YARN decouples cluster resource management and appli-
ture: Hives SQL grants/revokes or cell-level authorization for cations and allows to deploy most frameworks on user-level
HBase and Accumolo are examples of this approach. These without administrative involvement. However, in particular
tools rely on a shared service user for all HDFS interaction. frameworks that require a tighter integrations, e. g. Spark-
While these silo-ed security schemes simplify the security SQL/Hive, some pitfalls in particular with secure cluster
in their domain, they increase the management complexity exist. Also, as the implementation of YARN applications is
and the difculty of implementing consistent policies. Two a complex endeavor, it is typically not a rst order consid-
approaches for solving this issue have been proposed: (i) erations; YARN lacks the exibility and easy-of-use of other
Sentry adds a common authorization handler for adding role- schedulers such as Mesos [45] as it was primarily designed
based authorizations to different frameworks (currently Hive for Hadoop applications. There are some attempts to integrate
and Cloudera Impala). (ii) The usage of a shared directory these systems, e. g. Myriad provides the ability to dynamically
1206
spawn YARN clusters inside Mesos [46]. Another concern is 2500
security basic security mechanisms, such as Kerberos, are
available; however, in particular emerging frameworks, such 2000
1207
Select Aggregate Join Data Size (in GB) Number of Nodes
(32 Nodes) (8 GB Data)
10000 600
400
1000
Time (in sec)
200
100
0
1 2 4 8 16 8 16 32
10
Logistic Regression (SGD) SVM (SGD) Logistic Regression (LBFGS)
Feature Extraction Classification
Fig. 4. Hadoop SQL Performance: The gure shows the scale-out perfor-
mance of our automotive SQL benchmark based on queries of datawarehouse
the join query Hive is in average 50 times faster than Spark-
system. SQL. We had to optimize the join query for Spark, because
of memory issues in particular for the four node scenario.
Also, it must be noted that window queries are not supported
by Spark at the moment (it is part of our benchmark, but not
6000 depicted in graph). The advantage of the data lake architecture
Time (in sec)
1208
these machine learning algorithms inside of Spark. For this queries. Furthermore, we showed the potential of Hadoop-
purpose, we utilize a synthetic dataset of up to 110 million based machine learning frameworks, such as Sparks MLlib;
records (16 GB scenario). using this tool, we were able to process a signicantly larger
Figure 6 shows the results of this benchmark. A scalable dataset of high-dimensional text data consisting of 110 million
machine learning implementation is the key for dealing with records than with our previous R solution. However, several
high-dimensional data and large volumes (larger number of challenges remain: the maturity of the platform needs to
rows and columns). The application uses TF-IDF for feature increase to meet the requirements of the increasing number of
extraction and a SVM and logistic regression for classication. applications and users. Many available tools are low-level and
As shown in the gure, TF-IDF scales linearly, while the often immature and requiring highly advanced skills to operate
machine learning algorithms exhibit an increased overhead them. Finding and developing these skills and establishing
with larger data sizes due to the higher number of columns a culture and analytical mindset for supporting Big Data
in the TF-IDF matrix and the need for synchronization after services remains challenging. Higher-level tools are emerging
each iteration. Both learning algorithms SVM and logistic and hopefully can address this issue.
regression perform similarly when using stochastic gradient With the uptake of the data lake infrastructure, high stan-
descent (SGD) as solver. A signicant speedup can be ob- dards for data governance and the security are required. Fur-
served by replacing the SGD with the Limited-memory Broy- ther, sophisticated workload management systems for resource
denFletcherGoldfarbShanno (LBFGS) solver provided by and data provisioning are needed to support the exible and
MLlib for Logistic Regressing - both solvers showed a com- agile infrastructure usage required by data scientists. The
parable accuracy in the results. The results demonstrate the complexity of analytical workloads will further increase by
scalability of Spark MLlib compared to other machine learning including, e. g. more unstructured data (images, videos) and
frameworks our initial version of this SVM classier was spatial data critical for location-based services. Many IoT
built in R and was constrained to about 100,000 records. applications require close to realtime processing of incoming
data feeds. Emerging machine learning algorithms, such as
V. C ONCLUSION AND F UTURE W ORK deep learning, are increasingly complex and demand even
more memory and compute resources. An area of research
Hadoop enables enterprises to handle the different Vs is the use of hybrid query engines and the support for
of Big Data - in particular the data volume and velocity analytics across data residing on different platforms. While
requirement are manageable with a Hadoop cluster. With this the current state of the art in analytics is the usage of hand-
capability, Hadoop lls a gap in many existing enterprise crafted features and supervised learning, we believe that more
landscapes - most commercial database vendors have realized advanced analytical methods are required to handle the data
that und started to support Hadoop. However, we believe that deluge, e. g. topic modeling and deep learning.
the data lake has the potential to accommodate many more
workloads as the Hadoop ecosystem evolves. This is most ACKNOWLEDGMENTS
visible in the different SQL on Hadoop frameworks, which We acknowledge our colleagues at BMW and Clemson for
improved signicantly over the past years as seen in the providing valuable input and discussion to the paper: Matthew
presented benchmarks. Due to these developments, analytics Cook, Sandeep Jeereeddy, Linh Ngo, and Jason Anderson.
on large datasets is becoming the norm. A challenge remains This work was supported in part by NSF Grant #1228312.
the productivity and the ability to generate insight from data.
Scalable machine learning approaches are essential to extract R EFERENCES
knowledge from data. Intelligent, data-driven services are the [1] D. Laney, 3D data management: Controlling data volume, velocity, and
foundation of future innovations. A single vertical vendor and variety, META Group, Tech. Rep., February 2001.
closed stack cannot provide the capabilities needed to fulll [2] J. Dorsey, Big data in the drivers seat of con-
nected car technological advances, http://press.
all requirements. Hadoop provides efcient and economical ihs.com/press-release/country-industry-forecasting/
storage in conjunction with a vast set of processing and ana- big-data-drivers-seat-connected-car-technological-advance, 2013.
lytics frameworks enabling a wide range of novel automotive [3] Apache hadoop, http://hadoop.apache.org/, 2014.
[4] J. Dean and S. Ghemawat, MapReduce: Simplied Data Processing
applications. In comparison to traditional data warehouses, it on Large Clusters, in OSDI04: Proceedings of the 6th conference on
provides greater agility supporting relational data, unstructured Symposium on Opearting Systems Design & Implementation. Berkeley,
data and schema-on-read data organization. The vibrant open CA, USA: USENIX Association, 2004, pp. 137150.
[5] Apache Flume, https://ume.apache.org/.
source ecosystem functions as a supplier for capabilities that [6] Jay Kreps and Neha Narkhede and Jun Rao, Kafka: a distributed
are beyond the scope of a single vertically integrated system. messaging system for log processing, in 6th International Workshop
A Hadoop data lake is suitable for supporting typical on Networking Meets Databases, Athens, Greece, 2011.
[7] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
automotive application workloads, such as SQL and machine M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed
learning applications. In our benchmarks we showed interac- datasets: A fault-tolerant abstraction for in-memory cluster computing,
tive response time for SQL-based data warehouse workloads in Proceedings of the 9th USENIX Conference on Networked
Systems Design and Implementation, ser. NSDI12. Berkeley, CA,
on 120 billion records, while also illustrating the weaknesses USA: USENIX Association, 2012, pp. 22. [Online]. Available:
of Hadoop SQL frameworks, such as the processing of join http://dl.acm.org/citation.cfm?id=2228298.2228301
1209
[8] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, ser. SIGMOD 14. New York, NY, USA: ACM, 2014, pp. 337348.
A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, [Online]. Available: http://doi.acm.org/10.1145/2588555.2595637
A. Rheinlnder, M. J. Sax, S. Schelter, M. Hger, K. Tzoumas, and [30] M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching,
D. Warneke, The stratosphere platform for big data analytics, The A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi,
VLDB Journal, vol. 23, no. 6, pp. 939964, Dec. 2014. [Online]. L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson,
Available: http://dx.doi.org/10.1007/s00778-014-0357-y D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne,
[9] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, and M. Yoder, Impala: A modern, open-source sql engine for
H. Liu, P. Wyckoff, and R. Murthy, Hive: A warehousing hadoop. in CIDR. www.cidrdb.org, 2015. [Online]. Available: http:
solution over a map-reduce framework, Proc. VLDB Endow., //dblp.uni-trier.de/db/conf/cidr/cidr2015.html#KornackerBBBCCE15
vol. 2, no. 2, pp. 16261629, Aug. 2009. [Online]. Available: [31] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley,
http://dx.doi.org/10.14778/1687553.1687609 X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia,
[10] S. Jha, J. Qiu, A. Luckow, P. K. Mantha, and G. C. Fox, A tale of Spark SQL: relational data processing in spark, in Proceedings of
two data-intensive paradigms: Applications, abstractions, and architec- the 2015 ACM SIGMOD International Conference on Management of
tures, Proceedings of 3rd IEEE Internation Congress of Big Data, vol. Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, T. Sellis,
abs/1403.1528, 2014. S. B. Davidson, and Z. G. Ives, Eds. ACM, 2015, pp. 13831394.
[11] Hpc-abds kaleidoscope of 350 apache big data stack and hpc tecnolo- [Online]. Available: http://doi.acm.org/10.1145/2723372.2742797
gies, http://hpc-abds.org/kaleidoscope/, May 2015. [32] Optimized Row Columnar (ORC) Format, http://docs.hortonworks.
[12] C. Baru, M. Bhandarkar, C. Curino, M. Danisch, M. Frank, B. Gowda, com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcle.html, 2014.
H.-A. Jacobsen, H. Jie, D. Kumar, R. Nambiar, M. Poess, F. Raab, [33] Parquet Columnar Storage Format, http://parquet.io/, 2014.
T. Rabl, N. Ravi, K. Sachs, S. Sen, L. Yi, and C. Youn, Discussion [34] R Core Team, R: A Language and Environment for Statistical
of BigBench: A Proposed Industry Standard Performance Benchmark Computing, R Foundation for Statistical Computing, Vienna, Austria,
for Big Data, in Sixth TPC Technology Conference on Performance 2013, ISBN 3-900051-07-0. [Online]. Available: http://www.R-project.
Evaluation Benchmarking, ser. LNCS. Springer Berlin Heidelberg, org/
2014. [35] Pandas: Python Data Analysis Library, 2015. [Online]. Available:
[13] G. C. Fox, S. Jha, J. Qiu, S. Ekanazake, and A. Luckow, Towards a http://pandas.pydata.org/
Comprehensive Set of Big Data Benchmarks, 2015. [36] S. van der Walt, S. Colbert, and G. Varoquaux, The numpy array: A
[14] R. Nambiar, M. Poess, A. Dey, P. Cao, T. Magdon-Ismail, D. Qi Ren, structure for efcient numerical computation, Computing in Science
and A. Bond, Introducing tpcx-hs: The rst industry standard for Engineering, vol. 13, no. 2, pp. 2230, March 2011.
benchmarking big data systems, in Performance Characterization [37] GraphLab Create User Guide, 2015. [Online]. Available: https:
and Benchmarking. Traditional to Big Data, ser. Lecture Notes //dato.com/learn/userguide/
in Computer Science, R. Nambiar and M. Poess, Eds. Springer [38] S. Sridharan and J. M. Patel, Proling r on a contemporary processor,
International Publishing, 2015, vol. 8904, pp. 112. [Online]. Available: Proc. VLDB Endow., vol. 8, no. 2, pp. 173184, Oct. 2014. [Online].
http://dx.doi.org/10.1007/978-3-319-15350-6_1 Available: http://dl.acm.org/citation.cfm?id=2735471.2735478
[15] Apache tez, http://hortonworks.com/hadoop/tez/, 2013. [39] Scalapack scalable linear algebra package, http://www.netlib.org/
[16] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, scalapack/, 2014.
D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, [40] Arpack, http://www.caam.rice.edu/software/ARPACK/, 2014.
M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar, Mllib: [41] G. Ostrouchov, W.-C. Chen, D. Schmidt, and P. Patel, Programming
Machine learning in apache spark, CoRR, vol. abs/1505.06807, 2015. with big data in r, 2012, http://r-pbd.org/.
[Online]. Available: http://arxiv.org/abs/1505.06807 [42] Distributing the singular value decomposition
[17] J. W. Anderson, K. E. Kennedy, L. B. Ngo, A. Luckow, and A. W. Apon, with spark, https://databricks.com/blog/2014/07/21/
Synthetic data generation at scale with hadoop, in IEEE International distributing-the-singular-value-decomposition-with-spark.html, 2014.
Conference on Big Data, 2014. [43] H2O Scalable Machine Learning, 2015. [Online]. Available:
[18] Thilo Koslowski, How technology is ending the automotive industrys http://h2o.ai/
century-old business model, Gartner Research, 2012. [44] J. M. Hellerstein, C. R, F. Schoppmann, D. Z. Wang, E. Fratkin,
[19] K. Fehrenbacher, Cloudera gets proactive with hadoop management, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar,
GigaOM http://bit.ly/bmw-10m-connected-cars, 2013. The madlib analytics library: or mad skills, the sql, Proc. VLDB
[20] S. A. Fayazi, A. Vahidi, G. Mahler, and A. Winckler, Trafc signal Endow., vol. 5, no. 12, pp. 17001711, Aug. 2012. [Online]. Available:
phase and timing estimation from low-frequency transit bus data, under http://dl.acm.org/citation.cfm?id=2367502.2367510
review, 2014. [45] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph,
[21] A. D. Angelica, Googles self-driving car gath- R. Katz, S. Shenker, and I. Stoica, Mesos: a platform for ne-grained
ers nearly 1 GB/sec, http://www.kurzweilai.net/ resource sharing in the data center, in Proceedings of the 8th USENIX
googles-self-driving-car-gathers-nearly-1-gbsec, 2013. conference on Networked systems design and implementation, ser.
[22] J. Bruner, The industrial internet the machines are talking, http: NSDI11. Berkeley, CA, USA: USENIX Association, 2011, pp. 2222.
//radar.oreilly.com/2013/03/industrial-internet-report.html, 2013. [Online]. Available: http://dl.acm.org/citation.cfm?id=1972457.1972488
[23] Y. Bengio, I. J. Goodfellow, and A. Courville, Deep learning, [46] M. Soni and R. DelValle, Myriad: Running yarn
2015, book in preparation for MIT Press. [Online]. Available: alongside mesos, 2014, https://speakerdeck.com/mohit/
http://www.iro.umontreal.ca/~bengioy/dlbook running-yarn-alongside-mesos-mesoscon-2014.
[24] V. K. Vavilapalli, Apache Hadoop YARN: Yet Another Resource [47] TPC-DS, TPC Benchmark (TPC-DS): The New Decision Support
Negotiator, in Proc. SOCC, 2013. Benchmark Standard, http://www.tpc.org/tpcds/, 2015.
[25] B. Zhang, Y. Ruan, and J. Qiu, Harp: Collective communication on [48] D. Y. Min, Spark terasort, https://github.com/DrakeMin/spark-terasort,
hadoop, in "Technical Report Indiana University", 2014. 2015.
[26] N. Marz, Big data : principles and best practices of scalable realtime [49] R. Xin, P. Deyhim, A. Ghodsi, X. Meng, and M. Zaharia,
data systems. Manning Publishing, 2014. Graysort on apache spark by databricks, http://sortbenchmark.org/
[27] Twitter, Storm: Distributed and fault-tolerant realtime computation, ApacheSpark2014.pdf, 2014.
http://storm-project.net/.
[28] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton,
and T. Vassilakis, Dremel: Interactive analysis of web-scale datasets,
in Proc. of the 36th Intl Conf on Very Large Data Bases, 2010, pp.
330339. [Online]. Available: http://www.vldb2010.org/accept.htm
[29] M. A. Soliman, L. Antova, V. Raghavan, A. El-Helw, Z. Gu, E. Shen,
G. C. Caragea, C. Garcia-Alvarado, F. Rahman, M. Petropoulos,
F. Waas, S. Narayanan, K. Krikellas, and R. Baldwin, Orca: A
modular query optimizer architecture for big data, in Proceedings of the
2014 ACM SIGMOD International Conference on Management of Data,
1210