Вы находитесь на странице: 1из 10

2015 IEEE International Conference on Big Data (Big Data)

Automotive Big Data: Applications, Workloads and


Infrastructures
Andre Luckow , Ken Kennedy , Fabian Manhardt , Emil Djerekarov , Bennie Vorster , Amy Apon

Innovation Lab, BMW Group IT Research Center, Information Management Americas
Greenville, South Carolina, USA

Clemson University, Clemson, South Carolina, USA

AbstractData is increasingly affecting the automotive indus- volume, variety, velocity, veracity, and value (see [1] for the
try, from vehicle development, to manufacturing and service original 3 Vs): Volume: Industry sources estimate that on aver-
processes, to online services centered around the connected age about 480 TB of data were collected by every automotive
vehicle. Connected, mobile and Internet of Things devices and
machines generate immense amounts of sensor data. The ability manufacturer in 2013 (IHS) [2]. It is expected that this size will
to process and analyze this data to extract insights and knowledge increase to 11.1 PB per year in 2020. Variety: Heterogeneous
that enable intelligent services, new ways to understand business data sets generated by different sources stored in different
problems, improvements of processes and decisions, is a critical formats impose signicant challenges during the ingest and
capability. Hadoop is a scalable platform for compute and storage integration process; their fusion enables sophisticated analytic
and emerged as de-facto standard for Big Data processing at
Internet companies and in the scientic community. However, approaches ranging from self-learning algorithms for pattern
there is a lack of understanding of how and for what use cases detection to dimension reduction approaches for complex
these new Hadoop capabilities can be efciently used to augment predictions. Velocity: Data ingest rates and processing require-
automotive applications and systems. This paper surveys use ments vary widely from batch processing to real-time event
cases and applications for deploying Hadoop in the automotive processing of online data feeds, inducing high requirements
industry.
Over the years a rich ecosystem emerged around Hadoop on data infrastructure. Veracity: describes the uncertainty in
comprising tools for parallel, in-memory and stream processing the data, in the data provenance and the analytical modeling.
(most notable MapReduce and Spark), SQL and NOSQL engines Value: describes the ability to extract meaningful, actionable
(Hive, HBase), and machine learning (Mahout, MLlib). It is business insights from data. Further important dimensions
critical to develop an understanding of automotive applications include: data quality, provenance & lifecycle management.
and their characteristics and requirements for data discovery,
integration, exploration and analytics. We then map these re- The term data lake is used to refer to the ability to retain
quirements to a conned technical architecture consisting of core large volumes of data in its original, raw form to enable
Hadoop services and libraries for data ingest, processing and agile analytics on it. The phrase is commonly used to de-
analytics. The objective of this paper is to address questions, scribe Hadoop deployments. Hadoop [3] was developed as an
such as: What applications and datasets are suitable for Hadoop? open source implementation of the MapReduce [4] abstraction
How can a diverse set of frameworks and tools be managed
on multi-tenant Hadoop cluster? How do these tools integrate introduced by Google. Hadoop is based on a parallel and
with existing relational data management systems? How can distributed architecture and commodity hardware providing
enterprise security requirements be addressed? What are the per- economical storage and processing capabilities. A vibrant
formance characteristics of these tools for real-world automotive ecosystem of tools for various data-related tasks, for data in-
applications? To address the last question, we utilize a standard gest (Flume [5], Kafka [6]), data processing (e. g. MapReduce,
benchmark (TPCx-HS), and two application benchmarks (SQL
and machine learning) that operate on a dataset of multiple Spark [7], Flink [8] etc.) and advanced data analytics (e. g.
Terabytes and billions of rows. MLlib, Mahout) emerged. Hadoop is increasingly competing
with enterprise datawarehouse systems (Oracle, SQL Server,
I. I NTRODUCTION Teradata, HANA, etc.).
The increasing digitalization of the automotive industry While there is vast research on the usage of Hadoop in
driven by mobile and connected devices that carry an in- Internet companies [9] and in sciences [10], an understanding
creasing number of sensors is creating a signicant increase on the trade-offs of automotive Big Data applications and
in demand for data storage, processing and analytics. Enter- the suitability for a Hadoop data lake is missing. In this
prises are overwhelmed with large data volumes and analytics paper, we investigate the characteristics of automotive data
requirements as machine and sensor data generated by the applications to understand the variety of automotive data
Internet of Things (IoT) (soon the Internet of Everything) is sources, their volume and velocity requirements, and derive an
collected in ner granularities and higher frequencies. Exam- understanding how these requirements can be addressed using
ples of such applications in the automotive industry are: the Hadoop and Hadoop-based tools. We propose and validate a
connected vehicle, autonomous driving, smart manufacturing, Hadoop-based architecture for an automotive data lake: First,
the Industrial Internet, and mobility services. The requirements we must understand the automotive use cases and requirements
of Big Data applications are often described using the 5 Vs of for the infrastructure. Second, the tools and middleware stack

978-1-4799-9926-2/15/$31.00 2015 IEEE 1201


representing the foundation of the platform must be selected. A. Connected Vehicle and Intelligent Transport Systems
This is particular challenging as the variety of tools is very Gartner estimates that by 2020 most vehicles in mature
large Fox [11] lists >350 tools in the Big Data ecosystem. automotive markets will have data connectivity [18]. An
Based on the results of the technical platform, services for increasing number of vehicles are already online and utilize
managing and curating data need to be established. Finally, telematic services, such as OnStar, Ford Sync, and BMWs
business requirements must be carefully balanced with data Connected Drive. In addition to 3G/LTE based services, it
governance and compliance considerations. can be expected that vehicle-to-vehicle (V2V) and vehicle-
Further, there is lack of understanding of the performance of to-infrastructure (V2I) based services will transition from
Hadoop and other Hadoop ecosystem tools for workloads, such research to production in the near future.
as data extraction, transformation, SQL, and machine learning The number of sensors per vehicle and the data produced by
tasks. While various benchmark suites emerged (e. g. TPCx- these is rapidly increasing. Today, the electronic control units
HS, TPC-DS, BigDataBench), they often lack applicability (ECUs) of a vehicle can generate multiple GBs of data per
to real-life workloads and are often misused for marketing second. The majority of this data is transient; only a fraction
purposes making it difcult to generalize the results to real- of data is typically retained and used by telematic services,
world applications (see Baru [12] and Fox [13] for discussions e. g. for trafc prediction, safety warnings, vehicle diagnostics,
on Big Data benchmarks). We propose the usage of a hybrid location-based services (e. g. local search), entertainment ser-
approach consisting of two automotive application and a vices, and autonomous driving. By 2018 it is expected that for
standard benchmark (TPCx-HS [14]). This will enable us to BMW alone more 10 million vehicles will have connectivity
understand real-life performance trade-offs, which is relevant and will each transmit more than 1 TB of data each day [19].
to the architecture and sizing of the automotive data lake. In Trafc prediction is an example of a service that relies
particular, we investigate: (i) the performance/capability trade- on crowdsourced sensor data, collected from vehicle eets,
off of processing frameworks such as MapReduce, Tez [15], smartphones, and trafc infrastructures. In addition, this data
Spark [7] and Flink [8] using Terasort/TPCx-HS, (ii) the SQL is valuable for improving the energy efciency of vehicles.
performance of Hive [9] and Spark-SQL on an automotive The prediction of key trafc indicators, such as trafc ows
dataset with 120 billion records, and compare the performance or trafc signal states, is a challenge. Often, highly dynamic
to a state-of-the-art in-memory relational data management sources (e. g. map data, current and historic trafc ows, and
system (RDBMS), and (iii) the performance and scalability weather) need to be fused to enable accurate prediction. For
of Hadoop-based machine learning using Spark MLLib [16] example, we investigate the utilization of GPS probe data for
in comparison to tools, such as R. For (ii) and (iii), we utilize predicting trafc light phases and to for improving the energy
two real world automotive datasets to derive benchmarks efciency of the vehicle [20]. While optimizing the travel of
with different scale factors using the synthetic data generator an individual vehicle yields benets, the data can also be used
developed by Kennedy [17]. to understand and improve trafc infrastructure.
This paper is structured as follows: In section II, we Autonomous driving requires complex sensor data fusion
provide an overview of automotive use cases and analyze their (e. g. infrared, Lidar and camera-based sensors), which pro-
characteristics and requirements with respect to a Big Data vides the basis for the generation of an environmental model
infrastructure. We propose a data lake architecture comprising and for making real-time decisions. The Google self-driving
multiple logical layers and frameworks/tools in section III. car generates about 750 MB of sensor data per second [21].
We particularly investigate processing and execution man- Autonomous driving requires sophisticated machine learning
agement frameworks and higher-level frameworks for SQL techniques, e. g. deep learning methods for image classication
and advanced analytics. Further, we evaluate the security and understanding and for learning trafc situations.
requirements and mechanisms available for Hadoop. We con- While connected vehicle applications and services are often
clude this section with a discussion of how the proposed centered around a single vendor, even greater benets occur
architecture addresses automotive application requirements. In when data are combined from public and private sources,
section IV we present our experiences with Hadoop using including automotive, smart phone, parking, energy, and public
selected automotive workloads. transportation. Intelligent transport systems services enabled
by this data include intermodal routing, advanced parking
services, car sharing, electric vehicle related services, and the
II. AUTOMOTIVE A PPLICATIONS optimization of transportation infrastructure usage to increase
throughput, and to reduce emissions and accidents.
Scientic disciplines, such as physics and biology, along
with Internet companies, have been processing large volumes B. Manufacturing & Quality
of data for some time. Now, traditional industries, such as the The Industrial Internet [22] describes the increase of con-
automotive, are facing similar challenges as digitalization of nected machines in industrial environments and the resulting
their business processes and models increases. Developing and ability to optimize global operations using data generated
characterizing use cases is critical for designing an automotive during the production process. The amount of data collected
Big Data platform. during the production of a vehicle from OEMs and suppliers

1202
is rapidly increasing. New opportunities are being created Business Intelligence
(Reporting, OLAP, Data
Advanced Analytics
(ETL, Exploration, Prediction, Clustering,
Data Applications
(Recommendations, Applications
to monitor, analyze, and optimize the manufacturing process Discovery) Search) Predications)

Higher-Level
and the supply chain. Currently, the majority of this data is

Security & Management


SQL Data Frame Custom APIs
Processing
semi-structured produced by IT systems and humans, but an Hadoop Processing Hadoop Streaming Processing &
increasing amount is machine-generated (e. g., from PLCs or Relational
Databases (RDMS)
(HBase, MapReduce, Spark, Flink, (Spark Streaming, Execution
Custom Jobs) Storm, Flink) Management
SCADA systems). The fusion of this data with other data (Oracle, SQLServer,
PostgreSQL)
Data/Compute
sources, such as diagnostic data from the eld and engineering Hadoop Filesystem/YARN
Resources
data, enables predictive analytics to understand and prevent Data Ingest, Loading and Integration
(API Access, Flume, Sqoop, Kafka) Data
quality issues in the plant. By combining this data with Data Sources Ingest
(Transactional/ERP data, sensor data, machine-generated data, log data)
diagnostic data from the eld, a broader perspective on quality
can be obtained that will inform the design of future vehicle Fig. 1. Data Lake Architecture: The architecture comprises ve layers: In
models. the application layer the focus is to support advanced analytics and custom
data applications. The majority of applications will access the data lake via
C. Deep Learning for Image/Video Analytics SQL on the API layer. Data processing is done using Hadoop frameworks on
top of HDFS/YARN.
A particular challenge is the large amount of unstructured
data (videos, images, text), e. g. from camera-based sensors
on the vehicle or machines in the manufacturing process. and video. Such applications have a very different workload
To effectively utilize this kind of data new machine learning characteristic than traditional Big Data workloads that were
methods like deep learning are essential. Deep learning [23] primarily I/O bound; neural networks used for deep learning
refers to a set of machine learning algorithms that utilize neural are e. g. compute-intensive. A Big Data platform must account
networks for feature extraction/learning, classication and pre- for these complex requirements.
diction. Deep Learning is powering Internet services, e. g. the
voice recognition and dialog systems of Siri, Google Now and III. A PACHE H ADOOP FOR AUTOMOTIVE A PPLICATIONS
Microsoft Cortana, and will have many applications within the Hadoop was conceived as a scalable infrastructure for data-
automotive industry, such as computer vision for autonomous intensive applications providing an open source implemen-
driving and robotics, for optimizations in the manufacturing tation of the MapReduce programming model [4]. With the
process (e. g. monitoring for quality issues), connected vehicle introduction of the YARN resource manager [24], Hadoop has
and infotainment services (e. g. voice recognition systems). evolved to a general-purpose cluster computing framework for
heterogeneous tasks, supporting interactive and batch tasks as
D. Discussion well as data- and compute-intensive tasks. We refer to the
The automotive industry is transforming: digital services are Hadoop Filesystem (HDFS) and YARN as core Hadoop and
center of innovation. In addition to the described use cases, to higher-level frameworks and tools as the Hadoop ecosystem
there are many more automotive applications as the digital or environment.
footprint of customers is increasing. More digital touch points Figure 1 shows the data lake architecture. The main ob-
(e.g., web, mobile devices, social media, online services) gen- jective of the architecture is the storage and processing of
erate more and often complex unstructured data that requires large volumes supporting ETL, analytics and machine learn-
machine learning techniques to extract knowledge and insight. ing workloads. The Hadoop ecosystem differs signicantly
Existing datawarehouse solutions inhibit to scale to meet the from traditional enterprise environments, which are typically
immense data volume, velocity and analytics requirements of characterized by long release cycles, well-dened product
the described applications. Initially explorations of these uses dependencies and monolithic solutions. While the Hadoop
cases typically focused on augmented relational data manage- ecosystem is evolving to higher-level abstractions, it requires
ment systems (RDBMS) with the capabilities of a scalable, a deep expertise in distributed computing, data storage and
distributed storage and compute platform, such as Hadoop. processing. This main objective is to understand this land-
This allows us to combine medium-size data sources (e. g. scape of Hadoop tools, the different trade-offs und evaluate
transactional data from system of records systems) with large how these tools can be successfully deployed, operated, and
volume data sources, e. g. sensor and probe data. Supporting integrated in an enterprise environment. An important concern
ad-hoc queries on a dynamic set of datasets in an agile way is interoperability with enterprise datawarehouse environments
using higher-level tools, such as SQL and SQL-based business based mainly on relational technologies. As the majority of
intelligence tools is a critical requirements. Further, machine structured data resides in existing relational environments, a
learning techniques for extracting knowledge from data is stable integration with these environments is essential.
critical for different tasks: for assisting data discovery by The architecture comprises ve layers: In the application
automatic identication of patterns, for understanding com- layer the focus is to support advanced analytics and custom
plex interactions and correlations in data and for predictive data applications. The majority of applications will access the
modeling. Scalable machine learning techniques are required data lake via SQL on the API layer. For advanced analytics
to deal with large volumes of data and also with the inherent the dataframe abstraction is provided (see section III-C). Direct
high dimensionality of unstructured data, such as text, images access via specic framework API will be available to some

1203
special applications. In the processing layer we distinguish Kafka [6], which decouples message delivery and processing,
between data processing (batch and interactive) and stream- enabling the separation of the realtime and batch processing
ing frameworks (realtime). The data/compute resource layer pipeline. A message broker enables applications to support
encapsulates the core Hadoop services YARN and HDFS. multiple data consumers, e. g. a batch consumer for large
While we currently envision that the majority of interactions is historical analytics and realtime consumers for incremental
directed by users, in the future the degree of automation will model updates. While in the past several processing engines
improve and the number of automated machine connections specialized for Streaming emerged (e. g. Storm [27]), program-
to the cluster will increase conducting data loads, discovery, ming models between batch and streaming are harmonizing.
integration and analytics tasks without user interactions. Spark Streaming and Flink allows users to utilize the same
In the following we survey tools for the different layers: API on both batch and streaming data sources. Sparks MLlib
In section III-A we analyze the processing and execution supports the update of models inside the Spark Streaming
management layer for supporting batch and stream computing; environment.
we continue with the analytics layer investigating SQL and ad-
vanced analytics/machine learning capabilities in section III-B B. SQL Frameworks
and III-C. In section III-D we will analyze the current state of
security and governance in the Hadoop ecosystem. While Hadoop MapReduce and Spark provide a set of well-
dened, scalable abstractions for efcient data processing,
A. Processing and Execution Management higher-level abstraction to structured data - either based on
Hadoops original MapReduce uses a disk-oriented ap- SQL or document/object-store capabilities - enable a higher
proach for storage and processing of large volumes of data. productivity.
It will remain important for archival storage and computing, Many automotive use cases rely on SQL as common gram-
but can have slow access speeds for interactive or real-time mar for data extraction. SQL is particularly useful for querying
analytics requiring queries and for iterative processing for columnar data that has been derived from less structured
machine learning. To address these issues various processing data sources. In general, two architectures emerged: (i) the
and execution frameworks evolved, e. g. Spark [7], Flink [8], integration of Hadoop with existing relational datawarehouse
Tez [15], HARP [25]. Typically, a processing framework systems and (ii) the implementation of SQL engines on top of
comprises a programming model and an execution engine that core Hadoop services, i. e. HDFS and YARN. Architecture (i)
is responsible for lower-level resource management tasks, such is often used to integrate data from Hadoop into an existing
as scheduling, optimizations etc. datawarehouse. The scalability of this architecture is limited
In particular, Spark rapidly gained popularity. It utilizes as queried data always needs to be processed in the database
in-memory computing, which makes it particularly suitable (which is typically a magnitude smaller than the Hadoop
for iterative processing. The programming model is based on cluster). In the following we focus on Hadoop SQL engines.
an abstraction referred to as Resilient Distributed Datasets Inspired by Googles Dremel [28], various SQL query
(RDD). RDDs are immutable, read-only data structures on engines running over Hadoop have emerged: Hive [9],
which the application can apply transformations. The run- HAWQ [29], Impala [30] and Spark-SQL [31]. Hive was the
time handles the loading and distribution of the data into rst and is one of the most used SQL engine for Hadoop.
the memory of the cluster nodes. The immutable nature of Early versions of Hive compiled SQL into MapReduce jobs
RDDs enables efcient recoveries after a failure by simple which often yielded non-optimal performance. Thus, Hive was
re-applying transformations. A constraint of Spark is that extended to multiple runtimes, e. g. Tez [15] and Spark in addi-
its in-memory capabilities are limited to single application tion to MapReduce. Tez generalizes the MapReduce model to
jobs. There are multiple on-going efforts to extract in-memory a generic dataow oriented processing model while providing
capabilities into a separate runtime layer that is usable by an improved runtime system that supports in-memory caching
multiple frameworks and not restricted to caching data within and container re-use. Spark-SQL [7] is a SQL-engine that is
a single framework and/or job. Tachyon provides a HDFS part of the Spark distribution. A particular advantage of Spark-
compatible distributed in-memory lesystem. HDFS supports SQL is the tight integration with other analytic tools provided
since Hadoop 2.3 an in-memory cache. in the Spark ecosystems - results of SQL queries can be loaded
An emerging class of data-driven applications, e. g. trafc into a dataframe and then further processed with advanced
prediction and autonomous driving, requires realtime data pro- analytics tools, such as MLlib (see section III-C).
cessing and analytics. The Lambda architecture [26] aims to In contrast to traditional databases that often store data
support data-driven applications that have high data volumes, in proprietary formats tightly coupled to their processing
ingest rates, and require both realtime and batch analytics. For engine, open data storage formats are essential in the Hadoop
this purpose, the Lambda architecture denes a batch layer, environment in order to maintain interoperability between the
a serving layer and a speed layer. Hadoop while initially various tools and achieving a maximum level of exibility.
designed for MapReduce-style batch computing supports Very commonly there are multiple access paths and tools that
more stream processing via different frameworks and tools: access the same data. Thus, various le formats optimized for
The core of a realtime pipeline is a message broker, such as different use cases emerged: ORC [32] and Parquet [33] are

1204
particularly designed for SQL workloads using a columnar Hadoop
data structure.

Cluster
Analytics
Mahout Spark/MLLib
In summary, SQL processing on Hadoop provides more H2O
exibility compared to monolithic RDBMS, but requires a In-Database
Analytics
deeper understanding of the trade-offs: An application can PivotalR Madlib

Scale
optimize storage format (columnar storage for analytics, ran- Oracle Enterprise
R
dom access), processing engine (MapReduce, Tez, Spark),

Workstation
SQL frontend (Hive, Spark-SQL), the optimizer, to meet the
Workstation-based
Python/Dato R
Analytics
Open Source Python/Scikit pbdR Python
requirements of the particular workload. The optimizer plays
R
a critical role in achieving an optimal performance. Project Revolution R Other (Java, Scala, SQL)
Calcite used e. g. by Hive provides a data-format and None Data Data & Compute
processing engine agnostic optimizer that can be used to Parallelism
implement federated SQL engines on top of heterogeneous
data sources (e. g. Hive, HBase, MongoDB). Fig. 2. Scalable Data Analytics: We identify three categories of tools:
workstation-based tools typically provide the most comprehensive set of
features and best ease-of-use. In-database analytics provide a more scalable
C. Advanced Analytics environment, but are typically less scalable and exible than Hadoop-based
tools.
The term analytics is broadly used to describe systems
for extract-transform-load (ETL), business intelligence for
data exploration and reporting (e. g. QlikView, Tableau) and various parallel libraries for these tasks, e. g. ScaLAPACK [39]
advanced analytics tasks, such as data mining and machine or ARPACK [40], already exist.
learning. In the following, we focus on programmatic ad- Several R packages for out-of-core and parallel processing,
vanced analytics tools. The most known abstraction for data either via explicit implementation (e. g. pbdR [41]) or im-
analytics is the dataframe abstractions originally introduced in plicitly within a higher-level package, emerged. pdbR re-uses
R (respectively it predecessor S) exposes data in a tabulated ScaLAPACK for linear, while the open source R relies on the
format to the user and supports the efcient expression of non-parallel LAPACK and LINPACK. Further, various higher-
data transformations and analytics [34]. A dataframe provides level packages for R exist that provide dataframe compatible
a two-dimensional table abstraction that can store different abstraction, but delegate the implementation data transforma-
types of data (e. g. numeric, text data and categorical data) in tions and analytics to a distributed environment, e. g. Hadoop
its columns. Depending on the specic implementation, the or Spark. For example, Spark-SQL and MLLib [16] provide a
dataframe abstraction typically supports various functions for dataframe abstraction for Python, Scala and R the dataframe
data transformations, e. g. for subsetting, merging, ltering and is implemented using the well-known RDD abstraction of
aggregating data. In addition to R, several other dataframe Spark. MLLib relies on ARPACK encapsulated by netlib-
implementations exist, e. g. Pandas [35], NumPy [36], and java for linear algebra operations [42]. Revolution R provides
Dato SFrames [37] for Python. Traditionally, implementations data-frame level implementations of analytics functions that
of dataframe abstraction lacked the ability to scale-out. In are implemented by utilizing lower-level data-parallelism for
the following we investigate different analytic environments ltering, transformations etc. (e. g. rhadoop) and data/compute
with respect to their expressiveness, ease-of-use, scalability parallelism for advanced analytics (e. g. for generalized linear
and advanced analytics/machine learning support. models, decision trees and KMeans clustering). H2O [43]
Figure 2 shows the landscape of analytics tools. The R utilizes a similar architecture. A challenge is the need to copy
language [34] is a popular programming environment for data the data between the local R environment and the cluster.
analytics. While R provides an ecosystem comprising of a Further, the exposed dataframe APIs mimic R idioms, but
manifold set of packages for statistics and machine learning, cannot be used interchangeably with the standard R dataframe.
it has severe limitation with respect to scalability [38]. Various Another approach is the usage of in-database analytics. The
approaches for supporting better scalability via out-of-core levels of integration between R and databases can vary. In the
and distributed processing emerged. Often, R or Python serves simplest case the data-parallelism can be exploited using PL/R
as an interface to these scalable analytics environments (e. g. code. Other systems integrate Revolution R (e. g. Teradata)
Spark, H2O, or in-database analytics). In the following we or custom analytics implementations (e. g. Oracle). Another
focus in particular on the R and Python ecosystems as these are example is Madlib [44] a library that was designed for in-
the most prevalent programming languages for data science. database use. It runs embedded in databases, such as Postgres,
Scaling machine learning and analytics techniques requires HAWQ and Impala, enabling the usage of analytics functions
careful considerations of two forms of parallelism: data par- via SQL.
allelism typically can easily explored using programming In general, two kind of integrations of analytic tools with
models, such as MapReduce or Spark. Linear algebra solvers Hadoop exist: (i) tools that can load/store data from Hadoop
often used for machine learning typically demand a more ne- and (ii) tools that are tightly integrated with Hadoop and
grained parallelism, which is more difcult to scale. However, process data in-situ. Nearly all analytical tools support access

1205
to data stored in Hadoop either via HDFS or via an ODBC for user group and role management (e. g. based on LDAP)
connection to columnar data stored in Hive. Data discovery and a security management tool, such as Argus, to ensure that
tools (such as QlikView, Tableau), BI and analytical tools privileges are correctly translated to the different frameworks.
(e. g., R, Scikit-Learn, SPSS and SAS) fall into this category. The support for encryption of both network trafc and
While this approach enables the quick processing of small HDFS data is required for sensitive data. While network trafc
to medium size data, it also has some disadvantage: data encryption is supported by many Hadoop services, no built-
is duplicated, which increases the management and security in encryption for HDFS data is provided at the moment. The
overhead, and the scalability is often limited. The number and setup of network encryption is complex: encryption must be
maturity of higher-level tools that support scenario (ii) and enabled for all daemons, thus, management support for cong-
bridge the various layers and natively run analytics inside the uration and key management is important. HDFS encryption
cluster is still limited. is supported since Hadoop 2.6; several products attempt to
simplify and enhance this feature, e. g. Vormetric, Voltage or
D. Data Governance and Security Clouderas Navigator Encrypt and Key Trustee.
The data lake as central platform for storing diverse data sets Data lifecycle management describes the ability to track
supporting ad-hoc data integration, querying and analytics is a data provenance and lineage across the different phases: data
challenging environment with respect to data governance and ingest, storage and processing. Data sources and heteroge-
security. Often, the precise nature and privacy implications of neous data processing infrastructures contribute to the com-
the data at hand are unknown and requires analysis before plexity of this task. Often, it is required to move data between
the data can be efciently governed. Furthermore, the process different data stores with different, potentially inconsistent,
of linking data sources can potentially lead to unanticipated access policies. An important part of the data lifecycle is
privacy issues. Thus, it is critical to dene security measures the assessment of the data criticality and the assignment of
and governance processes for data access, retention, linkage access policies. Hadoops data lifecycle solutions are still in
and quality. The default security level of Hadoop is very low. their infancy with products, such as Clouderas Navigator
As Hadoop evolved security mechanisms have been retrotted and Apache Falcon. As part of the data lifecycle it is often
into different frameworks, however, it still lacks a coherent necessary that data is masked (i.e. parts of the data need to be
security architecture. The Kerberos feature is critical to prevent removed or pseudonymized) to meet the security level of the
the easy impersonation of other user accounts. However, the platform [17].
maturity of Kerberos support is in many tools very low -
some frameworks, e. g. Kafka, have no Kerberos support at E. Discussion
all, others, e. g. Spark, have difculty to interact with certain Automotive Big Data applications in the different domains:
secured services. manufacturing, quality, connected vehicle and intelligent trans-
It is best practice to deploy Hadoop behind a rewall and port systems share common characteristics: they bring together
only expose a small, selected set of services, either direct or a variety of data sources well structured data from transac-
via a gateway service, such as Apache Knox. Accordingly, tional systems and less structured data generated by machines,
the number of users with direct access to HDFS and YARN mobile devices, social networks etc. Loading and integrating
can be kept small. Knox allows the integration with single- these datasets and making it explorable via higher-level tools,
sign-on environments (e. g. Siteminder) and can forward the such as SQL is critical. Understanding the trade-offs of the
authenticated user or map it to specic service users. In non- different SQL on Hadoop frameworks in conjunction with the
Kerberos clusters, these gateway services are valid approaches different data formats is therefore crucial. As machine learning
for minimizing the number of users with direct access to HDF- techniques get deployed in the various phases of data science
S/YARN even though only a small number of use cases can be projects, e. g. for data cleaning, data discovery, and different
sufciently supported. The usage of Kerberos authentication is form of analytics (text analytics, pattern mining, predictive
essential for large-scale, multi-tenant data lakes and sensitive modeling, etc.). Specialized tools for deep learning e. g. often
data. require an specialized infrastructure (using e. g. GPU) and
Another important concern is authorization. Many frame- often do not integrate with the Hadoop scheduler YARN.
works chose to implement their own authorization infrastruc- YARN decouples cluster resource management and appli-
ture: Hives SQL grants/revokes or cell-level authorization for cations and allows to deploy most frameworks on user-level
HBase and Accumolo are examples of this approach. These without administrative involvement. However, in particular
tools rely on a shared service user for all HDFS interaction. frameworks that require a tighter integrations, e. g. Spark-
While these silo-ed security schemes simplify the security SQL/Hive, some pitfalls in particular with secure cluster
in their domain, they increase the management complexity exist. Also, as the implementation of YARN applications is
and the difculty of implementing consistent policies. Two a complex endeavor, it is typically not a rst order consid-
approaches for solving this issue have been proposed: (i) erations; YARN lacks the exibility and easy-of-use of other
Sentry adds a common authorization handler for adding role- schedulers such as Mesos [45] as it was primarily designed
based authorizations to different frameworks (currently Hive for Hadoop applications. There are some attempts to integrate
and Cloudera Impala). (ii) The usage of a shared directory these systems, e. g. Myriad provides the ability to dynamically

1206
spawn YARN clusters inside Mesos [46]. Another concern is 2500
security basic security mechanisms, such as Kerberos, are
available; however, in particular emerging frameworks, such 2000

Time (in sec)


as Spark, have some issue interfacing with secure Hadoop 1500
cluster. Further, the data management and governance remains
a manual process. Often, it is e. g. very difcult to assign 1000
access rights to raw data as the real content of the data remains
opaque. 500

Another important consideration is the selection of a


0
Hadoop distribution trading off capabilities, performance and Flink MapReduce Spark Tez
vendor lock-in. Hadoop distributions bundle the core Hadoop
Framework
with components to simplify deployment, monitoring and Flink MapReduce Spark Tez
upgrade. In the optimal case, the Hadoop distribution should
provide tools that cover all layers of the architecture. The Fig. 3. Terasort Benchmark for Flink, MapReduce, Tez and Spark: Run-
most well-known distributions are Cloudera, Hortonworks and ning on YARN out of the box. Flink worse performance and resource usage.
MapR. Each distribution follows a philosophy of releases. Spark more memory usage; MapReduce more CPUs, Tez best performance
Cloudera typically adopts selected upstream features to their
stable version. Hortonworks follows closely the open source
release cycle. Also, many vendors substitute their own compo- Hadoop examples for MapReduce and Tez, the Flink example
nents at different levels of the Hadoop ecosystem. MapR re- implementation and the un-optimized Spark implementation
places HDFS with their own distributed lesystem, promising provided by Min [48].
better performance and reliability. The participation of many The results demonstrate the inherent trade-off between pro-
vendors in the Hadoop ecosystem has led to maturity, but gramming model capabilities, easy-of-use and performance.
also to fragmentation. While the lower layers of Hadoop (i. e. Both Flink and Spark provide a higher-level and a more
HDFS and YARN) have reached stability, vendors have moved expressive programming model than MapReduce - while Spark
to higher layers for differentiation, such as SQL, machine outperforms MapReduce, Flink show the worst performance
learning and security. of all frameworks. Furthermore, it must be noted that the
performance improved signicantly from Spark 1.3 to 1.4
IV. E XPERIMENTAL E VALUATION (>30 %). Nevertheless, the Tez implementation shows the
While a set of benchmarks for Big Data workloads is best performance. This is particularly interesting since Spark
emerging (e. g. TPC-DS [47], TPCx-HS [14]), often difcult to showed the best performance in the Terasort Graysort bench-
compare results generated by different vendors are available. mark in 2014 [49] - there are several potential explanations:
These results are typically obtained using highly optimized to ensure comparability and real-world applicability, we run
setups and carefully selected numbers. For the design and all scenarios through YARN, which is associate with some
sizing of our data lake it is critical to understand the trade- overhead. The Spark/YARN integration is still evolving. In
off and optimization parameters and efforts for the different particular, the startup times for Spark on YARN with a high
frameworks. This is not possible by relying on published number of containers is signicant. Tez is much better opti-
benchmark results. Thus, we created a custom benchmark mized for YARN. Finally, the environment for the benchmark
suite based on two real-world datasets of an automotive was different - the Graysort was run on EC2 instances with
applications. We augment these results the TPCx-HS standard SSDs, while we utilize disk-based storage.
benchmark Terasort to obtain a baseline performance of our
B. SQL on Hadoop
cluster. The application workloads we study are SQL and
a logistic regression and SVM for text mining. For these SQL is one of the most important workloads that users cur-
experiments, we use our 48 node Hadoop data lake with 10 rently use on data stored in Hadoop for ETL, data preparation
Gigabit Ethernet connectivity, Hortonworks HDP 2.2 with and simple analytics tasks. Our benchmark consists of several
Hadoop 2.6, Hive 0.14, Spark-SQL 1.3., Flink 0.9 and Tez large tables with up to 120 billion rows with an uncompressed
0.5.2. data size of 3.2 TB. The benchmark comprises ve queries. In
the following, we evaluate the suitability of Hive (with Tez)
A. Terasort and Spark-SQL using different storage formats: Text, Parquet,
Figure 3 investigates the Terasort performance for different and ORC. In both cases, we run the queries via the Hive
Hadoop processing engines. We sort 1 TB of data using respectively the Spark-SQL shell and YARN. Data is loaded
32 nodes. All jobs are run via YARN, which introduces some from HDFS in both cases.
challenges: due to the memory footprint of YARN, not the Figure 4 shows the results of the benchmark. We tested
full memory is available for Spark; further, heterogeneous a total of ve queries - in the following we particularly
nodes with different amounts of cores/memory are not op- investigate three of these: select, aggregate with groupby, and
timally supported. We use the Terasort implementation in the join using a dataset consisting of 20 million vehicles and

1207
Select Aggregate Join Data Size (in GB) Number of Nodes
(32 Nodes) (8 GB Data)
10000 600


Time (in sec)


 




400 




1000 




Time (in sec)



 
 
200




100     

 

0        

1 2 4 8 16 8 16 32

10 
Logistic Regression (SGD) SVM (SGD) Logistic Regression (LBFGS)
 


Feature Extraction Classification


Fig. 6. Scalability of Different MLLib Algorithms: Both the feature


4 8 16 32 4 8 16 32 4 8 16 32 extraction and classication scale linear. The performance of the training phase
depends on the used algorithm and optimizer: Both stochastic gradient descent
Number of Nodes based approaches (logistic regression and SVM) show a worse performance
Hive/Text Hive/ORC Hive/Parquet Spark/ORC Spark/Parquet than the L-BFGS optimizer that is available in Spark for logistic regression.

Fig. 4. Hadoop SQL Performance: The gure shows the scale-out perfor-
mance of our automotive SQL benchmark based on queries of datawarehouse
the join query Hive is in average 50 times faster than Spark-
system. SQL. We had to optimize the join query for Spark, because
of memory issues in particular for the four node scenario.
Also, it must be noted that window queries are not supported
by Spark at the moment (it is part of our benchmark, but not
6000 depicted in graph). The advantage of the data lake architecture
Time (in sec)

is that different technologies can be combined trading off


4000 capabilities and performance: while lacking in SQL support
and performance, Spark-SQL has advantages with respect to
integration with higher-level analytics and machine learning
2000
frameworks, such as dataframes and MLlib.
In the next scenario we compare the performance of
0 Hive and Spark to a commercial analytical database system
Select Aggregate Join
(RDBMS). Figure 5 shows the time for each system broken
Query down by query. We utilize a similar number of cores and
RDBMS Hive Spark
memory for all three systems: Hive/Spark are run on 4 nodes
with 12 cores and 128 GB memory each and the RDBMS is
Fig. 5. Hive/Spark versus RDBMS: On a comparable infrastructure 4 nodes
with 128 GB memory vs. 1 nodes with 512 GB, we compare Hive/Spark to a
run on a 40 core/512 GB system. We use a reduced dataset
commercial RDBMS. Generally, the RDBMS can outperform Hive and Spark consisting of 60 billion rows and Hive/ORC and Spark/Parquet
however, both deliver a solid performance at a lower cost. (50 % of the size for the previous scenarios). While the
RDBMS has an advantage as it runs in non-distributed mode,
it outperforms both Hive and Spark-SQL. Spark-SQL memory
120 billion rows. For 20 million vehicles the columnar data usage and thus, the necessity to page-out data from the RDD
formats ORC and Parquet compress the data signicantly, during the shufe phase causes the performance deterioration
reducing the size from 3.2 TB for text to 1.3 TB for Parquet during the join query. Nevertheless, both frameworks delivery
and to 193 MB for ORC. For Hive, ORC exhibits a better an acceptable performance considering that they are available
performance compared to the other formats; For Spark-SQL, as open source tools and run on commodity hardware.
Parquet shows the best performance. It must be noted that
data conversion to Parquet and ORC is associated with some C. Advanced Analytics: Text Classication
overhead (e. g. it took e. g. 40 min to create an ORC table based Analytical challenges arise in particular from unstructured
on the text le). As the results indicate this effort typically data, such as text data. Machine learning approaches are
pays-off. commonly used to address this challenge. Scalable approaches
We compare three queries: a select, aggregate and join for machine learning are necessary to support large volume
query. The select query does not require a reduce phase, and high dimensional data. Text classication is an example
whereas both the aggregate and join require one. The shufe of a machine learning problem and describes the process of
trafc is the highest for the join query. Both Hive and grouping documents into categories. In this project we utilize
Spark (when comparing Hive/ORC and Spark/Parquet) deliver Support Vector Machines (SVM) and Logistic Regression
a comparable performance for the select and join query; (LR) to classify verbatim data from customer surveys. Our
nevertheless, Hive has a small advantage in most cases. For analysis particularly focuses on the MLlib implementation of

1208
these machine learning algorithms inside of Spark. For this queries. Furthermore, we showed the potential of Hadoop-
purpose, we utilize a synthetic dataset of up to 110 million based machine learning frameworks, such as Sparks MLlib;
records (16 GB scenario). using this tool, we were able to process a signicantly larger
Figure 6 shows the results of this benchmark. A scalable dataset of high-dimensional text data consisting of 110 million
machine learning implementation is the key for dealing with records than with our previous R solution. However, several
high-dimensional data and large volumes (larger number of challenges remain: the maturity of the platform needs to
rows and columns). The application uses TF-IDF for feature increase to meet the requirements of the increasing number of
extraction and a SVM and logistic regression for classication. applications and users. Many available tools are low-level and
As shown in the gure, TF-IDF scales linearly, while the often immature and requiring highly advanced skills to operate
machine learning algorithms exhibit an increased overhead them. Finding and developing these skills and establishing
with larger data sizes due to the higher number of columns a culture and analytical mindset for supporting Big Data
in the TF-IDF matrix and the need for synchronization after services remains challenging. Higher-level tools are emerging
each iteration. Both learning algorithms SVM and logistic and hopefully can address this issue.
regression perform similarly when using stochastic gradient With the uptake of the data lake infrastructure, high stan-
descent (SGD) as solver. A signicant speedup can be ob- dards for data governance and the security are required. Fur-
served by replacing the SGD with the Limited-memory Broy- ther, sophisticated workload management systems for resource
denFletcherGoldfarbShanno (LBFGS) solver provided by and data provisioning are needed to support the exible and
MLlib for Logistic Regressing - both solvers showed a com- agile infrastructure usage required by data scientists. The
parable accuracy in the results. The results demonstrate the complexity of analytical workloads will further increase by
scalability of Spark MLlib compared to other machine learning including, e. g. more unstructured data (images, videos) and
frameworks our initial version of this SVM classier was spatial data critical for location-based services. Many IoT
built in R and was constrained to about 100,000 records. applications require close to realtime processing of incoming
data feeds. Emerging machine learning algorithms, such as
V. C ONCLUSION AND F UTURE W ORK deep learning, are increasingly complex and demand even
more memory and compute resources. An area of research
Hadoop enables enterprises to handle the different Vs is the use of hybrid query engines and the support for
of Big Data - in particular the data volume and velocity analytics across data residing on different platforms. While
requirement are manageable with a Hadoop cluster. With this the current state of the art in analytics is the usage of hand-
capability, Hadoop lls a gap in many existing enterprise crafted features and supervised learning, we believe that more
landscapes - most commercial database vendors have realized advanced analytical methods are required to handle the data
that und started to support Hadoop. However, we believe that deluge, e. g. topic modeling and deep learning.
the data lake has the potential to accommodate many more
workloads as the Hadoop ecosystem evolves. This is most ACKNOWLEDGMENTS
visible in the different SQL on Hadoop frameworks, which We acknowledge our colleagues at BMW and Clemson for
improved signicantly over the past years as seen in the providing valuable input and discussion to the paper: Matthew
presented benchmarks. Due to these developments, analytics Cook, Sandeep Jeereeddy, Linh Ngo, and Jason Anderson.
on large datasets is becoming the norm. A challenge remains This work was supported in part by NSF Grant #1228312.
the productivity and the ability to generate insight from data.
Scalable machine learning approaches are essential to extract R EFERENCES
knowledge from data. Intelligent, data-driven services are the [1] D. Laney, 3D data management: Controlling data volume, velocity, and
foundation of future innovations. A single vertical vendor and variety, META Group, Tech. Rep., February 2001.
closed stack cannot provide the capabilities needed to fulll [2] J. Dorsey, Big data in the drivers seat of con-
nected car technological advances, http://press.
all requirements. Hadoop provides efcient and economical ihs.com/press-release/country-industry-forecasting/
storage in conjunction with a vast set of processing and ana- big-data-drivers-seat-connected-car-technological-advance, 2013.
lytics frameworks enabling a wide range of novel automotive [3] Apache hadoop, http://hadoop.apache.org/, 2014.
[4] J. Dean and S. Ghemawat, MapReduce: Simplied Data Processing
applications. In comparison to traditional data warehouses, it on Large Clusters, in OSDI04: Proceedings of the 6th conference on
provides greater agility supporting relational data, unstructured Symposium on Opearting Systems Design & Implementation. Berkeley,
data and schema-on-read data organization. The vibrant open CA, USA: USENIX Association, 2004, pp. 137150.
[5] Apache Flume, https://ume.apache.org/.
source ecosystem functions as a supplier for capabilities that [6] Jay Kreps and Neha Narkhede and Jun Rao, Kafka: a distributed
are beyond the scope of a single vertically integrated system. messaging system for log processing, in 6th International Workshop
A Hadoop data lake is suitable for supporting typical on Networking Meets Databases, Athens, Greece, 2011.
[7] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
automotive application workloads, such as SQL and machine M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed
learning applications. In our benchmarks we showed interac- datasets: A fault-tolerant abstraction for in-memory cluster computing,
tive response time for SQL-based data warehouse workloads in Proceedings of the 9th USENIX Conference on Networked
Systems Design and Implementation, ser. NSDI12. Berkeley, CA,
on 120 billion records, while also illustrating the weaknesses USA: USENIX Association, 2012, pp. 22. [Online]. Available:
of Hadoop SQL frameworks, such as the processing of join http://dl.acm.org/citation.cfm?id=2228298.2228301

1209
[8] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, ser. SIGMOD 14. New York, NY, USA: ACM, 2014, pp. 337348.
A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, [Online]. Available: http://doi.acm.org/10.1145/2588555.2595637
A. Rheinlnder, M. J. Sax, S. Schelter, M. Hger, K. Tzoumas, and [30] M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching,
D. Warneke, The stratosphere platform for big data analytics, The A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi,
VLDB Journal, vol. 23, no. 6, pp. 939964, Dec. 2014. [Online]. L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson,
Available: http://dx.doi.org/10.1007/s00778-014-0357-y D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne,
[9] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, and M. Yoder, Impala: A modern, open-source sql engine for
H. Liu, P. Wyckoff, and R. Murthy, Hive: A warehousing hadoop. in CIDR. www.cidrdb.org, 2015. [Online]. Available: http:
solution over a map-reduce framework, Proc. VLDB Endow., //dblp.uni-trier.de/db/conf/cidr/cidr2015.html#KornackerBBBCCE15
vol. 2, no. 2, pp. 16261629, Aug. 2009. [Online]. Available: [31] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley,
http://dx.doi.org/10.14778/1687553.1687609 X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia,
[10] S. Jha, J. Qiu, A. Luckow, P. K. Mantha, and G. C. Fox, A tale of Spark SQL: relational data processing in spark, in Proceedings of
two data-intensive paradigms: Applications, abstractions, and architec- the 2015 ACM SIGMOD International Conference on Management of
tures, Proceedings of 3rd IEEE Internation Congress of Big Data, vol. Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, T. Sellis,
abs/1403.1528, 2014. S. B. Davidson, and Z. G. Ives, Eds. ACM, 2015, pp. 13831394.
[11] Hpc-abds kaleidoscope of 350 apache big data stack and hpc tecnolo- [Online]. Available: http://doi.acm.org/10.1145/2723372.2742797
gies, http://hpc-abds.org/kaleidoscope/, May 2015. [32] Optimized Row Columnar (ORC) Format, http://docs.hortonworks.
[12] C. Baru, M. Bhandarkar, C. Curino, M. Danisch, M. Frank, B. Gowda, com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcle.html, 2014.
H.-A. Jacobsen, H. Jie, D. Kumar, R. Nambiar, M. Poess, F. Raab, [33] Parquet Columnar Storage Format, http://parquet.io/, 2014.
T. Rabl, N. Ravi, K. Sachs, S. Sen, L. Yi, and C. Youn, Discussion [34] R Core Team, R: A Language and Environment for Statistical
of BigBench: A Proposed Industry Standard Performance Benchmark Computing, R Foundation for Statistical Computing, Vienna, Austria,
for Big Data, in Sixth TPC Technology Conference on Performance 2013, ISBN 3-900051-07-0. [Online]. Available: http://www.R-project.
Evaluation Benchmarking, ser. LNCS. Springer Berlin Heidelberg, org/
2014. [35] Pandas: Python Data Analysis Library, 2015. [Online]. Available:
[13] G. C. Fox, S. Jha, J. Qiu, S. Ekanazake, and A. Luckow, Towards a http://pandas.pydata.org/
Comprehensive Set of Big Data Benchmarks, 2015. [36] S. van der Walt, S. Colbert, and G. Varoquaux, The numpy array: A
[14] R. Nambiar, M. Poess, A. Dey, P. Cao, T. Magdon-Ismail, D. Qi Ren, structure for efcient numerical computation, Computing in Science
and A. Bond, Introducing tpcx-hs: The rst industry standard for Engineering, vol. 13, no. 2, pp. 2230, March 2011.
benchmarking big data systems, in Performance Characterization [37] GraphLab Create User Guide, 2015. [Online]. Available: https:
and Benchmarking. Traditional to Big Data, ser. Lecture Notes //dato.com/learn/userguide/
in Computer Science, R. Nambiar and M. Poess, Eds. Springer [38] S. Sridharan and J. M. Patel, Proling r on a contemporary processor,
International Publishing, 2015, vol. 8904, pp. 112. [Online]. Available: Proc. VLDB Endow., vol. 8, no. 2, pp. 173184, Oct. 2014. [Online].
http://dx.doi.org/10.1007/978-3-319-15350-6_1 Available: http://dl.acm.org/citation.cfm?id=2735471.2735478
[15] Apache tez, http://hortonworks.com/hadoop/tez/, 2013. [39] Scalapack scalable linear algebra package, http://www.netlib.org/
[16] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, scalapack/, 2014.
D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, [40] Arpack, http://www.caam.rice.edu/software/ARPACK/, 2014.
M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar, Mllib: [41] G. Ostrouchov, W.-C. Chen, D. Schmidt, and P. Patel, Programming
Machine learning in apache spark, CoRR, vol. abs/1505.06807, 2015. with big data in r, 2012, http://r-pbd.org/.
[Online]. Available: http://arxiv.org/abs/1505.06807 [42] Distributing the singular value decomposition
[17] J. W. Anderson, K. E. Kennedy, L. B. Ngo, A. Luckow, and A. W. Apon, with spark, https://databricks.com/blog/2014/07/21/
Synthetic data generation at scale with hadoop, in IEEE International distributing-the-singular-value-decomposition-with-spark.html, 2014.
Conference on Big Data, 2014. [43] H2O Scalable Machine Learning, 2015. [Online]. Available:
[18] Thilo Koslowski, How technology is ending the automotive industrys http://h2o.ai/
century-old business model, Gartner Research, 2012. [44] J. M. Hellerstein, C. R, F. Schoppmann, D. Z. Wang, E. Fratkin,
[19] K. Fehrenbacher, Cloudera gets proactive with hadoop management, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar,
GigaOM http://bit.ly/bmw-10m-connected-cars, 2013. The madlib analytics library: or mad skills, the sql, Proc. VLDB
[20] S. A. Fayazi, A. Vahidi, G. Mahler, and A. Winckler, Trafc signal Endow., vol. 5, no. 12, pp. 17001711, Aug. 2012. [Online]. Available:
phase and timing estimation from low-frequency transit bus data, under http://dl.acm.org/citation.cfm?id=2367502.2367510
review, 2014. [45] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph,
[21] A. D. Angelica, Googles self-driving car gath- R. Katz, S. Shenker, and I. Stoica, Mesos: a platform for ne-grained
ers nearly 1 GB/sec, http://www.kurzweilai.net/ resource sharing in the data center, in Proceedings of the 8th USENIX
googles-self-driving-car-gathers-nearly-1-gbsec, 2013. conference on Networked systems design and implementation, ser.
[22] J. Bruner, The industrial internet the machines are talking, http: NSDI11. Berkeley, CA, USA: USENIX Association, 2011, pp. 2222.
//radar.oreilly.com/2013/03/industrial-internet-report.html, 2013. [Online]. Available: http://dl.acm.org/citation.cfm?id=1972457.1972488
[23] Y. Bengio, I. J. Goodfellow, and A. Courville, Deep learning, [46] M. Soni and R. DelValle, Myriad: Running yarn
2015, book in preparation for MIT Press. [Online]. Available: alongside mesos, 2014, https://speakerdeck.com/mohit/
http://www.iro.umontreal.ca/~bengioy/dlbook running-yarn-alongside-mesos-mesoscon-2014.
[24] V. K. Vavilapalli, Apache Hadoop YARN: Yet Another Resource [47] TPC-DS, TPC Benchmark (TPC-DS): The New Decision Support
Negotiator, in Proc. SOCC, 2013. Benchmark Standard, http://www.tpc.org/tpcds/, 2015.
[25] B. Zhang, Y. Ruan, and J. Qiu, Harp: Collective communication on [48] D. Y. Min, Spark terasort, https://github.com/DrakeMin/spark-terasort,
hadoop, in "Technical Report Indiana University", 2014. 2015.
[26] N. Marz, Big data : principles and best practices of scalable realtime [49] R. Xin, P. Deyhim, A. Ghodsi, X. Meng, and M. Zaharia,
data systems. Manning Publishing, 2014. Graysort on apache spark by databricks, http://sortbenchmark.org/
[27] Twitter, Storm: Distributed and fault-tolerant realtime computation, ApacheSpark2014.pdf, 2014.
http://storm-project.net/.
[28] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton,
and T. Vassilakis, Dremel: Interactive analysis of web-scale datasets,
in Proc. of the 36th Intl Conf on Very Large Data Bases, 2010, pp.
330339. [Online]. Available: http://www.vldb2010.org/accept.htm
[29] M. A. Soliman, L. Antova, V. Raghavan, A. El-Helw, Z. Gu, E. Shen,
G. C. Caragea, C. Garcia-Alvarado, F. Rahman, M. Petropoulos,
F. Waas, S. Narayanan, K. Krikellas, and R. Baldwin, Orca: A
modular query optimizer architecture for big data, in Proceedings of the
2014 ACM SIGMOD International Conference on Management of Data,

1210

Вам также может понравиться