Вы находитесь на странице: 1из 6

Cloudera

PAGE 16
DATA ENGINEERING
WITH APACHE HADOOP

DataStax
PAGE 17

NOSQL IS A NO-BRAINER

Cask THE EVOLUTION


OF BIG DATA:
PAGE 18
FUTURE-PROOFING YOUR

NOSQL, HADOOP,
BIG DATA SOLUTIONS

SPARK & BEYOND

Best Practices Series


NOSQL, HADOOP, AND SPARK
ENRICH BIG DATA INITIATIVES

Best Practices Series


As the big data ecosystem continues to expand, new technologies are
NoSQL databases can run on commodity hardware, support
addressing the requirements for managing, processing, analyzing,
the unstructured, non-relational data flowing into organizations
and storing data to help companies gain the most benefit from the
from the proliferation of new sources, and are available in a vari-
rich sources of information flowing into their organizations.
ety of structures that open up new types of data sources, provid-
From NoSQL databases to open source projects such as Spark, ing ways to tap into the institutional knowledge locked in PCs
Hive, Drill, Kafka, Arrow, and Storm to commercial products offered and departmental silos. For example, the emerging blockchain
on-premises and in the cloud, the future of big data is being driven technology is designed to store data, general-ledger style, in a
by innovative new approaches across the data management lifecycle. highly distributed approach across the wider internet.
The most pressing areas include real-time data processing, interac- The four key database types that fall under the NoSQL category
tive analysis, data integration, data governance, and security. are key-value stores, which allow the storage of schema-less data,
The big data revolution is diminishing the sharp delineations with a key and actual data; column family databases, which store
between data types, handling all data types equally. This provides a data within columns; graph databases, which employ structures
unique opportunity to tap into all of this data to provide informa- with nodes, edges, and properties to store data; and document
tion to decision makers. databases, which enable simple storage of document aggregates.
And, despite what the name might imply, NoSQL database
NEW PROBLEMS CALL FOR NEW SOLUTIONS vendors are increasingly addressing their customers need to use
Addressing the need to store and manage data that does not fit SQL as a primary language for querying data.While the NoSQL
neatly in rows and columns, NoSQL technologies are at the fore- landscape is still relatively new, it is evolving quickly with new fea-
front, representing a broader world that connects to the internet tures for greater accessibility, interoperability, security, and gover-
at large, the Internet of Things, and clouds. nance, and showing signs of its future potential for the enterprise.

14 B IG D ATA QU A RTERLY | SPRIN G 2016


sponsored content

THE GROWING HADOOP ECOSYSTEM The main distinction is that Spark is purely an analytics engine,
Central to the big data technology landscape, Hadoop, which while Hadoop is a data management and storage system. Spark
this year marked its 10-year anniversary, has expanded well beyond does not require Hadoop for data management and storage, how-
a platform for storage and batch processing of large quantities ever. The benefit is that data managers can keep their current con-
of data on commodity servers. The Apache Hadoop framework, figurationsincluding Hadoop or other data environmentsin
consisting of Hadoop Common, the Hadoop Distributed File Sys- place, without the need to undertake migrations.
tem (HDFS), Hadoop YARN, and Hadoop MapReduce, is a core Organizations often find themselves with silos of data and
component to most big data projects and to the creation of data data frameworks. Even Hadoop projects tend to end up in their
lakes. In addition, there are now well over 100 open source projects own silos. The Spark framework is extremely versatile and can
in the greater Hadoop stack, adding the security, flexibility, and be deployed in many ways for many analytic functions. Spark
accessibility of Hadoop with additional enterprise features. capabilities can also be integrated into existing applications, such
Prior to Hadoop, capturing and analyzing data of any kind as those built on Java. Another key feature is Sparks support for
meant using proprietary tools, which was an expensive and Resilient Distributed Datasets (RDDs), which enables objects
resource-intensive undertaking. Hadoop, as well as the open to be managed and stored anywhere across the infrastructure,
source ecosystem that surrounds it, provides a more cost-effec- both on disks or in memory. This also helps ensure greater high
tive option for big data analytics within the reach of more users. availability.
However, the challenge to data managers will be acquiring the Data analytics has long been confined to specialized teams
skills needed to build out these open source environments. of analysts who tended to work remotely from other business
teams. Spark may represent the biggest step yet toward the holy
THE EMERGENCE OF SPARK grail of analyticsubiquitous enterprise access by all levels of
Spark, a data analytics framework, is one of the newest tech- employees. Spark is accessible to many players from across the
nologies in the Hadoop ecosystem. It is a data analytics framework enterpriseespecially those who are concerned with managing,
rooted in the Apache Hadoop world. Some vendors even refer to analyzing, and bringing data to the fore of a business strategy.
Spark as an analytics operating system, suggesting that it can Spark is also an optimum framework for scientists and analysts,
form the foundation of a broad array of data analytics functions as many complex capabilities are already built into its analytics
and applications that can be built on top of it. operating system. The framework supports the well-known pro-
Hadoop and Spark are part of the open source wave that con- gramming languages, and applications can be quickly built on
tinues to offer useful capabilities to enterprises of all sizes, pro- top of Sparks foundational capabilities, which include machine
viding a cost-effective and highly scalable means to package and learning, real-time processing, and graph processing.
deploy large and varied datasets. Spark is still a maturing technology platform, meaning that
Some of Sparks proponents point out that the open source there are still some aspects of the framework that need to be
framework picks up where Hadoop leaves off. For starters, Spark learned by data managers and finessed by the open source com-
inherently supports real-time requirements and offers faster pro- munity and supporting vendors. For example, some users report
cessing and a more robust analytics engine powered by in-memory that aspects of the solution are not user-friendly and require some
parallel-processing capabilities. Among other benefits, the frame- noodling to get around. This is a natural phenomenon with all
work includes resident libraries that enable faster development of emerging technologies, and as Sparks base of developers and
applications that target and process structured, semi-structured, vendors continue to enhance its functionality, it will have a more
and unstructured datasets. The Spark framework helps manage central place in the enterprise.
a variety of jobs, from streaming to traditional ETL to data lakes
and the latest real-time streaming applications. THE PRESSURE TO COMPETE ON ANALYTICS
The major components of Spark include SparkSQL, which This is an era in which organizations are under pressure to
is its query access language that accesses data sources via SQL compete on analytics in a hyper-competitive global economy.
queries; Spark MLib, which provides predictive analytics capabil- Many are building big data stores, but lack effective ways to con-
ities; SparkR, which provides access to Spark data; and GraphX, vert that data into actionable insights, when and where they are
an API for graph computations with a built-in library of common needed.
algorithms. The combination of Hadoop big data processing and the
larger Hadoop ecosystem, including Spark-accelerated analytics
COMPLEMENTING HADOOP and the wealth of NoSQL databases, provides the capabilities
Spark is not intended to replace Hadoop but complement it. targeted to business problems and opportunities.
Spark is built on the Hadoop Distributed File System and fits well
within environments that already have Hadoop skillsets or tools. Joe McKendrick

DBTA. COM/ BI GDATAQUARTERLY 15


sponsored content

Data Engineering with


Apache Hadoop
alternative architectures like Impala
WHAT IS DATA ENGINEERING? FRAUD-DETECTION
and Spark have been created to
Data engineering is the process of To design effective fraud detection
accommodate new operations. Spark
building analytic data infrastructure architecture, look no further than the
has grown so much that it is poised
or internal data products to support human brain (with some help from Spark
to succeed MapReduce as Hadoops
the collection, cleansing, storage, and Streaming and Kafka). At its core, fraud
general-purpose computation paradigm.
processing (in batch or real time) of detection circles around the detection
Fortunately, this section of the
data, for answering business questions of anomalies and reactions to those
eBook explains how its entirely possible
usually, by a data scientist, a statistician, anomalies.
to re-implement MapReduce-like
or someone in a related role. Effective fraud detection architecture
computations in Spark.
Examples of data engineering requires that three subsystems work
include, but are not limited to: cohesively to detect anomalies in streams
TUNE APACHE SPARK JOBS
Building data pipelines that aggregate of events: operationalizing for real-time,
When writing Apache Spark code
data from multiple sources stream-processing systems, and offline-
and paging through the public APIs,
The productionization, at scale, of processing systems.
one often comes across words such as
machine-learning models
NEAR-REAL TIME SESSIONIZATION transformation, action, and RDD.
The creation of pre-built tools that
With Spark and Hadoop Similarly, if things start to fail, or the
assist in the query process (e.g., UDFs) application takes an inordinate amount
In this section, we demonstrate and
Data engineers rely on the Apache of time, a new vocabulary of words like
walk through common and advanced
Hadoop ecosystem, including job, stage, and task get thrown around
Spark Streaming functionality via the use
components, such as Apache Spark, To reiterate, understanding Spark at this
case of doing near-real-time sessionization
Apache Kafka, and Apache Flume, as level is essential in executing good Spark
of website events, then loading stats about
the foundation for this infrastructure. programsand by good, we mean FAST!
that activity into Apache HBase, and
Regardless of use cases and components In this section, we cover the basics
finally populating graphs in your preferred
involved, this infrastructure should of how Spark programs are actually
BI tool for analysis.
be compliance-ready with respect to executed on a cluster, followed by
security, data lineage, and metadata APACHE KAFKA FOR BEGINNERS practical recommendations on the
management. Apache Kafka is creating a lot of buzz capacity that Sparks execution model
This Data Engineering eBook walks these days. While LinkedIn, where Kafka carries for writing efficient programs.
through technical concepts pertaining to was founded, is the most well-known
building and maintaining analytic data user, there are actually many companies SUMMARY
infrastructure on a Hadoop-powered successfully deploying the technology. If modern strategies in data engineer-
enterprise data hub. We introduce some Now that the words out, everyone wants ing are interesting to you, download this
of these concepts at a high level here, but to know: What does it do? Why does eBook and feel free to reach out to Cloud-
dive deeper in the eBook. everyone want to use it? How is it better era with any questions you may have.
than existing solutions? Do the benefits
justify replacing existing systems and
ARCHITECTURAL PATTERNS ABOUT CLOUDERA
infrastructure? In this section, we answer
For Near Real-Time Data Processing Cloudera delivers the worlds fastest,
those questions and more.
Evaluating which streaming easiest, and most secure platform for
architectural pattern is the best match TRANSLATE MAPREDUCE data management and analytics, built
to your use case is a precondition for a TO SPARK on Apache Hadoop and the latest open
successful production deployment. In this Hadoop was originally designed source technologies.
eBook, we discuss four major streaming for large-scale log processing and
patterns, and how to implement those batch-oriented ETL operations. With CLOUDERA
patterns architecturally. broadening Hadoop usage today, www.cloudera.com

16 BI G D ATA QU A RTERLY | FA LL 2016


sponsored content

NoSQL is a No-Brainer
For any business that wants to unique, these are the foundational set of
successfully compete in todays digital database requirements that must be met
economy, it is not a question of if but for the business to successfully compete
rather how they need to evolve their in the market.
business to survive. Todays customers Distributed to ensure 100% uptime
wont give you a second chance when it Responsive to minimize latency and
comes to digital customer experiences. If linearly scale up or down as the
you fail to deliver on their expectations, business demands
they will leave and not come back. Intelligent to accommodate different
Providing an amazing customer types of data models and workloads,
experience is a must. To do this for while managing issues as they arise
your cloud applications that provide And, all of this has to be in a single,
real-time value at epic scale to your secure, enterprise-ready platform
customers, you need a database platform that can be intelligently managed and Digital Customer Experience Challenges
that is distributed, highly responsive, monitored with ease.
and intelligent. RDBMS technologies fail to deliver DataStax keeps us in business, says
WHY NOSQL OVER RDBMS? on these expectations and businesses are Christos Kalantzis, Cloud Database
According to The Forrester Wave: turning to NoSQL databases to handle Engineering Manager at Netflix. He
Big Data NoSQL, Q3 2016 report, a these requirements. recounted a horror story about how
survey of more than 2,000 data and Netflix went down for more than 48
analytics technology decision makers HARNESS THE TRUE POWER hours when its Oracle database failed.
found that more than 60% of enterprises OF NOSQL WITH DATASTAX We couldnt risk that happening again.
already have implemented, plan on ENTERPRISE (DSE) Oracle wasnt built for the cloud and it
implementing, or are expanding/ Only DataStax Enterprise lets you doesnt work in the cloud at the level we
upgrading their implementation of harness the true power of a multi- need it to.
NoSQL solutions within the next 12 model NoSQL database to deliver Digital natives and 100-year-old
months. NoSQL is not an optionit an always-on customer experience, enterprises alike are already starting to
has become a necessity to support next- faster performance, and powerful deliver value derived from connected, not
generation applications, wrote Noel contextual recommendations. All this collected data. This ability to understand
Yuhanna, Principal Analyst at Forrester while supporting agile DevOps with a customers interactions across business
Research. management tools that make it easy for silos and digital channels, and to
the Ops team to monitor and operate personalize and offer highly relevant and
WHAT DO CLOUD your production environments. contextual experiences is fast becoming
APPLICATIONS NEED? While high availability may be the norm.
A cloud application is defined as achieved with legacy RDBMS failover Customers wont give your company
an application with many endpoints, procedures, true continuous availability a second chance to deliver a great digital
including browsers, mobile devices, and/ for a cloud application requires an customer experienceand with
or machines, that are geographically architecture that was designed to DataStax, you wont need one. Build and
distributed, intensely transactional, never allow single points of failure. manage cloud applications that exceed
always available, and instantaneously DataStax Enterprise builds on the core your customers expectations for speed,
responsive no matter the number of architecture of Apache Cassandra, access, and relevant experiences, by
users or machines using the application. which sports a masterless design where harnessing the true power of a multi-
Customers expect personalized every node in a database cluster operates model NoSQL database to deliver
information at their fingertips and the independently with respect to database real-time value at epic scale.
ability to take action when and where operations. This results in 100% uptime.
they want. Not 99.99% uptime, 100% uptime. DATASTAX www.datastax.com
While each cloud application is DataStax EnterpriseSolve the Toughest
DBTA. COM/ BI GDATAQUARTERLY 17
sponsored content

Future-Proofing Your
Big Data Solutions
Big Data turned 10 years old this year and way from a prototype to a production grade governance. CDAP ensures data and
much has changed since the invention of environment. But because of the process consistency between applications
Apache Hadoop. Early on, we saw new significant differences between developing and underlying infrastructure technologies
projects in the ecosystem develop quickly code on a laptop or workstation versus the across multiple environments and between
that filled in major functional areas highly distributed multi-node production different parts of the IT organization.
necessary to effectively process the volume, environment, a significant amount of To future-proof the Hadoop
velocity, and variety of data never before recoding and configuration changes are applications built on CDAP, the 100%
seen. In the years following, additional often required to accommodate these open source and extensible platform
projects have been developed to round variances. This dramatically slows time to acts as an abstraction layer, separating
out much of the major features required market. Some of the challenges might not integration logic from application and
to enable varied use cases and developer even lie within the skillset of the original data logic. Future changes to the data
access, and more recently with the rise of developers, especially as production- or the business logic of the application
Apache Spark. specific requirements (packaging, become much easier, streamlining ongoing
versioning, etc.) come into the picture. As IT operations for big data solutions.
DEVELOPERS AND big data solutions evolve, organizations CDAP also includes Cask Hydrator, a
ORGANIZATIONS DONT WANT need to clear this hurdle, efficiently drag-and-drop extension to CDAP that
TO BUILD INTEGRATIONSTHEY operationalize data apps from prototype enables users to quickly and easily create
WANT TO BUILD SOLUTIONS to production, and enable a self-service code-free data pipelines to ingest and
As Hadoop adoption started to take environment for business users before they transform data in Hadoop, accelerating the
off, vendors began delivering specialized can start to derive value from their data. process of building enterprise data lakes.
integration toolsfor ETL, data A second CDAP extension called
integration, security, governance, etc.to Cask Tracker provides the ability to
simplify the development of big data discover data, audit data access and trace
solutions. However, in most cases today, data lineage, all necessary for enterprise
creating a big data application requires governance.
integrating between a large and growing CDAP enables rapid development
number of these individual tools, forcing and deployment of big data applications,
big data developers to shift their attention including the ability to easily move them
from application and data logicwhere into production, while meeting stringent
The Cask Data Application Platform (CDAP): governance requirements. CDAP is the
most of the IP and value residesto An Integration layer on top of Hadoop
figuring out integration logic. As a result, only platform you need not only to
developers and organizations end up ensure your Hadoop environment is on
spending valuable time on infrastructure THE SOLUTION IS UNIFIED an efficient and safe path to production,
and integration tasks while dealing with a INTEGRATION FOR BIG but also to easily evolve along with the
multitude of vendors and the intricacies DATA WITH CDAP ecosystem while future-proofing the big
of their tools, rather than focusing their The Cask Data Application Platform data solutions you are building today. For
energy on applications and insights. (CDAP) has been designed to help more information about these products,
solve the issues that arise when moving please go to theCask website, and to stay
ITS CHALLENGING TO from prototype to production, whether updated on product and company news,
MOVE FROM PROTOTYPE on-premise or in the cloud, on a laptop or follow us@caskdata.
TO PRODUCTION on a 1,000-node server. CDAP is a truly
Just like traditional 3-tier applications, unified integration platform that combines
distributed data applications dont just application management and data CASK
magically appear; they are being built, integration capabilities with a code-free www.cask.com
tested, optimized and staged on their self-service environment and enterprise-

18 BI G D ATA QU A RTERLY | FA LL 2016

Вам также может понравиться