Вы находитесь на странице: 1из 26

Big Data’s Velocity and

Variety Challenges

Ross Mason, John D’Emic, Ken Yagen, Dan Diephouse, Alex Li

1
Table of Contents
1 Introduction
2 Emerging Architecture for High-Velocity Big Data
Trend towards Real-Time
Real-Time Data Capture at the Edge of the Network
Anypoint Platform & Real-time Data Ingestion
Application Analytics – time-sensitive and low-latency data
Capture Exhaust Data – low-value data
Anypoint Platform & Real-Time Processing
Apache Kafka, Storm and Spark
Anypoint Platform and Big Data Streaming
3 Emerging Architecture for High-Variety Big Data
Storing Unstructured Data and NoSQL Databases
Select Anypoint Platform Use Case with NoSQL - Document Storage
Anypoint Platform’s strengths for Integrating NoSQL Databases
APIs for High-Variety Big Data
API Use Case with NoSQL
4 Conclusion and MuleSoft’s Big Data Integration Solution

2
1 Introduction
In 2001, Doug Laney described big data as cost-effective and innovative processing of
high-volume, high-velocity and high–variety information for insights and decisions. More
than a decade later, Hadoop and NoSQL ecosystems have flourished and been popularly
adopted by an increasing number of enterprises. According to a report commissioned by
Dell, 96% of midmarket companies are embracing the rise of big data. Kafka, Spark, Storm
and many new innovations, in addition to Hadoop and NoSQL databases, are providing us
with more pieces to solve the puzzle. However, organizations are still wrestling with two of
the three (in)famous Vs.

Velocity: Real-time business intelligence (BI) and real-time machine learning


applications are essential for analytics-driven organizations. Meaningful insights and
decisions at the right time and right place require both low-latency data ingestion and
high-speed stream processing. Traditional data integration / ETL tools and hadoop’s
inherent batch-processing model are intrinsically incompatible with real-time big data
applications.

Variety: Mixing and matching unstructured data from disparate sources and connecting
multiple NoSQL and relational databases could be extremely complex. In a recent
Gartner report, 1/3 of companies surveyed mentioned “obtaining skills and capabilities
needed” and “integrating multiple data sources” as two of their top four challenges.
Learning new query interfaces and hand-coding point-to-point data integration could
mean waste of precious IT resource and slow development.

At the same time, companies are embracing integration frameworks and APIs to modernize
their legacy IT systems, addressing two similar issues –process automation by switching to
event-based flows and integration of various applications and data sources on premise and
in the cloud. As big data evolves to become an integral part of the enterprise infrastructure,
a unified architecture and development solution is emerging. MuleSoft’s Anypoint platform
plays a crucial role in helping CIOs, architects and developers solve the velocity and variety
challenges around big data.

3
2 Emerging Architecture for High-Velocity Big Data
2.1 Trend towards Real-Time

Two trends are driving the increasing adoption of real-time architecture.

1. Centralized data processing drives real-time ingestion and shift from ETL to
ELT

We are increasingly shifting the heavy-lifting of data transformation and cleansing work
to Hadoop or other clusters, because we are generating and collecting data at a higher
speed. So, the traditional extract-transform-load looks more like extract and load in real-
time, followed by massively-parallel processing of data. Particularly for big data, while
some use cases may leverage batch capabilities, high-latency handling (outside big data
infrastructure) significantly reduces the value and increases costs in many emerging
high-impact use cases. Realizing Data Hub or Data Lakes would require real-time
capabilities in many cases.

2. Time value of data and reactive programming drive real-time processing and
presentation of data

For end users of data, real-time is the ability to react to something as soon as it happens.
In recent years, reactive applications have been an emerging trend gathering great
fanfare, and being “event-driven” is one important pillar of the emerging paradigm.
Typically, when we want to achieve real or near-real time, we design architectures that
can respond to data as it arrives, without necessarily persisting it to a database. Being
real-time is not only a necessary and inevitable step to make machines behave, think
and decide more like humans. It is also driven by general users’ psychological need and
real-world economic implications. People hate waiting, and a query that takes minutes or
hours to respond is unbearable. Delays make information less accurate and decisions
less relevant.

Below is a summary of rationale behind adopting real-time big data architectures throughout
data cycle of capture - ingestion - processing - presentation.

• Capture: Low-latency, high-throughput data require real-time data capture and


processing. For example, IoT are generating large amounts of machine data that
require continuous handling at the edge.

4
• Ingestion: High-latency data ingestion into Hadoop and other NoSQL databases incurs
unnecessary costs. First, staging and conducting ETL on a large batch of data occupy
more expensive memory and processing inside traditional databases (compared to
Hadoop). Secondly, managing hundreds of batch data ETL and data quality processes
across various data structures and sources could be costly within your IT organization.
This is also what drives companies to shift ETL to ELT with transformations inside
Hadoop

• Machine Learning / CEP: The value and impact of machine learning applications rely
on frequent refreshes of rules / models and low-latency ingestion of data. Real-time
machine learning applications are about detecting frauds when credit cards are swiped,
inferring operational anomalies when or before they happen, and rearranging ads when
a user clicks into the website.

• BI / Analytics: A modern global distributed workforce requires 24/7 real-time query and
access to BI / Analytics based on up-to-date data.

More and more, big data demands low latency architectural solutions. The choice between
two paradigms for data processing, batch and stream, determines latency of applications.
Processing a batch of terabytes of data would certainly require time and is fundamentally
high-latency. Stream processing conducts computation on fragments of data as they arrive.
The Anypoint Platform supports real-time capture, ingestion and processing of big data in
this emerging category of architectures.

2.2 Real-Time Data Capture at the Edge of the Network

One of the emerging sources of big data is smart devices. Projections from CISCO and
many other organizations depict an exploding growth for Internet of Things (“IoT”), which
implies both great opportunities and challenges to big data integration and processing. Each
smart device would continuously generate a large amount of data. To realize value from
IoTs, the key would be to form the “network” that captures, transforms and routes data
generated from the embedded devices.

MuleSoft’s Anypoint Platform is already running on-premises and in the cloud and can now
run virtually everywhere thanks to Anypoint Edge. This initiative not only focuses on allowing
Anypoint Platform to run on lightweight embedded architectures, like the Raspberry Pi, but

5
also to support protocols and devices relevant to the embedded world, like MQTT and
Zigbee.
For example, embedding Anypoint Edge in routers can help capture requests and routing
data in real-time, and achieve monitoring and big data analytics.

   
Anypoint Edge brings lightweight and crucial capabilities to embedded devices, such as
routers, machines, and sensors.

• Edge supports and sends real-time data stream through light-weight wire protocols,
such as MQTT

• Edge’s ability of mapping data across various structures and MuleSoft’s APIkit enable
you build an API layer for embedded devices, making integration into broader
information flow easier and more flexible

• Edge can publish derivative events to other systems in the cloud or on premises and
receive action events from these systems

6
• Edge can be used to monitor time series events and take immediate actions based on
rules

• Like MuleSoft’s other offerings, AnyPoint Edge is easy to use develop for and easy to
deploy

With the capacity to run on limited hardware platforms and interact with the physical world
through smart devices, Anypoint Edge, the first embedded integration platform, has all it
takes to deliver in the domains of distribution, decoupling and smart routing, to enable
capturing and light transformation of real-time machine data for big data applications.

2.3 Anypoint Platform & Real-time Data Ingestion

An ETL-oriented data ingestion model would mean accumulating data in a staging area,
conducting transformations and cleansing, and then loading the data into the final
warehouse in one batch. This model is not ideal for many big data use cases. Before you
decide on whether to adopt real-time data ingestion, the following questions are worth
considering:

• Is your big data time sensitive, or low latency and high volume?
• Are you using expensive database to stage low-value data before NoSQL / Hadoop?
• Are you using the data collected for real-time applications?

If your answer to any of them is yes, you should consider implementing an architecture that
enables real-time data ingestion. Increasingly, those big data use cases fall into those three
scenarios. For example, social media data are usually time-sensitive. Web-log data is low-
latency and high-volume. Machine data is usually relatively low-value data, which would not
validate any expensive processing. Today, many big data applications are real-time, e.g.
fraud-detection, customized ads, recommendation engines, predictive diagnostics, etc. As a
result, organizations are increasingly collecting new forms of data in real-time and
depositing them directly into NoSQL databases, such as HDFS, Cassandra and MongoDB,
following an ELT model.

MuleSoft’s Anypoint Platform, with both event-driven and batch processing capabilities, is
best fit to help enable the switch from batch ETL to real-time data ingestion. For example,
for e-commerce and other retail companies, there are usually constant updates to their
massive online product catalogs with detailed information (SKUs, inventory, specs, reviews,

7
etc.). A modern architecture can leverage Anypoint Platform to facilitate data flow from
various sources into Cassandra, which could expose the most up-to-date product info to
web / mobile apps and other applications through APIs, designed and managed by
MuleSoft’s API Platform. Following are two common use patterns of MuleSoft and NoSQL
databases for real-time data ingestion – collecting analytics data and exhaust data.

2.3.1 Application Analytics – time-sensitive and low-latency data


 
Analytic data for modern web, mobile and enterprise applications can be vast and
unstructured, making it awkward to capture and operate with a traditional relational
database. Analytics data, once captured, can pilot the operation of your applications in real-
time. Insight into traffic patterns, for instance, can control the dynamic scaling of virtual
machines hosting an application. Site appearance, including the choice and placement of
ads, can be tailored for users based on how their behavior clusters them to other users of an
application.

NoSQL solutions such as MongoDB, Cassandra and Riak provide more natural storage
mechanisms for such data, but getting the data into them can be tricky. Some capabilities of
the Mule runtime along with connectors for all major NoSQL stores, including MongoDB,
Redis, Cassandra and Neo4J can gather this analytics data:
 
• Application metrics must be typically produced and gathered asynchronously. This is
often accomplished using a messaging fabric like AMQP, WebSockets or JMS. Mule’s
event driven architecture and support for almost every messaging protocol make it a
natural choice to persist analytics data published on such a layer.

• MuleSoft’s asynchronous processing scope, notification system, Custom Business


Event and wiretap features make it easy to capture analytics data for MuleSoft
applications that can be inserted into Redis or Cassandra.

8
2.3.2 Capture Exhaust Data – low-value data
 
The plummeting cost of disk storage is making it feasible, and often desirable, to adopt a
“store everything” mentality. Many companies are capturing all data since information that
seems useless today might be priceless in the future. Organizations can’t afford to throw
away data that could be mined later on, but they can afford to store it. MuleSoft’s position in
many enterprise architectures, which is typically as a mediation or proxy layer, makes it a
natural place to capture this data. It can be challenging, however, to store this data as
message formats and protocols offer differ widely between endpoints and flows.

Schema-less NoSQL solutions like MongoDB and Cassandra are ideal candidates to store
unstructured data such as messages being proxied by Anypoint Platform. This is simplified
by:
 
• MuleSoft’s notification system, asynchronous messaging scope and wiretapping
features facilitate asynchronous message insert into MongoDB and Cassandra using
the Cassandra and MongoDB connectors.

9
Overall, if you are storing and plan to leverage time-sensitive, low-latency or low-value data,
a real-time architecture enabled by MuleSoft should be considered, to enable high return
and short time to market on your big data projects. Penn Mutual, one of MulesSoft’s
customers, adopted MuleSoft and Cassandra. Switching from legacy batch-mode to low-
latency data ingestion enabled Penn Mutual to supply timely data to external partners.

2.4 Anypoint Platform & Real-Time Processing


 
Real-time stream processing is the inevitable next stage of big data evolution towards value
realization through real-time decisions and actions. Instead of processing server log files
hourly, companies want to know what just went wrong NOW. Instead of daily customer
preference reports from website clickstreams, companies want to deploy customized ads
and deals to the current customers while they are visiting the site NOW. Some other use
cases include trading stocks using large-volume real-time data streams, network monitoring,
fraud and security breach detection, smart grid and energy conservations, etc. However,
even though real-time stream processing has been put in great use at Google, Linkedin and
Twitter, implementation of an easy stream processing solution at other companies given skill
and resource constraints is still an on-going debate.

ETL and Hadoop are intrinsically batch-oriented, and are unfortunately not the solution for
real-time big data processing. The strength of Hadoop lies in its parallel storage and
processing, which leverages MapReduce and replication to simplify analyses on very large
data sets. However, each MapReduce batch job requires slow disk I/O. Replication and
serialization also contribute to latency in processing. As a result, ad-hoc queries are slow

10
and stream processing is hard. ETL tools are in general not designed for “always on”. For
each batch, ETL tools set up connections, parallel processes, load the data into memory,
and conduct data quality and transformations. There is usually a start and an end to each
ETL job.

Real-time processing is therefore the immediate need in many practical big data
applications. In recent years, we have seen an emerging universe of new solutions, like
Amazon Kinesis, Storm, Cloudera’s Impala, Spark and Tez, rising to address the issue.
Kafka, Storm and Spark are three open source projects that could play significant roles in
real-time analytics and machine learning applications.

2.4.1 Apache Kafka, Storm and Spark

Recently, we have seen an emerging real-time big data processing architecture using Kafka
and Storm / Spark. Kafka and other message brokers can serve as real-time data
aggregator and log service provider, offering replay and broadcast capabilities. Storm and
Spark offer processing capabilities for large-volume stream data. The missing piece is how
to integrate data flow into the messaging system and incorporating results of the stream
processing into broader information flow that powers applications and analytics. ESB and
API are architectural tools that fill the gap. MuleSoft’s event-based Anypoint Platform, used
for data ingestion and API publishing, together with appropriate architecture (e.g. Mule-
Kafka-Storm) could make real-time streaming applications easier to implement and future-
proof.

11
Apache Kafka
 
Kafka is an open-source distributed, partitioned, replicated commit log service. It is used as
a high-performing messaging broker system for real-time data applications. Many
(“producers”) can publish messages to Kafka in different categories (“topics”). Messages
flow into partitions (for scalability of Kafka clusters), and append to ordered, immutable
sequences – commit logs. All published messages are maintained, consumed or not, for a
configurable period of time. Processes (“consumers”) can subscribe to topics and process
the feed of published messages.
   

   
Traditionally, messaging systems fall into two models – queuing and publish-subscribe. In a
queue, each message goes to one consumer, vs. pub-sub would broadcast messages to all
consumers. Kafka can support both. “Consumer groups” can be defined on Kafka. And each
message would only be delivered to one process in any consumer group, achieving queuing
(if all processes are in the same group).

In the traditional messaging use cases, Kafka is compared to ActiveMQ or RabbitMQ. In the
emerging big data use cases, Kafka is gaining popularity in helping to address real-time
data aggregation and pub-sub feeding. For example, website activities such as views,
searches and other interactions are tracked and published in real-time to Kafka, with one
topic for each activity. The feeds then can be subscribed and applied in real-time monitoring,
processing, and loading into Hadoop or NoSQL stores for offline analyses. This is a typical
high-volume, low-latency streaming application. Operational metrics monitoring and
processing is another example. Linkedin developed Kafka initially, and has used it
extensively for activity stream data and optional data processing, powering products like

12
LInkedin Newsfeed and offline analytics. Airbnb uses Kafka for event pipeline and
exception tracking. Other users include Square (metrics, logs, events), Box (production
analytics pipeline and monitoring), Tumblr, Spotify, Foursquare, Coursera, Hotel.com, and
many others.

Apache Storm

Storm is an open-source distributed real-time computation system for processing high-


velocity, large stream of data. It is designed to integrate with existing queuing and bandwidth
systems and processes unbounded sequences of tuples (“streams”).

A Storm topology is a graph of computation, with each node containing processing logic and
links defining data flow. The sources of streams are called “spouts” and processing nodes
called “bolts”. A storm cluster is structured very similar to a Hadoop cluster, although
adapted for real-time processing. While MapReduce jobs run on Hadoop are batch-oriented,
and would eventually finish, Storm topologies process messages forever until they are killed.
Frameworks are available for integration with JMS and RabbitMQ / AMQP, which are well-
supported by Anypoint Platform.

Storm has been adopted by Twitter, the Weather Channel, WebMD, Spotify, Groupon,
RocketFuel and many others for various real-time applications. For example, at Yahoo!,
Storm empowers stream / micro-batch processing of user events, content feeds and
application logs. At Spotify, Storm powers music recommendations, monitoring, analytics
and targeting.

13
Apache Spark

Spark is an open-source cluster-computing environment similar to Hadoop, but with


differences that make it superior for certain use cases.

Spark’s in-memory cluster computing enables caching of data to reduce latency of access.
Instead of following a strict i/o – map – reduce – i/o chaining cycle in Hadoop, you can
perform map or reduce in any order without the costly i/o in between. This significantly
speeds up iterative processing jobs and interactive queries. Spark applications are easier to
write, with java, scala and python as three programming language options. Spark also
powers a stack of high-level tools for machine learning (MLib), GraphX (graph data), SQL
(Spark SQL) and streaming (Spark Streaming). Spark is increasingly becoming a more
preferred processing engine in the Hadoop ecosystem. It can be run on YARN, and read
data from HDFS, HBase and Cassandra. The in-memory design of Spark and its integration
with Hadoop are making it the potential engine that unifies batch and real-time processing.
Although it is still in its early-stage of adoption and may lack some enterprise-level support
and polish, it is quickly gaining momentum. Groupon, Autodesk, Alibaba Taobao, Baidu and
many others are all users of Spark.

   

2.4.2 Anypoint Platform and Big Data Streaming

A few characteristics of Anypoint Platform make it a perfect solution for lightweight real-time
data transformation and integration tool, to facilitate data inflow and outflow around Kafka-
Storm / Spark architecture.

14
Best-in-class performance: It can process up to tens of thousands of messages per
second (depending on data structure, transformation, hardware specs, etc.), capable of
feeding streams of data into message brokers for aggregation. IT also offers a stateless
architecture so can scale horizontally on commodity hardware.

Broad transformation and connectivity functions: It provides easy connectivity and


mapping among virtually all business data sources and services, datastores (traditional and
NoSQL), messaging systems and data structures. For a list of connector offerings, please
refer to below.

All-in-one integration architecture-level solution: Anypoint Platform, which includes the


API platform and Anypoint Edge, support and cover real-time data capture, batch
processing, transformation and integration throughout all data flows across your
organization, on premise or in the cloud or in edge devices. You can use MuleSoft to bring
in all data streams into Kafka-Storm or other real-time data application processing system.

Real-time monitoring of operational metrics is one example of how Anypoint Platform can
help integrate data to power real-time big data application. In many organizations, it is
already facilitating data flow among various operational data sources and storage, such as
CRM, ERP, HR, marketing, finance and web apps for thousands of organizations. These
data flows could be extended and published to Kafka in real-time, as well as deposited
directly into traditional and NoSQL databases for mining and future usage. Kafka can feed
the aggregated stream data into Storm or Spark for processing, such as combining,
comparing, slide-window analyses, or applying machine-learning models. The real-time
results would then flow through Anypoint Platfrom to monitoring tools or automate decisions.
   

15
With AnyPoint Edge facilitating capture of real-time machine data, Mule-Kafka-Storm
architecture can present a solution to real-time machine-data processing.

In summary, as organizations start to evolve to modernize data architecture towards real-


time applications, crucial changes must be adopted to capture, integrate, store and process
low-latency and high-volume data. MuleSoft provides the platform and products to support
your adoption of new big data technologies, and seamlessly modernize your IT system for
both small-data and big-data real-time integration throughout data lifecycle.

• Real-time capture: Anypoint Edge and API Platform + embedded devices


• Real-time ingestion: Anypoint Platform (on-prem or cloud) + NoSQL databases
• Real-time processing: Anypoint Platform + messaging queue + stream processing
engines

3 Emerging Architecture for High-Variety Big Data


A few years ago, we needed to make the decision of what data to store first before
designing the system to store, cleanse, and manipulate the data into the right form. With the
rise of big data, many companies have begun to store data of all types and structures, with
the hope that the data will be of use in the future. The number of data sources exploded –
social media, log, machine, images, videos, texts, and documents. The various ways of
representing the data also have increased, XML, JSON, binary, graph, etc. Various storage
solutions to accommodate these different type and structures, naturally, were introduced.

16
The high-variety of NoSQL storage solutions therefore added great complexity to handling of
high-variety big data. Integrating NoSQL databases present companies with three great
challenges:

• Selection of the right NoSQL database: what is appropriate for storing documents may
be inefficient for storing machine data or log data

• Skill and knowledge gap: internal IT team may not have the knowledge and expertise to
build connection and queries for various new types of NoSQL databases. Finding the
right talents or training existing developers may both be costly and time-consuming,
increasing the uncertainty and cost associated with the big data project

• Chaos: The compound permutation of various data sources, data stores and data
applications would mean exponential increase in number of connections and paths to
facilitate data flow. Point-to-point integration would mean inevitable chaos in the future,
and will eventually cause issues. Point-to-point hand-coding data store connectivity also
would make monitoring of data flow and usage difficult, putting the company at a
disadvantage in the environment where big data is enabling big services with increasing
number of data consumers internally and externally.
 
MuleSoft provides such a solution to standardize data integration connections among
NoSQL stores and sources. The connectors to the major NoSQL databases in all categories
can help your developers build integration flow, both batch and real-time with drags and
drops.

MuleSoft NoSQL Database connectors:

• Document: MongoDB
• Graph: neo4j
• Column: Cassandra, Riak, HBase
• Key-Value: Redis, HDFS
• Cloud: Amazon Redshift (JDBC, general database connector)

3.1 Storing Unstructured Data and NoSQL Databases

What storage solution to choose for a Big Data use case is an important decision.
Fundamental to this decision is an understanding of the CAP theorem. The CAP theorem
states that a distributed system can simultaneously provide only two of the following:

17
• Consistency: All nodes in the system have visibility to all data at the same time.
• Availability: A client to the system will receive a response to a request
• Partitioning: The system functions normally despite a failure in another part of the
system

Your choice of a NoSQL solution will be largely guided by which two of the three your usage
pattern dictates. The following is a list of NoSQL connectors provided by MuleSoft and how
they could help address certain types and structures of data.

MongoDB

MongoDB is an open source, document driven database with the following features

• Horizontal scaling via sharding


• Strongly consistent

18
• Single documents can be atomically updated
• Multi-datacenter replication supported
• High-availability via automatic failover with short downtime
• Data is stored in BSON format, a binary form of JSON
• Indexes are supported
• Ad-hoc query language allows data retrieval

MongoDB offers Consistency and Partition tolerance. MongoDB is best suited for the
Document Storage and Data Exhaust usage patterns.

Cassandra

Apache Cassandra is an open source, column driven database, featuring the following:

• Automatically horizontally scalable


• Semi-flexible structured key-value store
• Column and column-family based structuration
• Tunable consistency
• Multi-datacenter replication supported
• Master/master reads and writes
• Map/reduce is supported
• Ad-hoc query language allows data retrieval

Cassandra offers Availability and Partition tolerance. While Cassandra is standardizing


towards a CQL based Java driver, the plethora of current drivers and the column-based
access and query model make it a good candidate for the API Exposure usage pattern.
Cassandra’s excellent multi-datacenter replication features make it well suited for the Multi-
Data Center State Synchronization usage pattern.

Hadoop Distributed Filesystem (HDFS)

The Hadoop Distributed File System, or HDFS, is a distributed file system primarily used to
share data in a Hadoop cluster, featuring the following:

• Distributed file system


• Automatically horizontally scalable

19
• High-availability via manual and automatic failover.

HDFS offers Consistency. HDFS is typically used in conjunction with Hadoop for distributed
map reduce operations, making it a good choice when using the Machine Learning usage
patterns discussed above.

Riak

Riak is an open-source distributed implementation of the principles put forth in Amazon’s


Dynamo paper. Riak features

• Distributed file system


• Strongly consistent
• Automatically horizontally scalable
• High-availability via manual failover
• Map-reduce is supported

Riak offers Availability and Partition tolerance. These characteristics make it an option for
Multi Data-Center State Synchronization.

Redis

Redis is an open source key-value based “data structure server.” It provides mechanisms to
store strings, hashes, lists, sets and sorted sets. Prominent features include:

• Key-value store with rich data structures as values


• Transactions simulated by executing multiple commands atomically
• Uses master/slave replication
• High-availability cluster in beta
• Data operations done via ad hoc commands
• Supports a basic pub/sub mechanism

Redis offers Consistency and Partition tolerance. Redis is a good option for Application
Analytics.

Neo4j

Neo4j is an open source, graph database with the following characteristics:

20
• Graph oriented database
• Node and relationships support key/value properties
• Fully ACID
• Can easily be embedded
• High-availability clustering is available
• HA uses master/slave replication, with writable slaves
• Online backup is available
• Indexes are supported
• Ad-hoc query language allows data retrieval
• Graph traversal is supported
• Algorithm-based traversal is supported.

Neo4j offers Consistency and Availability. Neo4J is a good choice for Application Analytics
that can be modeled as graphs.
 
3.2 Select Anypoint Platform Use Case with NoSQL - Document Storage
 
Document driven databases like MongoDB are emerging as the preferred data backend for
web and mobile applications. The rigid schemas associated with relational databases often
makes them ill-suited to store semi-structured content such as product data, user generated
content like comments or metadata attached to media files.

Scaling relational databases horizontally has also proved to be challenging, pushing


developers to new solutions that easily support scaling across commodity hardware and
data centers.

Anypoint Platform support for document-oriented NoSQL databases like MongoDB, along
with its light-weight transformation and data streaming features simplify the implementation
of a NoSQL datastore. Some common usage patterns with Anypoint Platform and document
stores include the following:
 
• Batch ingest and transformation of aggregate data accumulated periodically from
external sources. A retailing use case is to use the MongoDB connector to persist

21
product, stock and tax data into MongoDB collections by periodically polling web
services API and data on FTP sites. Anypoint DataMapper can be used to transform
the disparate formats of these sources into a canonical JSON format, which can be
directly persisted into MongoDB.

• Normalized databases, for good or ill, often long outlive the applications built on top of
them. Document stores can be used to build a denormalized layer “on top” of these
databases, providing a scalable fabric to facilitate ancillary BigData use cases. Platform
support for JDBC makes it easy to maintain denormalized storage view in parallel to a
normalized relational database.

   

3.3 Anypoint Platform’s strengths for Integrating NoSQL Databases

While other tools may also help solve connections into NoSQL databases, Anypoint
Platform has a few areas of strengths that differentiate it. In addition to easy-to-use
graphical interface, robust enterprise support, cloud / on-premise deployment options, and
top-of-class performance, the platform’s ability to integrate all sources of data and its batch /
real-time dual processing capabilities are worth mentioning.

22
Anypoint Platform is already used by thousands of enterprises as the backbone to integrate
a wide variety of data sources, on premise and on the cloud. The table below lists a subset
of connectors:

• Applications: Twitter, Facebook, Concor, Salesforce, Zuora, Box, Microsoft Dynamics


CRM, Marketo, NetSuite, Google Contacts, Google Calendar, Google Tasks, Google
Spreadsheet, Oracle Siebel, Quickbooks, Jira, SAP, Zendesk, Drupal, Service Source,
Fresh Books, HubSpot, Yammer, Google Prediction, Documentum, Sugar CRM,
Metanga, Jobvite, Paypal

• Databases: Amazon S3, Amazon DynamoDB, HL7, HDFS, Oracle DB, DB2,
PostgreSQL, MySQL, JDBC, OData, Riak, Microsoft SQL Server, MongoDB,
Cassandra, neo4j, Redis

• Others: AMQP, RabbitMQ, excelCSV, SSL, SSH, SMTP, ActiveMQ, File, POP3,
WebLogic, AJAX, TCP, Jetty, SwiftMQ, JBoss, WebsphereMQ, FTP, HTTP/HTTPS,
SFTP, Apple Push Notification Services, Amazon Notification Services

The Anypoint Platform offers both real-time and batch capabilities. While Anypoint Platform
has always been great for handling event-driven, real-time data flows, it also has launched
“batch processing”, seeing the need for enterprises to do both. While real-time ingestion
and processing are increasingly adopted, some data and processes are still better fit for
batch mode. For example, master data may still reside in traditional databases and would be
moved in batch in big data use cases. Synchronization among different applications and
databases may still be a batch-oriented operation. Anypoint Platfrom possesses the ability
to process messages in batches. Within an application, you can initiate a batch job which is
a block of code that splits messages into individual records, performs actions upon each
record, then reports on the results and potentially pushes the processed output to other
systems or queues.
 
3.4 APIs for High-Variety Big Data
 
While some NoSQL solutions, notably MongoDB, offer straightforward query and data
representations using standard formats like JSON, others have more obtuse access
methods or data representations. Interaction with Cassandra, until recently, required some
insight into the Thrift protocol, of which Cassandra is built on top. This is further complicated
by competing drivers for the datastores even within a single platform. Redis, for instance,

23
has 6 Java clients to chose from. Querying Neo4J graph databases requires knowledge of
the Cypher query language.

An API layer in front of the datastore can ease access and adoption when working with
these databases.

Anypoint Platform’s support for API constructions can simplify adding an API layer in front of
your data layer:

• Anypoint APIKit offers a declarative model to facilitate REST API design and
development.
• Anypoint DataMapper can assist in the transformation from the persisted format of the
data to the format being exposed by the API.

The fast-growing consumer electronics retailer, Dick Smith, replaced their legacy systems
without business disruption, with the help from Anypoint Platform’s integration and API
capabilities. The retailer used Anypoint Platform to manage staged data integration from
AS/400, a 20+ years old technology, to Amazon Redshift. APIs are being used to create
interfaces for POS devices’ communication. For more, please refer to this article.

24
3.4.1 API Use Case with NoSQL
 
A REST API sitting in front of Neo4J that accepts and returns JSON, for instance, would
simplify access for developers writing an application on top of it, who aren’t necessarily
familiar with graph database concepts or querying with Cypher.

Visibility between application interaction is often needed as more and more applications are
built on top of APIs. Neo4J, a graph database, can be used to store a graph of
dependencies between API’s. Coupled with usage and reliability data captured Anypoint
Service Data, this provides the ability to predict the impact of maintenance windows and
upgrades. Such data would allow an organization to specifically time a maintenance
window, for instance, based on statistically when the last amount of users will be affected. A
REST API sitting in front of Neo4J that accepts and returns JSON, for instance, would
simplify access for developers writing an application on top of it, who aren’t necessarily
familiar with graph database concepts or querying with Cypher.

4 Conclusion and MuleSoft’s Big Data Integration Solution


 
The next phase of big data evolution will be unifying big and small data, and realizing value
from faster data and more sources. Addressing velocity and variety challenges require not
only having the data and the infrastructure, but more importantly, integrating them with the
resource and skill constraints. Smart enterprises adapt their architectures and leverage

25
integration tools, like MuleSoft, to shorten deployment cycle, save costs, avoid chaos and
achieve ROI.

MuleSoft is uniquely positioned to offer the one-stop solution for your future-proof big data
project implementation. The chart below summarizes MuleSoft’s product solution
supporting the two big data challenges.

Tell us about your big data projects and discuss how MuleSoft could help you, please email
us at: bigdata@Mulesoft.com.

26