Вы находитесь на странице: 1из 6

High-Velocity Data

You are here: Home / High-Velocity Data

High-Velocity Data The Data Fire Hose


What is High-Velocity Data?
Computer systems are creating ever more data at increasing speeds, and there are a growing number of consumers of that databoth operations and
analytics. Hadoop-style batch processing has awakened engineers to the value of big data, but they increasing demand access to the data earlier. In
essence people not only want all of the data, they want it as soon as possible; this is driving the trend toward high-velocity data. High-velocityor fast
datacan mean millions of rows of data per second, we are talking about massive volume. One of the use cases for high-velocity data is real-time
analytics.

What is Driving the Explosion in High-Velocity Data?


Data generated by humans has been growing exponentially for quite some time, fueling the growth of companies like EMC and Netapp. In fact, 90% of the
worlds data was created in the last 2 years [http://www.sciencedaily.com/releases/2013/05/130522085217.htm] . This really demonstrates how the world
has embraced big data. However, the data generated by both devices and the actions of humanssuch as log files, website click-stream data, and Twitter
feedswerent tracked or collected until recently, because the state of the art technology couldnt handle that data velocity.
Big Data, driven largely by Hadoop, provided a mechanism for running analytics across massive volumes of data using a batch process. This gave people a
reason to store these huge amounts of data. As people began deriving value from big data, they started wanting more. They began to ask why they
couldnt process these large volumes of data in real-time. This extreme level of data velocity requires new high-velocity data technologies.

What are the Sources of High-Velocity Data?


This is a list of some of the popular sources of high-velocity data today:
1. Log Files: Devices, websites, database, any number of technologies log events. Log mining applications like Splunk and Loggly opened peoples eyes
to the value in these log files. This resulted in an increase in logging and the richness of data collected in these log files.
2. IT Devices: Networking devices (routers, switches, etc.) firewalls, printers, every device these days generates valuable data, a ssuming you can collect
it and process it at scale.

3. User Devices: One of the largest sources of high-velocity data is the use of smartphones. Everything you do on your smartphone is logged, providing
valuable data.
4. Social Media: Whether it is the Twittertweets, Facebook posts, Foursquare check-ins or any number of other social data streams, these create
massive amounts of real-time data that degrades in value quickly.
5. Online Gaming: Another source of real-time data based on user interactions, not just with the game but also with other users. This group includes
the Massive Multiplayer Online Gaming (MMOG) like World of Warcraft as well as 1:1 games, many played on mobile phones, like Words with Friends.
6. SaaS Applications: SaaS applications typically start with a limited set of functionality. As they mature, the functionality grows and user relationships
and interactions also grow, creating a massive flow of real-time data. Linkedin is perfect examples of this trend. This high-velocity stream of events
has led Linkedin to create Kafka a Complex Event Processor (CEP) that handles the routing and delivery of high-velocity event data.
There are many more sources of high-velocity data, including vertical sources, like the flood of GIS data found in oil and gas companies. As technologies
come online to extract value from this high-velocity data, it is transforming many industries.

Managing the Flow of High-Velocity Data


The flood of high-velocity data can quickly overwhelm systems, especially during peak loads.
Furthermore, most applications need certain quality guarantees (delivery guarantee, deliver only once, etc.). To coordinate the flow of high-velocity data,
some companies use Complex Event Processing (CEP) solutions based ona publish and subscribe approach. Examples of these include Java Messaging
Service (JMS) and Apache Kafka, which came out of Linkedin. If you only need to manage the flow of data, CEP can help coordinate the flood of data.

Processing High-Velocity Data


The desire to extract real-time insight from high-velocity data led to the creation of Stream Processing Engines. These engines include Twitters Storm,
Yahoos S4 and LinkedinsSamza (built on top of the Kafka CEP above). These engines can route, transform and analyze a stream of data at high-velocity.
However, they do not persist the data, instead they provide a brief sliding window on the data. For example, they might maintain a 2 minute or 10 minute
view of the data, but the amount, or time window, is limited by the velocity of the data and the size of their memory. These engines can persistthe data to
a database, giving you a comprehensive view of the historical data. Thisassumesthat your chosen database can handle the data velocity.

Persisting High-Velocity Datathe Database

Traditional Database Management Systems (DBMS) simply cannot handle the high-velocity data coming from modern applications. This is a data ingestion
problem; think of a human sipping from a firehose and youll get the idea. Hadoopprovides batch processing of high-volume data, but when dealing with
high-velocity data you need real-time processing. This has led to a few innovations.
Add a SQL Interface to Hadoop
The demand for persisting and querying high-velocity data in real-time has led a number of companies to add limited SQL interfaces to Hadoop. Examples
of this approach include Apache Tez (Hortonworks), Impala (Cloudera), Hadapt (Hadapt) and Apache HBase. Hadoop and HDFS werent designed for
database requirementsin fact their storage is based on large files, not small blocksbutcorporate demand for a solution to the high-velocity data
ingestion problem is certainly strong.Hadoop is really optimized for data volume, not data velocity.
NoSQL
NoSQL is one solution to the high-velocity data ingest problem. The challenge NoSQL faces is the same challenge faced by Hadoop, namely that
corporations have standardized upon and built expertise and tools around SQL, which doesnt work for NoSQL databases.
In-Memory DBMS
In-memory databases eliminate the slowest piece of the traditional databasethe diskenabling databases to ingest data at a much higher rate than
traditional databases. The two big contenders in the in-memory database world are HANA (SAP) and TimesTen (Oracle). However, in-memory databases
are ill-suited to high-velocity data because their data size is limited to memory;they simply cannot handle the volume of data created by a high-velocity
data source.
Extending MySQL to Handle High-Velocity Data: ScaleDB
Traditional databases, like MySQL, do not deliver sufficiently high data ingest rates to persist high-velocity data. ScaleDB changes all of that. ScaleDB
extends MySQL without changing a single line of MySQL code, so the entire ecosystem (tools, applications, etc.) works with ScaleDB. ScaleDBs new
Streaming Table technology enables a small cluster of MySQL databases to ingest millions of rows of data per second. This data is then available for
real-time manipulation using the rich tools that are already part of the MySQL ecosystem, such as Tableau Software [http://www.tableausoftware.com/]
, QlikView [http://www.qlikview.com/] and LogiAnalytics [http://www.logianalytics.com/] .
In addition to running leading analytics tools, persisting the data in a database gives you the ability to query the data in an ad hoc fashion. If we use the
exampleof a flow of colored balls, a stream processor can count green balls, or it can transform all data about red balls into orange balls. However, if you
want to ask questions of the data, across a time series, you need database functionality. For example, using a database you can ask how many red balls
were preceded by green balls, or how many orange balls we processed in the last hour, or any number of questions of any detail you need, all in an
interactive fashion.

Selecting the Right High-Velocity Data Tool for Your Needs*

Conclusion
High-Velocity Data, over time, accumulates to create Big Data. Think of high-velocity data as the firehose, pumping out water that forms into a pond that
represents big data. Hadoop has gained popularity for providing batch-oriented processing of big data. But batch processing is deficient in that it does
not provide real-time processing or ad hoc queries.
Several classes of applications are generating high-velocity data, where Hadoop-style batch processing is insufficient. For example, a Massive Multi-Player
Online Games (MMOG) might require a high-velocity data solution that serves multiple use cases, for example: (1) maintaining player state currently and
in between session; (2) generating real-time analytics as a mechanism for modifying game play or informing operations; (3) supporting ad hoc queries
from customer support; (4) Providing real-time action-based billing, and more. In this case a brief moving window of time, as provided by stream
processing engines is insufficient, it requires high-velocity streaming persistence with an ad hocideally SQL-basedinterface.
Hadoop opened up whole new possibilities for extracting value from big data, or high-volume data. This led more and more companies to start collecting
massive data, because they could extract value from it. The new wave of high-velocity data tools enable companies to extract real-time value from high-

velocity data, instead of waiting for it to pile up and then running a batch process on it. Look for more companies to recognize this opportunity to drink
upstream from their competition; using high-velocity data to make them more agile, responsive and ultimately more competitive.

Copyright 2014 ScaleDB

Вам также может понравиться