Академический Документы
Профессиональный Документы
Культура Документы
1 of 6
Home
Data
Shop
Gov 2.0
Programming
Answers
Publishing
Web 2.0
http://radar.oreilly.com/2011/08/building-data-startups.html
Conferences
Training
School of Technology
Community
At the same time, these technological forces are not symmetric: CPU
and storage costs have fallen faster than that of network and disk IO.
Thus data is heavy; it gravitates toward centers of storage and
compute power in proportion to its mass. Migration to the cloud is
the manifest destiny for big data, and the cloud is the launching pad
for data startups.
8/8/2012 12:44 AM
2 of 6
http://radar.oreilly.com/2011/08/building-data-startups.html
algorithms.
Finally, at the top of the stack are services and applications. This
is the level at which consumers experience a data product, whether it
be a music recommendation or a traffic route prediction.
Lets take each of layers and discuss the competitive axes at each.
The competitive axes and representative technologies on the Big Data stack are illustrated
here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and
we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally
along the axis of speed; offering faster processing and query times. Several of these players
are pushing up towards the second tier of the data stack, analytics. At this layer, the primary
competitive axis is scale: few offerings can address terabyte-scale data sets, and those that
do are typically proprietary. Finally, at the top layer of the big data stack lies the services that
touch consumers and businesses. Here, focus within a specific sector, combined with depth
that reaches downward into the analytics tier, is the defining competitive advantage.
Fast data
At the base of the big data stack where data is stored,
processed, and queried the dominant axis of competition was once
scale. But as cheaper commodity disks and Hadoop have effectively
addressed scalable persistence and processing, the focus of
competition has shifted toward speed. The demand for faster disks
has led to an explosion in interest in solid-state disk firms, such as
Fusion-IO, which went public recently. And several startups,
most notably MapR, are promising faster versions of Hadoop.
FusionIO and MapR represent another trend at the data layer:
commercial technologies that challenge open source or commodity
offerings on an efficiency basis, namely watts or CPU cycles consumed.
With energy costs driving between one-third and one-half of data
center operating costs, these efficiencies have a direct financial
impact.
Finally, just as many large-scale, NoSQL data stores are moving from
disk to SSD, others have observed that many traditional, relational
databases will soon be entirely in memory. This is particularly
true for applications that require repeated, fast access to a full
set of data, such as building models from customer-product matrices.
This brings us to the second tier of the big data stack, analytics.
Big analytics
At the second tier of the big data stack, analytics is the brains to
cloud computings brawn. Here, however, the speed is less of a
challenge; given an addressable data set in memory, most statistical
algorithms can yield results in seconds. The challenge is scaling
these out to address large datasets, and rewriting algorithms to
8/8/2012 12:44 AM
3 of 6
http://radar.oreilly.com/2011/08/building-data-startups.html
Focused services
The top of the big data stack is where data products and services
directly touch consumers and businesses. For data startups, these
offerings more frequently take the form of a service, offered as an
API rather than a bundle of bits.
BillGuard is a great example of a startup offering a focused data
service. It monitors customers credit card statements for dubious
charges, and even leverages the collective behavior of users to
improve its fraud predictions.
Several startups are working on
algorithms that can crack the content relevance nut, including
Flipboard and News.me. Klout delivers a pure data service that uses
social media activity to measure online influence. My company,
Metamarkets, crunches server logs to provide pricing analytics for
publishers.
For data startups, data processes and algorithms define their
competitive advantage. Poor predictions whether of fraud,
relevance, influence, or price will sink a data startup, no matter
how well-designed their web UI or mobile application.
Focused data services arent limited to startups: LinkedIns People
You May Know and FourSquares Explore feature enhance engagement of
their companies core products, but only when they correctly suggest
people and places.
8/8/2012 12:44 AM
4 of 6
http://radar.oreilly.com/2011/08/building-data-startups.html
players, such as SAS and SAP, are expanding their storage footprints
and challenging the need for alternative data platforms as staging
areas. Finally, data startups and many established firms are
creating services whose success hinges directly on proprietary
analytics algorithms.
The emergence of data startups highlights the democratizing
consequences of a maturing big data stack. For the first time,
companies can successfully build offerings without deep infrastructure
know-how and focus at a higher level, developing analytics and
services. By all indications, this is a democratic force that
promises to unleash a wave of innovation in the coming decade.
Related:
What is data science?
The SMAQ stack for big data
Data science democratized
tags: data companies, data product, data stack, startups
Share
Share
102
Login
Add New Comment
Showing 8 comments
Michael, nice article. Would be nice to hear your take on where 1010data fits
into the mix.
On the subject of democratizing big data, something that troubles me with the
term big data is that it reminds me of long string in the how long is a piece of
string? sense. How big is big? The edge cases are very clear. Google,
facebook, this is big data. A retailer with 3 stores and no web site is not big data.
But importantly the vast majority of companies fall between these extremes.
Many smaller businesses now capture machine-generated data in additional to
old-school transactions and volumes are exploding as a result. While this data
might not be big from a computer science perspective, it feels pretty big to the
people who are trying to understand what it means, and in particular the
association between the transactional and machine-generated data.
From a technology perspective, this neglected majority doesnt need the typical
tools of big data like Hadoop, cant afford Netezza and wouldnt remember the
math to use SPSS or SAS. Fortunately, there is an additional category of
analytics startups emerging to address these customer needs with simple,
elegant analytics solutions based on recent technologies combining columnar
storage, in-memory, and compression to blast their way through hundreds of
8/8/2012 12:44 AM
5 of 6
http://radar.oreilly.com/2011/08/building-data-startups.html
millions of records almost instantly using hardware you can pick up in Best Buy.
One example is sisense.
Nice article, and great depiction of the layers. But if the cost of data access is
increasing exponentially, how would that define a move to the cloud? Moving
data across WANs is now relatively very expensive, not just for mobile users, but
for corporations. So "dumping" your big and fast data into a cloud, where you
are continually paying premiums for access even if the storage is cheap, could
be prohibitively expensive. @robertoames
Great recap, Michael. I've seen earlier iterations of this breakdown in other
blogposts that you've written.
Without disagreeing with you on anything, I still feel like there's a bigger picture
to putting the data entrepreneurship opportunity, and that's around the ongoing
effort to drag data kicking and screaming from being an ephemeral thing to a
first class media type. When data is an object in the social graph, and gains the
general appreciation that music, video and documents enjoy a lot of
fundamental assumptions must be rechecked. Suddenly data isn't the exclusive
realm of quants and analysts, it's been upgraded to something that even
non-technical people can understand and potentially even interact with.
A great example of this is Wikipedia and Google. It used to be that if you wanted
to look up an encyclopedia article, you'd go to the Brittanica site and call up an
article. It certainly wasn't editable and nobody could easily improve upon it.
Yet the perfect storm of Wikipedia giving each article a meaningful URL and
Google facilitating search meant that suddenly Wikipedia articles are the top
result of most searches on Wikipedia.
BuzzData.com is our attempt to replicate the trajectory of Wikipedia with data.
We give dataset publishers the opportunity to curate a community of people that
are interested in and often working with their data. In this regards, we are trying
to be the glue that ties people working on data, regardless of where in the
diagram they are standing.
Try it out... I'd love to know what you think so far!
I see those exponential trends only continuing throughout the next several
decades. I think connectivity will continue to proliferate, and the cost of data
storage and bandwidth will continue to decrease. With these trends comes also
an increase of available data, which must be sorted through in a logical and
robust way, which is where these data start-ups find their niche.
What is oft important is not the size of the data, that's not an engineer's
approach. Rather how much of the big data (sampling) is truly required for
8/8/2012 12:44 AM
6 of 6
http://radar.oreilly.com/2011/08/building-data-startups.html
useful inferences. That can easily bring down the complexity by nearly two
orders of magnitude!
Shouldn't you also be addressing the exponential growth of data and how it
compares with the others? Is it rising slower or faster than these others, thus
implying that technology is closing or getting left behind with the sheer volumes
produced?
Or is it a case of data filling teh available storage and processing capability?
8/8/2012 12:44 AM