Вы находитесь на странице: 1из 2

Just as Newtonian physics become useless at very high speeds, so do techniques for

processing data become non-performant with very large data sets.

Sorting is O(n log 10) meaning sorting 10 million records takes 13 times as long as
sorting 1 million records.

Mapping, that is taking a data object and transforming it to another data object, is O(n),
so ten times more records take ten times as much time.

Folding, or reducing, which takes a list of objects and reduce them to a single value is
also O(n).

Moore's law was sufficient to allow single processor/storage systems to scale with data. I
suspect that one sees an inflection point and slope increase with regards to data per user
with the internet in the mid-90s.

So, very large datasets have to be split up, which imposes latency costs via time costs in
network control and communication and via the overhead to divide data and recombine
the results. With very large distributed processors, the probability of failure during a
process increases, and the controller has to have the ability to requery a subset and the
retries add time to a job.

Further speed-ups are possible via caching, but caching also adds overhead in terms of
storage to contain the caches, time to check if a value is in the cache, and some form of
management in order to be sure that the cache is not stale when it matters if it is.

All of this added infrastructure is not necessary in the domains for which I wrangle data.
Access would be powerful enough, because the data sets are in the hundreds and not
the billions. I actually use postgresql which has more power than I need, but provides
more scalability than my clients will need and can be deployed on all the server
platforms.

I could transform the dataset management into these large-scale NoSql / map-reducing
frameworks - and I have been looking into them - but I'd be spending more time writing
for moving targets as SQL and RDBMS's are well-understood, but the new tools are still
figuring out where the sweet spots of interfaces, cost/benefit and CAP
(http://en.wikipedia.org/wiki/CAP_theorem) are found. But, I see this as adding hours to
my development time and the result would be achieving 0.33 seconds in response time
instead of 1.2 seconds for internal applications. I work with small businesses, they don't
need and won't pay for the jet-powered Pregnant Guppy
(http://en.wikipedia.org/wiki/Aero_Spacelines_Pregnant_Guppy) for their data transport.

I understand where we agree. Back in the day, the slow-and-steady mainframe, which
took up half the second floor, did the payroll overnight every other week. People did plan
their processes around the limits of the 100% capacity jobs, and as the limits have
increased, the size of the routine jobs increased. Even with the new hardware,
programmers quickly found themselves at the edge of the flat world, applying all their
creativity to keep the ship from going over into the realm of dragons. Think about
weather forecasting which gets better with more and more comprehensive station
reports. Seven day forecasts? Back when I was a young grown-up, unimaginable. And, I
imagine today, meteorologists have a firm grasp on which modeling jobs are out of
reach.

But my point is that something like the CAP theorem wouldn't even be thought of,
except that the largest datasets are now magnitudes larger than pre-internet datasets
and in constant flux. However the typical (plus/minus one standard deviation) datasets
are still manageable without adding the complexities of distributed processing.

If vendors and customers wish to label the suite of tools for handling tera- and peta-
record datasets as BigData, to distinguish from the stuff I need (for which the tools are
now given away), then I see why. These are not my grandfather's databases (which
would have been filing cabinets.)

VOLUME, VARIETY AND VERACITY OF DATA

Вам также может понравиться