Академический Документы
Профессиональный Документы
Культура Документы
Big Data
Big Data:
The term Big Data refers to collection of large and complex data sets which is difficult to
process using traditional database management tools.
Big Data is not just about lots of data, it is actually a concept providing an opportunity to
find new insight into the existing data as well guidelines to capture and analysis of future
data. It makes any business more agile and robust so it can adapt and overcome business
challenges.
We have 3Vs that define Big Data are Variety, Velocity and Volume.
Variety:
Data might be one of these forms: Text, Images, Audio, Video, logs etc., It is the need of
the organization to arrange it and make it meaningful. It will be easy to do so if we have
data in the same format, however it is not the case most of the time. The real world has
data in many different formats and that is the challenge we need to overcome with the Big
Data. These varieties of the data represent Big Data.
Velocity:
The data growth and social media explosion have changed how we look at the data. There
was a time when we used to believe that data of yesterday is recent. News channels and
radios have changed how fast we receive the news. Today, people reply on social media to
update them with the latest happening. On social media sometimes a few seconds old
messages (a tweet, status updates etc.) is not something that interests users. They often
discard old messages and pay attention to recent updates. The data movement is now
almost real time and the update gap has reduced to fractions of the seconds. This high
velocity of data represents Big Data.
Volume:
We currently see the exponential growth in the data storage as the data is now more than
text data. We can find data in the format of videos, audio and images on our social media
channels. It is very common to have Terabytes and Petabytes of the storage system for
enterprises. As the database grows, the applications built to support the data needs to be
reevaluated quite often. The big volume indeed represents Big Data.
Facebook
Twitter
Google
Skype
Logs generated by Servers
How to understand and use big data when it comes in an unstructured format, such
as text or video.
How to capture the most important data and deliver that to the right people in
real-time.
How to store the data and how to analyze and understand it given its size and our
computational capacity.
And there are numerous other challenges, from privacy and security to access and
deployment.
1. Hadoop:
Apache Hadoop is an open source software that enables the distributed processing
of large data sets across clusters of commodity servers. It has a very high degree of fault
tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes
from the softwares ability to detect and handle failures at the application layer.Hadoop is
the most popular for implementation of Map Reduce, which is an open source platform for
handling Big Data. It has several different applications, but one of the top use cases is for
large volumes of constantly changing data, such as location-based data from weather or
traffic sensors, web-based or social media data, or machine-to-machine transactional data.
2. HPCC:
High Performance Computing Cluster stores and processes large quantities of data,
processing billions of records per second using massive parallel processing technology. Large
amounts of data across disparate data sources can be accessed, analyzed and manipulated
in fractions of seconds. HPCC functions as both a processing and a distributed data storage
environment, capable of analyzing terabytes of information.
4. Hive
Hive is a "SQL-like" bridge that allows BI applications to run queries against a
Hadoop cluster. It was developed originally by Facebook, but now has been made open
source and it allows anyone to make queries against data stored in a Hadoop cluster just as
if they were manipulating a data store. It amplifies the reach of Hadoop, making it more
familiar for BI users.
5. Pig
3
6. WibiData
WibiData is a combination of web analytics with Hadoop, being built on top of
HBase, which is itself a database layer on top of Hadoop. It allows web sites to better
explore and work with their user data, enabling real-time responses to user behaviour, such
as serving personalized content, recommendations and decisions.
7. Platfora
The greatest limitation of Hadoop is that it is a very low-level implementation of
Map Reduce, requiring extensive developer knowledge to operate. Between preparing,
testing and running jobs, a full cycle can take hours, eliminating the interactivity that users
enjoyed with conventional databases. PLATFORA is a platform that turns user's queries into
Hadoop jobs automatically, thus creating an abstraction layer that anyone can exploit to
simplify and organize datasets stored in Hadoop
8. SkyTree
SkyTree is a high-performance machine learning and data analytics platform
focused specifically on handling Big Data. Machine learning, in turn, is an essential part of
Big Data, since the massive data volumes make manual exploration, or even conventional
automated exploration methods unfeasible or too expensive.
9. Ingestion & Streaming Technologies
a. Storm - Storm is a distributed real time computation system that makes it easy to
reliably process unbounded streams of data, doing for real time processing what
Hadoop did for batch processing. Storm can be used with any programming language, is
scalable, fault-tolerant, and guarantees data will be processed. It is easy to set-up and
operate.
b. Flume - Apache Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. It has a simple and
flexible architecture based on streaming data flows. It is robust and fault tolerant with
tunable reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic application.
c. Sqoop - Apache Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured data stores such as relational databases. Sqoop can be
used to import data from a relational database management system (RDBMS) such as