Вы находитесь на странице: 1из 5

Day 1: Big Data Concepts

Big Data
Big Data:
The term Big Data refers to collection of large and complex data sets which is difficult to
process using traditional database management tools.
Big Data is not just about lots of data, it is actually a concept providing an opportunity to
find new insight into the existing data as well guidelines to capture and analysis of future
data. It makes any business more agile and robust so it can adapt and overcome business
challenges.
We have 3Vs that define Big Data are Variety, Velocity and Volume.

Variety:
Data might be one of these forms: Text, Images, Audio, Video, logs etc., It is the need of
the organization to arrange it and make it meaningful. It will be easy to do so if we have
data in the same format, however it is not the case most of the time. The real world has
data in many different formats and that is the challenge we need to overcome with the Big
Data. These varieties of the data represent Big Data.

Day 1: Big Data Concepts

Velocity:
The data growth and social media explosion have changed how we look at the data. There
was a time when we used to believe that data of yesterday is recent. News channels and
radios have changed how fast we receive the news. Today, people reply on social media to
update them with the latest happening. On social media sometimes a few seconds old
messages (a tweet, status updates etc.) is not something that interests users. They often
discard old messages and pay attention to recent updates. The data movement is now
almost real time and the update gap has reduced to fractions of the seconds. This high
velocity of data represents Big Data.

Volume:
We currently see the exponential growth in the data storage as the data is now more than
text data. We can find data in the format of videos, audio and images on our social media
channels. It is very common to have Terabytes and Petabytes of the storage system for
enterprises. As the database grows, the applications built to support the data needs to be
reevaluated quite often. The big volume indeed represents Big Data.

Sources of Big Data:

Facebook
Twitter
Google
Skype
Logs generated by Servers

Big Data Challenges:


Big data presents a number of challenges relating to its complexity.

How to understand and use big data when it comes in an unstructured format, such

as text or video.
How to capture the most important data and deliver that to the right people in

real-time.
How to store the data and how to analyze and understand it given its size and our

computational capacity.
And there are numerous other challenges, from privacy and security to access and
deployment.

Day 1: Big Data Concepts


Big Data Technologies:
Any technology that gives an economically feasible solution to these Big Data challenges in
is a Big Data Technology.

1. Hadoop:
Apache Hadoop is an open source software that enables the distributed processing
of large data sets across clusters of commodity servers. It has a very high degree of fault
tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes
from the softwares ability to detect and handle failures at the application layer.Hadoop is
the most popular for implementation of Map Reduce, which is an open source platform for
handling Big Data. It has several different applications, but one of the top use cases is for
large volumes of constantly changing data, such as location-based data from weather or
traffic sensors, web-based or social media data, or machine-to-machine transactional data.

2. HPCC:
High Performance Computing Cluster stores and processes large quantities of data,
processing billions of records per second using massive parallel processing technology. Large
amounts of data across disparate data sources can be accessed, analyzed and manipulated
in fractions of seconds. HPCC functions as both a processing and a distributed data storage
environment, capable of analyzing terabytes of information.

3. Schema-less databases, or NoSQL databases


There are several database types that fit into this category, such as key-value
stores and document stores, which focuses on the storage and retrieval of large volumes of
unstructured, semi-structured, or even structured data. They achieve performance by doing
away with some of the restrictions traditionally associated with conventional databases,
such as read-write consistency, in exchange for scalability and distributed processing.

4. Hive
Hive is a "SQL-like" bridge that allows BI applications to run queries against a
Hadoop cluster. It was developed originally by Facebook, but now has been made open
source and it allows anyone to make queries against data stored in a Hadoop cluster just as
if they were manipulating a data store. It amplifies the reach of Hadoop, making it more
familiar for BI users.

5. Pig
3

Day 1: Big Data Concepts


PIG is another bridge that tries to bring Hadoop closer to the realities of developers and
business users, similar to Hive. Unlike Hive, however, PIG consists of a "Perl-like" language
that allows for query execution over data stored on a Hadoop cluster, instead of a "SQL-like"
language. PIG was developed by Yahoo!, and, just like Hive, has also been made fully open
source.

6. WibiData
WibiData is a combination of web analytics with Hadoop, being built on top of
HBase, which is itself a database layer on top of Hadoop. It allows web sites to better
explore and work with their user data, enabling real-time responses to user behaviour, such
as serving personalized content, recommendations and decisions.

7. Platfora
The greatest limitation of Hadoop is that it is a very low-level implementation of
Map Reduce, requiring extensive developer knowledge to operate. Between preparing,
testing and running jobs, a full cycle can take hours, eliminating the interactivity that users
enjoyed with conventional databases. PLATFORA is a platform that turns user's queries into
Hadoop jobs automatically, thus creating an abstraction layer that anyone can exploit to
simplify and organize datasets stored in Hadoop

8. SkyTree
SkyTree is a high-performance machine learning and data analytics platform
focused specifically on handling Big Data. Machine learning, in turn, is an essential part of
Big Data, since the massive data volumes make manual exploration, or even conventional
automated exploration methods unfeasible or too expensive.
9. Ingestion & Streaming Technologies
a. Storm - Storm is a distributed real time computation system that makes it easy to
reliably process unbounded streams of data, doing for real time processing what
Hadoop did for batch processing. Storm can be used with any programming language, is
scalable, fault-tolerant, and guarantees data will be processed. It is easy to set-up and
operate.
b. Flume - Apache Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. It has a simple and
flexible architecture based on streaming data flows. It is robust and fault tolerant with
tunable reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic application.
c. Sqoop - Apache Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured data stores such as relational databases. Sqoop can be
used to import data from a relational database management system (RDBMS) such as

Day 1: Big Data Concepts


MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in
Hadoop Map Reduce, and then export the data back into an RDBMS.
d. Kafka - Apache Kafka is high-throughput distributed messaging system that can handle
terabytes of messages without performance impact and be elastically and transparently
expanded without downtime. It has a modern cluster-centric design that offers strong
durability and fault-tolerance guarantees.

Вам также может понравиться