Вы находитесь на странице: 1из 24

HADOOP/BIG DATA

About Big Data


Big data is a general term used to describe the voluminous amount of unstructured and semistructured data a company creates -- data that would take too much time and cost too much money to load into a relational Database for analysis. The term is often used when speaking about petabytes and exabytes of data.

When dealing with larger datasets, organizations face difficulties in being able to create, manipulate, and manage Big Data. Big data is particularly a problem in business analytics because standard tools and procedures are not designed to search and analyze massive datasets

A primary goal for looking at big data is to discover repeatable business patterns. Unstructured data , most of it located in text files, accounts for at least 80% of an organizations data. If left unmanaged, the sheer volume of unstructured data thats generated each year within an enterprise can be costly in terms of storage . Unmanaged data can also pose a liability if information cannot be located in the event of a compliance audit or lawsuit.

Big data spans three dimensions

Volume Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.

Variety Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more

Velocity Often timesensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.

Customer challenges for securing Big Data


Awareness & Understanding are lacking
Customers are not actively talking about security concerns. Customers need help understanding threats in a Big Data environment

Companies policies & laws add complexity

Main considerations: Synchronizing retention and disposition policies across jurisdictions, moving data across countries. Customers need help navigating frameworks and changes

Storage Efficiency challenges for Big Data


DeDuplication
Challenge: In most instances, data is random and inconsistent, not duplicated Opportunity: There is a need for more intelligent identification of data

Compression

Challenge: Compression normally happens instead of deduplication, yet, will compress duplicated data regardless Opportunity: There is a need for an automated manner in doing both de-duplicating, and then compressing

About Hadoop
Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. Solution for Big Data: Deals with complexities of high volume, Velocity & variety of data. It enables applications to work with thousands of nodes and petabytes of data. It is:-

Reliable : The software is fault tolerant, it expects and handles hardware and software failures

Scalable Designed for massive scale of processors, memory, and local attached storage

Distributed Handles replication. Offers massively parallel programming model, MapReduce


5

About Apache Hadoop Software Library

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availaible service on top of a cluster of computers, each of which may be prone to failures

Market Drivers for Apache Hadoop


Business Drivers High-value projects that require use of more data Belief that there is great ROI in mastering big data

Financial Drivers Growing cost of data systems as percentage of IT spendCost advantage of commodity hardware + opensource Enables departmental-level big data strategies

Trend
The OLD WAY
Operational systems keep only current records, short history

The New Trend


Keep raw data in Hadoop for a longtime Able to produce a new analytics view on-demand Keep a new copy of data that was previously on in silos Can directly do new reports, experiments at low incremental cost

Analytics systems keep only conformed/cleaned/digested data Unstructured data locked away in operational silos

Archives offline:-Inflexible, new questions require system redesigns

New products/services can be added very quickly


Agile outcome justifies new infrastructure

Hadoop is a part of a larger framework of related technologies


HDFS: Hadoop Distributed File System

HBase: Column oriented, non-relational, schema-less, distributed database modeled after Googles BigTable. Promises Random, real-time read/write access to Big Data Hive: Data warehouse system that provides SQL interface. Data structure can be projected ad hoc onto unstructured underlying data

Pig: A platform for manipulating and analyzing large data sets. High level language for analysts

ZooKeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

Organizations using Hadoop

Hadoop Developer Core contributor since Hadoops infancy Project Lead for Hadoop Distributed File System Facebook (Hadoop, Hive, Scribe) Yahoo! (Hadoop in Yahoo Search) Veritas (San Point Direct, Veritas File System) IBM Transarc (Andrew File System) UW Computer Science Alumni (Condor Project)

Why Hadoop Is needed?


Need to process Multi Petabyte Datasets
Expensive to build reliability in each application. Nodes fail every day
Failure is expected, rather than exceptional. The number of nodes in a cluster is not constant.

Need common infrastructure Efficient, reliable, Open Source Apache License

The above goals are same as Condor, but


Workloads are IO bound and not CPU bound

Hadoop is particularly useful when:Complex information processing is needed

Unstructured data needs to be turned into structured data


Queries cant be reasonably expressed using SQL Heavily recursive algorithms Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing Machine learning Data sets are too large to fit into database RAM, discs, or require too many cores (10s of TB up to PB) Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost

Results are not needed in real time


Fault tolerance is critical Significant custom coding would be required to handle job scheduling

Hadoop Is being used as a

Staging layer: The most common use of Hadoop in enterprise environments is as Hadoop ETL preprocessing, filtering, and transforming vast quantities of semistructured and unstructured data for loading into a data warehouse.

Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.

Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.

Karmasphere released the results of a survey of 102 Hadoop developers regarding adoption, use and future plans

What Data Projects is Hadoop Driving?

Are Companies Adopting Hadoop?


More than one-half (54%) of organizations surveyed are using or considering Hadoop for largescale data processing needs More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analyses and better utilization of computing resources 87% of Hadoop users are performing or planning new types of analyses with large scale data 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data Organizations use Hadoop in particular to work with unstructured data such as logs and event data (63%) More than two-thirds of Hadoop users perform advanced analysis data mining or algorithm development and testing

Hadoop At Linkedin:-

LinkedIn leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIns 125 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services. The streams of user-generated information, referred to as a social media feeds, may contain valuable, real-time information on the LinkedIn member opinions, activities, and mood states.

Hadoop At Forsquare

Forsquare were finding problems in handling huge amount of data which they are handling. Their Business development managers, venue specialists, and upper management eggheads needed access to the data in order to inform some important decisions.

To enable easy access to data foursquare engineering decided to use Apache Hadoop and Apache Hive in combination with a custom data server (built in Ruby), all running in Amazon EC2. The data server is built using Rails, MongoDB, Redis, and Resque and communicates with Hive using the ruby Thrift client.

Hadoop @ Orbitz

Orbitz needed an infrastructure that provides:Long term storage of large data sets; Open access for developers and business analysts; Ad-hoc quering of data

Rapid deploying of reporting applications.


They moved to Hadoop and Hive to provide reliable and scalable storage and processing of data on inexpensive commodity hardware.

HDFS Architecture
Metadata ops Namenode Client Read Block ops Datanodes
replication

Metadata(Name, replicas..) (/home/foo/data,6. ..

Datanodes B Blocks

Rack1

Write Client

Rack2

7/30/2012

24

Вам также может понравиться