Вы находитесь на странице: 1из 42

Facts about big data What is big data ?

Big data examples Three dimensions of big data General steps in Big data analysis Challenges in Big data analysis Apache hadoop

The volume of business data worldwide, across all companies, doubles every 1.2 years.

Facebook handles 40 billion photos from its user base.


The four main detectors at the Large Hadron Collider (LHC) produced 13 petabytes of data annually (13,000 terabytes). Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data.

Big data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, manage, and process the data within a tolerable elapsed time.
No clear consensus on Quantity of data to qualify for big data. With the advancement of technology , the size of data sets that qualify as big data will also increase.

The benefit gained from the ability to process large amounts of information is the main attraction of big data analytics. It calls for scalable storage, and a distributed approach to querying.

Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams in order to maximize its value. Organizations need to quickly utilize useful information.

Data is rarely in a structured or orders form. It could be text from social networks, image data, a raw feed directly from a sensor source. Even on the web , different browsers send different data.

In traditional data warehouses there was always the assumption that the data is certain, clean, and precise. Veracity deals with uncertain or imprecise data. Two of the now 3 Vs of Big Data are actually working against the Veracity of the data.

Both Variety and Velocity hinder the ability to cleanse the data before analyzing it and making decisions.

It is recorded from some data generating source. We need to filter and compress by orders of magnitude.

One challenge is to define these filters in such a way that they do not discard useful information.
The second big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured.

The information collected will not be in a format ready for analysis. An information extraction process , pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis.

Given the heterogeneity of the flood of data, it is not enough merely to record it and throw it into a repository. This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then resolvable. There is a strong body of work in data integration that can provide some of the answers.

Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis on small samples. Even noisy data can be more valuable than tiny samples. Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big-data computing environments.

Having the ability to analyse Big Data is of limited value if users cannot understand the analysis. A decision-maker, provided with the result of analysis, has to interpret these results. It involves examining all the assumptions made and retracing the analysis.

Heterogeneity and Incompleteness Scale Timeliness Privacy Human Collaboration

When humans consume information, a great deal of heterogeneity is comfortably tolerated. However, machine analysis algorithms expect homogeneous data, and cannot understand a subtle difference in meaning.

Even after data cleaning and error correction, some incompleteness and some errors in data are likely to remain. This incompleteness and these errors must be managed during data analysis. Doing this correctly is a challenge.

Because of the large size of data, managing rapidly increasing volumes of data has been a challenge for many decades. Implications of changing storage technologies potentially touch every aspect of data processing, including query processing algorithms, query scheduling, database design and recovery methods.

The design of a system that effectively deals with size is likely also to result in a system that can process a given size of data set faster. Designing such techniques becomes particularly challenging when the data volume is growing rapidly and the queries have tight response time limits.

The privacy of data is another huge concern, and one that increases in the context of Big Data. Managing privacy is effectively both a technical and a sociological problem, which must be addressed jointly from both perspectives to realize the promise of big data.

In spite of the tremendous advances made in computational analysis, there remain many patterns that humans can easily detect but computer algorithms have a hard time finding.

A Big Data analysis system must support input from multiple human experts, and shared exploration of results.

Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Some of these techniques are 1. Cluster Analysis 2. Data Fusion and Integration 3. Genetic Algorithms 4. Natural Language Processing 5. Machine learning 6. Pattern recognition 7. Data mining

Apache Hadoop is an open source software framework this support data intensive distributed applications. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Googles MapReduce and Google File System (GFS) papers.

MapReduce is a programming model within hadoop designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. -> "Map" step -> "Reduce" step

A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element.

A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.

In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values. The mapping and reducing functions receive not just values, but (key, value) pairs.

1. Input files are initially stored on HDFS (Hadoop distributed file system). 2. Input format selects the files and objects from HDFS to be used as input. 3. Input splits breaks the input files into task (eg 64 MB in hadoop). 4. RecordReader loads the above data and converts it into (key, value) pairs suitable for reading by the Mapper.

4. Mapper: Given the key , value and the map() function, the mapper does user-defined work to produce output to be forwarded to reducer. 5. Shuffling process moves the data form mapper to reducer. 6. Reducer: Given the key, value and user-defined reducer function , it produces output to be stored again at HDFS.

High degree of fault tolerance, even when running jobs on a large cluster where individual nodes or network components may experience high rates of failure, Hadoop can guide jobs toward a successful completion.

The primary way that Hadoop achieves fault tolerance is through restarting tasks. Individual task nodes (TaskTrackers) are in constant communication with the head node of the system, called the JobTracker.

If a TaskTracker fails to communicate with the JobTracker for a period of time (by default, 1 minute), the JobTracker will assume that the TaskTracker in question has crashed.

The JobTracker knows which map and reduce tasks were assigned to each TaskTracker.

If the job is still in the mapping phase or reducing phase , then other TaskTrackers will be asked to re-execute all map tasks or reduce tasks previously run by the failed TaskTracker.

Reduce tasks, once completed, are written back to HDFS.

By dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.

[1] Divyakant Agrawal- UC Santa Barbara ,Philip Bernstein- Microsoft , Big data white paper. [2] Edd Dumbil, What is big data? An introduction to the big data landscape http://radar.oreilly.com/2012/01/what-is-big-data.html [3] http://en.wikipedia.org/wiki/Big_data [4]IBM, http://www-01.ibm.com/software/data/bigdata/

[5] http://www.gartner.com/it/page.jsp?id=1731916
[6] http://mike2.openmethodology.org/wiki/Big_Data_Definition [7] What is Big Data? http://www.zdnet.com/blog/virtualization/what-is-bigdata/1708 [8] Doug Laney, 3D data Management: Controlling data volume, variety and velocity [9] http://en.wikipedia.org/wiki/Apache_Hadoop

Вам также может понравиться