Basics Introduction 1

What is Big Data?
With all the devices available today to collect data, such as RFID readers, microphones, cameras,
Sensors, social media, and so on, we are seeing an explosion in data being collected worldwide. Big
Data is a term used to describe large collections of data (also known as datasets) that may be
Unstructured, and grow so large and quickly that it is difficult to manage with regular database or
Statistics tools.
What is distributed computing and Hadoop fits in ?
Multiple independent systems appear as one, interacting via a message passing interface, no single
point of failure.
Challenges of Distributed computing.
1. Resource sharing. Access any data and utilize CPU resources across the system.
2. Openness. Extensions, interoperability, portability.
3. Concurrency. Allows concurrent access, update of shared resources.
4. Scalability. Handle extra load. Like increase in users, etc.
5. Fault tolerance. By having provisions for redundancy and recovery
6. Heterogeneity. Different Operating systems, different hardware, Middleware system allows this.
7. Transparency. Should appear as a whole instead of collection of computers.
8. Biggest challenge is to hide the details and complexity of accomplishing above challenges. From
the user and to have a common unified interface to interact with it. Which is where hadoop comes in?
Clustered storage is the use of two or more storage servers working together to increase
performance, capacity, or reliability. Clustering distributes workloads to each server, manages the
transfer of workloads between servers, and provides access to all files from any server regardless of
the physical location of the file.
What makes Hadoop unique is its simplified programming model which allows the user to quickly
write and test distributed systems, and its efficient, automatic distribution of data and work
across machines and in turn utilizing the underlying parallelism of the CPU cores.
How Hadoop Resolve Big Data Problem?

Scenarios:
Increase in Volume
Imagine you have 1GB of data that you need to process.
The data are stored in a relational database in your desktop computer and this desktop computer has
no problem handling this load. Then your company starts growing very quickly, and that data grows to
10GB.And then 100GB.And you start to reach the limits of your current desktop computer. So you
scale-up by investing in a larger computer, and you are then OK for a few more months. When your
data grows to 10TB, and then 100TB.And you are fast approaching the limits of that computer.
Moreover, you are now asked to feed your application with unstructured data coming from sources
like Facebook, Twitter, RFID readers, sensors, and so on.
Your management wants to derive information from both the relational data and the unstructured
Data, and wants this information as soon as possible.
(Courtesy: Big Data University)
Data Read and Transfer Speed
While the storage capacities of hard drives have increased massively over the years, access speed
(the rate at which data can be read from drives) have not kept up. One typical drive from 1990 could
store 1370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a
full drive in around five minutes. Almost 20 years later one terabyte drives are the norm, but the
transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off
the disk.

This is a long time to read all data on a single driveand writing is even slower. The obvious way to
reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one
hundredth of the data. Working in parallel, we could read the data in under two minutes. Only using
one hundredth of a disk may seem wasteful. But we can store one hundred datasets, each of which is
one terabyte, and provide shared access to them.

The most analysis tasks need to be able to combine the data in some way; data read from one disk
may need to be combined with the data from any of the other 99 disks. Various distributed systems
allow data to be combined from multiple sources, but doing this correctly is notoriously challenging.
Hardware Failure (Fault Tolerance)
The first problem to solve is hardware failure: as soon as you start using many pieces of hardware,
the chance that one will fail is fairly high. A common way of avoiding data loss is through replication:
redundant copies of the data are kept by the system so that in the event of failure, there is another
copy available. This is how RAID works, for instance, although Hadoops file system, the Hadoop
Distributed File system (HDFS), takes a slightly different approach. We will see it later about HDFS.

(Courtesy: Definitive Guide book)
What should you do?
Hadoop may be the answer!
What is Hadoop?
Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS,
and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.
Distributed Storage +Computational Capabilities
HDFS for Storage
MapReduce for Computational Framework for parallel processing
Hadoop is designed to efficiently process large volumes of information by connecting many
commodity computers together to work in parallel.
Hadoop uses Googles MapReduce and Google File System technologies as its foundation. It is
optimized to handle massive quantities of data which could be structured, unstructured or semi-
structured, using commodity hardware, that is, relatively inexpensive computers. This massive parallel
processing is done with great performance. However, it is a batch operation. Handling massive
quantity of data, so the response time is not immediate.

Appendix
Seek time - is the time taken for a hard disk controller or pointer to locate a specific piece of stored
data. Other delays include transfer time (data rate) and rotational delay (latency).
Latency - is the time required to perform some action or to produce some result. Latency is
measured in units of time -- hours, minutes, seconds, nanoseconds or clock periods.
Throughput - is the amount of data that can traverse through a given medium (Bandwidth is the
diameter of your medium)
Fault Tolerance - Your system should have the ability to respond gracefully and continue the
operations during the unexpected failure of your system like Power, Hardware, Data corruption, etc.
RAID - Collection of Disks storing the same data (mirroring) in different places. Data redundantly
increases fault tolerance.
Commodity hardware - is nothing but PCs that is affordable and easy to obtain. Typically it is a low-
performance system that is IBM PC-compatible and is capable of running Microsoft Windows, Linux,
or MS-DOS without requiring any special devices or equipment.

Basics Introduction 1

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Basics Introduction 1

Загружено:

Авторское право:

Доступные форматы

What is Big Data?

Вам также может понравиться