Вы находитесь на странице: 1из 10

Big Data Emerging Technologies: A CaseStudy with

Analyzing HeathCare Data using Apache Hive and Pig

Abstract-These are the days of Growth and Innovation for a better


future. Now-a-days companies are bound to realize need of Big Data to
make decision over complex problem. Big Data is a term that refers to
collection of large datasets containing massive amount of data whose
size is in the range of Petabytes, Zettabytes, or with high rate of
growth, and complexity that make them difficult to process and analyze
using conventional database technologies. Big Data is generated from
various sources such as social networking, and the data that is
generated can be in various formats like structured, semi-structured or
unstructured format. For extracting valuable information from this huge
amount of Data, new tools and techniques is a need of time for the
organizations to derive business benefits and to gain competitive
advantage over the market. In this paper a comprehensive study of
major Big Data emerging technologies by highlighting their important
features and how they work, with a comparative study between them is
presented. This paper also represents performance analysis of Apache
Hive query for executing Patients records in order to calculate Map
Reduce CPU time spent and total time taken to finish the job.

INTRODUCTION

Big data is a popular term used to describe the exponential growth and availability of data,
both structured and unstructured
Big data analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient operations,
higher profits and happier customers. Big data enabling the future of healthcare by collecting
patient data, clinicians can use predictive analytics to prevent potentially deadly conditions.
In this paper analysis the patients records so that we predict how much patients are suffering
from diseases and how we prevent them. This paper is organized as follows. In section II
overview of BigData and Hadoop is presented. In section III and IV, Hadoop Ecosystem with

analyzing HealthCare data by using Apache Hive configured on Apache Hadoop cluster is
discussed. Finally, this paper is concluded in section V
OVERVIEW OF BIG DATA AND HADOOP

Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capabale enough to develop
applications capable of running on clusters of computers and they could perform complete
statistical analysis for a huge amounts of data. Hadoop is an Apache open source framework
written in java that allows distributed processing of large datasets across clusters of
computers using simple programming models. A Hadoop frame-worked application works in
an environment that provides distributed storage and computation across clusters of
computers. Hadoop is designed to scale up from single server to thousands of machines, each
offering local computation and storage.
Map Reduce
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling
the application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.

HADOOP ECOSYSTEM
HIVE
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

It stores schema in a database and processed data into HDFS.

It is designed for OLAP.

It provides SQL type language for querying called HiveQL or HQL.

It is familiar, fast, scalable, and extensible

PIG
Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, is a simple
query algebra that lets you express data transformations such as merging data sets, filtering
them, and applying functions to records or groups of records. Users can create their own
functions to do special-purpose processing.
Pig Latin queries execute in a distributed fashion on a cluster. Our current implementation
compiles Pig Latin programs into Map-Reduce jobs, and executes them using Hadoop cluster.
HBASE
It is a column oriented database scaling to millions rows. Data is stored in the rows of a
table , and data within a row is grouped by column family. HBase is a schemaless database in
the sense that neither the column nor the type of data stored in them need to be defined before
using them. The open-source codes scales linearly to handle petabytes of data on thousands of
nodes. It can rely on data redundancy, batch processing ,and other feature that are provided
by distributed applications in the hadoop ecosystem.
JQL
JQL is an extension for Java that provides support for querying collections of
objects. These queries can be run over objects in collections in your program, or
even used to check expressions on all instances of specific types at runtime.

HIVE

PREDICTION USING R

PIG

STARTING HADOOP CLUSTER

STARTING JAQL

REFERENCES

[1] Apache Hadoop [Online]. Available: http://hadoop.apache.org/


[2]D. Borthakur, The Hadoop Distributed File System: Architecture and Design,
The Apache Software Foundation, 2007
[3]Cloudera-http://www.cloudera.com
[4]Tom White,The Hadoop: The Definitive guide,Third edition
[5] Apache Hive, https://hive.apache.org.

Вам также может понравиться