Вы находитесь на странице: 1из 6

Indian Journal of Science Perspective

The International Journal for Science

ISSN 2319 – 7730 EISSN 2319 – 7749
© 2016 Discovery Publication. All Rights Reserved

Big Data Framework for Healthcare using Hadoop

Publication History
Received: 25 December 2015
Accepted: 12 January 2016
Published: 1 February 2016

Bhuvaneswari Ragothaman, Elsa Jose, Surya Prabha M, Sarojini Ilango B. Big Data Framework for Healthcare using Hadoop. Indian
Journal of Science, 2016, 23(78), 121-126

Big Data Framework for Healthcare using Hadoop
Bhuvaneswari Ragothaman1, Elsa Jose2, Surya Prabha M3 , Dr. B.Sarojini Ilango4
MPhil Scholar, Avinashilingam University, Coimbatore
Ph.D Scholar, Avinashilingam University, Coimbatore
Assistant Professor, Avinashilingam University, Coimbatore

The technological advancements prevailing today make it possible to sense, collect, transmit, store and
analyze the voluminous, heterogeneous medical data. This data could be mined to provide valuable and
interesting patterns. However, extracting knowledge from big data poses a number of challenges that
need to be addressed. In this paper we propose big data architecture for processing medical data. To
handle this large volume of data, Big Data’s Hadoop tool can be used. The Apache Hadoop Framework
for processing distributed healthcare database is also discussed.

Keywords: Big Data, knowledge extraction, Hadoop Frame work.

Introduction complexity and heterogeneity. The analysis of real

time data poses an additional challenge on speed of
Thanks to the advancements in technology the processing. These problems could be sorted out by
hospitals are producing voluminous, heterogeneous using few tools which are available in Big Data and
medical data. The data from healthcare is predicted Big Data Analytics.
to increase to 25,000 petabytes in 2020 from
current estimation of 500 petabytes. This data can Characteristics of Big Data
be computationally analyzed to provide useful
knowledge and information. Imagine a scenario Big Data is characterized by 4V’s as Volume,
where it is possible to track an individual’s health Variety, Velocity and Veracity [3].
by monitoring the vital health parameters like
E.C.G, blood pressure, pulse rate etc., which could 1. Volume consists of large amount of data
be stored and analyzed in real-time. It means that obtained from various sources with
every one of us is having an up- to- date health different in size and structure like flat
document that contains all health-related files, tables, graph structures, image data.
information and which is getting updated on a real- Large amount of data obtained from the
time basis. This voluminous (volume), source overcome the problem of data
heterogeneous (variety) data that changes from hidden and missing values.
time to time (velocity) are the ‘Big Data’. The 2. Variety consists of different data types of
analysis of the large amount of data termed as “Big data obtained from source via smart cards,
Data”, reveals useful patterns, trends and cell phones, files, internet, etc. Here cell
associations between the features that can be used phones data and file data can be images,
in the identification of disease, for diagnosis and tables and flat files which are semi
for treatment. Big Data analytics in health care structured and structured. Data received
improves clinical decision making and in from internet are unstructured.
predicting future with the current data. Research 3. Velocity consists of frequency of
shows that the analysis of data improves the quality incoming data needs to be processed.
of healthcare by providing high knowledge of Machine generated mails or messages sent
information [1].Analysis of health care data to the mailboxes or mobiles phones while
analytics play a vital role in the early identification the usage of credit cards, which are high
of diseases. The early detection of fatal diseases in velocity.
reduces health care cost and improves quality of 4. Veracity denotes the trustworthy of data.
treatment. The analysis of medical data is always a Data must remain consolidated, cleaned,
challenging process considering the high volume, and consistent and updated which helps to
make the right decisions [4].

Big Variety


Fig1: 4V’s of Big Data

Role of Big Data in Healthcare quality of the treatment and the quality of hospital
could be improved with the examination which is
In healthcare there is a huge amount of data done by the Indian Medical Council
generated from various departments like research
data, clinical data, medical insurance claim, drug
details, electronic medical record (EMR) and etc.
This leads to enormous amount of data which is
said to be Big Data, which helps in overcoming the
problems like storage, processing and management.
There is various advancement in healthcare due to
Big Data analytics. With the help of analysis,
disease could be identified easily and drugs and
treatment can be given efficiently. And side effects
of the treatment could be avoided with this process.
This also minimizes the cost for the treatment. The





Fig 2: Healthcare Data Set

Literature Survey 1. Data Collection Layer: This layer
performs data collection and
Big Data and Hadoop are used in the enhancement preprocessing. Preprocessing is done by
of healthcare. Here are few techniques used in the adapter which is encapsulated in this
Hadoop for development of healthcare. layer. Data are obtained from various
sources like Hospital, Internet and User
In paper [1], Big Data Analytics and Generated Data. Those data includes
Hadoop’s Map Reduce is used to detect Diabetics research data, clinical data, medical
Mellitus from patient’s blood serum and give expense data and emotional data. Here
effective treatment in the low cost. HDFS adapter is used as the middleware which
framework is used to do pattern matching in acts as raw data preprocessor.
identifying the chances for the diabetics for the
patient. 2. Data Management Layer: This layer
deals with the data management and it
In paper [6], proposes a smart health consists of Distributed File Storage (DFS)
system assisted cloud and Big Data, along with module and Distributed Parallel
various layers of architecture in collecting data, Computing (DPC) module. DFS deals
managing data and in application services of those with the massive storage of data along
data. Application Program Interface (API) is with those it deals with Data Description,
developed for developers and users friendliness. Data Entity and Security check for data.
DPC deals with real time analysis and off
line analysis of data.

Big Data Architecture for Healthcare 3. Application Service Layer: This layer
consists of Application Program Interface
Big Data Architecture for healthcare consists of (API), User Interface and Data Access.
three layers as Data Collection Layer, Data API is used for the developer and the user
Management Layer and Application Service Layer interface is used for end user in an
[6]. efficient way to access the data.

Figure: Big Data Architecture for Healthcare

Apache Hadoop Framework There are more than 150 eco systems available in

Hadoop like Hive, Pig, HCatalog, Map Reduce,

An open source framework used to store and HDFS, Zoo Keeper and etc. Here most of the eco
process Big Data in distributed systems with cluster system are used to store, analysis and process the

of computers using simple programming models. large amount of data. Map Reduce and HDFS are
the two main eco systems of Hadoop. Hadoop was Pig
developed by Google and facebook to handle a
large amount of data generated and later it was Pig is a language used for analyzing and processing
taken by apache to handle the large amount of data of large data sets. It is done in two modes like
in various fields. Hadoop framework consists of Local mode and Hadoop mode. To run the script in
four modules as Local mode no HDFS or Hadoop is required. It is
run in local file systems. To run the script in
1. Hadoop Common consists of common Hadoop mode Hadoop installation is required. The
utilities that support the other Hadoop code is run in the Hadoop virtual machine.
2. Hadoop Distributed File System Hadoop Distributed File System (HDFS)
(HDFS) is a distributed file system that HDFS is a distributed file system designed to hold
provides high-throughput access to very large amount of data and this is highly fault
application data. tolerant. In the case of any system failure the loss
3. Hadoop YARN is a framework for job of data is being minimized. HDFS replicates the
scheduling and cluster resource data nodes to all the computers in the cluster and
management. manages the data transfer.
4. Hadoop Map Reduce is a YARN based HDFS Architecture
system for parallel processing of large
amount of data sets. HDFS follows master slave architecture. HDFS
consists of Name node, Data node and Blocks.

Name Node: This acts as the master server and

Hive regulates client to access the files and manages the
file system. It also executes file system operations
Hive provides the data warehouse for Hadoop. such as renaming, closing and opening files and
Using SQL like language data sets could be directories.
summarized and queries could be used to analyses
of those data set. Hive uses table structures to
manipulate those data.

Figure: HDFS Architecture

Data Node: This node manages the data storage Blocks: Data are stored in HDFS are divided into

and perform read – write operations of the file segments and stored in individual data nodes know
system as per the client request. According to the as blocks. And the minimum amount of data that
instructions given by the name node it also perform HDFS can read and write is called as Block

operations like block creation, deletion and

replication of the blocks.
Map Reduce HCatalog presents a relational view of data. Data is
stored in tables and these tables can be placed into
Map Reduce works as job scheduler and resource databases.
manager. This divides the datasets into small
blocks of tasks and uses distributed algorithms to
group the computers in a cluster to process the task.
This consists of two functions as Map() and Conclusion
The processing of accumulated healthcare data will
Map() function acts as master node and divides the improve the quality of healthcare. Identifying
larger task into smaller sub tasks and distributes it necessary patterns for diagnosis helps in improving
to worker nodes and process result for the task. future prediction and decision making. The Hadoop
framework is used as the tool for analyzing the
Reduce() function collects the results of all worker data. Hadoop framework and its eco system help in
node task and produce a final result to the large processing and producing the results effectively for
task as the answer to the big query. the large amount of data or a big query. The
application of technology in health care helps in
HCatalog early identification of diseases and helps to achieve
treatment at low cost by storing and processing

1. Dr Saravana kumar N M, Eswari T , Sampath P & Lavanya S, Predictive Methodology for Diabetic
Data Analysis in Big Data, Elsevier - 2nd International Symposium on Big Data and Cloud Computing
(ISBCC’15) - 2015.
2. J.Archenaa and E.A.Mary Anita, A Survey Of Big Data Analytics in Healthcare and Government,
Elsevier - 2nd International Symposium on Big Data and Cloud Computing (ISBCC’15) - 2015.
3. Aisling O’Driscoll, Jurate Daugelaite, Roy D. Sleator, ‘Big data’, Hadoop and cloud computing in
genomics, Elsevier - Journal of Biomedical Informatics – 2013
4. Marco Viceconti, Peter Hunter and Rod Hose, Big Data, Big Knowledge: Big Data for Personalized
Healthcare, IEEE Journal of Biomedical and Health Informatics, Vol 19, No 4 – July 2015.
5. IbrahimAbakerTargioHashem, IbrarYaqoob, NorBadrulAnuar, Salimah Mokhtar, AbdullahGani,
SameeUllahKhan , The rise of “big data” on cloud computing: Review and open research issues,
Elsevier – Information Systems47(2015) 98 – 115 – 2015
6. Yin Zhang, Meikang Qiu, Chun-Wei Tsai Mohammad Mehedi Hassan and Atif Alamri, Health – CPS:
Healthcare Cyber-Physical System Assisted by Cloud and Big Data.IEEE Systems Journal - 2015