Вы находитесь на странице: 1из 5

BIG DATA: A STUDY OF SECURITY AND

PRIVACY CHALLENGES
Vijaya Kumar KS
Tavant Technologies
Bangalore, India
vijaykalanji@gmail.com

Abstract Big data is the term used to describe large sets of


data that are difficult to process using conventional database and
data mining technologies. There has been a huge increase in the
volume of data that is generated today. To be able to identify
important insights from these data is the most important thing in
today's world. In this paper I will explain what is Big data, what are
the challenges in handling the big data, Further, I will discuss
security and privacy related to big data in greater depth.
Keywords big data, big data security, big data privacy, big
data characteristics

I. INTRODUCTION, CHALLENGES
In Todays world we deal with diverse set of data. Users
generate contents by accessing social networks, huge servers
log information about their activity into log files. These data
come from heterogeneous sources.

large sets of data. This helps companies to get an edge over


their competitors. This field is emerging as one most sought
after field these days.
II. BIG DATA
A. What is Big Data
Big data refers to data sets that are too huge and complex
to be processed by using on-hand data management tools or
traditional data processing applications.
B. Challenges
Handling big data requires different breed of technologies
altogether. There are four characteristics that define Big data.
These are Volume, Velocity, Variety and Veracity[5] as
shown in figure 1.

In last few years the data rate has grown exponentially.


According to Fortune magazine's survey[1], five exabytes
(10006bytes) of digital data were created in recorded time until
2003. In 2011 same amount of data was created in two days.
This example illustrates the rate at which the data is growing.
With such high rate of data generation, traditional databases
have been pushed to their limit. It is very obvious then that
traditional databases have a fixed limit and are not suited to
store such huge data. Hence the need for new technologies that
can handle big data. These technologies are more complex
than traditional databases and scale to vastly large set of data.
Google, Amazon and open source communities are main
contributors of Big data technologies. For instance, Google
pioneered MapReduce[2] framework, Distributed file system,
and distributed locking systems. Amazon created distributed
key value pair called Dynamo[3]. The open source community
contributed to Hadoop, HBase, MongoDB, Cassandra,
RabbitMQ and a lot of other projects[4].
Till last decade, a large chunk of data such as logs
generated by different servers were not processed thoroughly.
As a result they were either discarded periodically or were
stored in disks. With the advent of big data technology, these
sources of data have suddenly got a lot of importance. The
process of research into massive amounts of data to reveal
hidden patterns and secret correlations is named as Big data
analytics. Big data analytics provides the companies with
richer and deeper insights into their business by making use of

Fig. 1. Big Data Characteristics

Volume refers to the size of the data. With the advent of


latest technologies and more space for storage, applications
have started capturing more data than what it used to capture.
Also the source of data is no longer restricted to text. Mobile
devices and social media produce a lot data. A decade ago
organizations were content with tera bytes storage for their
analytics infrastructure. But now they have graduated to
applications requiring storage in peta bytes. As the volume of
data is huge, the storage required and technologies that are
required to process this data need to be different from those of
traditional ones.

Variety refers to the different sources from which data


comes and the different structures of data. In todays world,
data comes from different sources such as social media,
sensors etc. As these data come from different sources the
structure of these data is also different. This makes the task of
processing the data more difficult. Hence, traditional databases
which use data types and schemas to store data wont be much
of a relevance here.
Velocity refers to the speed with which the data is
generated and the speed at which it moves around. The
pervasiveness of mobiles, tablets and sensors have contributed
greatly to the Velocity aspect of big data. It is important to
understand that the businesses gain an upper hand if they are
in a position to identify a trend, problem, or opportunity
before somebody else does. Besides, the shelf life of the data
has gone down considerably. As a consequence, the ability to
analyze the data in real time and with least time is the need of
the hour. For instance, in traditional processing queries are run
mostly on static data and the results are also mostly static. If
one wants to identify all the people located in a city, then
he/she would query a database to fetch this data. However,
this data may not represent the actual data as a lot of people
would have moved out and a lot of other people would have
moved in (Same can be tracked by using GPS data). This
scenario illustrates the importance of querying/processing
against the volume and variety of data while it is still in
motion, not just after it is at rest.
Veracity mainly refers to the credibility of the data that
comes in. Unlike internal data, a major chunk of Big data
inputs comes from sources that are out of our control.
Therefore the data suffers from significant correctness or
accuracy problems. It is very important to verify the
authenticity of the data. For instance, a company launches a
new product in the market. If we see a large number of
positive responses on the social media, then what is the
guarantee that these are genuine responses? How do we ensure
that these responses are not placed by a third party?
III. PRIVACY AND SECURITY ISSUES IN BIG DATA
Security and privacy issues are very important aspects that
we need to consider when dealing with Big data. These issues
take a new dimension all together in case of Big data
(compared to traditional relational database applications).
Since the amount of data is very large and as it comes from a
variety of source with semi or no structure, traditional security
mechanisms wont work with Big data.
In the remaining part of the paper, I will discuss ten major
security and privacy challenges pertaining to Big data[6]. I
will first analyze the problem, then look at where the research
is headed and proposed solutions if any. I will often consider
the example of Apache Hadoop (A very popular
implementation of MapReduce framework). This is so because
Hadoop is very famous and is used by many organizations.
A. Major Security and Privacy Challenges
1) Secure computations in distributed programming
frameworks.
2) Security best practices for non-relational data stores.

3)
4)
5)
6)
7)
8)
9)
10)

Secure data storage and transactions logs.


End-point input validation/filtering.
Real-time security monitoring.
Scalable and composable privacy-preserving data
mining and analytics.
Cryptographically enforced data centric security.
Granular access control.
Granular audits.
Data provenance.

The below mentioned figure depicts a typical big data


ecosystem. The numbers within the parenthesis indicate the
security challenge that exist at each component in the Big data
ecosystem. For instance, when we gather data from different
sources, End-point input validation/filtering and Real-time
security monitoring are the main issues (as indicated by
numbers 4, 10 in figure 2).

Fig. 2. Big Data Ecosystem

1) Secure computations in distributed programming


frameworks: The frameworks that are used to operate in a
distributed environment utilize parallel computation and
storage to process huge amounts of data. For instance, in case
of Map Reduce, a given problem is solved in two steps. The
framework splits an input file into many chunks. In the first
phase of the MapReduce, a mapper reads the data from a
given chunk, processes the data and outputs a list of key value
pair. In the second phase a Reducer aggregates the values from
each distinct key and gives the result.
In this kind of an environment, a machine that runs
mapper, might malfunction and as a result might return
incorrect data. This machine can further be manipulated to
divulge private data of the user. Further, a rogue data node can
be added to a cluster. This rogue data node can receive
replicated data or deliver altered MapReduce code. To address
these issues, two techniques have been proposed.
a) Trust Establishment: In this technique, at the outset
of the computation, the authenticity of the worker is verified

by the master. Following this initial verification, the workers


will then be probed periodically for their conformance with
their security.
b) Mandatory Access Control (MAC): MAC ensures
access to the files authorized by a predefined security policy.
MAC ensures integrity of inputs to the mappers, but does not
prevent data leakage from the mapper outputs. In case of
MapReduce, Airawat[7] is the implementation of the MAC.
Airawat modifies, MapReduce framework, the distributed file
system, and the Java virtual machine with SELinux as the
underlying operating system. This set up ensures that
untrusted code doesn't leak information.
2) Security best practices for non-relational data stores:
NoSQL (Not Only SQL) databases[8] are generally used in
solving big data problems. These systems scale well and
support disparate data types. It must be noted that the security
aspects of these systems are still at a nascent stage. As a result,
when compared with traditional RDBMS systems, NoSQL
databases have a thin security layer. Each NoSQL database
was implement to address a different aspect of Big data
problem and security was not given the highest priority during
the design phase[9]. NoSQL databases do not provide any
support for explicitly enforcing security in the database. Since,
these databases are used in a clustered environment, additional
threats are posed.
At the moment, security model is enforced on NoSQL
databases by using, external enforcement mechanisms. For
instance, NoSQL works in conjunction with Hadoop. If a user
wants to access data in the NoSQL database, then he/she
should go through Hadoop security layer. Some organizations
overlay the core components and give their own version of the
framework. This framework will have strong authentication
mechanism implemented within it. An example for this is
Cloud-era distributed Hadoop. This prevents user
impersonification in NoSQL systems.
Another way to enforce security is to introduce a middle
layer on top of NoSQL to handle the security aspects of the
system. An example of this middleware is Java. Java's in built
support for authentication is called Java Authentication and
Authorization Services (JAAS).This kind of a configuration
ensures that any changes to the data/schema are validated.
Finally, cryptography based solutions serve well to protect
the data. The advantages of using Cryptography is that the
solutions are independent of the underlying platform,
operating system or storage type and that the solutions are cost
effective. An example of this is Hadoop which uses file layer
encryption to provide data protection.
3) Secure data storage and transactions logs: Data and
transactions are stored in multi-tiered storage media. There are
two ways in which data can be moved among the layers. It is
often the case that one needs to move the data from one tier to
other. There are two ways one can move the data among the
tiers.
a) Manual: This case a person will manually move the
data from one tier to another. The problem with this approach
is that it exposes the data to the person who does movement.

b) Auto-tiering: As the size of the data has increased


manifold, auto-tiering has become the need of the hour. In
auto-tiering, software components will take care of the inter
tier movement activities. The limitation with this approach is
that it doesn't track the location where the data is stored. As a
result this approach poses new challenges to secure data
storage.
Security strategies are diverse as a result of heterogeneity
of technologies, varied security policies, and cost constraints.
Data confidentiality, integrity, and availability are general
issues and these can be solved. There are three special issues
that require more attention.
a) Dynamic data operation: Data set in the auto-tier
storage system is dynamic. It supports a lot of operations such
as modification, duplication, deletion and insertion. An
extended version of Provable Data Possession (PDP)
scheme[10] can be used to secure the data as it relies only on
symmetry-key cryptography.
b) Privacy Preservation: In a distributed cloud
environment, the data is stored remotely from the user. On the
one hand it saves the user from the burden of local data
storage and maintenance on the other hand it raises the
question of data integrity protection. However, in this type of
an environment, a third party auditor (TPA) is required to
enable public auditability of the data. Consequently, data
access to TPA should be secured. A privacy-preserving public
auditing scheme can be utilized for cloud storage [Wang, et
al.][4] Based on a holomorphic linear authenticator integrated
with random masking, the proposed scheme preserves data
privacy when a TPA audits the data set stored in the servers at
different tiers.
c) Secure Manipulations on Encrypted Data: It has been
shown by Craig Gentry [12] that many operations can be
conducted on cipher-text without decrypting tithe fully
homomorphic encryption scheme makes these operations
possible because more complex functions are supported. This
development helps cloud computing compatible with privacy.
4) End-point input validation/filtering: As input data can
come from many sources (Variety), the major challenge is
input data validation. The problem is further compounded by
the fact that some of the sources may be malicious. For
instance, consider a big data application which processes
sensor data. A motivated adversary may create a rogue virtual
sensor to send spurious data. Since, the volume of the data
collected is too large, it is very difficult to identify the
malicious data. The issue of input validation can be tackled in
two steps.
a) Prevent the adversary from sending malicious data at
the source itself.
b) Identify the malicious data at the central system and
take necessary action.
Currently, there is no fool proof solution to this issue. A
hybrid approach is very effective. Firstly, Big data collection
system designers should develop secure platform for data
collection. Secondly, designers should identify all possible

spoofing attacks and develop cost effective solutions to tackle


the same. Thirdly, designers should be aware that there is a
high chance that an adversary may send malicious data and
design algorithms to filter out malicious data.

implement approach. In the second approach, data itself is


encrypted before it is presented to the user. The latter
approach is more difficult to break.

5) Real-time security monitoring: This is one of the most


challenging Big data analytics problem. There are two main
aspects to this problem.
a) Monitoring the big data infrastructure. An example
of this is monitoring the health of all nodes in the big data
infrastructure.

Elliptic curve groups that support bilinear pairing maps are


used by many identity attribute based encryption schemes. In
2009, Gentry[16] constructed the first fully homomorphic
encryption (FHE) scheme. This scheme used ideal lattices
over a polynomial ring. Although lattice constructions are not
terribly inefficient, the computational overhead for FHE is still
far from practical. Research is in progress to find simpler
constructions.

b) Using big data infrastructure for data analytics. An


example of this is insurance companies using monitoring tools
to identify fraudulent claims.
An important thing in real time security monitoring is the
minimization of false positive alerts. The system may trigger a
large number of such alerts. Due to the limitation of the
human capacity, many of these alarms are difficult to analyze.
The implementation of the security best practice depends
on the scenario. Currently, there are no built in security
monitoring and analysis tools in Hadoop. Some vendors such
as Spunk[13] are developing tools on top of Hadoop. Another
approach is to monitor Hadoop requests at the front end.
Currently the work is in progress in this regard.
6) Scalable and Composable Privacy-Preserving Data
Mining and Analytics: Big data can potentially invade privacy
in case not handled properly[14]. Mainly companies, use big
data analytics to study the behavior of customers and then
target the customers with relevant advertisements. It is not
sufficient to hide only the identity of the customers.
With given data set of all the anonymous data of customer,
one can identify the identity of the customers. AOL
anonymous search logs is a good example for this[15]. The
huge volume of data gathered by a company is accessed by
many parties. It is possible that a party with ill intent can
extract personal information from the data and use it.
Continuous monitoring is an option to prevent the data from
wrong hands. There multiple ways in which data privacy can
be achieved. Privacy preserving analytics, Differential privacy
are two example of the same. Both the fields are at a nascent
stage. Currently, the best way to prevent attacks from the
malicious users is to encrypt the data sets, using access
controls and authorization mechanisms. One can also keep the
software infrastructure patched with up-to-date security
solutions. This will minimize the security vulnerability of the
application.
7) Cryptographically enforced data centric security:
These data storage is not confined to mainframes or relational
databases. The data is on the move. Meaning, data is
transferred from one place to another. In such a scenario, the
security of the data is of major importance. There are two
ways in which visibility of data can be controlled to different
entities (individuals, organizations etc.) In the first approach,
access to the underlying system such as operating system is
granted based on the identity of the user. This is an easy to

8) Granular Access Control: In any application, it is very


important that we grant access to only those people who
should have access. This leads us to the concept of granular
level access. It is very important to understand the level if
granularity that is required in an application. Granular level
access can be enforced on row level, column level, or at both
row and column level in an application. In this context, a row
represents a single record and a column represents a specific
field of all records. In some applications, finer granular level
access is required. A cell-level access is required in this case.
It is also important to note that we need to consider scalability
as an important factor when devising a granular level access
solution. When implementing a granular level security access,
we need to consider all the elements of the big data ecosystem.
For instance, in case of Hadoop, Protocols that enforce access
level restrictions should be ingrained in both HDFS and
NoSQL databases.
Apache Accumulo[17] is one example of a NoSQL
database that supports mature, cell-level access control.
Accumulo is built on top Hadoop ecosystem. In Accumulo,
every atomic key/value pair is tagged with an expression that
describes the roles required to read that entry, and every query
includes a role check. As a result, Accumulo is used as
key/value store for applications that require access to sensitive
data sets.
9) Granular audits: Audit information is needed for a
variety of reasons. For instance, we can identify a missed
attack by using audit information, or we can figure out what
went wrong and when it went wrong. Given the importance of
the audit data, while designing an auditing data solution one
needs to take into account few considerations. These are,
completeness of the required audit information, timely access
to audit information, integrity of the information, and
authorized access to the audit information.
Implementation of audit features starts on the individual
component level. Examples include enabling syslog on
routers, application logging, and enabling logging on the
operating system level. Then a tool will collect all the data. A
good design is to have this tool implemented outside of big
data ecosystem so that these two are well separated. Another
approach is to create an Audit Layer which would abstract the
required audit information from the auditor. In this scenario,

the audit layer will take the requests of the auditors, and give
back the auditor the required information.
Besides these, many organizations are developing auditing
tools on top of Hadoop. For instance Cloudera has come up
with its own data visualization and auditing tool. IBM has
come up with IBM InfoSphere Guardium[18] on top Hadoop.
10) Data provenance: The term data provenance refers to
the origin and creation of the data. Digital data is often copied
and transferred to other systems before it reaches the intended
destination. Hence, it is very difficult to identify the original
source of the data. Data provenance is very important as it
enables us to evaluate the quality and trust in data among other
things. In the context of Big data, provenance is called Big
Provenance[19]. It is a new field within Big data that is yet to
be explored further.
One way to identify the provenance of the data is to use a
rule engine to identify the data related to provenance in the
logs. But this approach has the limitation that we need to sift
through a large volume of data. Work is in progress to build a
modified version of Hadoop suiting provenance need. This
tool is called as HadoopProv[20]. This tool captures data
provenance at the record level. It has been designed to have
minimal temporal overhead of capturing provenance
information in MapReduce jobs. HadoopProv has an overhead
below 10% on typical job runtime. Additionally, it is
demonstrated that provenance queries are serviceable in O (k
log n), where n is the number of records per Map task and k is
the set of Map tasks in which the key appears.

IV. CONCLUSION
In this paper I have made an effort to explain what big data
is, and its importance in general and security and privacy
issues related to big data in particular. All the major areas
related to privacy and security are covered and the current
work in the respective areas has been mentioned. Different
components, tools/techniques that are used to address these
issues are also mentioned. Though the treatment of the subject
is not in depth, an attempt has been made to provide a
meaningful discussion of the topic. I hope that this will serve
as a good starting point for other researchers in this area.

REFERENCES

[5]

[6]

[7]

[8]
[9]

[10]

[11]

[12]

[13]
[14]

[15]

[16]

[17]
[18]
[19]

[20]

[21]
[22]

[1]
[2]

[3]

[4]

What data says about us . Fortune, September 24, 2012


Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data
Processing on Large Clusters"
http://static.googleusercontent.com/media/research.google.com/en//archi
ve/mapreduce-osdi04.pdf
Giuseppe DeCandia, Deniz Hastorun,et al. "Dynamo: Amazons Highly
Available Key-value Store
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Apache Hadoop official site
http://hadoop.apache.org/

[23]

[24]
[25]
[26]

Big data spans four dimensions: Volume, Velocity, Variety, and


Veracity
http://bigdatafoundation.com/blog/big-data-spans-four-dimensionsvolume-velocity-variety-and-veracity/
CSA releases the expanded top ten big data security & privacy
challenges
https://cloudsecurityalliance.org/media/news/csa-releases-the-expandedtop-ten-big-data-security-privacy-challenges/
I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov and E. Witchel, Airavat:
security and privacy for MapReduce in USENIX conference on
Networked systems design and implementation
http://dl.acm.org/citation.cfm?id=1855731
NoSQL official site
http://nosql-database.org/
B.Sullivan,
NoSQL,
But
Even
Less
Security,
2011.
http://blogs.adobe.com/asset/files/2011/04/NoSQL-But-Even-LessSecurity.pdf.
Giuseppe Ateniese1, Roberto Di Pietro2, Luigi V. Mancini3, and Gene
Tsudik
https://eprint.iacr.org/2008/114.pdf
C. Wang, Q. Wang, K. Ren, and W. Lou, Privacy-preserving public
auditing for data storage security in cloud computing, in INFOCOM,
2010 Proceedings IEEE, pages 1 9
C. Gentry, Computing arbitrary functions of encrypted data,
Communications of the ACM, Vol. 53, No. 3, 2010.
https://crypto.stanford.edu/craig/easy-fhe.pdf
Splunk official site
http://www.splunk.com/
D. Boyd, and K. Crawford, Criticial Questions for Big Data, in
Information, Communication & Society, 15:5, pp. 662-675, May 10,
2012
http://www.tandfonline.com/doi/pdf/10.1080/1369118X.2012.678878
M. Barbard and T. Zeller, A Face is Exposed for AOL Searcher No.
4417749. The New York Times, August 9,2006
http://www.nytimes.com/2006/08/09/technology/09aol.html?pagewante
d=all&_r=0
Craig Gentry "Fully Homomorphic Encryption Using Ideal Lattices"
https://www.cs.cmu.edu/~odonnell/hits09/gentry-homomorphicencryption.pdf
Apache Accumulo official site.
https://accumulo.apache.org/
InfoSphere Guardium Data Security
http://www-01.ibm.com/software/in/data/guardium/
Boris Glavic, "Big Data Provenance: Challenges and Implications for
Benchmarking"
http://cs.iit.edu/~dbgroup/pdfpubls/G13.pdf
Sherif Akoush, Ripduman Sohan and Andy Hopper, "HadoopProv:
Towards Provenance As A First Class Citizen In MapReduce"
https://www.cl.cam.ac.uk/~sa497/paper-tapp.pdf
Big Data details from Wikipedia
http://en.wikipedia.org/wiki/Big_data
Big Data Analytics, Disruptive Technologies for Changing the Game by
Dr. Arvind Sathi
S. Sagiroglu andD. Sinanc, Big data: a review, in Proceedings of the
International Conference on Collaboration Technologies and Systems
(CTS 13), pp. 4247, IEEE, San Diego, Calif, USA,May 2013
A. Crdenas,Pratyusa K. Manadhata, Sreeranga P. Rajan, "Big Data
Analytics for Security"
Charles Schmitt, "Security and Privacy in the Era of Big Data"
McKinsey Global Institute, Bigdata: the next frontier for innovation,
competition and productivity, Accessed January
2014
www.mckinsey.com/Insights/MGI/Research/Technolog_and_Innovation
/Big_data_The_next_frontier_for_innovation

Вам также может понравиться