Академический Документы
Профессиональный Документы
Культура Документы
Abstract – In this paper we will be introducing BIG DATA , is a term for data
sets that are so large or complex that traditional data processing application
softwares are inadequate to deal with them. Challenges
include capture, storage,analysis, data curation,
search, sharing, transfer, visualization, querying, updating and information
privacy.
Analysis of data sets can find new correlations to "spot business trends,
prevent diseases, combat crime and so on”.Scientists, business executives,
practitioners of medicine, advertising and governments alike regularly meet
difficulties with large data-sets in areas including Internet search,
finance, urban informatics, and business informatics. Scientists encounter
limitations in e-Science work, including meteorology, genomics,
connectomics, complex physics simulations, biology and environmental
research.
Relational Database Management System and desktop statistics- and
visualization-packages often have difficulty handling big data. The work may
require "massively parallel software running on tens, hundreds, or even
thousands of servers".What counts as "big data" varies depending on the
capabilities of the users and their tools, and expanding capabilities make big
data a moving target. "For some organizations, facing hundreds of gigabytes
of data for the first time may trigger a need to reconsider data management
options. For others, it may take tens or hundreds of terabytes before data size
becomes a significant consideration."
1. Introduction :
Big data is an evolving term that describes any voluminous amount of structured,
semistructured and unstructured data that has the potential to be mined for
information. The term has been in use since the 1990s, with some giving credit
to John Mashey for coining or at least making it popular. Big data usually includes
data sets with sizes beyond the ability of commonly used software tools
to capture, curate, manage, and process data within a tolerable elapsed time. Big
data "size" is a constantly moving target, as of 2012 ranging from a few dozen
terabytes to many petabytes of data. Big data requires a set of techniques and
technologies with new forms of integration to reveal insights from datasets that are
diverse, complex, and of a massive scale.
In a 2001 research report and related lectures, META Group (now Gartner) defined
data growth challenges and opportunities as being three-dimensional, i.e. increasing
volume (amount of data), velocity (speed of data in and out), and variety (range of
data types and sources). Gartner, and now much of the industry, continue to use this
"3Vs" model for describing big data. In 2012, Gartner updated its definition as
follows: "Big data is high volume, high velocity, and/or high variety information assets
that require new forms of processing to enable enhanced decision making, insight
discovery and process optimization." Gartner's definition of the 3Vs is still widely
used, and in agreement with a consensual definition that states that "Big Data
represents the Information assets characterized by such a High Volume, Velocity
and Variety to require specific Technology and Analytical Methods for its
transformation into Value". Additionally, a new V "Veracity" is added by some
organizations to describe it, revisionism challenged by some industry authorities. The
3Vs have been expanded to other complementary characteristics of big data:
Volume: big data doesn't sample; it just observes and tracks what happens
Velocity: big data is often available in real-time
Variety: big data draws from text, images, audio, video; plus it completes missing
pieces through data fusion
MapReduce
This is a programming paradigm that allows for massive job execution scalability
against thousands of servers or clusters of servers. Any MapReduce implementation
consists of two tasks:
The "Map" task, where an input dataset is converted into a different set of
key/value pairs, or tuples;
The "Reduce" task, where several of the outputs of the "Map" task are combined
to form a reduced set of tuples (hence the name).
Hadoop
Hadoop is by far the most popular implementation of MapReduce, being an entirely
open source platform for handling Big Data. It is flexible enough to be able to work
with multiple data sources, either aggregating multiple sources of data in order to do
large scale processing, or even reading data from a database in order to run
processor-intensive machine learning jobs. It has several different applications, but
one of the top use cases is for large volumes of constantly changing data, such as
location-based data from weather or traffic sensors, web-based or social media data,
or machine-to-machine transactional data.
Hive
Hive is a "SQL-like" bridge that allows conventional BI applications to run queries
against a Hadoop cluster. It was developed originally by Facebook, but has been
made open source for some time now, and it's a higher-level abstraction of the
Hadoop framework that allows anyone to make queries against data stored in a
Hadoop cluster just as if they were manipulating a conventional data store. It
amplifies the reach of Hadoop, making it more familiar for BI users.
PIG
PIG is another bridge that tries to bring Hadoop closer to the realities of developers
and business users, similar to Hive. Unlike Hive, however, PIG consists of a "Perl-
like" language that allows for query execution over data stored on a Hadoop cluster,
instead of a "SQL-like" language. PIG was developed by Yahoo!, and, just like Hive,
has also been made fully open source.
WibiData
WibiData is a combination of web analytics with Hadoop, being built on top of
HBase, which is itself a database layer on top of Hadoop. It allows web sites to
better explore and work with their user data, enabling real-time responses to user
behavior, such as serving personalized content, recommendations and decisions.
PLATFORA
Perhaps the greatest limitation of Hadoop is that it is a very low-level implementation
of MapReduce, requiring extensive developer knowledge to operate. Between
preparing, testing and running jobs, a full cycle can take hours, eliminating the
interactivity that users enjoyed with conventional databases. PLATFORA is a
platform that turns user's queries into Hadoop jobs automatically, thus creating an
abstraction layer that anyone can exploit to simplify and organize datasets stored in
Hadoop.
Storage Technologies
As the data volumes grow, so does the need for efficient and effective storage
techniques. The main evolutions in this space are related to data compression and
storage virtualization.
SkyTree
SkyTree is a high-performance machine learning and data analytics platform focused
specifically on handling Big Data. Machine learning, in turn, is an essential part of
Big Data, since the massive data volumes make manual exploration, or even
conventional automated exploration methods unfeasible or too expensive.
Even companies that are fully committed to big data, that have defined the business
case and are ready to mature beyond the “science project” phase, face a daunting
question: how do we make big data work?
The massive hype, and the perplexing range of big data technology options and
vendors, makes finding the right answer harder than it needs to be. The goal must
be to design and build an underlying big data environment that is low cost and low
complexity. That is stable, highly integrated, and scalable enough to move the entire
organization toward true data-and-analytics centricity.
Data-and-analytics centricity is a state of being where the power of big data and big
data analytics are available to all the parts of the organization that need them. With
the underlying infrastructure, data streams and user toolsets required to discover
valuable insights, make better decisions and solve actual business problems. That’s
how big data should work.
Seamlessly Use Data Sets: Much of the payoff comes through the mixing, combining
and contrasting of data sets – so there’s no analytics-enabled innovation without
integration
Flexible, Low Cost: The target here is low complexity and low cost, with sufficient
flexibility to scale for future needs, which will be both larger-scale and more targeted
at specific user groups
Stable: Stability is critical because the data volumes are massive and users need to
easily access and interact with data. In this sense, infrastructure performance holds
a key to boosting business performance through big data.
4. PRIVACY AND SECURITY ISSUES AND CHALLENGES WITH BIG
DATA :
Secure Computations in Distributed Programming Frameworks
Distributed programming frameworks use parallel computing and data storage
for massive amounts of data. An example of this is MapReduce framework.
As has been mentioned earlier MapReduce framework divides an input file
into many chunks and then a mapper for each chunk reads the data, does
computations and provides outputs in the form of key/value pairs. A reducer
then combines the values belonging to each unique key and outputs the
results. The main concerns here are: securing the mappers and securing the
data from a malicious mapper. Mappers returning incorrect results are difficult
to detect and it eventually results in incorrect aggregate outputs. With very
large data sets malicious mappers are too hard to be detected as well and
they eventually damage essential data. Mappers leaking, intentionally or
unintentionally, private records are also an issue of concern. MapReduce
computations are often subjected to replay attack, man-in-the-middle attack
and denial-of-service attack Rogue data nodes can be added to a cluster, and
in turn receive replicated data or deliver altered MapReduce code. Creating
snapshots of legitimate nodes and reintroducing altered copies is an easy
attack in cloud and virtual environments and is difficult to detect
Security Best Practices for Non-Relational Data Stores
Non-relational databases used to store big data, mainly NoSQL databases,
handle many challenges of big data analytics without concerning much over
security issues. NoSQL databases consist of security embedded in the
middleware and no explicit security enforcement is provided. Transactional
integrity maintenance is very lax in NoSQL databases. Complex integrity
constrains can’t be inculcated in NoSQL databases as it hampers with its
functioning of providing better performance and scalability. NoSQL databases
have weak authentication techniques and weak password storage
mechanisms. They use HTTP Basic- or Digest- based authentication and are
subjected to man-in-the-middle attack. REST (Representational State
Transfer) based on HTTP is prone to cross-site scripting, cross-site request
forgery and injection attacks like: JSON injection, array injection, view
injection, REST injection, GQL (Generalized Query Language) injection,
schema injection and others. NoSQL is unsupportive of blocking with the help
of third party as well. Authorization techniques in NoSQL provide authorization
at higher layers only. It provides authorization on a per database level rather
than at the level where the data are collected. NoSQL databases are
subjected to inside attacks as well due to lenient security mechanisms. They
may go unnoticed due to poor logging and log analysis methods along with
other fundamental security mechanisms
Secure Data Storage And Transaction Logs
Data and transactions logs used to be kept in multi-tiered storage media. As
data size grew scalability and accessibility became an issue hence auto-
tiering for big data storage came to the fore. It doesn’t keep track of where the
data are stored unlike in previous multi-tiered storage media where IT
managers knew which data resided where and when. This gave rise to many
new challenges for data security storage. Untrustworthy storage service
providers often search for clues that help them correlate user activities and
data sets and get to know certain properties, which can well prove vital to
them. They however are not able to break into the data overcoming the
encipherment. As the data owner stores the cipher text in an auto-storage
system and distributes the private key to each user, he gives the right to
access data of certain portions to certain users, he being unauthorized to
access the data. However he may conspire with users by exchanging the key
and data hence he can obtain data to which he is not authorized to. The
service provider can instigate roll back attack on users in case of a multi-user
environment. He may serve outdated versions of data while the updated ones
are already uploaded in the database. Data tampering and data loss resulted
by malicious users often results in disputes between the data storage provider
or amongst users.
End Point Input Validation/ Filtering
Organizations collect data from a variety of sources including hardware
devices, software applications and endpoint devices. As and when collecting
these data, validation of the data as well as the source is a challenge. Often
mischievous users tamper with the device from where the data are collected
or tamper with the data collecting application installed in the device so that
malicious data gets input into the central data collecting system. Fake IDs
may be created by malicious users and provide malicious data as input into
the central data collecting system. ID cloning attacks like Sybil attacks are
predominant in a Bring Your Own Device (BYOD) scenario where a malicious
user brings his own device, faked as a trusted device and provides malicious
input from there into the central data collecting system. Input sources of
sensory data can be manipulated as well like artificially changing the
temperature from a temperature sensor and inputting malicious input into the
temperature collection process. GPS signals can be manipulated much the
same way. The malicious user may change data while it is in transmission
from a generous source to the central data collection system. It’s a man-in-the
middle attack in a sense.
Real-Time Security Monitoring
Real-time security monitoring has been an ongoing challenge in the big data
analysis scenario mainly due to the number of alerts generated by security
devices. These alerts, may be co-related may be not, lead to many false
positives and due to human being’s incapability to successfully deal with such
an huge amount of them at such a speed, results in them being clicked away
or ignored [9]. Security monitoring requires that the Big Data infrastructure or
platform be inherently secure. Threats to a Big Data infrastructure include
rogue admin access to applications or nodes, (web) application threats, and
eavesdropping on the line. Infrastructure which is mostly an ecosystem of
different components, the security of each component and the security
integration of the components must be considered. In case of a Hadoop
cluster run in a public cloud the security of the public cloud, itself being an
ecosystem of components consisting of computing, storage and network
components, needs to be considered. The security of the Hadoop cluster, the
security of the nodes, the interconnection among the nodes and the security
of the data stored in a node needs to be considered. The security of the
monitoring application including applicable correlation rules that should follow
secure coding principles, must be considered as well. The security of the input
source from where the data comes from too must be taken into account.
6. CONCLUSION :
To handle big data and to work with it and obtaining benefits from it a branch of
science has come up and is evolving, called Data Science. Data Science is the
branch of science that deals with discovering knowledge from huge sets of data,
mostly unstructured and semi structured, by virtue of data inference and exploration.
It’s a revolution that’s changing the world and finds application across various
industries like finance, retail, healthcare, manufacturing, sports and communication.
Search engine and digital marketing companies like Google, Yahoo and Bing, social
networking companies like Facebook, Twitter and finance and e commerce
companies like Amazon and EBay are requiring and will require a lots of data
scientists.. As far as security is concerned the existing technologies are promising to
evolve as newer vulnerabilities to big data arise and the need for securing them
increases.
7. REFERENCES :
1. www.techrepublic.com/blog/big-data.
2. www.coursera.org, Introduction to Big Data.
3. https://en.wikipedia.org/wiki/Big_data.
4. http://www.dataversity.net/common-big-data-management-issues-solutions/
The Most Common Big Data Management Issues (And Their Solutions). By:
A.R. Guess. July 15 2014.