Вы находитесь на странице: 1из 15

Abstract

This is a report that contains details about what is Big Data, advantages and disadvantages of Big Data. Some
things that you can accomplish with Big Data, Utilization of Big Data and a conclusion. The Utilization of Big
Data part consists of significant information about where does the data comes from, what they can do with data
and how does this benefit them. The conclusion part consists of information about with big data what would be the
future like, what are people going to be doing when everything makes data and finally what do I want to do with
big data.

Big Data
Big data refer to technologies and initiatives that tackle diverse, massive data to address the traditional technologies,
skills, and infrastructure efficiently. The volume, velocity, and variety of data are greatly high. Big Data is not a
single technology or initiative, but it depends on several domains of business and technology. Recently developed
technologies make it possible to recognize value from Big Data. For instance, governments and even Google can
track the emergence of disease outbreaks through social media signals. Big Data refer to large and complex data sets
that are impractical to manage with traditional software tools. The size of Big Data might be represented in
petabytes (1024 terabytes) or Exabytes (1024 petabytes) that consist of trillion records of millions of people
collected from various sources such as web, social media, mobile data, and customer contact center. The nature of
data is loosely structured, i.e. incomplete and inaccessible.
Operational technology and analytical technology are the two technologies that dominate the Big Data domain. The
former class of technology offers operational capabilities for real-time, data manipulation where the data is primarily
captured and stored. The latter class of technology offers analytical capabilities for complex analysis based on all
data. These technologies are complement to each other and frequently deployed together. Both these technologies
have opposing requirements and unique demands in a very different manner. Operational systems like NoSQL
database offers service to several concurrent requests while ensuring low response-latency. The analytical system
focuses on high throughput even if the queries are too complex and require referring all data in the system. Hadoop
is an analytical system for Map Reduce.
Big Data grasped a lot of attention from market trends, equipment based performance, and other industry elements.
Big data, analytical tools and technologies greatly assist in IT decision making. Even the large organizations find it
difficult to deal with the larger datasets in terms of manipulating and managing the Big Data. Big Data deals with
two classes of data sets, namely, structured and unstructured. The records obtained from inventories, orders, and
customer information contributes to the structured datasets. The unstructured data set can be obtained from the web,
social media, and intelligent devices.
 Data Mining Tools and Techniques for Big Data

 Most objects and data in the real world are of multiple types, interconnected, forming complex, heterogeneous but often
semi-structured information networks. We view interconnected, multityped data, including the typical relational
database data, as heterogeneous information networks, study how to leverage the rich semantic meaning of structural
types of objects and links in the networks, and develop a structural analysis approach on mining semi-structured, multi-
typed heterogeneous information networks. Here, we summarize a set of methodologies that can effectively and
efficiently mine useful knowledge from such information networks, and point out some promising research directions.

Here is the list of important areas where data mining is widely


used:

Future health care


Market Basket Analysis
Manufacturing Engineering
Fraud detection

CRM

Criminal

Investigation, Corporate surveillance

Software and Tools for Massive Big Data Processing

Here are the tools used to store and analyse Big Data. We can categorise them into
two (storage and Querying/Analysis).

1. Apache Hadoop - Apache Hadoop is a java based free software framework that
can effectively store large amount of data in a cluster. This framework runs in
parallel on a cluster and has an ability to allow us to process data across all nodes.
Hadoop Distributed File System (HDFS) is the storage system of Hadoop which
splits big data and distribute across many nodes in a cluster. This also replicates
data in a cluster thus providing high availability.
2. Microsoft HDInsight - It is a Big Data solution from Microsoft powered by
Apache Hadoop which is available as a service in the cloud. HDInsight uses
Windows Azure Blob storage as the default file system. This also provides high
availability with low cost.

3. NoSQL - While the traditional SQL can be effectively used to handle large
amount of structured data, we need NoSQL (Not Only SQL) to handle unstructured
data. NoSQL databases store unstructured data with no particular schema. Each
row can have its own set of column values. NoSQL gives better performance in
storing massive amount of data. There are many open-source NoSQL DBs
available to analyse big Data.

4. Hive - This is a distributed data management for Hadoop. This supports SQL-
like query option HiveSQL (HSQL) to access big data. This can be primarily used
for Data mining purpose. This runs on top of Hadoop.

5. Sqoop - This is a tool that connects Hadoop with various relational databases to
transfer data. This can be effectively used to transfer structured data to Hadoop or
Hive.

6. PolyBase - This works on top of SQL Server 2012 Parallel Data Warehouse
(PDW) and is used to access data stored in PDW. PDW is a data ware housing
appliance built for processing any volume of relational data and provides an
integration with Hadoop allowing us to access non-relational data as well.
7. Big data in EXCEL - As many people are comfortable in doing analysis in
EXCEL, a popular tool from Microsoft, you can also connect data stored in
Hadoop using EXCEL 2013. Hortonworks, which is primarily working in
providing Enterprise Apache Hadoop, provides an option to access big data stored
in their Hadoop platform using EXCEL 2013.

8. Presto - Facebook has developed and recently open-sourced its Query engine
(SQL-on-Hadoop) named Presto which is built to handle petabytes of data. Unlike
Hive, Presto does not depend on MapReduce technique and can quickly retrieve
data.

OPEN SOURCE TOOLS

Cassandra: Apache Cassandra is a distributed NoSQL database for managing copious


amounts of structured data across many commodity servers. It also manages nodes efficiently
leaving no single point of failure. With capabilities like continuous availability, linear scale
performance, operational simplicity, and easy data distribution Apache Cassandra provides a
solution which is unmatchable by relational databases.Apache Cassandra offers a masterless
“ring” design that is intuitive and user-friendly.

Apache SAMOA: SAMOA stands for Scalable Advanced Massive Online Analysis. It
is an open source platform build for mining big data streams with a special emphasis on
machine learning enablement. SAMOA supports Write-Once-Run-Anywhere (WORA)
architecture which allows for seamless integration of multiple Distributed Stream Processing
Engines (DSPEs) into the framework. Apache SAMOA allows for the development of new ML
algorithms.

Elasticsearch: Elasticsearch is a dependable and safe open source platform where you can
take any data from any source, in any format and search, analyze it and envision it in real time.
Elasticsearch is designed for horizontal scalability, reliability, and ease of management. It is based on
Lucene a retrieval software library originally compiled in Java.
Big data storage
Big data storage is a compute-and-storage architecture that collects and manages large
data sets and enables real-time data analytics. Companies apply big data analytics to
get greater intelligence from metadata. In most cases, big data storage uses low-cost
hard disk drives, storage systems as the foundation of big data storage. These systems
can be all-flash or hybrids mixing disk and flash storage.

The data itself in big data is unstructured, which means mostly file-based and object
storage.

Although a specific volume size or capacity is not formally defined, big data storage
usually refers to volumes that grow exponentially to terabyte or petabyte scale.

The components of big data storage infrastructure

A big data storage system clusters a large number of commodity servers attached to
high-capacity disk to support analytic software written to crunch vast quantities of
data. The system relies on massively parallel processing databases to analyze data
ingested from a variety of sources.

Big data often lacks structure and comes from various sources, making it a poor fit for
processing with a relational database. The Apache Hadoop Distributed File System
(HDFS) is the most prevalent analytics engine for big data, and is typically combined
with some flavor of a NoSQL database.

Hadoop is open source software written in the Java programming language. HDFS
spreads the data analytics across hundreds or even thousands of server nodes without
a performance hit. Through its MapReduce component, Hadoop distributes processing
in this way as a safeguard against catastrophic failure. The multiple nodes serve as a
platform for data analysis at a network's edge. When a query arrives, MapReduce
executes processing directly on the storage node on which data resides. Once analysis
is completed, MapReduce gathers the collective results from each server and
“reduces” them to present a single cohesive response.

How big data storage compares to traditional enterprise storage

Big data can bring an organization a competitive advantage from large-scale statistical
analysis of the data or its metadata. In a big data environment, the analytics mostly
operate on a circumscribed set of data, using a series of data mining-based predictive
modelingforecasts to gauge customer behaviors or the likelihood of future events.

Statistical big data analysis and modeling is gaining adoption in a cross-section of


industries, including aerospace, environmental science, energy exploration, financial
markets, genomics, healthcare and retailing. A big data platform is built for much
greater scale, speed and performance than traditional enterprise storage. Also, in most
cases, big data storage targets a much more limited set of workloads on which it
operates.

Security and Privacy Issues of Big Data


This chapter revises the most important aspects in how computing infrastructures should be configured
and intelligently managed to fulfill the most notably security aspects required by Big Data applications.
One of them is privacy. It is a pertinent aspect to be addressed because users share more and more
personal data and content through their devices and computers to social networks and public clouds. In
addition, the traditional mechanisms to support security such as firewalls and demilitarized zones are not
suitable to be applied in computing systems to support Big Data. SDN is an emergent management
solution that could become a convenient mechanism to implement security in Big Data systems.

Each of these are the following security challenges, according to CSA:

Infrastructure Security

1. Secure Distributed Processing of Data

2. Security Best Actions for Non-Relational Data-Bases

3. Data Analysis through Data Mining Preserving Data Privacy

4. Cryptographic Solutions for Data Security


5. Granular Access Control Data Management and Integrity

6. Secure Data Storage and Transaction Logs

7. Granular Audits

8. Data Provenance Reactive Security

The new Big Data security solutions should extend the secure perimeter from the enterprise to the public
cloud. In this way, a trustful data provenance mechanism should be also created across domains. In
addition, similar mechanisms, can be used to mitigate distributed denial-of-service (DDoS) attacks
launched against Big Data infrastructures. Also, a Big Data security and privacy is necessary to ensure
data trustworthiness throughout the entire data lifecycle – from data collection to usage.. A recent work
describes proposed privacy extensions to UML to help software engineers to quickly visualize privacy
requirements, and design them into Big Data applications (Jutla, Bodorik, & Ali, 2013).

Homomorphic encryption is a form of encryption which allows specific types of computations (e.g. RSA
public key encryption algorithm) to be carried out on ciphertext and generate an encrypted result which,
when decrypted, matches the result of operations performed on the plaintext (Gentry, 2010). This allows
encrypted queries on databases, which keeps secret private user information where that data is normally
stored (somewhere in the cloud – in the limit an user can store its data on any untrusted server, but in
encrypted form, without being worried with the data secrecy) (Ra Popa & Redfield, 2011) . More broadly,
the fully homomorphic encryption improves the efficiency of secure multiparty computation

WHY BIG DATA SECURITY ISSUES ARE SURFACING

Due to the reasons such as the rapid growth and spread of network services, mobile devices, and
online users on the Internet leading to a remarkable increase in the amount of data. Almost every
industry is trying to cope with this huge data. Big data phenomenon has begun to gain importance.
However, it is not only very difficult to store big data and analyse them with traditional applications,
but also it has challenging privacy and security problems

What is Big Data Visualization?


Big Data visualization involves the presentation of data of almost any type in a graphical format that makes it easy
to understand and interpret. But it goes far beyond typical corporate graphs, histograms and pie charts to more
complex representations like heat maps and fever charts, enabling decision makers to explore data sets to identify
correlations or unexpected patterns.

A defining feature of Big Data visualization is scale. Today's enterprises collect and store vast amounts of data that
would take years for a human to read, let alone understand. But researchers have determined that the human retina
can transmit data to the brain at a rate of about 10 mb per second. Big Data visualization relies on powerful
computer systems to ingest raw corporate data and process it to generate graphical representations that allow
humans to take in and understand vast amounts of data in seconds.
Importance of Big Data Visualization
The amount of data created by corporations around the world is growing every year, and thanks to innovations such
as the Internet of Things this growth shows no sign of abating. The problem for businesses is that this data is only
useful if valuable insights can be extracted from it and acted upon.

To do that decision makers need to be able to access, evaluate, comprehend and act on data in near real-time, and
Big Data visualization promises a way to be able to do just that. Big Data visualization is not the only way for
decision makers to analyze data, but Big Data visualization techniques offer a fast and effective way to:

Review large amounts of data

Spot trends

Identify correlations and unexpected relationships

Present the data to others

TOOLS FOR BD VISUALIZATION


Google Chart : Google is an obvious benchmark and well known for the user-friendliness
offered by its products and Google chart is not an exception. It is one of the easiest tools for
visualizing huge data sets. Moreover, the most important part while designing a chart is
customization and with Google charts, it’s fairly Spartan. You can always ask for some
technical help if you want to dig deep. It renders the chart in HTML5/SVG format and it is cross-
browser compatible. The chart data can be easily exported to PNG format. Consequently, Google
chart is quite efficient in handling real-time data.

Tableau : Tableau desktop is an amazing data visualisation tool (SaaS) for manipulating big
data and it’s available to everyone. It has two other variants “Tableau Server” and cloud-based
“Tableau Online” which are dedicatedly designed for big data related organizations. You don’t
have to be a coder to use this tool. This tool is very handy and provides lightning fast speed.

D3 : D3 or Data Driven Document is a Javascript library for visualising big data in virtually any
way you want. This is not a tool, like the others and the user needs a good grasp over javascript to
give the collected data a shape. The manipulated data are rendered through HTML, SVG and CSS,
so there is no place for old browsers (IE 7 or 8) as they don’t support SVG (Scalable Vector
Graphics).

Fusion chart
Canvas

Big Data Models and Algorithms

 The more data an organization has, the more difficult it is to process,


store, and analyze, but conversely, the more data the organization has, the
more accurate its predictions can be. As well big data comes with big
responsibility. Big data requires military-grade encryption keys to keep
information safe and confidential.
 This is where data science comes in. Many organizations, faced with the
problem of being able to measure, filter, and analyze data, are turning to
data science for solutions – hiring data scientists, people who are
specialists in making sense out of a huge amount of data. Generally, this
means making use of statistical models to create algorithms to sort,
classify, and process data.

What is Data Science?


Data science has been a term in the computing field since around 1960 when it
was first floated as a substitute for the term “computer science”. Over the next
twenty years or so, it gradually came to mean that blend of statistics and
methodology that data science began to be a fundamental requirement of any
organization working out how to analyze such massive amounts of data.

Data science is interdisciplinary, incorporating elements of statistics, data


mining, and predictive analysis, and focusing on processes and systems that
extract knowledge and insights from data. It is also known as “analytics
transformation” because the goal is to “transform” raw data into usable insights.
It has also been called “industrial analytics” because the context is industrial
rather than scientific – to analyze data for competitive or quality improvements
that can be gained by having a better understanding of one’s customers,
potential customers, service model, and almost any aspect of the organization
that can be represented in bytes.

Because the cost of computing and analyzing organizational data is declining


(as the cost of technology tends to do), we can measure and analyze huge data
sets with a level of precision not previously available. Methodology and
algorithmic analysis provide the tools for this precision.

Methods of Data Science


Data scientists need to be able to combine flexibility and agility with rigorous
analysis and the scientific method. Walking this line is very difficult, to do so
requires experimentation and exploration, as well as a commitment to long-
standing principles for scientific objectivity:

- Report the facts as they are, not as you were hoping they would be;
- Conclusions cannot always be legitimately drawn from a given data
set;
- A lack of evidence for a theory does not prove that the opposite is
true;
- Ensure your initial data is reliable;
- Base your conclusions on the full set of data – don’t choose data to
support a conclusion.
- Some algorithms were developed to address business problems.
Some were developed to augment algorithms in use for other
purposes, or to have them perform somewhat differently, to tune
them to a business environment. These algorithms can be used, for
instance, to remind customers of an event, or to target likely credit
card applicants. Although one algorithm might be clearly better for
a certain purpose than another, it’s sometimes very useful to try
more than one. Doing this can provide comparisons and often turn
up some unexpected results that can tell you more than you
expected about your product or your customers.
- Ten of the most commonly-used algorithms are:
1. K-Means Clustering Algorithm : A simple, unsupervised
learning algorithm that is often used with big data sets, often as a way
of pre-clustering or classifying into larger categories that other
algorithms can further refine. It has some other inherent problems
that make it best suited to large-scale, high-level clustering.

2. Association Rule Mining Algorithm : Sometimes referred to


as “Market Basket Analysis”, since that was the original application
of this algorithm, the association rule algorithm is a learning
algorithm that looks for associations that co-occur with a high degree
of frequency. It can identify associations that you might not expect in
a random sampling – a famous example was when Walmart found
that when a hurricane was coming, along with their bottled water and
flashlight batteries.

3. Linear Regression Algorithms : One of the most widely-used


methods of statistical analysis, linear regression is applicable to many
problems, particularly when the expected output is a score rather
than a category. It is good for predicting trends and to forecast the
effects of a new policy or other change.

4. Logistic Regression Algorithms : Logistic regression is used


to find the likelihood of success of failure of a given event. It is a
classification algorithm.

5. C4.5 : A supervised learning algorithm used to create decision


trees from the already-classified input. Decision trees can be used as
diagnostic tools in medicine, as well as in the business sector.
6. Support vector machine (SVM) : This algorithm learns to
define a hyperplane to separate data into two classes. A hyperplane is
the line that divides a group but is based on a property or attribute
rather than location.This algorithm can help figure out an underlying
separation mechanism between people who will buy a product and
those who won’t.

7. Apriori : The Apriori algorithm is a similarity matching


algorithm. It is commonly used in transactional databases with a
large number of transactions, but it does run with a high degree of
computational overhead.

8. EM (expectation-maximization) : A clustering algorithm


used for knowledge discovery. It uses clustering to predict data
models that can be used in other statistical analysis methods.

9. AdaBoost : An algorithm which constructs a classifier and then


boosts it – meaning it looks for the best learning algorithm among a
number of machines and chooses the most effective one – then
propagates the improved information to the other machines, in this
way it optimizes the ability to learn of participating machines.

10. Naïve Bayesian : Named for Thomas Bayes, an English


statistician who also gave his name to Bayes’ Theorem, Naïve Bayes is not
one algorithm, but a family of
classification algorithms.It is called “naïve” because as it learns, it assumes
all attributes of an item are independent of each other. The algorithm
learns to predict an attribute based on other, known features.

BIG DATA SEMANTICS


We are addressing the data preparation problem by allowing a developer to interactively define a
correct data preparation plan on a small sample of data and then execute such a plan in a parallel,
streaming execution environment. To achieve this goal we are building a comprehensive set of
data transformation operations that can handle a wide range of realworld data, including
structured information as well as semistructured data, and developing an easy-to-use approach
that allows a developer to quickly define correct data reshaping plans that not only transform
data, but can easily restructure the output of one tool into the input of another tool. In the original
Karma, the system performs the task interactively on small to moderate size datasets. In this
effort, we are addressing three challenges (1) how to provide the rich set of capabilities required
to prepare datasets for both analysis and visualization tools, (2) how to make it easy to define
these complex data preparation plans, and (3) how to perform these tasks at scale on massive
datasets. Karma provides tools to semi-automatically build a semantic description (or model) of a
data source. This model makes it possible to rapidly map a set of sources (represented in XML,
KML, JSON, structured text files, or databases) into a shared domain ontology, which supports
the integrated reasoning and processing of data across sources. Once we have modeled the data
sources, they are then converted into a variety of formats, including RDF, and published so that
various analysis processes can reason over the integrated datasets. In order to apply Karma to the
problem of big data, we plan to start with the same core capabilities to be able to quickly model
sources, which allows us to automate many of the required transformations, and then develop
new data restructuring capabilities and then execute this restructuring on big datasets. The fact
that the system can rapidly build a model of a source means that Karma would enable a user to
quickly define a restructuring plan.
Different approaches are proposed to solve different problems in various areas.
They aim to be accurate, actionable, and agile to feed smarter decision-making.
Smart data harnesses the 3V-challenges and adopts semantics and neuroscience on
the data to extract its value, the meeting point of the big data and the semantics.

Conclusion

With Big Data what would be the future like?


As larger and more complex data sets emerge, it becomes increasingly more difficult toprocess
Big Data using on-hand database management tools or traditional data processing
applications. To maximize the significant investments in these datacenter resources,
companies must tackle Big Data with “Big Workflow,” a term we’ve coined at
Adaptive Computing to describe a comprehensive approach that maximizes
datacenter resources and streamlines the simulation and data analysis process.
What could you do with Big Data that you couldn’t do before with?
With Big Data one of the major things that we can do is to predict the fu ture. In
today's worldwe are surrounded by predictions. For instance, during political
elections the main focus of the media and the public is not on the differences
between the candidates' positions, butrather on the "horse race" aspect of the
competition. Issues at stake are secondary comparedto the main question: who is
going to win? So with these data trends that we receive we canpredict the future.

Вам также может понравиться