Академический Документы
Профессиональный Документы
Культура Документы
This is a report that contains details about what is Big Data, advantages and disadvantages of Big Data. Some
things that you can accomplish with Big Data, Utilization of Big Data and a conclusion. The Utilization of Big
Data part consists of significant information about where does the data comes from, what they can do with data
and how does this benefit them. The conclusion part consists of information about with big data what would be the
future like, what are people going to be doing when everything makes data and finally what do I want to do with
big data.
Big Data
Big data refer to technologies and initiatives that tackle diverse, massive data to address the traditional technologies,
skills, and infrastructure efficiently. The volume, velocity, and variety of data are greatly high. Big Data is not a
single technology or initiative, but it depends on several domains of business and technology. Recently developed
technologies make it possible to recognize value from Big Data. For instance, governments and even Google can
track the emergence of disease outbreaks through social media signals. Big Data refer to large and complex data sets
that are impractical to manage with traditional software tools. The size of Big Data might be represented in
petabytes (1024 terabytes) or Exabytes (1024 petabytes) that consist of trillion records of millions of people
collected from various sources such as web, social media, mobile data, and customer contact center. The nature of
data is loosely structured, i.e. incomplete and inaccessible.
Operational technology and analytical technology are the two technologies that dominate the Big Data domain. The
former class of technology offers operational capabilities for real-time, data manipulation where the data is primarily
captured and stored. The latter class of technology offers analytical capabilities for complex analysis based on all
data. These technologies are complement to each other and frequently deployed together. Both these technologies
have opposing requirements and unique demands in a very different manner. Operational systems like NoSQL
database offers service to several concurrent requests while ensuring low response-latency. The analytical system
focuses on high throughput even if the queries are too complex and require referring all data in the system. Hadoop
is an analytical system for Map Reduce.
Big Data grasped a lot of attention from market trends, equipment based performance, and other industry elements.
Big data, analytical tools and technologies greatly assist in IT decision making. Even the large organizations find it
difficult to deal with the larger datasets in terms of manipulating and managing the Big Data. Big Data deals with
two classes of data sets, namely, structured and unstructured. The records obtained from inventories, orders, and
customer information contributes to the structured datasets. The unstructured data set can be obtained from the web,
social media, and intelligent devices.
Data Mining Tools and Techniques for Big Data
Most objects and data in the real world are of multiple types, interconnected, forming complex, heterogeneous but often
semi-structured information networks. We view interconnected, multityped data, including the typical relational
database data, as heterogeneous information networks, study how to leverage the rich semantic meaning of structural
types of objects and links in the networks, and develop a structural analysis approach on mining semi-structured, multi-
typed heterogeneous information networks. Here, we summarize a set of methodologies that can effectively and
efficiently mine useful knowledge from such information networks, and point out some promising research directions.
CRM
Criminal
Here are the tools used to store and analyse Big Data. We can categorise them into
two (storage and Querying/Analysis).
1. Apache Hadoop - Apache Hadoop is a java based free software framework that
can effectively store large amount of data in a cluster. This framework runs in
parallel on a cluster and has an ability to allow us to process data across all nodes.
Hadoop Distributed File System (HDFS) is the storage system of Hadoop which
splits big data and distribute across many nodes in a cluster. This also replicates
data in a cluster thus providing high availability.
2. Microsoft HDInsight - It is a Big Data solution from Microsoft powered by
Apache Hadoop which is available as a service in the cloud. HDInsight uses
Windows Azure Blob storage as the default file system. This also provides high
availability with low cost.
3. NoSQL - While the traditional SQL can be effectively used to handle large
amount of structured data, we need NoSQL (Not Only SQL) to handle unstructured
data. NoSQL databases store unstructured data with no particular schema. Each
row can have its own set of column values. NoSQL gives better performance in
storing massive amount of data. There are many open-source NoSQL DBs
available to analyse big Data.
4. Hive - This is a distributed data management for Hadoop. This supports SQL-
like query option HiveSQL (HSQL) to access big data. This can be primarily used
for Data mining purpose. This runs on top of Hadoop.
5. Sqoop - This is a tool that connects Hadoop with various relational databases to
transfer data. This can be effectively used to transfer structured data to Hadoop or
Hive.
6. PolyBase - This works on top of SQL Server 2012 Parallel Data Warehouse
(PDW) and is used to access data stored in PDW. PDW is a data ware housing
appliance built for processing any volume of relational data and provides an
integration with Hadoop allowing us to access non-relational data as well.
7. Big data in EXCEL - As many people are comfortable in doing analysis in
EXCEL, a popular tool from Microsoft, you can also connect data stored in
Hadoop using EXCEL 2013. Hortonworks, which is primarily working in
providing Enterprise Apache Hadoop, provides an option to access big data stored
in their Hadoop platform using EXCEL 2013.
8. Presto - Facebook has developed and recently open-sourced its Query engine
(SQL-on-Hadoop) named Presto which is built to handle petabytes of data. Unlike
Hive, Presto does not depend on MapReduce technique and can quickly retrieve
data.
Apache SAMOA: SAMOA stands for Scalable Advanced Massive Online Analysis. It
is an open source platform build for mining big data streams with a special emphasis on
machine learning enablement. SAMOA supports Write-Once-Run-Anywhere (WORA)
architecture which allows for seamless integration of multiple Distributed Stream Processing
Engines (DSPEs) into the framework. Apache SAMOA allows for the development of new ML
algorithms.
Elasticsearch: Elasticsearch is a dependable and safe open source platform where you can
take any data from any source, in any format and search, analyze it and envision it in real time.
Elasticsearch is designed for horizontal scalability, reliability, and ease of management. It is based on
Lucene a retrieval software library originally compiled in Java.
Big data storage
Big data storage is a compute-and-storage architecture that collects and manages large
data sets and enables real-time data analytics. Companies apply big data analytics to
get greater intelligence from metadata. In most cases, big data storage uses low-cost
hard disk drives, storage systems as the foundation of big data storage. These systems
can be all-flash or hybrids mixing disk and flash storage.
The data itself in big data is unstructured, which means mostly file-based and object
storage.
Although a specific volume size or capacity is not formally defined, big data storage
usually refers to volumes that grow exponentially to terabyte or petabyte scale.
A big data storage system clusters a large number of commodity servers attached to
high-capacity disk to support analytic software written to crunch vast quantities of
data. The system relies on massively parallel processing databases to analyze data
ingested from a variety of sources.
Big data often lacks structure and comes from various sources, making it a poor fit for
processing with a relational database. The Apache Hadoop Distributed File System
(HDFS) is the most prevalent analytics engine for big data, and is typically combined
with some flavor of a NoSQL database.
Hadoop is open source software written in the Java programming language. HDFS
spreads the data analytics across hundreds or even thousands of server nodes without
a performance hit. Through its MapReduce component, Hadoop distributes processing
in this way as a safeguard against catastrophic failure. The multiple nodes serve as a
platform for data analysis at a network's edge. When a query arrives, MapReduce
executes processing directly on the storage node on which data resides. Once analysis
is completed, MapReduce gathers the collective results from each server and
“reduces” them to present a single cohesive response.
Big data can bring an organization a competitive advantage from large-scale statistical
analysis of the data or its metadata. In a big data environment, the analytics mostly
operate on a circumscribed set of data, using a series of data mining-based predictive
modelingforecasts to gauge customer behaviors or the likelihood of future events.
Infrastructure Security
7. Granular Audits
The new Big Data security solutions should extend the secure perimeter from the enterprise to the public
cloud. In this way, a trustful data provenance mechanism should be also created across domains. In
addition, similar mechanisms, can be used to mitigate distributed denial-of-service (DDoS) attacks
launched against Big Data infrastructures. Also, a Big Data security and privacy is necessary to ensure
data trustworthiness throughout the entire data lifecycle – from data collection to usage.. A recent work
describes proposed privacy extensions to UML to help software engineers to quickly visualize privacy
requirements, and design them into Big Data applications (Jutla, Bodorik, & Ali, 2013).
Homomorphic encryption is a form of encryption which allows specific types of computations (e.g. RSA
public key encryption algorithm) to be carried out on ciphertext and generate an encrypted result which,
when decrypted, matches the result of operations performed on the plaintext (Gentry, 2010). This allows
encrypted queries on databases, which keeps secret private user information where that data is normally
stored (somewhere in the cloud – in the limit an user can store its data on any untrusted server, but in
encrypted form, without being worried with the data secrecy) (Ra Popa & Redfield, 2011) . More broadly,
the fully homomorphic encryption improves the efficiency of secure multiparty computation
Due to the reasons such as the rapid growth and spread of network services, mobile devices, and
online users on the Internet leading to a remarkable increase in the amount of data. Almost every
industry is trying to cope with this huge data. Big data phenomenon has begun to gain importance.
However, it is not only very difficult to store big data and analyse them with traditional applications,
but also it has challenging privacy and security problems
A defining feature of Big Data visualization is scale. Today's enterprises collect and store vast amounts of data that
would take years for a human to read, let alone understand. But researchers have determined that the human retina
can transmit data to the brain at a rate of about 10 mb per second. Big Data visualization relies on powerful
computer systems to ingest raw corporate data and process it to generate graphical representations that allow
humans to take in and understand vast amounts of data in seconds.
Importance of Big Data Visualization
The amount of data created by corporations around the world is growing every year, and thanks to innovations such
as the Internet of Things this growth shows no sign of abating. The problem for businesses is that this data is only
useful if valuable insights can be extracted from it and acted upon.
To do that decision makers need to be able to access, evaluate, comprehend and act on data in near real-time, and
Big Data visualization promises a way to be able to do just that. Big Data visualization is not the only way for
decision makers to analyze data, but Big Data visualization techniques offer a fast and effective way to:
Spot trends
Tableau : Tableau desktop is an amazing data visualisation tool (SaaS) for manipulating big
data and it’s available to everyone. It has two other variants “Tableau Server” and cloud-based
“Tableau Online” which are dedicatedly designed for big data related organizations. You don’t
have to be a coder to use this tool. This tool is very handy and provides lightning fast speed.
D3 : D3 or Data Driven Document is a Javascript library for visualising big data in virtually any
way you want. This is not a tool, like the others and the user needs a good grasp over javascript to
give the collected data a shape. The manipulated data are rendered through HTML, SVG and CSS,
so there is no place for old browsers (IE 7 or 8) as they don’t support SVG (Scalable Vector
Graphics).
Fusion chart
Canvas
- Report the facts as they are, not as you were hoping they would be;
- Conclusions cannot always be legitimately drawn from a given data
set;
- A lack of evidence for a theory does not prove that the opposite is
true;
- Ensure your initial data is reliable;
- Base your conclusions on the full set of data – don’t choose data to
support a conclusion.
- Some algorithms were developed to address business problems.
Some were developed to augment algorithms in use for other
purposes, or to have them perform somewhat differently, to tune
them to a business environment. These algorithms can be used, for
instance, to remind customers of an event, or to target likely credit
card applicants. Although one algorithm might be clearly better for
a certain purpose than another, it’s sometimes very useful to try
more than one. Doing this can provide comparisons and often turn
up some unexpected results that can tell you more than you
expected about your product or your customers.
- Ten of the most commonly-used algorithms are:
1. K-Means Clustering Algorithm : A simple, unsupervised
learning algorithm that is often used with big data sets, often as a way
of pre-clustering or classifying into larger categories that other
algorithms can further refine. It has some other inherent problems
that make it best suited to large-scale, high-level clustering.
Conclusion