Вы находитесь на странице: 1из 8

Chapter 14 Big Data Analytics and NoSQL

Chapter 14
Big Data Analytics and NoSQL
Discussion Focus
Start by explaining that Big Data is a nebulous term. Its definition and the composition of the
techniques and technologies that are covered under this umbrella term are constantly changing and
being redefined. There is no standardizing body for Big Data or NoSQL so there is no one in
charge to make a definitive statement about exactly what qualifies as Big Data. This is made worse
by the fact that most technologies for big data problems and the NoSQL movement are open-
source so even the developers working in the arena are often a loose community without hierarchy
or structure.

As a generic definition, Big Data is data of such volume, velocity, and/or variety that it is difficult
for traditional relational database technologies to store and process it. Students need to understand
that the definition of Big Data is relative, not absolute. We cannot look at a collection of data and
state categorically that it is Big Data now and for all time. We may categorize a set of data or a
data storage and processing requirement as a Big Data problem today. In three years, or even in
one year, relational database technologies may have advanced to the point where that same
problem is no longer a Big Data problem.

NoSQL has the same problem in terms of its definition. Since Big Data and NoSQL are both
defined in terms of a negative statement that says what they are not instead of a positive statement
that says what they are, they both suffer from being ill-defined and overly broad.

Discuss the many V’s of Big Data. The basic V’s, volume, velocity, and variety are key to Big
Data. Again, because of the lack of an authority to define what Big Data is, other V’s are added by
writers and thinkers who like to jump on the alliteration of the 3 V’s. Beyond the 3 V’s, the other
V’s that are proposed by various sources are often not really unique to Big Data. For example, all
data have Volume. Big Data problems require Volume that is too large for relational database
technologies to support. Veracity is the trustworthiness of the data. All data needs to be
trustworthy. Big Data problems do not require support for a higher level of trustworthiness than
relational database technologies can support. Therefore, the argument can be made that veracity
is a characteristic of all data, not just Big Data. Students should understand that critical thinking
about Big Data is necessary when assessing claims and technologies in this fast-changing arena.

Discuss that Hadoop has been the beneficiary of great marketing and widespread buy-in from
pundits. Hadoop has become synonymous with Big Data in the minds of many people that are
passingly familiar with data management. However, Hadoop is a very specialized technology that
is aimed at very specific tasks associated with storing and processing very large data sets in non-
integrative ways. This makes the Hadoop ecosystem very important because the ecosystem can
expand the basic HDFS and MapReduce capabilities to support a wider range of needs and allow
greater integration of the data.

462
Chapter 14 Big Data Analytics and NoSQL

Stress to students that the NoSQL landscape is constantly changing. There are about 100 products
competing in the NoSQL environment as any point in time, with new entrants emerging almost
daily and other products disappearing at about the same rate. The text follows the standard
categories of NoSQL databases that appear in the literature, as shown below, but many products
do not fit neatly into only one category:
 Key-value
 Document
 Column family
 Graph
Each category attempts to deal with non-relational data in different ways.

Data analysis focuses on attempting to generate knowledge to expand and inform the
organization’s decision making processes. These topics were covered to a great extent in Chapter
13 when analyzing data from transactional databases integrated into data warehouses. In this
chapter, the use of exploratory and predictive analytics are applied to non-relational databases.

Answers to Review Questions


1. What is Big Data? Give a brief definition.
Big Data is data of such volume, velocity, and/or variety that it is difficult for traditional
relational database technologies to store and process it.

2. What are the traditional 3 Vs of Big Data? Briefly, define each.


Volume, velocity, and variety are the traditional 3 Vs of Big Data. Volume refers to the
quantity of the data that must be stored. Velocity refers to the speed with which new data is
being generated and entering the system. Variety refers to the variations in the structure, or
the lack of structure, in the data being captured.

3. Explain why companies like Google and Amazon were among the first to address the Big
Data problem.
In the 1990s, the use of the Internet exploded and commercial websites helped attract millions
of new consumers to online transactions. When the dot-com bubble burst at the end of the
1990s, the millions of new consumers remained but the number of companies providing them
services reduced dramatically. As a result, the surviving companies, like Google and Amazon
experienced exponential growth in a very short time. This lead to these companies being
among the first to experience the volume, velocity, and variety of data that is associated with
Big Data.

4. Explain the difference between scaling up and scaling out.


Scaling up involves improving storage and processing capabilities through the use of
improved hardware, software, and techniques without changing the quantity of servers.
Scaling out involves improving storage and processing capabilities through the use of more
servers.

463
Chapter 14 Big Data Analytics and NoSQL

5. What is stream processing, and why is it sometimes necessary?


Stream processing is the processing of data inputs to make decisions on which data should be
stored and which data should be discarded. In some situations, large volumes of data can enter
the system as such a rapid pace that it is not feasible to try to actually store all of the data. The
data must be processed and filtered as it enters the system to determine which data to keep and
which data to discard.

6. How is stream processing different from feedback loop processing?


Stream processing focuses on inputs, while feedback loop processing focuses on outputs.
Stream processing is performed on the data as it enters the system to decide which data
should be stored and which should be discarded. Feedback loop processing uses data after it
has been stored to conduct analysis for the purpose of making the data actionable by decision
makers.

7. Explain why veracity, value, and visualization can also be said to apply to relational
databases as well as Big Data.
Veracity of data is an issue with even the smallest of data stores, which is why data
management is so important in relational databases. Value of data also applies to traditional,
structured data in a relational database. One of the keys to data modeling is that only the data
that is of interest to the users should be included in the data model. Data that is not of value
should not be recorded in any data store – Big Data or not. Visualization was discussed and
illustrated at length in Chapter 13 as an important tool in working with data warehouses, which
are often maintained as structured data stores in relational DBMS products.

8. What is polyglot persistence, and why is it considered a new approach?


Polyglot persistence is the idea that an organization’s data storage solutions will consist of a
range of data storage technologies. This is a new approach because the relational database has
previously dominated the data management landscape to the point that the use of a relational
DBMS for data storage was taken for granted in most cases. With Big Data problems, the
reliance on only relational databases is no longer valid.

9. What are the key assumptions made by the Hadoop Distributed File System approach?
HDFS is designed around the following assumptions:
High volume
Write-once, read-many
Streaming access
Fault tolerance
HDFS assumes that the massive volumes of data will need to be stored and retrieved. HDFS
assumes that data will be written once, that is, there will very rarely be a need to update the
data once it has been written to disk. However, the data will need to be retrieved many times.
HDFS assumes that when a file is retrieved, the entire contents of the file will need to be
streamed in a sequential fashion. HDFS does not work well when only small parts of a file are
needed. Finally, HDFS assumes that failures in the servers will be frequent. As the number of
servers increases, the probability of a failure increases significantly. HDFS assumes that
servers will fail so the data must be redundant to avoid loss of data when servers fail.

464
Chapter 14 Big Data Analytics and NoSQL

10. What is the difference between a name node and a data node in HDFS?
The name node stores the metadata that tracks where all of the actual data blocks reside in the
system. The name node is responsible for coordinating tasks across multiple data nodes to
ensure sufficient redundancy of the data. The name node does not store any of the actual user
data. The data nodes store the actual user data. A data node does not store metadata about the
contents of any data node other than itself.

11. Explain the basic steps of MapReduce processing.


 A client node submits a job to the Job Tracker.
 Job Tracker determines where the data to be processed resides.
 Job Tracker contacts the Task Tracker on the nodes as close as possible to the data.
 Each Task Tracker creates mappers and reducers as needed to complete the processing
of each block of data and consolidate that data into a result.
 Task Trackers report results back to the Job Tracker when the mappers and reducers
are finished.
 The Job Tracker updates the status of the job to indicate when it is complete.

12. Briefly explain how HDFS and MapReduce are complementary to each other.
Both HDFS and MapReduce rely on the concept of massive, relatively independent,
distributions. HDFS decomposes data into large, independent chunks of data that are then
distributed across a number of independent servers. MapReduce decomposes processing into
independent tasks that are distributed across a number of independent servers. The distribution
of data in HDFS is coordinated by a name node server that collects data from each server about
the state of the data that it holds. The distribution of processing in MapReduce is coordinated
by a job tracker that collects data from each server about the state of the processing it is
performing.

13. What are the four basic categories of NoSQL databases?


Key-value database, document databases, column family databases, and graph databases.

14. How are the value components of a key-value database and a document database
different?
In a key-value database, the value component is nonintelligible for the database. In other
words, the DBMS is unaware of the meaning of any of the data in the value component – it is
treated as an indecipherable mass of data. All processing of the data in the value component
must be accomplished by the application logic. In a document database, the value component
is partially interpretable by the DBMS. The DBMS can identify and search for specific tags,
or subdivisions, within the value component.

465
Chapter 14 Big Data Analytics and NoSQL

15. Briefly explain the difference between row-centric and column-centric data storage.
Row-centric storage treats a row as the smallest data storage unit. All of the column values
associated with a particular row of data are stored together in physical storage. This is the
optimal storage approach for operations that manipulate and retrieve all columns in a row, but
only a small number of rows in a table. Column-centric storage treats a row as a divisible
collection of values that are stored separately with the values of a single column across many
rows being physically stored together. This is optimal when operations manipulate and retrieve
a small number of columns in a row for all rows in the table.

16. What is the difference between a column and a super column in a column family
database?
Columns in a column family database are relatively independent of each other. A super column
is a group of columns that are logically related. This relationship can be based on the nature of
the data in the columns, such as a group of columns that comprise an address, or it can be based
on application processing requirements.

17. Explain why graph databases tend to struggle with scaling out?
Graph databases are designed to address problems with highly related data. The data that
appears in a graph database are tightly integrated and queries that traverse a graph focus on the
relationships among the data. Scaling out requires moving data to number of different servers.
As a general rule, scaling out is recommended when the data on each server is relatively
independent of the data on other servers. Due to the dependencies among the data on different
servers in a graph database, the inter-server communication overhead is very high with a graph
database. This has a significant negative impact on the performance of graph databases in a
scaled out environment.

18. What is data analytics? Briefly define explanatory and predictive analytics. Give some
examples.
Data analytics is a subset of BI functionality that encompasses a wide range of mathematical,
statistical, and modeling techniques with the purpose of extracting knowledge from data. Data
analytics is used at all levels within the BI framework, including queries and reporting,
monitoring and alerting, and data visualization. Hence, data analytics is a “shared” service that
is crucial to what BI adds to an organization. Data analytics represents what business managers
really want from BI: the ability to extract actionable business insight from current events and
foresee future problems or opportunities.

Data analytics discovers characteristics, relationships, dependencies, or trends in the


organization’s data, and then explains the discoveries and predicts future events based on the
discoveries. Data analytics tools can be grouped into two separate (but closely related and often
overlapping) areas:
 Explanatory analytics focuses on discovering and explaining data characteristics and
relationships based on existing data. Explanatory analytics uses statistical tools to
formulate hypotheses, test them, and answer the how and why of such relationships—for
example, how do past sales relate to previous customer promotions?

466
Chapter 14 Big Data Analytics and NoSQL

 Predictive analytics focuses on predicting future data outcomes with a high degree of
accuracy. Predictive analytics uses sophisticated statistical tools to help the end user create
advanced models that answer questions about future data occurrences—for example, what
would next month’s sales be based on a given customer promotion?

19. Describe and contrast the focus of data mining and predictive analytics. Give some
examples.
In practice, data analytics is better understood as a continuous spectrum of knowledge
acquisition that goes from discovery to explanation to prediction. The outcomes of data
analytics then become part of the information framework on which decisions are built. You
can think of data mining (explanatory analytics) as explaining the past and present, while
predictive analytics forecasts the future. However, you need to understand that both sciences
work together; predictive analytics uses explanatory analytics
as a stepping stone to create predictive models.

Data mining refers to analyzing massive amounts of data to uncover hidden trends, patterns,
and relationships; to form computer models to simulate and explain the findings; and then to
use such models to support business decision making. In other words, data mining focuses on
the discovery and explanation stages of knowledge acquisition. However, data mining can also
be used as the basis to create advanced predictive data models. For example, a predictive model
could be used to predict future customer behavior, such as a customer response to a target
marketing campaign.

So, what is the difference between data mining and predictive analytics? In fact, data mining
and predictive analytics use similar and overlapping sets of tools, but with a slightly different
focus. Data mining focuses on answering the “how” and “what” of past data, while predictive
analytics focuses on creating actionable models to predict future behaviors and events. In some
ways, you can think of predictive analytics as the next logical step after data mining; once you
understand your data, you can use the data to predict future behaviors. In fact, most BI vendors
are dropping the term data mining and replacing it with the more alluring term predictive
analytics.

Predictive analytics can be traced back to the banking and credit card industries. The need to
profile customers and predict customer buying patterns in these industries was a critical driving
force for the evolution of many modeling methodologies used in BI data analytics today. For
example, based on your demographic information and purchasing history, a credit card
company can use data-mining models to determine what credit limit to offer, what offers you
are more likely to accept, and when to send those offers. Another example, a data mining tool
could be used to analyze customer purchase history data. The data mining tool will find many
interesting purchasing patterns, and correlations about customer demographics, timing of
purchases and the type of items they purchase together. The predictive analytics tool will use
those finding to build a model that will predict with high degree of accuracy when a certain
type of customer will purchase certain items and what items are likely to be purchased on
certain nights and times.

467
Chapter 14 Big Data Analytics and NoSQL

20. How does data mining work? Discuss the different phases in the data mining process.
Data mining is subject to four phases:
 In the data preparation phase, the main data sets to be used by the data mining
operation are identified and cleansed from any data impurities. Because the data in
the data warehouse are already integrated and filtered, the Data Warehouse usually
is the target set for data mining operations.
 The data analysis and classification phase objective is to study the data to identify
common data characteristics or patterns. During this phase the data mining tool
applies specific algorithms to find:
 data groupings, classifications, clusters, or sequences.
 data dependencies, links, or relationships.
 data patterns, trends, and deviations.
 The knowledge acquisition phase uses the results of the data analysis and
classification phase. During this phase, the data mining tool (with possible
intervention by the end user) selects the appropriate modeling or knowledge
acquisition algorithms. The most typical algorithms used in data mining are based
on neural networks, decision trees, rules induction, genetic algorithms,
classification and regression trees, memory-based reasoning, or nearest neighbor
and data visualization. A data mining tool may use many of these algorithms in any
combination to generate a computer model that reflects the behavior of the target
data set.
 Although some data mining tools stop at the knowledge acquisition phase, others
continue to the prognosis phase. In this phase, the data mining findings are used to
predict future behavior and forecast business outcomes. Examples of data mining
findings can be:

65% of customers who did not use the credit card in six months are 88% likely to cancel their
account

82% of customers who bought a new TV 27" or bigger are 90% likely to buy a entertainment
center within the next 4 weeks.

If age < 30 and income <= 25,0000 and credit rating < 3 and credit amount > 25,000, the
minimum term is 10 years.

The complete set of findings can be represented in a decision tree, a neural net, a forecasting
model or a visual presentation interface which is then used to project future events or results.
For example the prognosis phase may project the likely outcome of a new product roll-out or
a new marketing promotion.

21. Describe the characteristics of predictive analytics. What is the impact of Big Data in
predictive analytics?
Predictive analytics employs mathematical and statistical algorithms, neural networks,
artificial intelligence, and other advanced modeling tools to create actionable predictive
models based on available data. The algorithms used to build the predictive model are specific
to certain types of problems and work with certain types of data. Therefore, it is important that

468
Chapter 14 Big Data Analytics and NoSQL

the end user, who typically is trained in statistics and understands business, applies the proper
algorithms to the problem in hand. However, thanks to constant technology advances, modern
BI tools automatically apply multiple algorithms to find the optimum model. Most predictive
analytics models are used in areas such as customer relationships, customer service, customer
retention, fraud detection, targeted marketing, and optimized pricing. Predictive analytics can
add value to an organization in many different ways; for example, it can help optimize existing
processes, identify hidden problems, and anticipate future problems or opportunities. However,
predictive analytics is not the “secret sauce” to fix all business problems. Managers should
carefully monitor and evaluate the value of predictive analytics models to determine their
return on investment.

Predictive analytics received a big stimulus with the advent of social media. Companies turned
to data mining and predictive analytics as a way to harvest the mountains of data stored on
social media sites. Google was one of the first companies that offered targeted ads as a way to
increase and personalize search experiences. Similar initiatives were used by all types of
organizations to increase customer loyalty and drive up sales. Take the example of the airline
and credit card industries and their frequent flyer and affinity card programs. Nowadays, many
organizations use predictive analytics to profile customers in an attempt to get and keep the
right ones, which in turn will increase loyalty and sales.

469

Вам также может понравиться