Вы находитесь на странице: 1из 7

Unit 1: Introduction to Big Data

Types of data and their characteristics

The most basic forms of data are database data, data warehouse data and transactional data.

Brief overview of the basic forms of data are as follows:

Database Data

A database system, also called a database management system (DBMS), consists of a collection of
interrelated data, known as a database, and a set of software programs to manage and access the data. The
software programs provide mechanisms for defining database structures and data storage; for specifying
and managing concurrent, shared, or distributed data access; and for ensuring consistency and security of
the information stored despite system crashes or attempts at unauthorized access.

Data warehouse Data

It is a repository of information collected from multiple sources, stored under a unified schema, and usually
residing at a single site. Data warehouses are constructed via a process of data cleaning, data integration,
data transformation, data loading, and periodic data refreshing. To facilitate decision making, the data in a
data warehouse are organized around major subjects (e.g. customer, item, supplier, and activity). The data
are stored to provide information from a historical perspective, such as in the past 6 to 12 months, and are
typically summarized. For example, rather than storing the details of each sales transaction, the data
warehouse may store a summary of the transactions per item type for each store or, summarized to a higher
level, for each sales region.

Transactional Data

In a transactional database each reecord in the database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web page. A transaction typiically includes a unique
transaction identity number (trans ID) and a list of the items making up the transaction, such as the items
purchased in the transaction. A transactional database may have additional tables, which contain other
information related to the transactions, such as item description, information about the salesperson or the
branch, and so on.

Other Kinds of Data

Besides relational database data, data warehouse data, and transaction data, other kinds of data can be
seen in many applications:
1. time-related or sequence data (e.g., historical records, stock exchange data),
2. time-series and biological sequence data,
3. data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
4. spatial data (e.g., maps),
5. engineering design data (e.g., the design of buildings, system components, or integrated circuits),
6. hypertext and multi-media data (including text, image, video, and audio data),
7. graph and networked data (e.g., social and information networks)
8. the Web (a huge, widely distributed information repository made available by the Internet).

Various kinds of knowledge can be mined from these kinds of data. Banking data can be mined for changing
trends, which may aid in the scheduling of bank tellers according to the volume of customer traffic. Stock
exchange data can be mined to uncover trends that could help you plan investment strategies (e.g., the best
time to purchase stock). We could mine computer network data streams to detect intrusions based on the
anomaly of message flows, which may be discovered by clustering, dynamic construction of stream models
or by comparing the current frequent patterns with those at a previous time. With spatial data, we may look
for patterns that describe changes in metropolitan poverty rates based on city distances from major
highways. The relationships among a set of spatial objects can be examined in order to discover which
subsets of objects are spatially autocorrelated or associated. By mining text data, such as literature on data
mining from the past ten years, we can identify the evolution of hot topics in the field. By mining user
comments on products (which are often submitted as short text messages), we can assess customer
sentiments and understand how well a product is embraced by a market. From multimedia data, we can
mine images to identify objects and classify them by assigning semantic labels or tags. By mining video data
of a hockey game, we can detect video sequences corresponding to goals. Web mining can help us learn
about the distribution of information on the WWW in general, characterize and classify web pages, and
uncover web dynamics and the association and other relationships among different web pages, users,
communities, and web-based activities.

Digital data

Digital data is data that represents other forms of data using specific machine language systems that can be
interpreted by various technologies. The most fundamental of these systems is a binary system, which
simply stores complex audio, video or text information in a series of binary characters, traditionally ones and
zeros, or "on" and "off" values. Digital data are discrete, discontinuous representations of information or
works. It is in contrast to continuous, or analog signals which behave in a continuous manner, or represent
information using a continuous function.

One of the biggest strengths of digital data is that all sorts of very complex analog input can be represented
with the binary system. Along with smaller microprocessors and larger data storage centers, this model of
information capture has helped parties like businesses and government agencies to explore new frontiers of
data collection and to represent more impressive simulations through a digital interface.

An example is the conversion of a physical scene to a digital image. Digital data records visual information
into a bitmap that stores a particular color property for each bit on a precise and sophisticated grid. By using
this straightforward essential system of data transfer, the digital image is created.

Sources of Data

There are primarily nine sources of big data as follows:

Structured, Semi-Structured and Un-Structured Data with Examples

Structured data is the data stored in a relational database, data is organized in rows and columns within
named tables.

Semi-structured data does not have a formal data model but has an apparent, self-describing pattern and
structure that enable its analysis. Examples of semi-structured data include spreadsheets that have a row
and column structure, and XML files that are defined by an XML schema.
Quasi-structured data consists of textual data with erratic data formats, and can be formatted with effort,
software tools, and time. An example of quasi-structured data is the data about which webpages a user
visited and in what order.

Unstructured data does not have a data model and is not organized in any particular format. Some
examples of unstructured data include text documents, PDF files, e-mails, presentations, images, and
videos. More than 90 percent, of the data generated in the digital universe today is non-structured data
(semi-, quasi-, and unstructured).

Big Data

Big data is an evolving term that describes any voluminous amount of structured, semi-structured and
unstructured data that has the potential to be mined for information. Big datacan be analysed for insights that
lead to better decisions and strategic business moves.These data sets that are so voluminous and complex
that traditional data processing application software are inadequate to deal with them. Some of the Big
data challenges include capturing data, data storagedata analysis, search, sharing, transfer, visualization,
querying, update and information privacy.

Evolution of Big data


Characteristics of Big data

Five Vs

a. Velocity
b. Volume
c. Value
d. Variety
e. Veracity

Brief overview of the Five Vs

a. Velocity
Velocity refers to the speed at which new data is being generated, collected and analyzed, at any
given time. The number of emails, social media posts, video clips, or even new text added per day is
in excess of several billion entries. Additionally, this is continuing to increase with lightning speed as
tablets and mobile devices are giving us more access to add content online. As new data is added, it
is important that it is analyzed in real-time. Big data technology today gives you the ability to instantly
analyze data as it is generated.

b. Volume
Volume refers to the amount of data produced every second across all online channels, including
social media platforms, from mobile devices, through online transactions, etc. With data growing by
leaps and bounds every minute of every day, we can no longer store and analyze data using
traditional database technology. Instead, organizations have to shift to now use distributed systems,
where parts of data is stored and brought together by software for analysis. As an example,
Facebook alone can generate over five new profiles a second, 136,000 new picture uploads a
minute, and 510 comments per minute.

c. Value
Value refers to the worth of the data being extracted. Not all data is made equal. In fact, having
endless amounts of data does not always translate into having high value data. When trying to
decipher big data, it is critical to fully understand the costs and benefits of collecting and analyzing
the data. Analyzing data can give businesses a glimpse into their market and enable them to make
informed business decisions. For example, aggregating social media data can help an organization
locate social influencers to drive market awareness. Data can also be used to conduct cluster
analysis and data mining to enhance marketing, sales, and business growth strategies. The key is to
ensure that the data you are collecting can be turned into value for your organization, as quickly and
as cost-effectively as possible.

d. Variety
Variety is defined as the different types of data that we use. As the internet and technology evolves,
data can quickly become obsolete. Often with today’s data, it is not easily categorized into tables or
labels. As social media usage increases, more and more data will be lumped into the social sharing
content category. However, this category itself is very broad and could include, blog posts, social
media updates, social media profiles, pictures, videos, audio files, etc.

e. Veracity
Veracity is often defined as the quality or trustworthiness of the data you collect. When it comes to
big data, quality is always preferred over quantity. To focus on quality, it is important to set metrics
around what type of data you may collect and from what sources. Another thing to take into account
is how often you may need new data can also be helpful in determining the types of data sources to
pursue

Challenges of Big data

Advantages and applications of big data analytics are being realized in various sectors. The development of
distributed file systems (eg., HDFS), Cloud computing (eg., Amazon EC2), inmemory cluster computing (eg.,
Spark), parallel computing (eg., Pig), emergence of NoSQL frameworks, advancement in machine learning
algorithms (eg., Support Vector Machines, Deep Learning, Auto-Encoders, Random Forest) have brought
big data processing a reality. Despite the growth in these technologies and algorithms to handle big data,
there are few limitations, which are as follows:

1. Scalability and Storage Issues: The rate of increase in data is much faster than the existing processing
systems. The storage systems are not capable enough to store these. There is a need to develop a
processing system that not only caters to today's needs but also future needs.

2. Timeliness of Analysis: The value of the data decreases over time. Most of the applications like fraud
detection in telecom, insurance and banking, require real time or near real time analysis of the transactional
data

3. Representation of Heterogeneous Data: Data obtained from various sources are heterogeneous in
nature. Unstructured data like Images, videos and social media data cannot be stored and processed using
traditional tools like SQL. Smartphones now record and share images, audios and videos at an incredibly
increasing rate, forcing our brains to process more. However, the process for representing images, audios
and videos lacks efficient storage and processing

4. Data Analytics System: Traditional RDBMS are suitable only for structured data and they lack scalability
and expandability. Though non-relational databases are used for processing unstructured data, but there
exist problems with their performances. There is a need to design a system that combines the benefits of
both relational and non-relational database systems to ensure flexibility

5. Lack of talent pool: With the increase in amount of (structured and unstructured) data generated, there is
a need for talent. The demand for people with good analytical skills in big data is increasing. Research says
that by 2018, as many as 140,000 to 190,000 additional specialists in the area of big data may be required

6. Privacy and Security: New devices and technologies like cloud computing provide a gateway to access
and to store information for analysis. This integration of IT architectures will pose greater risks to data
security and intellectual property. Access to personal information like buying preferences and call detail
records will lead to increase in privacy concerns

7. Not always better data: Social media mining has attracted researchers. Twitter has become a new
popular source. Twitter users do not represent the global population. The tweets containing references to
pornography and spam are eliminated resulting in the inaccuracy of the topical frequency.

8. Out of Context: Data reduction is one of common ways to fit into a mathematical model. Retaining
context during data abstraction is critical. Data which are out of context lose meaning and value.

9. Digital Divide: Gaining access to big data is one of the most important limitations. Data companies and
social media companies have access to large social data. Few companies decide who can access data and
to what extent. Few sell the right to access for a high fees while others offer a portion of data sets to
researchers.

10. Data errors: With increase in the growth of information technology, huge amount of data is generated.
With advent of cloud computing for storage and retrieval of data, there is a need to utilize the big data. Large
datasets from internet sources are prone to errors and losses, hence unreliable.

Unit 1: Big Data Analytics

Business Intelligence

Business intelligence (BI) is a technology-driven process for analyzing data and presenting actionable
information to help executives, managers and other corporate end users make informed business decisions.
It combines a broad set of data analysis applications, including ad hoc analytics and querying, enterprise
reporting, online analytical processing (OLAP), mobile BI, real-time BI, operational BI, cloud and software-as-
a-service BI, open source BI, collaborative BI, and location intelligence

Benefits of business intelligence:


Accelerate and improve decision-making, optimize internal business processes, increase operational
efficiency, drive new revenues and gain competitive advantage over business rivals. BI systems can also
help companies identify market trends and spot business problems that need to be addressed
Data Science

Data science is the study of where information comes from, what it represents and how it can be turned into
a valuable resource in the creation of business and IT strategies. Mining large amounts of structured and
unstructured data to identify patterns can help an organization rein in costs, increase efficiencies, recognize
new market opportunities and increase the organization's competitive advantage.
The data science field employs mathematics, statistics and computer science disciplines, and incorporates
techniques like machine learning, cluster analysis, data mining and visualization.

Applications of Data Science:


Data science is used in Marketing, Finance, Human Resources, Health Care, Government Policies and
every possible industry where data gets generated. Using data science, the marketing departments of
companies decide which products are best for Up selling and cross selling, based on the behavioral data
from customers. In addition, predicting the wallet share of a customer, which customer is likely to churn,
which customer should be pitched for high value product and many other questions can be easily answered
by data science. Data-based algorithms are also used at Netflix to create personalized recommendations
based on a user's viewing history. Shipment companies like DHL, FedEx and UPS use data science to find
the best delivery routes and times, as well as the best modes of transport for their shipments.Finance (Credit
Risk, Fraud), Human Resources (which employees are most likely to leave, employees performance, decide
employees bonus) are easily accomplished using data science in these disciplines.

Data Analytics

Data analytics (DA) is the process of examining data sets in order to draw conclusions about the information
they contain. Data analytics technologies and techniques are widely used in commercial industries to enable
organizations to make more-informed business decisions and by scientists and researchers to verify or
disprove scientific models, theories and hypotheses.

Data analytics can also be separated into quantitative data analysis and qualitative data analysis. The former
involves analysis of numerical data with quantifiable variables that can be compared or measured
statistically. The qualitative approach is more interpretive -- it focuses on understanding the content of non-
numerical data like text, images, audio and video, including common phrases, themes and points of view

Business use of Data Analytics:


Banks and credit card companies analyze withdrawal and spending patterns to prevent fraud and identity
theft. E-commerce companies and marketing services providers do clickstream analysis to identify website
visitors who are more likely to buy a particular product or service based on navigation and page-viewing
patterns. Mobile network operators examine customer data to forecast churn so they can take steps to
prevent defections to business rivals; to boost customer relationship management efforts, they and other
companies also engage in CRM analytics to segment customers for marketing campaigns and equip call
center workers with up-to-date information about callers. Healthcare organizations mine patient data to
evaluate the effectiveness of treatments for cancer and other diseases

Big Data Analytics

Big data analytics is the process of examining large and varied data sets -- i.e., big data -- to uncover hidden
patterns, unknown correlations, market trends, customer preferences and other useful information that can
help organizations make more-informed business decisions
.
Why is big data analytics important?

Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in
turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. Big
data has got value in the following ways:

1. Cost reduction - Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data – plus they can identify more efficient ways of
doing business.
2. Faster, better decision making - With the speed of Hadoop and in-memory analytics, combined with the
ability to analyze new sources of data, businesses are able to analyze information immediately – and make
decisions based on what they’ve learned.
3. New products and services - With the ability to gauge customer needs and satisfaction through analytics
comes the power to give customers what they want. With big data analytics, more companies are creating
new products to meet customers’ needs.

Need For Big Data Analytics

Storage and retrieval of vast amount of structured as well as unstructured data at a desirable time lag is a
challenge. Some of these limitations to handle and process vast amount of data with the traditional storage
techniques led to the emergence of the term Big Data. Though big data has gained attention due to the
emergence of the Internet, but it cannot be compared with it. It is beyond the Internet, though, Web makes it
easier to collect and share knowledge as well data in raw form. Big Data is about how these data can be
stored, processed, and comprehended such that it can be used for predicting the future course of action with
a great precision and acceptable time delay.

Marketers focus on target marketing, insurance providers focus on providing personalized insurances to their
customers, and healthcare providers focus on providing quality and low-cost treatment to patients. Despite
the advancements in data storage, collection, analysis and algorithms related to predicting human behavior;
it is important to understand the underlying driving as well as the regulating factors (market, law, social
norms and architecture), which can help in developing robust models that can handle big data and yet yield
high prediction accuracy

The current and emerging focus of big data analytics is to explore traditional techniques such as rule-based
systems, pattern mining, decision trees and other data mining techniques to develop business rules even on
the large data sets efficiently. It can be achieved by either developing algorithms that uses distributed data
storage, in-memory computation or by using cluster computing for parallel computation. Earlier these
processes were carried out using grid computing, which was overtaken by cloud computing in recent days.

Вам также может понравиться