Академический Документы
Профессиональный Документы
Культура Документы
1. Introduction
Big data as coined, by Roger
Magoulas from OReilly media in 2005 [1]
represents massive data sets with large,
more varied and complex structure with
challenge of storing, analyzing and
visualizing for extracting meaningful results.
Big data analytics is the process of research
into massive amounts of data to reveal
hidden patterns and correlations.
Big data is generated from various
factors like astronomy, atmospheric science,
genomics,
biogeochemical,
biological
science and research, life sciences, medical
records, scientific research, government,
natural disaster and resource management,
private sector, military surveillance, private
sector, financial services, retail, social
networks, web logs, text, document,
photography, audio, video, click streams,
search indexing, call detail records, POS
information, RFID, mobile phones, sensor
networks and telecommunications [2] .
2. Overview of Big Data
2.1 Benefits
Following are the benefits of Big
data in various fields: Better aimed
marketing, more straight business insights,
client based segmentation, recognition of
sales and market chances, automated
decision making, definitions of customer
behaviors, greater return on investments,
quantification of risks and market trending,
comprehension of business alteration, better
planning and forecasting, identification of
Manufacturing:
(1) Research and development and
product design
(2) Product lifecycle management.
(3) Design to value.
(4) Open innovation
(5) Supply chain
(6) Production: Digital factory,
Sensor-driven operation.
Personal location data: Smart
routing, geo targeted advertising or
Social
network
analysis:
Understanding user intelligence for more
targeted advertising, marketing campaigns
and capacity planning, customer behavior
and buying patterns, sentiment analytics.
3
(5) Energy management: With the
increase of data volume and analytical
demands, the processing, storage, and
transmission of big data will inevitably
consume more and more electric energy.
(6) Expendability and scalability:
The analytical system of big data must
support present and future datasets. The
analytical algorithm must be able to process
increasingly expanding and more complex
datasets.
Apart from these others challenges
faced for implementation of Big Data
analytics are [8] security concerns,
capital/operational expenses, increased
network bottlenecks, shortage of skilled data
science professionals, unmanageable data
rate, data replication capabilities, lack of
compression capabilities, greater network
latency and insufficient CPU power, lack of
current database software in analytics and
fast process time, incapable to make big data
usable for end users.
2.4 Components of Big Data
It has basically three main components [6]:
Variety of Sources:
Technique
A/B testing
[3]
Method
A control group is compared with a
variety of test groups in order to
determine what changes will
improve a given objective variable.
Example
Determining what copy text, layouts,
images, or colors will improve
conversion rates on an e-commerce
web site
Association
rule learning
[5]
Data fusion
and data
integration
[6]
Data mining
[7]
Natural
language
processing
[8]
Predictive
modeling
[9]
Spatial
analysis
[10]
Technology
Big Table
[11]
Overview
Big table is a distributed storage system
for managing structured data at Google.
It can reliably scale to petabytes of data
and thousands of machines
Application
More than 60 Google
products like Google
Earth, Google Finance
and
Google
web
Indexing, Orkut, and
Google Analytics etc.
2.
Cassandra
[12]
Accenture, EBay,
Netflix, Go daddy,
Instagram , Reddit,
Yahoo! Japan, NASA
3.
Hadoop
[13]
Amazon, AOL,
Facebook, IBM, New
York Times, Yahoo!,
Microsoft, Google
4. Association rule
An association is a rule of the
format: LHS -- RHS. The goal of association
rule discovery is to find associations among
items from a set of transactions, each of
which contains a set of items. [5] Generally
the algorithm finds a subset of association
rules that satisfy certain constraints.
(1) Minimum support: - The support
of a rule is defined as the support of the
7
Most association rule algorithms
generate association rules in two steps:
B) =
Total number of tuples
B) =
Number of tuples with A
4.2.2 FP Algorithm
It generates all frequent item-sets
satisfying a given minimum support by
growing a frequent pattern tree structure that
stores compressed information about the
frequent patterns. In this way, FP-growth
can avoid repeated database scans and also
avoid the generation of a large number of
candidate item-sets. FP-growth takes
transactional data in the form of one row for
each
single
complete
transaction.
Implementations of FP-growth only generate
the frequent item-sets, and not the
association rules [16]. The mining task as
well as the database are decomposed using a
divide and conquer system and finally it
uses a fragment pattern method to avoid the
costly process of candidate generation and
testing opposed to the Apriori algorithm.
[15]
A frequent pattern tree is a structure
consisting of [17]
(1) One root labeled as null,
(2) A set of item-prefix subtrees as
the children of the root,
9
(3) A frequent-item-header table.
Item-prefix subtrees: -Each node in
the item-prefix subtree consists of three
fields: item-name, count, and node-link.
(1) Item-name: -It registers which
item this node represents.
(2) Count: - It registers the number
of transactions represented by the
portion of the path reaching this node.
(3) Node link: - Links to the next
node in the FP-tree carrying the same
item-name, or null if there is none.
Frequent-item-header table: -Each
entry consists of two fields: (1) Item name
(2) Head of node link (a pointer
pointing to the first node in the FP tree
carrying the item-name).
Problem Statement
Government
Fraud at Consignia ,
sector.
UKs Post office group
Researchers
of
Kings
College
London [19]
Method applied
Outcome
[20]
Issues
concerning Spatial association rule mining Helped
in
accessibility of an to geo-referenced U.K. census transportation planning
urban area
data of 1991
in area near a local
Stepping Hill Hospital.
Anomaly
detection In training Apriori algorithm
and classification in was applied and association
Breast Cancer.
rules were extracted. The
support was set to 10% and
the confidence to 0%.
Success
rate
of
classifier was 69.11%.
In a time duration of
1.5 hours about 2.6% of
discovered rules were
accepted and rest were
rejected.
Retail Sector.
[22]
Telecom Sector.
Time
required
for
training was much less
then neural network.
Thus
total
rules
reduced to about 14
rules per household
from 537 rules per
household.
11
[23]
or
triples
or treating the top-k country high rate of fraud calls
quadruples customers item set as a market basket trends associated with
are currently calling
for each of account.
adult
entertainment
services, that move
Exploiting temporal nature of from
country
to
data by using traffic from last country through time.
month as a baseline for
current month.
Manufacturing
sector.
VAM
Drilling
industries France
[24]
Setting up a system
which provides result
identical to human
observation related to
performance
and
dysfunctions during
forging.
Found
the
main
dysfunction responsible
for delay.
Finding that generator
is cause for exceeding
maximum
time
in
starting phase
The
third
major
problem was the lack of
effectiveness of metal
strippers
13
The master is responsible for
assigning tablets to tablet servers, detecting
the addition and expiration of tablet servers,
balancing tablet-server load, and garbage
collecting files. In addition, it handles
schema changes such as table and column
family creations and deletions. Each tablet
server manages a set of tablets. The tablet
server handles read and write requests to the
tablets that it has loaded, and also splits
tablets that have grown too large. A Bigtable
cluster stores a number of tables. Each table
consists of a set of tablets, and each tablet
contains all of the data associated with a row
range.
5.3.1 Tablet Location: It uses a three-level hierarchy, the
first level is a file stored in Chubby that
contains the location of the root tablet. The
root tablet contains the locations of all of the
tablets of a special METADATA table. Each
METADATA tablet contains the location of
a set of user tablets. Secondary information
like a log of all events pertaining to each
tablet (such as when a server begins serving
it) is also stored in METADATA. This
information is helpful for debugging and
performance analysis.
Item 1
Bread
Ice-cream
Bread
Bread
Butter
Bread
Milk
Ice-cream
Butter
Noodles
Item 21
Butter
Bread
Butter
Noodles
Milk
Noodles
Butter
Milk
Milk
Butter
Item 3I3
Milk
Butter
Noodles
Ice-cream
Bread
Ice-cream
Bread
Bread
Noodles
Ice-cream
15
In the given dataset every item occurs three
or more than three times and total number of
transaction is ten so,
Minimum Support = 0.3
Item-set
Bread
Butter
Noodles
Ice-cream
Milk
Support
0.8
0.7
0.5
0.5
0.5
Item-sets
{Bread, Butter}
{Bread, Milk}
{Bread, Noodles}
{Bread, Ice-cream}
{Butter, Milk}
{Butter, Noodles}
{Butter, Ice-cream}
{Noodles, Milk}
{Noodles, Ice-cream }
{Milk, Ice-cream}
Support
0.5
0.4
0.3
0.4
0.4
0.3
0.2
0.1
0.3
0.1
Item-sets
{Bread, Butter, Milk}
{Bread, Ice-cream, Noodles}
{Bread, Butter, Noodles}
Support
0.3
0.2
0.1
Support
{A,AU}
0.0440
{D,S}
0.0434
{D,AJ}
0.0430
{E,J}
0.0431
{F,W}
0.0439
{Q,AG}
0.0435
{S,AH}
0.0531
{AB,AC}
0.0509
{AH,AQ}
0.0431
Item-sets
{D,S,AJ}
Support
0.0411
S. No.
1
2
3
4
5
Item1
A
B
A
A
A
Item2
B
C
C
D
B
Item 3
D
D
E
C
Item 4
S. No.
A
1
FREQ. 2
1
3
Frequent Pattern
2-Item set
E,A
E,D
B,A
C,A
C,B
D,A
D,C
3 Item set
E,A,D
Support Count
2
2
2
2
2
2
2
2
17
Similarly, the FP growth algorithm
was implemented on other two databases
and identical results as given by the Apriori
algorithm were obtained.
REFERENCES: 1. G. Halevi, H. Moed, The evolution of big data as a
research and scientific topic: Overview of the
literature, Res. Trends(2012) 36.
2. http://en.wikipedia.org/wiki/Big_data
3. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,
C. Roxburgh and A.H. Byers, "Big data: The next
frontier for innovation, competition, and
productivity", McKinsey Global Institute, 2011.
4. Chen, Min, Shiwen Mao, and Yunhao Liu. "Big
data: A survey." Mobile Networks and Applications
19.2 (2014): 171-209
5. Agrawal, Rakesh, Tomasz Imielioski, and Arun
Swami. "Mining association rules between sets of
items in large databases." In ACM SIGMOD Record,
vol. 22, no. 2, pp. 207-216. ACM, 1993.
6. Lohr, Steve. "The age of big data." New York Times
11 (2012).
7. Rygielski, Chris, Jyun-Cheng Wang, and David C.
Yen. "Data mining techniques for customer
relationship management." Technology in society 24,
no. 4 (2002): 483-502.
8. Hennig-Thurau, Thorsten, Edward C. Malthouse,
Christian Friege, Sonja Gensler, Lara Lobschat, Arvind
Rangaswamy, and Bernd Skiera. "The impact of new
media on customer relationships." Journal of service
research 13, no. 3 (2010): 311-330.
9. Kamakura, Wagner A., Michel Wedel, Fernando
De Rosa, and Jose Afonso Mazzon. "Cross-selling
through database marketing: A mixed data factor
analyzer for data augmentation and prediction."