Clickstream Data

Commercial Analytics
of Clickstream Data using

Hadoop
June 2014
Submitted by:
Kartik Gupta
201100048
M.C.A
Thapar University
Submitted to:
School of Mathematics and
Computer Application
Department,
Thapar university,
Patiala.
Outline
Overview
Big Data
Hadoop
Major Steps
Results and Analysis
Conclusion and Future Scope
Overview
This Project gives an analytic report to find the
behavior and location of visitor using Hadoop.
Map Reduce is implemented to refine and sort
the raw data.
Searching is done based on the country, ip
addresses, Postal code, categories wise
Hadoop is a tool which converts the
unstructured, structured and semi-structured
data into pair into a single value which is
represented in binary format.
MapReduce framework is used for parallel
implementation.
Big Data
Big Data is a term used to describe large
collections of data that may be unstructured
grow so large and quickly that it is difficult to
manage with regular database or statistical
tools.
3 vs of Big data
Hadoop
Open source project started by Doug Cutting

A platform to manage Big Data
Helps in Distributed computing
Runs on Commodity Hardware
Data storage (HDFS)
Runs on commodity hardware (usually Linux)

Horizontally scalable
Processing (MapReduce)
Parallelized (scalable) processing

Fault Tolerant
CORE PARTS OF HADOOP
Hadoop Distributed File System(HDFS)

Hadoop Distributed File System (HDFS) is a
Java-based file system that provides scalable
and reliable data storage that is designed to
span large clusters of commodity servers.
Some specific features ensure that the
Hadoop clusters are highly functional
RackAwareness
Minimal Data Motion
Utilities
Rollback
Highly Operable
How HDFS works
MapReduce
MapReduce is a programming model and an
associated implementation for processing large
data sets.
MapReduce usually splits the input data-set into
independent chunks which are processed in a
completely parallel manner.
This allows programmers without any experience
with
parallel and distributed systems to easily
utilize the resources of a large distributed system.
The run-time system takes care of scheduling
tasks, monitoring them and re-executes the failed
tasks.
Execution flow in MapReduce
1. Mapreduce program that has been written tells

the job client to run a mapreduce job.
10
2.This sends a message to the Jobtracker which

produces a unique ID for the job.
11
3. JobClient copies job resources , such as jar file.
12
4. Once the resources are in Distributed

Filesystem, the JobClient can tell the JobTracker to
start the job.
13
5. The JobTracker does its own initialization for the

job.. It retrieves these input splits from the
distributed file system.
14
6. Now that the Jobtracker has work for

Tasktrackers, it will return the map task or reduce
task as response to the heart beat.
15
7. The TaskTracker need to obtain the code to

execute, so they get it from the shared file system.
16
8. The TaskTracker now will run the job.
17
OTHER TECHNOLOGICAL
TERMS
Clickstream Data
Clickstream data is an information trail a user leaves
behind while visiting a website. It is typically captured in
semi-structured website log files.
Potential Uses of Clickstream Data
What is the most efficient path for a site visitor to research
a product, and then buy it?
What products do visitors tend to buy together, and what
are they most likely to buy in the future?
Where should I spend resources on fixing or enhancing the
user experience on my website?
Basically we will focus on the path optimization use case.
Specifically: how can we improve our website to reduce
bounce rates and improve conversion?
STEP I
Upload Acme website log dataset contains about 4 million

rows of data, which represents five days of clickstream
data.
STEP II
Represent the dataset in unstructured format i.e

timestamp, registerd user swid, ip address,
geocoded ip address, url
STEP III
Represent the users data from the unstructured

loaddataset
STEP IV
Represent the products categories wise from

the dataset
STEP V
Shows the refine dataset of acme logfiles
STEP VI
Combine all the tables i.e acme log, products, users.
Configuration of Hadoop
Count the no of VISITORS from any country
Retrieving the ip address and displaying the state of visitors
Showing the no of ip access this category at a time
Initial stage of mapping and reduction
Category accessed by total no of ips
Showing shoes category acc to state access by total no of ip
showing details of ip accessed by visitors but gender wise
Result and Analysis
No of Females accessed this page
Result and Analysis
Total no of ip address accessed particular webpage
Result and Analysis
Calculate the sum of ages of all the visitors
Conclusion
The amount of clickstream data is rapidly
growing and with this demand for accessing
information
over
web
has
increased
significantly.
Therefore analyze the behavior and location
of the visitor.
It is inefficient to process large data using
traditional sequential method
Therefore MapReduce is used for processing
large datasets
Future Scope
Clickstream information play an important
role in a wide variety of applications such as
decision support systems, profile-based
marketing.
Location search is used by various industries
like telecom , e-commerce industry , in event
detection.
Nearest location method can be fused with
any other method to help in better way for
decision making.
Then the tradeoff would be done between
distance and other factor that would be fused
Thank you !!!

Clickstream Data

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Clickstream Data

Загружено:

Авторское право:

Доступные форматы

Commercial Analytics

of Clickstream Data using

Open source project started by Doug Cutting

Data storage (HDFS)

Runs on commodity hardware (usually Linux)

Parallelized (scalable) processing

CORE PARTS OF HADOOP

Hadoop Distributed File System(HDFS)

How HDFS works

Execution flow in MapReduce

1. Mapreduce program that has been written tells

Execution flow in MapReduce

2.This sends a message to the Jobtracker which

Execution flow in MapReduce

3. JobClient copies job resources , such as jar file.

Execution flow in MapReduce

4. Once the resources are in Distributed

Execution flow in MapReduce

5. The JobTracker does its own initialization for the

Execution flow in MapReduce

6. Now that the Jobtracker has work for

Execution flow in MapReduce

7. The TaskTracker need to obtain the code to

Execution flow in MapReduce

8. The TaskTracker now will run the job.

Upload Acme website log dataset contains about 4 million

Represent the dataset in unstructured format i.e

Represent the users data from the unstructured

Represent the products categories wise from

Shows the refine dataset of acme logfiles

Combine all the tables i.e acme log, products, users.

Results and Analysis

Results and Analysis

Count the no of VISITORS from any country

Results and Analysis

Retrieving the ip address and displaying the state of visitors

Results and Analysis

Showing the no of ip access this category at a time

Results and Analysis

Initial stage of mapping and reduction

Results and Analysis

Category accessed by total no of ips

Results and Analysis

Showing shoes category acc to state access by total no of ip

Results and Analysis

showing details of ip accessed by visitors but gender wise

Result and Analysis

No of Females accessed this page

Result and Analysis

Total no of ip address accessed particular webpage

Result and Analysis

Calculate the sum of ages of all the visitors

Thank you !!!

Вам также может понравиться