Вы находитесь на странице: 1из 38

Commercial Analytics

of Clickstream Data using


Hadoop

June 2014

Submitted by:
Kartik Gupta
201100048
M.C.A
Thapar University

Submitted to:
School of Mathematics and
Computer Application
Department,
Thapar university,
Patiala.

Outline
Overview
Big Data
Hadoop
Major Steps
Results and Analysis
Conclusion and Future Scope

Overview
This Project gives an analytic report to find the
behavior and location of visitor using Hadoop.
Map Reduce is implemented to refine and sort
the raw data.
Searching is done based on the country, ip
addresses, Postal code, categories wise
Hadoop is a tool which converts the
unstructured, structured and semi-structured
data into pair into a single value which is
represented in binary format.
MapReduce framework is used for parallel
implementation.

Big Data
Big Data is a term used to describe large
collections of data that may be unstructured
grow so large and quickly that it is difficult to
manage with regular database or statistical
tools.
3 vs of Big data

Hadoop

Open source project started by Doug Cutting


A platform to manage Big Data
Helps in Distributed computing
Runs on Commodity Hardware

Data storage (HDFS)

Runs on commodity hardware (usually Linux)


Horizontally scalable

Processing (MapReduce)

Parallelized (scalable) processing


Fault Tolerant

CORE PARTS OF HADOOP

Hadoop Distributed File System(HDFS)


Hadoop Distributed File System (HDFS) is a
Java-based file system that provides scalable
and reliable data storage that is designed to
span large clusters of commodity servers.
Some specific features ensure that the
Hadoop clusters are highly functional
RackAwareness
Minimal Data Motion
Utilities
Rollback
Highly Operable

How HDFS works

MapReduce
MapReduce is a programming model and an
associated implementation for processing large
data sets.
MapReduce usually splits the input data-set into
independent chunks which are processed in a
completely parallel manner.
This allows programmers without any experience
with
parallel and distributed systems to easily
utilize the resources of a large distributed system.
The run-time system takes care of scheduling
tasks, monitoring them and re-executes the failed
tasks.

Execution flow in MapReduce

1. Mapreduce program that has been written tells


the job client to run a mapreduce job.
10

Execution flow in MapReduce

2.This sends a message to the Jobtracker which


produces a unique ID for the job.
11

Execution flow in MapReduce

3. JobClient copies job resources , such as jar file.

12

Execution flow in MapReduce

4. Once the resources are in Distributed


Filesystem, the JobClient can tell the JobTracker to
start the job.
13

Execution flow in MapReduce

5. The JobTracker does its own initialization for the


job.. It retrieves these input splits from the
distributed file system.
14

Execution flow in MapReduce

6. Now that the Jobtracker has work for


Tasktrackers, it will return the map task or reduce
task as response to the heart beat.
15

Execution flow in MapReduce

7. The TaskTracker need to obtain the code to


execute, so they get it from the shared file system.
16

Execution flow in MapReduce

8. The TaskTracker now will run the job.

17

OTHER TECHNOLOGICAL
TERMS
Clickstream Data
Clickstream data is an information trail a user leaves
behind while visiting a website. It is typically captured in
semi-structured website log files.
Potential Uses of Clickstream Data
What is the most efficient path for a site visitor to research
a product, and then buy it?
What products do visitors tend to buy together, and what
are they most likely to buy in the future?
Where should I spend resources on fixing or enhancing the
user experience on my website?
Basically we will focus on the path optimization use case.
Specifically: how can we improve our website to reduce
bounce rates and improve conversion?

STEP I

Upload Acme website log dataset contains about 4 million


rows of data, which represents five days of clickstream
data.

STEP II

Represent the dataset in unstructured format i.e


timestamp, registerd user swid, ip address,
geocoded ip address, url

STEP III

Represent the users data from the unstructured


loaddataset

STEP IV

Represent the products categories wise from


the dataset

STEP V

Shows the refine dataset of acme logfiles

STEP VI

Combine all the tables i.e acme log, products, users.

Results and Analysis

Configuration of Hadoop

Results and Analysis

Count the no of VISITORS from any country

Results and Analysis

Retrieving the ip address and displaying the state of visitors

Results and Analysis

Showing the no of ip access this category at a time

Results and Analysis

Initial stage of mapping and reduction

Results and Analysis

Category accessed by total no of ips

Results and Analysis

Showing shoes category acc to state access by total no of ip

Results and Analysis

showing details of ip accessed by visitors but gender wise

Result and Analysis

No of Females accessed this page

Result and Analysis

Total no of ip address accessed particular webpage

Result and Analysis

Calculate the sum of ages of all the visitors

Conclusion
The amount of clickstream data is rapidly
growing and with this demand for accessing
information
over
web
has
increased
significantly.
Therefore analyze the behavior and location
of the visitor.
It is inefficient to process large data using
traditional sequential method
Therefore MapReduce is used for processing
large datasets

Future Scope
Clickstream information play an important
role in a wide variety of applications such as
decision support systems, profile-based
marketing.
Location search is used by various industries
like telecom , e-commerce industry , in event
detection.
Nearest location method can be fused with
any other method to help in better way for
decision making.
Then the tradeoff would be done between
distance and other factor that would be fused

Thank you !!!

Вам также может понравиться