Вы находитесь на странице: 1из 40

Big Data Analytics

Lab Manual

Final year
Computer Engineering

Prof. Vishal R. Gotarane


EXPERIMENT NO. 01

Aim: To study Hadoop Ecosystem.

Practical Objectives:

After completing this experiment students will be able to:


1. Understand hadoop ecosystem
2. Understand basics of hadoop.

Theory:

The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the application
layer, so delivering a highly available service on top of a cluster of computers, each of
which may be prone to failures.

Hadoop Ecosystem:
Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large
amount of data, quickly and cost effectively through clusters of commodity hardware. It
wont be wrong if we say that Apache Hadoop is actually a collection of several components
and not just a single product.
With Hadoop Ecosystem there are several commercial along with an open source products
which are broadly used to make Hadoop laymen accessible and more usable.
The following sections provide additional information on the individual components:

MapReduce:
Hadoop MapReduce is a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-
tolerant manner. In terms of programming, there are two functions which are most common
in MapReduce.
• The Map Task: Master computer or node takes input and convert it into divide it into
smaller parts and distribute it on other worker nodes. All worker nodes solve their
own small problem and give answer to the master node.
• The Reduce Task: Master node combines all answers coming from worker node and
forms it in some form of output which is answer of our big distributed problem.
Generally both the input and the output are reserved in a file-system. The framework is
responsible for scheduling tasks, monitoring them and even re-executes the failed tasks.

Hadoop Distributed File System (HDFS):


HDFS is a distributed file-system that provides high throughput access to data. When data
is pushed to HDFS, it automatically splits up into multiple blocks and stores/replicates the
data thus ensuring high availability and fault tolerance.
Note: A file consists of many blocks (large blocks of 64MB and above).
Here are the main components of HDFS:
• NameNode: It acts as the master of the system. It maintains the name system i.e.,
directories and files and manages the blocks which are present on the DataNodes.
• DataNodes: They are the slaves which are deployed on each machine and provide
the actual storage. They are responsible for serving read and write requests for the
clients.
• Secondary NameNode: It is responsible for performing periodic checkpoints. In the
event of NameNode failure, you can restart the NameNode using the checkpoint.

Hive:
Hive is part of the Hadoop ecosystem and provides an SQL like interface to Hadoop. It is
a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc
queries, and the analysis of large datasets stored in Hadoop compatible file systems.
It provides a mechanism to project structure onto this data and query the data using a SQL-
like language called HiveQL. Hive also allows traditional map/reduce programmers to plug
in their custom mappers and reducers when it is inconvenient or inefficient to express this
logic in HiveQL.
The main building blocks of Hive are –
1. Metastore – To store metadata about columns, partition and system catalogue.
2. Driver – To manage the lifecycle of a HiveQL statement
3. Query Compiler – To compiles HiveQL into a directed acyclic graph.
4. Execution Engine – To execute the tasks in proper order which are produced by the
compiler.
5. HiveServer – To provide a Thrift interface and a JDBC / ODBC server.

HBase (Hadoop DataBase):


HBase is a distributed, column oriented database and uses HDFS for the underlying
storage. As said earlier, HDFS works on write once and read many times pattern, but this
isn’t a case always. We may require real time read/write random access for huge dataset;
this is where HBase comes into the picture. HBase is built on top of HDFS and distributed
on column-oriented database.
Here are the main components of HBase:
• HBase Master: It is responsible for negotiating load balancing across all
RegionServers and maintains the state of the cluster. It is not part of the actual data
storage or retrieval path.
• RegionServer: It is deployed on each machine and hosts data and processes I/O
requests.

Zookeeper:
ZooKeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization and providing group services which are very useful
for a variety of distributed systems. HBase is not operational without ZooKeeper.

Mahout:
Mahout is a scalable machine learning library that implements various different approaches
machine learning. At present Mahout contains four main groups of algorithms:
• Recommendations, also known as collective filtering
• Classifications, also known as categorization
• Clustering
• Frequent itemset mining, also known as parallel frequent pattern mining
Algorithms in the Mahout library belong to the subset that can be executed in a distributed
fashion and have been written to be executable in MapReduce. Mahout is scalable along
three dimensions: It scales to reasonably large data sets by leveraging algorithm properties
or implementing versions based on Apache Hadoop.

Apache Spark:
Apache Spark is a general compute engine that offers fast data analysis on a large scale.
Spark is built on HDFS but bypasses MapReduce and instead uses its own data processing
framework. Common uses cases for Apache Spark include real-time queries, event stream
processing, iterative algorithms, complex operations and machine learning.
Pig:
Pig is a platform for analyzing and querying huge data sets that consist of a high-level
language for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. Pig’s built-in operations can make sense of semi-structured data, such as
log files, and the language is extensible using Java to add support for custom data types
and transformations.
Pig has three main key properties:
• Extensibility
• Optimization opportunities
• Ease of programming
The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets. At the present
time, Pig’s infrastructure layer consists of a compiler that produces sequences of
MapReduce programs.

Oozie:
Apache Oozie is a workflow/coordination system to manage Hadoop jobs.
Flume:
Flume is a framework for harvesting, aggregating and moving huge amounts of log data or
text files in and out of Hadoop. Agents are populated throughout ones IT infrastructure
inside web servers, application servers and mobile devices. Flume itself has a query
processing engine, so it’s easy to transform each new batch of data before it is shuttled to
the intended sink.

Ambari:
Ambari was created to help manage Hadoop. It offers support for many of the tools in the
Hadoop ecosystem including Hive, HBase, Pig, Sqoop and Zookeeper. The tool features a
management dashboard that keeps track of cluster health and can help diagnose
performance issues.
Conclusion:
Hadoop is powerful because it is extensible and it is easy to integrate with any component.
Its popularity is due in part to its ability to store, analyze and access large amounts of data,
quickly and cost effectively across clusters of commodity hardware. Apache Hadoop is not
actually a single product but instead a collection of several components. When all these
components are merged, it makes the Hadoop very user friendly.
.
Experiment No. 02

Aim: To study installation of hadoop.

Practical Objectives:

After completing this experiment Students will be able to

1. Understand installation of hadoop and able to deal with basic commands of


hadoop.

Resources: Computer, VMware installed, IBM Infosphere VM.

Theory:

Hadoop

Hadoop is a framework that allows distributed processing of large datasets across


clusters and commodity computers using a simple programming modes. It is designed to
scale up from single severs to thousands of machines, each providing computation and
storage.

Hadoop in short is an open source software framework for storing and processing big
data into distributed way on large clusters of commodity hardware. Basically it
accomplished the following 2 tasks.

1. Massive data storage

2. Faster processing

The main goals that hadoop follows are:-

1. Scalable
2. Fault tolerance
3. Economical
4. Handle hardware failure
To install hadoop core-cluster needed are
a) Install java into the computer
b) Install VMware
c) Download VM file
d) Load it into VMware and start

Steps that to be followed for installation of hadoop using IBM Infosphere biginsight are:-

Step 1: Check for vtx mode required configuration


8 GB RAM for better performance minimum i3 processor with 8 GB space
Step 2: open file into VMware
Open user Biadmin
It will start your os, RedhatOS
It contains python,Java,IBM Infoshere and Eclipse IDE
Step 3: Starting hadoop
Cd/opt/ibm/biginsight/bin
./start –all.sh
Components of all hadoop gets started
./starts –all.sh
Hadoop starts successfully

Output:
Conclusion:

Hence we have installed hadoop successfully.

Post Lab Assignment : Instead OF IBM Infosphere Run Hadoop Using


Cloudera or Hortonworks Sandbox
Experiment No. 03

Aim: To study and run File Operations in Hadoop.


Practical Objectives:

After completing this experiment students will be able to

1. Work with Hadoop Distributed File System and its operations.

Resources: Hadoop (IBM biginsights software, computer)

Theory:
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion
to rescue the system from possible data losses in case of failure. HDFS also makes
applications available to parallel processing.

Features of HDFS

• It is suitable for the distributed storage and processing.

• Hadoop provides a command interface to interact with HDFS.

• The built-in servers of namenode and datanode help users to easily check the
status of cluster.

• Streaming access to file system data.

• HDFS provides file permissions and authentication.

HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.

Basic Commands of Hadoop:


• Hadoop fs –ls
• Hadoop fs –mkdir
• Hadoop fs –rmdir
• Hadoop fs –help
• hadoop fs -put <localsrc> ... <HDFS_dest_Path>
• hadoop fs -get <hdfs_src> <localdst>
• hadoop fs -cat <path[filename]>
• hadoop fs -cp <source> <dest>
• hadoop fs -mv <src> <dest>

Steps:
• Open Web Client
• From here also You can Upload File into Hadoop
• Will have a look of all components of websphere
• Dashboard, Cluster Status , Files , Application ,Application status , Bigsheet.

Output:

Step 1: Basic Commands Of Hadoop:-


Hadoop fs –ls , Hadoop fs –mkdir, etc.
Step 2: Open Web Client

Step 3: From here also, you can Upload File into Hadoop
Conclusion:
The distributed file system which is used only for larger databases. Here, we have
studied HDFS and executed basic commands and file operations for Hadoop.

Post Lab Assignment : Run Basic Commands in Hortonworks sandbox or


Cloudera.
Experiment No.4

Aim: To study and implement Nosql program.

Objective:

After completing this experiment students will be able to

1. Acquire the knowledge NoSQL queries.


2. Design NoSQL queries.

Resources: Computer, Neo4j.

Theory:

NoSQL databases have grown in popularity with the rise of Big Data applications. In
comparison to relational databases, NoSQL databases are much cheaper to scale, capable
of handling unstructured data, and better suited to current agile development approaches.

The advantages of NoSQL technology are compelling but the thought of replacing a
legacy relational system can be daunting. To explore the possibilities of NoSQL in your
enterprise, consider a small-scale trial of a NoSQL database like MongoDB. NoSQL
databases are typically open source so you can download the software and try it out for
free. From this trial, you can assess the technology without great risk or cost to your
organization.

Commands of Neo4j:

1.Create database

CREATE (emp: employee)

2.Insert data

CREATE (DEPT: dept{deptnp:10, dname:”Accounting”, location:”Hyderabad”})

3.Display data
MATCH(dept: Dept) Return dept.deptno, dept.dname

4.Create node

MATCH(dept: Dept)

Return dept

5.Movie graph

Output:
Conclusion:

NoSQL are replicated databases which are handled by NoSQL queries.

Post Lab Assignment : Deign and Generate a Dependency graph for Existing
Project in Neo4j.
Experiment No. 05

Aim: To implement hello word program in map reduce using pig.

Objective:

After completing this experiment you will be able to

1. Implement simple MapReduce program using mapreduce.

Resources: Computer, VMware installed, IBM Infosphere VM,Pig.

Theory:

Map reduce

Map reduce is a java based system created by google where actual data from HDFS store
gets processed efficiently map reduce breaks down a big data processing job into smaller
tasks.Map reduce is responsible for analyzing large datasets in parallel before reducing it
to find the results.

PIG

Pig is one of the data accessing component in hadoop ecosystem.It is a convenient tool
developed by yahoo for analyzing huge data sets efficiently it is a high level flow
language that is optimized,extensible and easy to use.

Basic command in pig are as follow

Cat:editor
Load:To load the file
Dump:to display
Limit:to limit the range
Abc=LIMIT abcd 2;
Describe:Schema Definition
Group:To make group by Entity type
Group abcd by id

Describe Map reduce plan


Foreach:Similar to for command Foreach <bag> generate <data> count=foreach abcd
generate id
Tokenize:Splits a string into token
Flatten $0:to recording output in simple manner
Likhesh=foreach line generate tokenize(text)

Hello World in Pig


Create text file saying “hello world”
Upload it into hadoop
Open pig
Type command Abc = load ‘path’as (string:chararray);
Dump Abc;

Output:

Step1: Create text file saying “hello world”

Step2: Upload it into hadoop


Step 3: Open pig
Step 4: Type command:
Abc = load ‘path’ as (string:chararray);
Dump Abc;

Conclusion : Hence we have implemented and run Hello world program successfully.
Post Lab Assignment :

Experiment No 6- Write a code using pig to implement word count problem

Use

1) String Functions: Flatten(TOKENIZE)


2) Cluster Functions : Group By
3) Aggregate Function : COUNT
Experiment No. 07

Aim: To implement the frequent item algorithm by MapReduce using pig..

Practical Objectives:

After completing this experiment students will be able to

1. Implement logic and execute data mining algorithms using mapreduce.

Resources: Computer, VMware installed, IBM Infosphere VM,Pig.

Theory:

For the implementation of frequent item set using pig we used the Apriori algorithm. The
Apriori algorithm for finding frequent pairs is a two-pass algorithm that limits the amount
of main memory needed by using the down word-closure property of support to avoid
counting pairs that will turn out to be infrequent at the end.

Let s be the minimum support required. Let n the number of items. This required in the
first pass, we read the baskets and count in main memory the occurrences of each item.
when we then remove all item whose frequency is lesser than S to get the set of
frequency items. This requires memory proportional to n

In the second pass, we read the baskets again and count in main memory only those pairs
where both items are frequent items. This pass will require memory proportional to
square of frequent items only (for counts) plus a list of the frequent items (so you know
what must be counted). In fig main memory in two pass of Apriori.

Apriori Algorithm:

1. Load text
2. Tokenize text
3. Retain first letter
4. Group by letter
5. Count occurrences
6. Grab first element
7. Display/store results

The Apriori Algorithm uses the monotonocity property to reduce the number of
pairs that must be counted, at the expense of performing two passes over data rather
than one pass.

INPUT:

Apriori Algorithm and code:

1. Load text: abcd = load ‘path’ as (text:chararray);

2. tokenize text: Tokens = foreach abcd generate flatten(tokenize) as


token:chararray

3. retain first letter: Letters = foreach tokens generate substring(token,0,1) as


letter:chararray

4. group by letter: Lettergroup = group Letters by letter;

5. count occurrences: Countper= foreach lettergroup generate group ,


count(letters)

6. grab first element: Orederedcount = order countper by $1 desc;

Result limit orderedcount 1;

7. display/store result: Dump result;

OUTPUT:
Conclusion: Hence we have implemented frequent item set algorithm
successfully.

Post Lab Assignment: Implement and Execute Decision tree algorithm


using pig.
Experiment No. 08

Aim: To implement word count by Map Reduce using Eclipse.

Practical Objectives:

After completing this experiment students will be able to

1. Implement logic and execute data mining algorithms using mapreduce.

Resources: Computer, VMware installed, IBM Infosphere VM,Eclipse.

Theory:

To implement Word count program using Map reduce Execute following steps:

Open eclipse

Goto new -> file -> biginsight project

Create a new project

goto new -> file -> Java Mapreduce program

Create Java map reduce files.

• Name : MapperAnalysis
• i/p key : longwritable
• I/p Value : text
• o/p Key: text
• o/p value intwritable
• Click on next
• Name : ReducerAnalysis
• o/p Key: text
• o/p value intwritable
• Click on next
• Name : DriverAnalysis
• Click on finish

Click Mapperanalysis.java and complete code of it.

Click Reduceranalysis.java and complete code of it.

Click Driveranalysis.java and edit code.

• Goto Run Configuration


• Add project name,main class
• Running environment is local
• Job argument
• Before adding job argument
• Open terminal -> goto path of the file ->type command chmod o+w ‘name of
• file’ hit enter
• Add input and output path
• Run (publish and run)

Output:

Step 1: Open eclipse  

Step 2: Goto new -> file -> biginsight project  

Step 3: Create a new project  

Step 4: goto new -> file -> Java Mapreduce program


Step 5: Create java mapreduce files

Name : MapperAnalysis , i/p key : longwritable , I/p Value : text , o/p Key: text ,o/p
value intwritable

Click on next

Name : ReducerAnalysis , o/p Key: text , o/p value intwritable

Click on next

Name : DriverAnalysis

Click on finish

Step 6: Open mapperanalysis.java


Step 7: Open Reduceranalysis.java

Step 8: Open DriverAnalysis.java


Step 9: Run Procedure

Goto Run Configuration>>Add project name,main class>>Running environment is


local>>Job argument >> Before adding job argument
Open terminal -> goto path of the file ->type command chmod o+w ‘name of file’ hit
enter>>Add input and output path>>Run (publish and run)
Conclusion: Hence we have implemented Word count problem using Eclipse by
MapReduce technique successfully
Experiment No. 09

Aim: To implement matrix multiplication using Map reduce.

Practical Objectives:

After completing this experiment students will be able to

1. Implement logic and execute complex program by external resources using


mapreduce.

Theory:

Map reduce
MapReduce is a style of computing that has been implemented in several systems,
including Google’s internal implementation (simply called MapReduce) and the popular
open-source implementation Hadoop which can be obtained, along with the HDFS file
system from the Apache Foundation. You can use an implementation of MapReduce to
manage many large-scale computations in a way that is tolerant of hardware faults. All
you need to write are two functions, called Map and Reduce, while the system manages
the parallel execution, coordination of tasks that execute Map or Reduce, and also deals
with the possibility that one of these tasks will fail to execute. In brief, a MapReduce
computation executes as follows:
1. Some number of Map tasks each are given one or more chunks from a distributed file
system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-
value pairs are produced from the input data is determined by the code written by the user
for the Map function.
2. The key-value pairs from each Map task are collected by a master controller and sorted
by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the
same key wind up at the same Reduce task.
3. The Reduce tasks work on one key at a time, and combine all the values associated
with that key in some way. The manner of combination of values is determined by the
code written by the user for the Reduce function.

Matrix Multiplication

Suppose we have an nxn matrix M, whose element in row i and column j will be denoted
by Mij. Suppose we also have vector v of length n, whose jth element is Vj . Then the
matrix vector product is the vector of length n, whose ith element xi .

Conclusion:- Hence we have executed Matrix Multiplication Program using


MapReduce Successfully.

Post Lab Assignment : Experiment no 10 : Implement Clustering


Algorithm Using Map-Reduce.
Experiment No. 11

Aim: To analyze and summarize large data with Graphical Representation Using
Bigsheets.

Practical Objectives:

After completing this experiment students will be able to

1. Create a graph for large amount of filtered or non-filtered data.

Resources: Computer, VMware installed, IBM Infosphere VM. Bigsheets.

IBM technologies enrich this open source framework with analytical software, enterprise
software integration, platform extensions, and tools. BigSheets is a browser-based analytic
tool initially developed by IBM's Emerging Technologies group. Today, BigSheets is
included with BigInsights to enable business users and non-programmers to explore and
analyze data in distributed file systems. BigSheets presents a spreadsheet-like interface so
users can model, filter, combine, explore, and chart data collected from various sources.
The BigInsights web console includes a tab at top to access BigSheets.

Figure 1 depicts a sample data workbook in BigSheets. While it looks like a typical
spreadsheet, this workbook contains data from blogs posted to public websites, and
analysts can even click on links included in the workbook to visit the site that published
the source content.

Figure 1 - BigSheets workbook based on social media data, with links to source content

After defining a BigSheets workbook, an analyst can filter or transform its data as desired.
Behind the scenes, BigSheets translates user commands, expressed through a graphical
interface, into Pig scripts executed against a subset of the underlying data. In this manner,
an analyst can iteratively explore various transformations efficiently. When satisfied, the
user can save and run the workbook, which causes BigSheets to initiate MapReduce jobs
over the full set of data, write the results to the distributed
Figure 1 : Extract data to Bigsheets

file system, and display the contents of the new workbook. Analysts can page through or
manipulate the full set of data as desired.

Complementing BigSheets are a number of ready-made sample applications that business


users can launch from the BigInsights web console to collect data from websites, relational
database management systems (RDBMS), remote file systems, and other sources. We'll
rely on two such applications for the work described here. However, it's important to realize
that programmers and administrators can use other BigInsights technologies to collect,
process, and prepare data for subsequent analysis in BigSheets. Such technologies include
Jaql, Flume, Pig, Hive, MapReduce applications, and others.
Figure 2 : Graph Generation

Conclusion: Hence We have studied and implemented graphs using bigsheet


successfully.
Post Lab Assignment :
1) create an application for data analysis using pig and generate graph for output
using BigSheets.
MINI PROJECT

Eg :

1. Twitter Data Analysis

2. Text Analysis

3. Weather Data Analysis

Вам также может понравиться