Bda Lab

DEPARTMENT OF INFORMATION TECHNOLOGY
Course Code : 16IT611

Course Name : BIG DATA ANALYTICS LABORATORY
Regulation : 2016
Academic Year : 2016-2020
LIST OF EXPERIMENTS:
1. Set up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed

File System, running on Ubuntu Linux. After successful installation on one node,
configuration of a multi-node Hadoop cluster(one master and multiple slaves).
2. MapReduce application for word counting on Hadoop cluster
3. Unstructured data into NoSQL data and do all operations such as NoSQL query with API.
4. K-means clustering using map reduce
5. Page Rank Computation
6. Mahout machine learning library to facilitate the knowledge build up in big data analysis.
7. Application of Recommendation Systems using Hadoop/mahout libraries
2
Ex. No:1
Set up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop
Distributed File System, running on Ubuntu Linux. After successful installation on one node,
configuration of a multi-node Hadoop cluster(one master and multiple slaves).
Aim:
To set up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop
Distributed File System, running on Ubuntu Linux. After successful installation on one node, do
the configuration of a multi-node Hadoopcluster (one master and multiple slaves).
Procedure:
1. Installing Java
# apt update
# apt install -y oracle-java8-set-default
# update-alternatives --display java
java - manual mode
link best version is /usr/lib/jvm/java-8-oracle/jre/bin/java
link currently points to /usr/lib/jvm/java-8-oracle/jre/bin/java
link java is /usr/bin/java
slave java.1.gz is /usr/share/man/man1/java.1.gz
/usr/lib/jvm/java-8-oracle/jre/bin/java - priority 1081
slave java.1.gz: /usr/lib/jvm/java-8-oracle/man/man1/java.1.gz
As you can see JAVA_HOME should be set to /usr/lib/jvm/java-8-oracle/jre.
Open /etc/environment and update the PATH line to include the Hadoop binary directories.
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr
/local/hadoop/bin:/usr/local/hadoop/sbin"
Also add a line for the JAVA_HOME variable.
JAVA_HOME="/usr/lib/jvm/java-8-oracle/jre"
Make sure the directory matches the output from update-alternatives above minus the bin/java
part.
2. Adding a dedicated Hadoop user
Add a hadoop user and give them the correct permissions.
# adduser hadoop
# usermod -aG hadoop hadoop
# chown hadoop:root -R /usr/local/hadoop
# chmod g+rwx -R /usr/local/hadoop
lab@MKCE:~$ sudo su – hduser

3. Installing SSH
ssh has two main components:
1. ssh : The command we use to connect to remote machines - the client.
1
2. sshd : The daemon that is running on the server and allows clients to connect to the
server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first.
4. Create and Setup SSH Certificates
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local
machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to
localhost. So, we need to have SSH up and running on our machine and configured it to allow
SSH public key authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a
password. However, this requirement can be eliminated by creating and setting up SSH
certificates using the following commands. If asked for a filename just leave it blank and press
the enter key to continue.
hduser@MKCE:~$ ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:hiDtFJrpe8oQLeJLy/yYO0KVJ9TwKdmyFehLoJ/1Vms hduser@MKCE
The key's randomart image is:
+---[RSA 2048]----+
| .=. |
| . B+oo |
|. O+== |
|.o X*o .. |
|+.=+=...S. |
|o+oo o.E |
|oo. .. . |
|=+=o |
|.O*. |
+----[SHA256]-----+
hduser@MKCE:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hduser@MKCE:~$
hduser@MKCE:~$ chmod 0600 ~/.ssh/authorized_keys
hduser@MKCE:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:ix4HPBPh2dwHRS5MjRctoRFi4lnFyClMAcHEg4rFmO8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 18.10 (GNU/Linux 4.18.0-12-generic x86_64)
hduser@MKCE:~$ javac -version
2
javac 1.8.0_191
hduser@MKCE:~$ readlink -f /usr/bin/javac
/usr/lib/jvm/java-8-oracle/bin/javac
5. Install Hadoop
wget http://apache.claz.org/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz
tar -xzvf hadoop-3.1.1.tar.gz
# mv hadoop-3.1.1 /usr/local/hadoop
6. Setup Configuration Files
The following files will have to be modified to complete the Hadoop setup:
hduser@MKCE:~$ nano ~/.bashrc
hduser@MKCE:~$ source ~/.bashrc
1. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to find the path where Java has
been installed to set the JAVA_HOME environment variable using the following command:
2. hduser@MKCE:~$ nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We need to set JAVA_HOME by modifying hadoop-env.sh file. Adding the above statement in
the hadoop-env.sh file ensures that the value of JAVA_HOME variable will be available to
Hadoop whenever it is started up.
The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that
Hadoop uses when starting up.
This file can be used to override the default settings that Hadoop starts with.
hduser@MKCE:~$ sudo mkdir -p /app/hadoop/tmp

[sudo] password for hduser:
hduser@MKCE:~$ sudo chown hduser:hadoop /app/hadoop/tmp
hduser@MKCE:~$ nano /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose scheme and authority
determine the FileSystem implementation. The uri's scheme determines the config
3
property (fs.SCHEME.impl) naming theFileSystem implementation class. The uri's
authority is used to determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
hduser@MKCE:~$ nano /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

hduser@MKCE:~$ nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
We need to enter the following content in between the <configuration></configuration> tag:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then
jobs are run in-process as a single map and reduce task.
</description>
</property>
</configuration>
hduser@MKCE:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode

hduser@MKCE:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
hduser@MKCE:~$ sudo chown -R hduser:hadoop /usr/local/hadoop_store
hduser@MKCE:~$ nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the
cluster that is being used.
It is used to specify the directories which will be used as the namenode and the datanode on that
host.
Before editing this file, we need to create two directories which will contain the namenode and
the datanode for this Hadoop installation.
This can be done using the following commands:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
4
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
7. Format the NewHadoop File system
The Hadoop file system needs to be formatted so that we can start to use it. The format
command should be issued with write permission since it creates current directory
under /usr/local/hadoop_store/hdfs/namenode folder:
hduser@MKCE:~$ hadoop namenode -format
8. Starting Hadoop
Now it's time to start the newly installed single node cluster. We can use start-all.sh
hduser@MKCE:~$ start-all.sh
We can check if it's really up and running:
hduser@MKCE:~$ jps
9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
5
Result:
Thus the installation process to Set up a pseudo-distributed, single-node Hadoop cluster

backed by the Hadoop Distributed File System, running on Ubuntu Linux. After successful
installation on one node, configuration of a multi-node Hadoopcluster(one master and multiple
slaves).
6
Ex:No:2 MapReduce application for word counting on Hadoop cluster.
Aim:
To write a java program for MapReduce application for word counting on Hadoop cluster.
Algorithm:
Step 1: Start the Program.

Step 2:Open Eclipse> File > New > Java Project >WordCount.java
Step 3: Add Following Reference Libraries:
Right Click on Project > Build Path> Add External
/usr/lib/hadoop-0.20/hadoop-core.jar
Step 4: Type the coding.
Step 5: Create a jar file
Right Click on Project> Export> Select export destination as Jar File > next> Finish.
Step 6:Take a text file and move it into HDFS format:
hduser@MKCE:~$ hadoop fs -put wordcount Fileword CountFile
Step 7: Run the jar file:
Step 8: Display the result
Step 9: Stop the Program.
7
Program:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);

private Text word = new Text();
public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,

Context context
) throws IOException, InterruptedException {
int sum = 0;
8
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
9
OUTPUT :
TO CREATE A DIRECTORY IN HDFS:

hduser@MKCE:~$ hadoop fs -mkdir /durga
TO LOAD INPUT FILE:

hduser@MKCE:~$ hdfs -put /Desktop/WordCount/sample.txt /durga/input/
TO EXECUTE:
hduser@MKCE:~$ hadoop jar wordcount /durga/input/sample.txt /durga/output
16/09/16 14:34:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
16/09/16 14:34:17 INFO Configuration.deprecation: session.id is deprecated. Instead, use
dfs.metrics.session-id
16/09/16 14:34:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
16/09/16 14:34:17 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not
performed. Implement the Tool interface and execute your application with ToolRunner to
remedy this.
16/09/16 14:34:17 INFO input.FileInputFormat: Total input paths to process : 1
16/09/16 14:34:17 INFO mapreduce.JobSubmitter: number of splits:1
16/09/16 14:34:18 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_local364071501_0001
16/09/16 14:34:18 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/09/16 14:34:18 INFO mapreduce.Job: Running job: job_local364071501_0001
16/09/16 14:34:18 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/09/16 14:34:19 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/09/16 14:34:19 INFO mapred.LocalJobRunner: Waiting for map tasks
16/09/16 14:34:19 INFO mapred.LocalJobRunner: Starting task:
attempt_local364071501_0001_m_000000_0
16/09/16 14:34:19 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/09/16 14:34:19 INFO mapred.MapTask: Processing split:
hdfs://localhost:54310/deepika/wc1:0+712
16/09/16 14:34:19 INFO mapreduce.Job: Job job_local364071501_0001 running in ubermode :
false
16/09/16 14:34:23 INFO mapreduce.Job: map 0% reduce 0%
16/09/16 14:34:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
10
16/09/16 14:34:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/09/16 14:34:24 INFO mapred.MapTask: soft limit at 83886080
16/09/16 14:34:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/09/16 14:34:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/09/16 14:34:24 INFO mapred.MapTask: Map output collector class =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/09/16 14:34:26 INFO mapred.LocalJobRunner:
16/09/16 14:34:26 INFO mapred.MapTask: Starting flush of map output
16/09/16 14:34:26 INFO mapred.MapTask: Spilling map output
16/09/16 14:34:26 INFO mapred.MapTask: bufstart = 0; bufend = 1079; bufvoid = 104857600
16/09/16 14:34:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend =
26214032(104856128); length = 365/6553600
16/09/16 14:34:26 INFO mapred.MapTask: Finished spill 0
16/09/16 14:34:26 INFO mapred.Task: Task:attempt_local364071501_0001_m_000000_0 is
done. And is in the process of committing
16/09/16 14:34:26 INFO mapred.LocalJobRunner: map
16/09/16 14:34:26 INFO mapred.Task: Task 'attempt_local364071501_0001_m_000000_0'
done.
16/09/16 14:34:26 INFO mapred.LocalJobRunner: Finishing task:
attempt_local364071501_0001_m_000000_0
16/09/16 14:34:26 INFO mapred.LocalJobRunner: map task executor complete.
16/09/16 14:34:26 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/09/16 14:34:26 INFO mapred.LocalJobRunner: Starting task:
attempt_local364071501_0001_r_000000_0
16/09/16 14:34:26 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/09/16 14:34:26 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin:
org.apache.hadoop.mapreduce.task.reduce.Shuffle@2ee9ab75
16/09/16 14:34:26 INFO reduce.MergeManagerImpl: MergerManager:
memoryLimit=363285696, maxSingleShuffleLimit=90821424, mergeThreshold=239768576,
ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/09/16 14:34:26 INFO reduce.EventFetcher: attempt_local364071501_0001_r_000000_0
Thread started: EventFetcher for fetching Map Completion Events
16/09/16 14:34:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map
attempt_local364071501_0001_m_000000_0 decomp: 1014 len: 1018 to MEMORY
16/09/16 14:34:27 INFO reduce.InMemoryMapOutput: Read 1014 bytes from map-output for
attempt_local364071501_0001_m_000000_0
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size:
1014, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->1014
16/09/16 14:34:27 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
11
16/09/16 14:34:27 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-
outputs and 0 on-disk map-outputs
16/09/16 14:34:27 INFO mapred.Merger: Merging 1 sorted segments
16/09/16 14:34:27 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of
total size: 991 bytes
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: Merged 1 segments, 1014 bytes to disk to
satisfy reduce memory limit
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: Merging 1 files, 1018 bytes from disk
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory
into reduce
16/09/16 14:34:27 INFO mapred.Merger: Merging 1 sorted segments
16/09/16 14:34:27 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of
total size: 991 bytes
16/09/16 14:34:27 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use
mapreduce.job.skiprecords
16/09/16 14:34:30 INFO mapred.Task: Task:attempt_local364071501_0001_r_000000_0 is
done. And is in the process of committing
16/09/16 14:34:30 INFO mapred.Task: Task attempt_local364071501_0001_r_000000_0 is
allowed to commit now
16/09/16 14:34:30 INFO output.FileOutputCommitter: Saved output of task
'attempt_local364071501_0001_r_000000_0' to
hdfs://localhost:54310/deepika/out2/_temporary/0/task_local364071501_0001_r_000000
16/09/16 14:34:30 INFO mapred.LocalJobRunner: reduce > reduce
16/09/16 14:34:30 INFO mapred.Task: Task 'attempt_local364071501_0001_r_000000_0' done.
16/09/16 14:34:30 INFO mapred.LocalJobRunner: Finishing task:
attempt_local364071501_0001_r_000000_0
16/09/16 14:34:30 INFO mapred.LocalJobRunner: reduce task executor complete.
16/09/16 14:34:31 INFO mapreduce.Job: Job job_local364071501_0001 completed successfully
16/09/16 14:34:31 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=8552
FILE: Number of bytes written=507858
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1424
12
HDFS: Number of bytes written=724
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=10
Map output records=92
Map output bytes=1079
Map output materialized bytes=1018
Input split bytes=99
Combine input records=92
Combine output records=72
Reduce input groups=72
Reduce shuffle bytes=1018
Reduce input records=72
Reduce output records=72
Spilled Records=144
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=111
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=242360320
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=712
File Output Format Counters
Bytes Written=724
Input File:
Hi How are you, I am fine over here. What about you, where are you working.
13
Output File:
hduser@MKCE:~$ hadoopfs -ls durga
Found 3 items
-rw-r--r-- 1 training supergroup 0 2019-01-23 03:36 /durga/output/_SUCCESS
drwxr-xr-x - training supergroup 0 2019-01-23 03:36 /durga/output/_logs
-rw-r--r-- 1 training supergroup 20 2019-01-23 03:36 durga/output/part-r-00000
hduser@MKCE:~$ hadoop fs -cat /durga/output/part-r-00000
Hi 1
How 1
are 2
you, 2
I 1
am 1
fine 1
over 1
here. 1
What 1
about 1
where 1
you 1
working. 1
Result:
Thus the java program for Map Reduce application for word counting on Hadoop cluster
has been created, executed and verified successfully.
14
Ex:No:3 K-means Clustering Algorithm
Aim:
To write a java program to execute the K-means clustering algorithm using Euclidean distance
measure.
Algorithm:
Step 1: Start the program
Step 2: Create an array will set of values
Step 3: Assign the cluster values in an arrays
Step 4: Compute the distance measure to group items into respective clusters using Euclidean
distance
Step 5: Display the grouping of items in each cluster after each iteration.
Step6: Stop the program
15
Program:
public class KMeansClustering
{
public static void main(String args[]) {
int arr[] = {2, 4, 10, 12, 3, 20, 30, 11, 25}; // initial data
int i, m1, m2, a, b, n = 0;
boolean flag;
float sum1, sum2;
a = arr[0];
b = arr[1];
m1 = a;
m2 = b;
int cluster1[] = new int[arr.length], cluster2[] = new int[arr.length];
do {
sum1 = 0;
sum2 = 0;
cluster1 = new int[arr.length];
cluster2 = new int[arr.length];
n++;
int k = 0, j = 0;
for (i = 0; i <arr.length; i++) {
if ((Math.sqrt(Math.pow((arr[i] - m1), 2))) <=(Math.sqrt(Math.pow((arr[i] - m2), 2))))
{
cluster1[k] = arr[i];
k++;
}
else {
cluster2[j] = arr[i];
j++;
}
}
System.out.println();
for (i = 0; i < k; i++)
{
sum1 = sum1 + cluster1[i];
}
for (i = 0; i < j; i++)
{
sum2 = sum2 + cluster2[i];
}
16
System.out.println("m1=" + m1 + " m2=" + m2);
a = m1;
b = m2;
m1 = Math.round(sum1 / k);
m2 = Math.round(sum2 / j);
flag = !(m1 == a && m2 == b);
System.out.println("After iteration " + n + " , cluster 1 :\n");

for (i = 0; i < cluster1.length; i++)
{
System.out.print(cluster1[i] + "\t");
}
System.out.println("\n");
System.out.println("After iteration " + n + " , cluster 2 :\n");
{
}
} while (flag);
System.out.println("Final cluster 1 :\n");
{
}
System.out.println();
System.out.println("Final cluster 2 :\n");
{
}
}
}
17
Output:
M1=2 M2=4
After iteration1, cluster 1:

230000000
After iteration1,cluster 2:
4 10 12 20 30 11 25 0 0
M1=3 M2=16
2430000000
10 12 20 30 11 25 0 0 0
Result:
Thus the java program to execute the K-means clustering using map reduce has been
created and executed successfully
18
Ex:No:4 Page Rank Computation
Aim:
To write a java program to execute the Page Rank Computation
Algorithm:
Step 2: Read the adjacency matrix representing the link values between the pages
Step 3: Compute the page rank based on the incoming and outgoing links
Step 4: Improve the result using the damping factor for each pages.
Step 5: Display the result for each iteration
19
Program:
importjava.util.*;
import java.io.*;
public class PageRank {
publicint path[][] = new int[10][10];

public double pagerank[] = new double[10];
public void calc(double totalNodes){
doubleInitialPageRank;
doubleOutgoingLinks=0;
doubleDampingFactor = 0.85;
doubleTempPageRank[] = new double[10];
intExtNode;
intIntNode;
int k=1; // For Traversing
int ITERATION_STEP=1;
InitialPageRank = 1/totalNodes;
System.out.printf(" Total Number of Nodes :"+totalNodes+"\t Initial PageRank of All Nodes
:"+InitialPageRank+"\n");
for(k=1;k<=totalNodes;k++)
{
this.pagerank[k]=InitialPageRank;
}
System.out.printf("\n Initial PageRank Values , 0th Step \n");

{
System.out.printf(" Page Rank of "+k+" is :\t"+this.pagerank[k]+"\n");
}
while(ITERATION_STEP<=2) // Iterations
{
{
TempPageRank[k]=this.pagerank[k];
20
this.pagerank[k]=0;
}
for(IntNode=1;IntNode<=totalNodes;IntNode++)
{
for(ExtNode=1;ExtNode<=totalNodes;ExtNode++)
{
if(this.path[ExtNode][IntNode] == 1)
{
k=1;
OutgoingLinks=0;
while(k<=totalNodes)
{
if(this.path[ExtNode][k] == 1 )
{
OutgoingLinks=OutgoingLinks+1;
}
k=k+1;
}
this.pagerank[IntNode]+=TempPageRank[ExtNode]*(1/OutgoingLinks);
}
}
}
System.out.printf("\n After "+ITERATION_STEP+"th Step \n");
ITERATION_STEP = ITERATION_STEP+1;
}
{
this.pagerank[k]=(1-DampingFactor)+ DampingFactor*this.pagerank[k];
}
System.out.printf("\n Final Page Rank : \n");
{
}
}
public static void main(String args[])
21
{
int nodes,i,j,cost;
Scanner in = new Scanner(System.in);
System.out.println("Enter the Number of WebPages \n");
nodes = in.nextInt();
PageRank p = new PageRank();
System.out.println("Enter the Adjacency Matrix with 1->PATH & 0->NO PATH
Between two WebPages: \n");
for(i=1;i<=nodes;i++)
for(j=1;j<=nodes;j++)
{
p.path[i][j]=in.nextInt();
if(j==i)
p.path[i][j]=0;
}
p.calc(nodes);
}
}
22
Output:
Enter the number of web pages: 5

Enter the adjacency matrix 1-> path
01000
00001
11011
00101
00010
Total No.of Nodes:5.0
Initial Page Rank of all nodes:0.2
Initial Page rank Values: 0th step
Page rank of 1 is :0.2

After 1st step

After 2nd step

Page rank of 1 is :0.0750000000000001
Result:
Thus the java program to execute the Page Rank Computation has been created and executed
successfully.
23
Ex:No:5 Mahout machine learning library to facilitate the knowledge build up in big data
analysis
Aim:
To install the Mahout machine learning library to facilitate the knowledge build up in big data
analysis
Procedure:
1.Setting up your Environment
Edit your environment in ~/.bash_profile for Mac or ~/.bashrc for many linux distributions.
Add the following
export MAHOUT_HOME=/path/to/mahout
export MAHOUT_LOCAL=true # for running standalone on your dev machine,
# unset MAHOUT_LOCAL for running on a cluster
You will need $JAVA_HOME, and if you are running on Spark, you will also need
$SPARK_HOME.
2.Using Mahout as a Library
Running any application that uses Mahout will require installing a binary or source version and
setting the environment. To compile from source:
mvn -DskipTests clean install
To run tests do mvn test
To set up your IDE, do mvneclipse:eclipse or mvnidea:idea
To use maven, add the appropriate setting to your pom.xml or build.sbt following the template
below.
To use the Samsara environment you'll need to include both the engine neutral math-scala
dependency:
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-math-scala_2.10</artifactId>
<version>${mahout.version}</version>
</dependency>
and a dependency for back end engine translation, e.g:
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-spark_2.10</artifactId>
<version>${mahout.version}</version>
</dependency>
3. Building From Source
Prerequisites:
24
Linux Environment (preferably Ubuntu 16.04.x) Note: Currently only the JVM-only build will
work on a Mac. gcc> 4.x NVIDIA Card (installed with OpenCL drivers alongside usual GPU
drivers)
Downloads
Install java 1.7+ in an easily accessible directory (for this example, ~/java/)
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Create a directory ~/apache/ .
Download apache Maven 3.3.9 and un-tar/gunzip to ~/apache/apache-maven-3.3.9/
.https://maven.apache.org/download.cgi
Download and un-tar/gunzipHadoop 2.4.1 to ~/apache/hadoop-2.4.1/
.https://archive.apache.org/dist/hadoop/common/hadoop-2.4.1/
Download and un-tar/gunzip spark-1.6.3-bin-hadoop2.4 to ~/apache/
.http://spark.apache.org/downloads.html Choose release: Spark-1.6.3 (Nov 07 2016) Choose
package type: Pre-Built for Hadoop 2.4
Install ViennaCL 1.7.0+ If running Ubuntu 16.04+
sudo apt-get install libviennacl-dev
Otherwise if your distribution’s package manager does not have a viennniacl-dev package
>1.7.0, clone it directly into the directory which will be included in when being compiled by
Mahout:
mkdir ~/tmp
cd ~/tmp&&git clone https://github.com/viennacl/viennacl-dev.git
cp -r viennacl/ /usr/local/
cp -r CL/ /usr/local/
Ensure that the OpenCL 1.2+ drivers are installed (packed with most consumer grade NVIDIA
drivers). Not sure about higher end cards.
Clone mahout repository into ~/apache.
git clone https://github.com/apache/mahout.git
Configuration
When building mahout for a spark backend, we need four System Environment variables set:
export MAHOUT_HOME=/home/<user>/apache/mahout
export HADOOP_HOME=/home/<user>/apache/hadoop-2.4.1
export SPARK_HOME=/home/<user>/apache/spark-1.6.3-bin-hadoop2.4
export JAVA_HOME=/home/<user>/java/jdk-1.8.121
Mahout on Spark regularly uses one more env variable, the IP of the Spark cluster’s master node
(usually the node which one would be logged into).
To use 4 local cores (Spark master need not be running)
export MASTER=local[4]
To use all available local cores (again, Spark master need not be running)
export MASTER=local[*]
To point to a cluster with spark running:
25
export MASTER=spark://master.ip.address:7077
Then add these to the path:
PATH=$PATH$:MAHOUT_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$JAVA
_HOME/bin
These should be added to the your ~/.bashrc file.
Building Mahout with Apache Maven
Currently Mahout has 3 builds. From the $MAHOUT_HOME directory we may issue the
commands to build each using mvn profiles.
JVM only:
mvn clean install -DskipTests
JVM with native OpenMP level 2 and level 3 matrix/vector Multiplication
mvn clean install -Pviennacl-omp -Phadoop2 -DskipTests
JVM with native OpenMP and OpenCL for Level 2 and level 3 matrix/vector Multiplication.
(GPU errors fall back to OpenMP, currently only a single GPU/node is supported).
mvn clean install -Pviennacl -Phadoop2 -DskipTests
Testing the Mahout Environment
Mahout provides an extension to the spark-shell, which is good for getting to know the language,
testing partition loads, prototyping algorithms, etc..
To launch the shell in local mode with 2 threads: simply do the following:
$ MASTER=local[2] mahout spark-shell
Result:
Thus the installation process for Mahout Machine learning library to facilitate the
knowledge builds up in big data analysis
26
Ex:No: 6 Application of Recommendation Systems
Aim:
Write a java program for collaborative recommendation system using Cosine Similarity
Matrix and Pearson Correlation Matrix.
Algorithm:
Step 2: Read the relationship between two or more items in an adjacency matrix
Step 3: Compute the relationship matrix using Cosine and Pearson Correlation
Step 4: Find the relationship coefficient value
Step 5: Display the result
27
Program:
importjava.util.Arrays;
publicclass Cosine {
publicstaticdouble similarity(int vec1[],int vec2[])
{
intdop=vec1[0]*vec2[0] + vec1[1]*vec2[1];
double mag1 =Math.sqrt(Math.pow(vec1[0],2)+Math.pow(vec1[1],2));
double mag2 =Math.sqrt(Math.pow(vec2[0],2)+Math.pow(vec2[1],2));
double csim =dop / (mag1 * mag2);
return csim;
}
publicstaticdouble similarity(double vec1[],double vec2[])

{
doubledop=vec1[0]*vec2[0] + vec1[1]*vec2[1];
double mag1=Math.sqrt(Math.pow(vec1[0],2) + Math.pow(vec1[1],2));
double mag2=Math.sqrt(Math.pow(vec2[0],2) + Math.pow(vec2[1],2));
doublecsim=dop/ (mag1 * mag2);
returncsim;
}
publicstaticvoid main(String[] args)

{
int vector1 [] = {1,2};
int vector2 [] = {4,5};
double csim1= Cosine.similarity(vector1, vector2);
System.out.println("\n Cosine similarity between two 2D vectors");
System.out.println("integer 2D vectors");
System.out.println("Vector 1 :" + Arrays.toString(vector1));
System.out.println("similarity value :" + csim1);
double vector3 [] = {5.2, 2.6};

double vector4 [] = {9.8,7.6};
System.out.println("double 2D vectors");
double csim2=Cosine.similarity(vector3, vector4);
28
System.out.println("similarity value :" + csim2);
}
}
Output:
0.9848485
Rating of 0 item 4.0 4.0

Pearson Correlation:
public class Main
{
public static void main(String[] args)
{
float ans=0,sum1=0,sum2=0;
int x[];
double b1;
double a1,c1,d1,e1 = 0,z1=0,w1=0;
int i;
double avg1,avg2;
int a[]= {4,5,3,4,0};
int b[]= {4,5,4,3,3};
for(i=0;i<4;i++)
{
e1=e1+(a[i]*b[i]);
}
for(i=0;i<4;i++)
{
c1=(Math.pow(a[i],2));
d1=(Math.pow(b[i],2));
z1=z1+c1;
w1=w1+d1;
}
29
z1=Math.sqrt(z1);
w1=Math.sqrt(w1);
ans=(float) (e1/(z1*w1));
System.out.println(ans);
for(i=0;i<5;i++)
{
float g=(a[i]*ans)/ans;
float h=(b[i]*ans)/ans;
System.out.println("Rating of "+ i +" item"+ " " +g +" " +h);
}
}
}
Output:
4.851256530
Result:
Thus the java program for Application of Recommendation Systems using Cosine
Similarity matrix and Pearson Correlation Matrix has been created and executed successfully.
30
Ex. No: 7 Unstructured data into NoSQL data and do all operations such as NoSQL query
with Mongo DB
Mongo DB Installation and commands
https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
Steps1: open the terminal: (server)
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv
9DA31620334BD75D9DCB49F368818C72E52529D4
Step 2. Establish a connection with the local machine and server
echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0
multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
Step 3. Update all files
sudo apt-get update
Step 4. Install Mongo DB
sudo apt-get install -y mongodb-org
Step 5. Run the basic commands to enable database operations
echo "mongodb-org hold" | sudo dpkg --set-selections
echo "mongodb-org-server hold" | sudo dpkg --set-selections
echo "mongodb-org-shell hold" | sudo dpkg --set-selections
echo "mongodb-org-mongos hold" | sudo dpkg --set-selections
echo "mongodb-org-tools hold" | sudo dpkg --set-selections
Step 6. Run the Server using the following command
Mongod
Client Machine
Open another terminal (client)
Step 1. Start the client service
sudo service mongod start
[initandlisten] waiting for connections on port 27017
Step 2: The Mongo DB service can be stopped using the following command
sudo service mongod stop
Step 3: Mongo DB service can be restarted using the following command
sudo service mongod restart
Step 4. Start the database operation after establishing a connection using the following command
hduser@MKCE:~$ mongo
MongoDB shell version v4.0.4

connecting to: mongodb://127.0.0.1:27017
Implicit session: session { "id" : UUID("58bc451c-54fa-4f00-a8f7-d711737699ad") }
MongoDB server version: 4.0.4
31
Server has startup warnings:
2019-03-21T13:48:36.656+0530 I STORAGE [initandlisten]
2019-03-21T13:48:36.656+0530 I STORAGE [initandlisten] ** WARNING: Using the XFS
filesystem is strongly recommended with the WiredTiger storage
engine
2019-03-21T13:48:36.656+0530 I STORAGE [initandlisten] ** See
http://dochub.mongodb.org/core/prodnotes-filesystem
2019-03-21T13:48:43.235+0530 I CONTROL [initandlisten]
2019-03-21T13:48:43.235+0530 I CONTROL [initandlisten] ** WARNING: Access control is
not enabled for the database.
2019-03-21T13:48:43.235+0530 I CONTROL [initandlisten] ** Read and write access to data
and configuration is unrestricted.
2019-03-21T13:48:43.235+0530 I CONTROL [initandlisten]
---
Enable MongoDB's free cloud-based monitoring service, which will then receive and display
metrics about your deployment (disk utilization, CPU, operation statistics, etc).
The monitoring data will be available on a MongoDB website with a unique URL accessible to
you
and anyone you share the URL with. MongoDB may use this information to make product
improvements and to suggest MongoDB products and deployment options to you.
To enable free monitoring, run the following command: db.enableFreeMonitoring()
To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
Mongo DB Operations
Commands : CRUD Operations
Switch to a new Database
> use db1;
switched to db db1
Create table
> db.createCollection("table1");
{ "ok" : 1 }
> show collections;
table1
> db.table1.insert({name:"vik"});
WriteResult({ "nInserted" : 1 })
> db.table1.find()
{ "_id" : ObjectId("5c9349f1fc5dd9934b37b795"), "name" : "vik" }
> db.table1.insert({name:"vraj"});
> db.table1.update({name:"vraj"},{$set:{name:"newvikram"}});
32
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.table1.find()
{ "_id" : ObjectId("5c934a3dfc5dd9934b37b796"), "name" : "newvikram" }
> db.table1.insert({name:"abi",ahe:20,city:"salem"});
> db.table1.find()
{ "_id" : ObjectId("5c934b61fc5dd9934b37b797"), "name" : "abi", "ahe" : 20, "city" :
"salem" }
> db.table1.find();
"salem" }
> db.table1.update({name:"newvikram"},{$set:{name:"newram"}})
> db.table1.find();
{ "_id" : ObjectId("5c934a3dfc5dd9934b37b796"), "name" : "newram" }
"salem" }
> db.table1.update({name:"abi"},{$set:{ahe:50}})
Update Database
> db.table1.update({name:"abi"},{$set:{ahe:50,city:"karur"}})
Command to alter Table by adding new Column
> db.table1.update({name:"abi"},{$set:{ahe:50,city:"karur",faname:"rajesh"}})
> db.table1.find();

"karur", "faname" : "rajesh" }
> db.table1.update({name:"abi"},{ahe:50,city:"karur1",faname:"rajesh1"})
33
> db.table1.find();
{ "_id" : ObjectId("5c934b61fc5dd9934b37b797"), "ahe" : 50, "city" : "karur1",
"faname" : "rajesh1" }
Creating new object and insert into table
> db.table1.insert({name:"ab",city:{dno:15,stname:"kamallarst"}}
... )
> db.table1.find();
{ "_id" : ObjectId("5c934b61fc5dd9934b37b797"), "ahe" : 50, "city" : "karur1", "faname" :
"rajesh1" }
{ "_id" : ObjectId("5c934f97bf0cd53d038d8d9a"), "name" : "ab", "city" : { "dno" : 15, "stname"
: "kamallarst" } }
Command to insert many rows at a time
> db.table1.insertMany([{name:"ab",city:{dno:15,stname:"kamallarst"}},{name:"hhh"}])
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("5c9350acbf0cd53d038d8d9b"),
ObjectId("5c9350acbf0cd53d038d8d9c")
]
}
> db.table1.find();
{ "_id" : ObjectId("5c934f97bf0cd53d038d8d9a"), "name" : "ab", "city" : { "dno" : 15,
"stname" : "kamallarst" } }
{ "_id" : ObjectId("5c9350acbf0cd53d038d8d9b"), "name" : "ab", "city" : { "dno" : 15,
{ "_id" : ObjectId("5c9350acbf0cd53d038d8d9c"), "name" : "hhh" }
Remove a data from the table
> db.table1.remove(name:"vik")
2019-03-21T14:25:07.200+0530 E QUERY [js] SyntaxError: missing ) after argument list
@(shell):1:21
34
> db.table1.remove({name:"vik"})
WriteResult({ "nRemoved" : 1 })
> db.table1.remove({city.dno:15})
2019-03-21T14:26:42.240+0530 E QUERY [js] SyntaxError: missing : after property id
@(shell):1:22
> db.table1.remove({name:"abb"})
WriteResult({ "nRemoved" : 1 })
> db.table1.find();
{ "_id" : ObjectId("5c934f97bf0cd53d038d8d9a"), "name" : "ab", "city" : { "dno" : 15,
{ "_id" : ObjectId("5c9350acbf0cd53d038d8d9b"), "name" : "ab", "city" : { "dno" : 15,
{ "_id" : ObjectId("5c9350acbf0cd53d038d8d9c"), "name" : "hhh" }
{ "_id" : ObjectId("5c9350f7bf0cd53d038d8d9e"), "name" : "hkhh" }
Result:
Thus the program for unstructured data and operations to access the data using NoSQL
query with Mongo DB has been created and executed successfully.
35

Bda Lab

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Bda Lab

Загружено:

Авторское право:

Доступные форматы

DEPARTMENT OF INFORMATION TECHNOLOGY

Course Code : 16IT611

1. Set up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed

lab@MKCE:~$ sudo su – hduser

hduser@MKCE:~$ sudo mkdir -p /app/hadoop/tmp

hduser@MKCE:~$ nano /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

hduser@MKCE:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode

7. Format the NewHadoop File system

hduser@MKCE:~$ hadoop namenode -format

Thus the installation process to Set up a pseudo-distributed, single-node Hadoop cluster

Step 1: Start the Program.

public class WordCount {

public static class TokenizerMapper

private final static IntWritable one = new IntWritable(1);

public void map(Object key, Text value, Context context

public static class IntSumReducer

public void reduce(Text key, Iterable<IntWritable> values,

public static void main(String[] args) throws Exception {

TO CREATE A DIRECTORY IN HDFS:

TO LOAD INPUT FILE:

System.out.println("After iteration " + n + " , cluster 1 :\n");

After iteration1, cluster 1:

publicint path[][] = new int[10][10];

public void calc(double totalNodes){

System.out.printf("\n Initial PageRank Values , 0th Step \n");

Enter the number of web pages: 5

Page rank of 1 is :0.2

After 1st step

After 2nd step

publicstaticdouble similarity(double vec1[],double vec2[])

publicstaticvoid main(String[] args)

double vector3 [] = {5.2, 2.6};

Rating of 0 item 4.0 4.0

Mongo DB Installation and commands

MongoDB shell version v4.0.4

{ "_id" : ObjectId("5c9349f1fc5dd9934b37b795"), "name" : "vik" }

Вам также может понравиться