Академический Документы
Профессиональный Документы
Культура Документы
TEAM - POC
Deepak Gattala
Hadoop Administrator
DW Architect ABI/EBI
Spike White
Linux System Administrator
Kerberos Specialist.
Will OBrian
Active Directory and Identity.
Security Analyst.
Deepak Gattala
- Architect
Photography
What is Hadoop?
Hadoop is an open source software frame work.
Its an Apache top-level project but the underlying technology was
from google white paper about to index all the rich textural and
structural information.
Architected to run on a large number of machines that dont share
any memory or disks.
Hadoop enables distributed parallel processing of huge amounts of
data across inexpensive, industry-standard servers that both store
and process the data, and can scale without limits.
Prerequisites
Hadoop framework mainly consists of two important components: HDFS (Hadoop Distributed File System).
MapReduce paradigm
HDFS is a file system written in Java use for storage similar to ext3 or
ext4 in LINUX.
MapReduce is a programming model for processing and generating
large data sets with a parallel, distributed algorithm on a cluster.
MapReduce is the paradigm used to process data on HDFS, the
processing is moved to the data location.
Basic Linux commands Ex: ls, cat .. Etc.
HDFS
1
Client
Input file
Name Node
TCP/IP
Networking
Data Node
File Blocks
DataNode1
1, 4, 5
DataNode2
2, 3, 4
DataNode3
1, 2, 5
DataNode4
3, 4, 5
DataNode5
1, 2, 3
1
5
2 3
4
DN1
DN2
Metadata
1 2
5
DN3
3 4
5
DN4
1 2
3
DN5
Every piece of data is split into blocks and distributed across cluster.
Typically blocks are 64MB or 128MB, default replication is 3.
8
MapReduce - Example
MapReduce had 5 different stages.
Hadoop Distributions
Even though Hadoop is an open source project, we have some
vendors who actually packaged the compatible version together and
enable the operations tools and provide great flexibility.
Below are the top 3 vendors
Cloudera
Horton Works
MapR
The underlying code still remain bare bone apache open source
however some of them have commercial products and services
attached to specific distributions.
10
Eco-System
11
Hadoop Configuration
Daemons of Hadoop ecosystem: Namenode (Master)
Block information
12
Spike White
- System Sr. Engineer
Hardware Configuration
There are 3 different types of node configuration that are very
important in the architecture to get optimal performance.
14
15
16
Kerberos
The Kerberos protocol is a standard designed to provide strong
authentication within a client/server network environment.
Kerberos
18
Will OBrian
- Active Directory Services
Active Directory
Active Directory
CM uses the service account
servicegtminf.
The Servicegtminf was given rights
to create\delete accounts within
us-poclab.dellpoc.com/Hadoop
OU as well as Full Control rights to
any descendant objects
(accounts).
Service accounts are create by CM
by changing the user principles
name.
Account serviceARFSqfwFob is
configured to be utilized for the
sentry service running on
ausgtmhadoop07.
21
Active Directory
22
Deepak Gattala
- Architect
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}}
24
Using PIG
a = load '/user/hue/word_count_text.txt';
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word; d = foreach c generate COUNT(b),
group;
store d into '/user/hue/pig_wordcount';
Using Hive
Hive
Facebook uses Hadoop extensively, looking for way to allow nonJava programmers access to the data in its Hadoop clusters.
Data analysts, Statisticians, Data Scientists etc.
25
Hive - Authorization
CREATE ROLE [ ROLE NAME];
DROP ROLE [ ROLE NAME ];
GRANT ROLE role_name [, role_name] TO GROUP <groupName> [,GROUP
<groupName>];
Hive - Authorization
The Object Hierarchy where you can apply security can be as
granular as below:-
27
28
Cloudera Manager
Cloudera provides the web interface for the cluster management.
29
Questions???
31