Вы находитесь на странице: 1из 4

1) What is MapReduce?

MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a
distributed environment.

 MapReduce consists of two distinct tasks – Map and Reduce.


 As the name MapReduce suggests, reducer phase takes place after mapper phase has been completed.
 So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate
outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or
key-value pairs which is the final output.
 Hadoop Map-Reduce is a software framework for developing applications which can process huge amounts of data
typically running into terabytes in size. This programming model was initially developed at Google to power their search
engine. Hadoop has later adopted this model under apache open source license.
 MapReduce has two main functions at its core namely: map() and reduce(). These two operations are inspired from
functional programming language Lisp.
Map Processing:
Given an input file to process, it is divided into smaller chunks (input splits). MapReduce framework will create a
new map task for each input split.
• Map task reads each record from the input and maps input key-value pairs to intermediate key-value pairs.
• list(k2,v2) where (k2,v2) is an intermediate key/value pair.map(k1,v1)
What is the right number of map tasks to create?

Depends on the total size of the input. Typically number of map tasks is equal to number of blocks in input file.

• Mapper outputs are sorted and fed to a partitioner which will partition the intermediate key/value pairs among the reducers
(all the intermediate key/value pairs with same keys get partitioned to common reducer).
• MapReduce framework then takes all the intermediate values for a given output key and then combines them together into a
list.
Reduce Processing:
• Each reduce task receives the output produced after Map Processing (which is key/list of values pairs) and then performs
operation on the list of values against each key. It then emits output key-value pairs.

• (k2,[v2]) -> (k2,v3).


3) PIG:
The Apache Pig is a platform for managing large sets of data which consists of high-level programming to analyze the
data. Pig also consists of the infrastructure to evaluate the programs. The advantages of Pig programming is that it can
easily handle parallel processes for managing very large amounts of data. The programming on this platform is basically
done using the textual language Pig Latin.

Pig Latin comes with the following features:


 Simple programming: it is easy to code, execute and manage
 Better optimization: system can automatically optimize the execution
 Extensive nature: it can be used to achieve highly specific processing tasks
Pig can be used for following purposes:
 ETL data pipeline
 Research on raw data
 Iterative processing.
The scalar data types in pig are int, float, double, long, chararray, and bytearray. The complex data types in Pig are
map, tuple, and bag.
Map: The data element with the data type chararray where element has pig data type include complex data type
Example- [city’#’bang’,’pin’#560001]
In this city and pin are data element mapping to values.
Tuple: It is a collection of data types and it has fixed length. Tuple is having multiple fields and these are ordered.
Bag: It is a collection of tuples, but it is unordered, tuples in the bag are separated by comma
Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)
LOAD function:

Load function helps to load data from the file system. It is a relational operator. In the first step in data-flow language
we need to mention the input, which is completed by using ‘load’ keyword.

The LOAD syntax is


LOAD ‘mydata’ [USING function] [AS schema];
Example- A = LOAD ‘intellipaat.txt’;
A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘\t’);
The relational operations in Pig:
foreach, order by, filters, group, distinct, join, limit.

foreach: It takes a set of expressions and applies them to all records in the data pipeline to the next operator.

A =LOAD ‘input’ as (emp_name :charrarray, emp_id : long, emp_add : chararray, phone : chararray, preferences : map [] );

B = foreach A generate emp_name, emp_id;


Filters: It contains a predicate and it allows us to select which records will be retained in our data pipeline.
Syntax: alias = FILTER alias BY expression;
Alias indicates the name of the relation, By indicates required keyword and the expression has Boolean.
Example: M = FILTER N BY F5 == 4;
Hive:
The Apache Hive is a data warehouse software that lets you read, write and manage huge volumes of datasets that is stored in
a distributed environment using SQL. It is possible to project structure onto data that is in storage. Users can connect to Hive
using a JDBC driver and a command line tool.
Hive is an open system. We can use Hive for analyzing and querying in large datasets of Hadoop files. It’s similar to SQL. The
present version of Hive is 0.13.1.
Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation, and Durability. ACID
transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transaction.
Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive
can do.
Hive is most suitable for following data warehouse applications
 Analyzing the relatively static data
 Less Responsive time
 No rapid changes in data.
Hive doesn’t provide fundamental features required for OLTP, Online Transaction Processing. Hive is suitable for data
warehouse applications in large data sets.
The two types of tables in Hive
1. Managed table
2. External table
We can change the settings within Hive session, using the SET command. It helps to change Hive job settings for an exact
query.
2) a) Hadoop Cluster:
Normally any set of loosely connected or tightly connected computers that work together as a single system is called
Cluster. In simple words, a computer cluster used for Hadoop is called Hadoop Cluster.
Hadoop cluster is a special type of computational cluster designed for storing and analyzing vast amount of unstructured
data in a distributed computing environment. These clusters run on low cost commodity computers.
Hadoop cluster has 3 components:
1. Client
2. Master
3. Slave
The role of each components are shown in the below image.
Client:
It is neither master nor slave, rather play a role of loading the data into cluster, submit MapReduce jobs describing how the
data should be processed and then retrieve the data to see the response after job completion.
Master:
The Masters consists of 3 components NameNode, Secondary Node name and JobTracker.
NameNode does NOT store the files but only the file's metadata. In later section we will see it is actually the
DataNode which stores the files.
JobTracker coordinates the parallel processing of data using MapReduce.
Secondary Name Node:Don't get confused with the name "Secondary". Secondary Node is NOT the backup or
high availability node for Name node.
Slave:
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
 Store the data
 Process the computation.
2) b) Administration and maintenance in hadoop:
“Admins are not only to stop people from doing stupid things, also to stop them from doing clever things!”
In the Hadoop world, a Systems Administrator is called a Hadoop Administrator. Hadoop Admin Roles and
Responsibilities include setting up Hadoop clusters. Other duties involve backup, recovery and maintenance. Hadoop
administration requires good knowledge of hardware systems and excellent understanding of Hadoop architecture.
It’s easy to get started with Hadoop administration because Linux system administration is a pretty well-known beast, and
because systems administrators are used to administering all kinds of existing complex applications. However, there are
many common missteps we’re seeing that make us believe there’s a need for some guidance in Hadoop administration.
Most of these mistakes come from a lack of understanding about how Hadoop works. Here are just a few of the common
issues we find.
With increased adoption of Hadoop in traditional enterprise IT solutions and increased number of Hadoop implementations
in production environment, the need for Hadoop Operations and Administration experts to take care of the large Hadoop
Clusters is becoming vital.
What does a Hadoop Admin do on day to day ?

 Installation and Configuration


 Cluster Maintenance
 Resource Management
 Security Management
 Troubleshooting
 Cluster Monitoring
 Backup And Recovery Task
 Aligning with the systems engineering team to propose and deploy new hardware and software
environments required for Hadoop and to expand existing environments.
 Diligently teaming with the infrastructure, network, database, application and business intelligence teams
to guarantee high data quality and availability.
Some of the potential problems, which Hadoop administrators face in day to day operations are caused due to:
Human: Humans cause most common errors in health of systems or machines. Even a simple mistake can create a disaster
and lead to down time. To avoid these errors it is important to have proper and proactive diagnostic process in place.
Miss-configuration is another problem that Hadoop administrators come across. Even now, after so much of development
in Hadoop, it is considered a young technology.

Вам также может понравиться