Вы находитесь на странице: 1из 31

Agenda

Introduction to Pig
Features of Pig
Pig Usage Scenarios
Getting started with Pig
Working with Pig

Introduction to Pig
Pig is a scripting language used for exploring large data sets. Pig Latin
is a Hadoop
extension that simplifies hadoop programming by giving a high-level
data processing
language.
Components of Pig
Pig Latin a language used to express data flows
Pig Engine an engine on top of hadoop
Advantages of using Pig
Removes the need for users to tune Hadoop
Insulates users from changes in Hadoop interfaces.
Increases productivity. In one test
10 lines of Pig Latin 200 lines of Java
What takes 4 hours to write in Java takes about 15 minutes in Pig
Latin
Opens the system to non-Java programmers

Features of Pig
Pig Latin is a data flow language
Provides support for data types long, float, chararray, schemas
and functions
Is extensible and supports User Defined Functions
Metadata not required, but used when available
Operates on files in HDFS
Provides common operations like JOIN, GROUP, FILTER, SORT

Pig Usage Scenario

Where can you implement Pig


Web log processing
Data processing for web search platforms
Ad hoc queries across large data sets
Rapid prototyping of algorithms for processing large data sets
Who uses Pig
Yahoo, one of the heaviest user of hadoop, runs 40% of all its
hadoop jobs in pig.
Twitter is also another well-known user of Pig.

Getting started with Pig


Install Pig
Initialize environment variable
Configure Pig to run in different
modes
Data types
Pig Latin Statements

Installing Pig
To install pig untar the .gz file using tar xvzf pig-0.8.1bin.tar.gz
Pig works on two modes:
Local Mode
No HDFS is required
All files are run on local host and on local file system
Map Reduce Mode
To run PIG scripts in this mode, ensure you have access
to HDFS
By Default, PIG starts in MapReduce Mode

Initializing environment variables


To initialize the environment variables, export the following:
export PIG_HADOOP_VERSION=20
(Specifies the version of hadoop that is running, In this case, it is 0.20.2)
export HADOOP_HOME=/home/<user-name>/hadoop-0.20.2
(Specifies the installation directory of hadoop to the environment variable
HADOOP_HOME. Typically defined as /home/user-name/hadoop-version)
export PIG_CLASSPATH=$HADOOP_HOME/conf
(Specifies the classpath for pig)
export PATH=$PATH:/home/user-name/pig-0.8.1/bin
(for setting the PATH variable)
export JAVA_HOME=/usr
(Specifies the java home to the environment variable.)

Running Pig in Local mode


Execute the command: pig x local
The following prompt appears.
grunt>
(Pig scripts are written in this grunt prompt and every statement is
ended with a semi-colon)

Running Pig in MapReduce(hadoop) mode


To run the Pig scripts in mapreduce mode, do the following:
Move to the pigtmp directory. Copy the excite.log.bz2 file from the pigtmp
directory to the HDFS directory.
bin/hadoop fs copyFromLocal excite.log.bz2 <hdfs path>
Set the PIG_CLASSPATH environment variable to the location of the cluster
configuration directory (the directory that contains the core-site.xml, hdfssite.xml and mapred-site.xml files):
export PIG_CLASSPATH=/clustername/conf
Set the HADOOP_CONF_DIR environment variable to the location of the cluster
configuration directory:
export HADOOP_CONF_DIR=/clustername/conf
Execute the pig in map reduce mode as follows:
pig x mapreduce

Introducing Data Types


Data type is a a data storage format that can contain a specific
type or
range of values. Some common data types include - int, long,
double etc
PIG supports two types of data type:
Scalar types
Sample: int, long, double, chararray, bytearray
Complex types
Sample: Map Associative array
Tuple is an ordered list of data. Elements may
be of scalar or
complex type
Bag is an unordered collection of Tuples

Data Types Relations, Bags, Tuples,


Fields
relation can be defined as follows:
A field is a piece of data.
Ex:12.5 or hello world
A tuple is an ordered set of fields.
EX: Tuple (12.5,hello world,-2)
A tuple is an ordered set of fields. Its most often used as a row in a
relation.
its represented by fields separated by commas, all enclosed by parentheses.

Map [key#value]
A map is a set of key/value pairs. Keys must be unique and be a string
(chararray). The value can be any type.S

Data Types Relations, Bags, Tuples,


Fields
A relation is a bag (more specifically, an outer
bag).
Bag {(12.5,hello world,-2),(2.87,bye world,10)}
A bag is a collection of tuples.
A bag is an unordered collection of tuples.
A relation is a special kind of bag,
sometimes called an outer bag. An inner bag is a
bag that is a field within some
A bag is represented by tuples separated by
commas, all enclosed by curly

Data Types contd..


By default Pig treats data as un-typed.
User can declare data type at load time.
log = LOAD excite-small.log
AS (user:chararray,time:long,query: chararray);

If data type is not declared but script treats value as


a certain type,
Pig will assume it is of that type and cast it.
log = LOAD excite-small.log AS (user, time,query);
weighted = FOREACH log
GENERATE time* 100; --time cast to long

Pig Latin Statements


A Pig Latin statement is an operator that takes a relation as input and
produces another relation as output. This definition applies to all Pig
Latin operators except LOAD and STORE which read data from and
write data to the file system.
Pig Latin statements can span multiple lines and must end with a
semi-colon ( ; ).
Pig Latin statements are generally organized in the following
manner:
A LOAD statement reads data from the file system.
A series of "transformation" statements process the data.
A STORE statement writes output to the file system; or, a
DUMP statement
displays output to the screen
EXAMPLE:
log = LOAD /home/user/pig-0.8.1/tutorial/data/excitesmall.log AS (user,time,query);
lmt = LIMIT log 4;
STORE lmt into /home/user/dir1;

Working with data


Pig Latin allows you to work with data in many ways.

Use the FILTER operator to work with tuples or rows of data.


Use the FOREACH operator to work with columns of data.
Use the GROUP operator to group data in a single relation.
Use the COGROUP and JOIN operators to group or join data in two or
more relations.
Use the UNION operator to merge the contents of two or more
relations.
Use the SPLIT operator to partition the contents of a relation into
multiple relations.

Group
Definition: Groups the data in one or more relations
Syntax:
alias = GROUP alias { ALL | BY expression}
Alias: The name of a relation.
ALL Keyword: Use ALL if you want all tuples to go to a single
group; for example, when doing aggregates across entire
relations.
B = GROUP A ALL;
BY Keyword. Use this clause to group the relation by field, tuple
or expression.
B = GROUP A BY f1;
To group using multiple keys, enclose the keys in parentheses:
B = GROUP A BY (key1,key2);

Group Example
A = load 'student' AS (name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
DUMP A;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
B = GROUP A BY age;
DUMP B;
(18,{(John,18,4.0),(Joe,18,3.8)})
(19,{(Mary,19,3.8)})
(20,{(Bill,20,3.9)})
C = FOREACH B GENERATE group, COUNT(A);
DUMP C;
(18,2)
(19,1)
(20,1)
C = FOREACH B GENERATE $0, $1.name;
DUMP C;
(18,{(John),(Joe)})
(19,{(Mary)})
(20,{(Bill)})

Filter
Definition: Selects tuples from a relation based on some condition.
Syntax:
alias = FILTER alias BY expression;
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you dont want.
Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example the condition states that if the third field equals 3, then include the tuple
with relation X.
X = FILTER A BY f3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)

Filter
For further examples
Consider File A
123
421
834
433
725
843
Consider file B
24
89
13
27
29
46
49

Cogroup
Definition: Groups the data in one or more relations.
Note: The GROUP and COGROUP operators are identical. Both
operators work with one or more relations. For readability
GROUP is used in statements involving one relation and
COGROUP is used in statements involving two or more relations.
Cogrouping is a generalization of grouping.
Keys for two (or more) inputs are collected.
Output of group statement is (key, bag1, bag2), where key is
the group key, bag1 contains a tuple
for every record in the first input with that key, and bag2
contains a tuple for every record in the second input with that
key.
Syntax:
alias = COGROUP alias { ALL | BY expression}[, alias ALL | BY
expression ]

Cogroup contd..
The statement
COGROUP A BY f1, B BY $0;
where f1 is the first field
gives output as
<1, {<1, 2, 3>}, {<1, 3>}>
<2, {}, {<2, 4>, <2, 7>, <2, 9>}>
<4, {<4, 2, 1>, <4, 3, 3>}, {<4, 6>,<4, 9>}>
<7, {<7, 2, 5>}, {}>
<8, {<8, 3, 4>, <8, 4, 3>}, {<8, 9>}>

Cogroup contd..
Note that some of the bags are empty, which indicates that no
tuples from the
corresponding input belong to that group. If we only wish to
see groups for
which inputs have at least one tuple, we can write:
C = COGROUP A BY $0 INNER, B BY $0 INNER;
Result :
<1, {<1, 2, 3>}, {<1, 3>}>
<4, {<4, 2, 1>, <4, 3, 3>}, {<4, 6>, <4, 9>}>
<8, {<8, 3, 4>, <8, 4, 3>}, {<8, 9>}>

Flatten Operator
Flatten un-nests tuples as well as bags. The idea is
the same, but the operation and result is different
for each type of structure.
For tuples, flatten substitutes the fields of a tuple in
place of the tuple. For example, consider a relation
that has a tuple of the form (a, (b, c)). The
expression GENERATE $0, flatten($1), will cause
that tuple to become (a, b, c).

Flatten Operator cont


For bags, the situation becomes more complicated.
When we un-nest a bag, we create new tuples. If we
have a relation that is made up of tuples of the form
({(b,c),(d,e)}) and we apply GENERATE flatten($0), we
end up with two tuples (b,c) and (d,e). When we
remove a level of nesting in a bag, sometimes we
cause a cross product to happen. For example,
consider a relation that has a tuple of the form (a,
{(b,c), (d,e)}),
commonly produced by the GROUP operator. If we
apply the expression GENERATE $0, flatten($1) to this
tuple, we will create new tuples: (a, b, c) and (a, d, e).
-- using FLATTEN

Join
Definition: Performs an join of two or more relations based on common field
values
Syntax:
alias = JOIN alias BY {expression|'('expression [, expression ]')'} (, alias BY
{expression|'('expression [, expression ]')'} )
The equi-join of A and B on column 0 can be expressed as follows:
JOIN A BY $0, B BY $0;
which is equivalent to:
X = COGROUP A BY $0 INNER, B BY $0 INNER;
FOREACH X GENERATE FLATTEN(A), FLATTEN(B);
The result is:
<1, 2, 3, 1, 3>
<4, 2, 1, 4, 6>
<4, 3, 3, 4, 6>
<4, 2, 1, 4, 9>
<4, 3, 3, 4, 9>
<8, 3, 4, 8, 9>
<8, 4, 3, 8, 9>

DISTINCT
Definition: Removes duplicate tuples in a relation.
Syntax: alias = DISTINCT alias
For example, suppose we first say
X = FOREACH A GENERATE $2;
As we know, this will result in X =
<3>
<1>
<4>
<3>
<5>
<3>
Now, if we say
Y = DISTINCT X;
The output is Y =
<1>
<3>
<4>
<5>

CROSS
Defintion: Computes the cross product of two or more relations.
Syntax:alias = CROSS alias, alias [, alias ]
Example:
X = CROSS A, B;
Based on the values of A and B given earlier in the document, the result is X =
<1, 2, 3, 2, 4>
<1, 2, 3, 8, 9>
<1, 2, 3, 1, 3>
<1, 2, 3, 2, 7>
<1, 2, 3, 2, 9>
<1, 2, 3, 4, 6>
<1, 2, 3, 4, 9>
<4, 2, 1, 2, 4>
<4, 2, 1, 8, 9>

SPLIT
Definition: Partitions a relation into two or more relations.
Syntax:
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF
expression ];
Data flow need not be linear:
A = LOAD data;
B = GROUP A BY $0;

C = GROUP A by $1;

Split can be used explicitly:


A = LOAD data;
SPLIT A INTO NEG IF $0 < 0, POS IF $0 > 0;
B = FOREACH NEG GENERATE

SPLIT
For example:
contd
..
SPLIT A INTO X IF $0 < 7, Y IF ($0 > 2 AND $0<> 7);
X=

<1, 2, 3>
<4, 2, 1>
<4, 3, 3>

and
Y = <4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<8, 4, 3>

Pig Commands
Pig Command

What it does

load

Read data from file system.

store

Write data to file system.

foreach

Apply expression to each record and output one or


more
records.

filter

Apply predicate and remove records that do not


return true.

group/cogroup

Collect records with the same key from one or more


inputs.

join

Join two or more inputs based on a key.

order

Sort records based on a key.

distinct

Remove duplicate records.

union

Merge two data sets.

split

Split data into 2 or more sets, based on filter


conditions.

stream

Send all records through a user provided binary.

dump

Write output to stdout.

limit

Limit the number of records.

Conclusion
Opens up the power of MapReduce
Provides common data processing
operations
Supports rapid iteration of adhoc
queries

Вам также может понравиться