Вы находитесь на странице: 1из 10

IBM Software

Exercise 1
Working with Pig

Copyright IBM Corporation, 2013
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM Software
Contents Page 3
LAB 1 WORKING WITH PIG ......................................................................................................................................... 4
1.1 PIG BASICS ............................................................................................................................................. 4

IBM Software
Page 4

Lab 1 Working with Pig
This exercise gives you the opportunity work with some of the Pig basics in order to begin to get familiar
with this environment.
After completing this hands-on lab, youll be able to:
- Execute Pig statements from the Grunt shell
- Execute a Pig script
- Pass parameters to a Pig script
- Load data for using within Pig

Allow 10 minutes to complete this lab.
This version of the lab was designed using the InfoSphere BigInsights 2.1 Quick Start Edition, but has
been tested on the 2.1.2 image. Throughout this lab you will be using the following account login
information. If your passwords are different, please note the difference.

Username Password
VM image setup screen root password
Linux biadmin biadmin

1.1 Pig basics
__ 1. If Hadoop is not running, start it and its components using the icon on the desktop.
__ 2. Open a command line. Right-click the desktop and select Open in Terminal.
__ 3. Next start the Grunt shell. Change to the Pig bin directory and start the shell running in
local mode.
cd $PIG_HOME/bin
./pig -x local
IBM Software
Hands-on-Lab Page 5
__ 4. Read the data from a comma separated values file,
/home/labfiles/SampleData/books.csv into a relation called data using the default
PigStorage() loader.
data = load /home/labfiles/SampleData/books.csv;
__ 5. Next access the first field in each tuple and then write the results out to the console. You
will have to use the foreach operator to accomplish this. I understand that we have not
covered that operator yet. Just bare with me.
b = foreach data generate $0;
dump b;
First of all, let me explain what you just did. For each tuple in the data bag (or relation), you
accessed the first field ($0 remember, positions begin with zero) and projected it to a field
called f1 that is in the relation called b.
The data listed shows a tuple on each line and each tuple contains a single character string
that contains all of the data. This may not be what you expected, since each line contained
several fields separated by commas.
__ 6. The problem in the previous step was that the default field separator is the tab character
(\t); Reread the data but this time specify a comma as being the separating character.
Then project the first field in each tuple into a field in a new relation. Then dump the new
relation. (You can use the up and down cursor keys to recall previous command.)
data = load /home/labfiles/SampleData/books.csvusing PigStorage(,);
b = foreach data generate $0;
dump b;
Note that now comma separated fields became individual fields in the tuple. And the first
field appears to be a number.
__ 7. What if you wanted to be able to access the fields in the tuple by name instead of only by
position? You have to using the LOAD operator and specify a schema.
data = load '/home/labfiles/SampleData/books.csv' using PigStorage(',') as
(booknum:int, author:chararray, title:chararray, published:int);
b = foreach data generate author, title;
dump b;
__ 8. The presentation indicated that both relations and fields were case sensitive. Verify this.
Reread the book information with the following schema.
data = load '/home/labfiles/SampleData/books.csv' using PigStorage(',') as
(f1:int, F1:chararray, f2:chararray, F2:int);
Surprise, that actually worked.
__ 9. Now dump your data file the following command.
dump DATA;
IBM Software
Page 6

You should have gotten an error.
__ 10. This time dump the data but capitalize the dump command.
DUMP data;
__ 11. Terminate your Grunt shell.
Although the up and down cursors recall previous commands, I have found that when a
recalled command spans two lines, I was not able to move the cursor to the first line.
Also, I like using a script because I can copy and pasted previous commands.
__ 12. Create a script that contains Pig commands. Open another command line and execute
gedit &
__ 13. You can pass parameters to a Pig script. Create a parameter that passes the directory
structure for your data. In the gedit edit area, type the following:
dir = /home/labfiles/SampleData
__ 14. Save your file as /home/biadmin/myparams.
__ 15. Next, in gedit, open a new edit window. Type the LOAD command that will read data from
a field called /home/labfiles/SampleData/pig_bookreviews.json.
As the file extension indicates, this is a JSON file. This requires your using the
__ 16. Type the following. Note $dir references your parameter.
data = load $dir/pig_bookreviews.json using JsonLoader();
dump data;
__ 17. Save your work to /home/biadmin/pig.script.
__ 18. From the command line, invoke your Pig script and pass in your parameter file.
./pig -x local -param_file ~/myparams ~/pig.script
You should have an error. It says that it could not find a schema file. The JsonLoader()
requires a schema. There are no defaults. If you do not code the schema in the LOAD
statement, then the function is expecting to find a schema file in the same directory as
your data.
__ 19. It probably would be more informative if you were to look at the format of the data in the
pig_bookreviews.json file so that you can see how the schema matches. For those of you
who are not really motivated, here is a partial look at the data. The important thing to note
is that each JSON record is on a single line.
IBM Software
Hands-on-Lab Page 7
[{"author": "J. K. Rowlings", "title": "The Sorcerers Stone", "published": 1997,
"reviews": [{"name": "Mary", "stars": 5},{"name": "Tom", "stars": 5}]},
{"author": "David Baldacci", "title": "First Family", "published": 2009,
"reviews": [{"name": "Andrew", "stars": 4},{"name": "Katie", "stars":
4},{"name": "Scott", "stars": 5}]}]
__ 20. In your Pig script, modify it so that the JsonLoader() has a schema.
data = load '$dir/pig_bookreviews.json' using
dump data;
__ 21. This should have solved your problem and this ends this exercise.

End of exercise


Copyright IBM Corporation 2013.
The information contained in these materials is provided for
informational purposes only, and is provided AS IS without warranty
of any kind, express or implied. IBM shall not be responsible for any
damages arising out of the use of, or otherwise related to, these
materials. Nothing contained in these materials is intended to, nor
shall have the effect of, creating any warranties or representations
from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of
IBM software. References in these materials to IBM products,
programs, or services do not imply that they will be available in all
countries in which IBM operates. This information is based on
current IBM product plans and strategy, which are subject to change
by IBM without notice. Product release dates and/or capabilities
referenced in these materials may change at any time at IBMs sole
discretion based on market opportunities or other factors, and are not
intended to be a commitment to future product or feature availability
in any way.
IBM, the IBM logo and ibm.com are trademarks of International
Business Machines Corp., registered in many jurisdictions
worldwide. Other product and service names might be trademarks of
IBM or other companies. A current list of IBM trademarks is
available on the Web at Copyright and trademark information at