Further Unix

Data Manipulation with UNIX
Introduction
Who is this tutorial for?
This course is about manipulating data using the Unix command line utilities. The kind of data files it deals with
are plain text files containing data in row and columns - the layout you would see in a simple database or
spreadsheet. If you can produce plain text data in this format then you can use these techniques to process it.
To complete this course you will need basic command line Unix skills. Specifically you should be happy with the
following commands:
ls
cat
less
head
tail
cd
mkdir
rm
rmdir
You will also need to understand the general idea of redirection and piping..
Why read this tutorial?

Besides learning that you can do quite a lot of useful data cleansing, manipulation, and simple reporting from the
command line, this course will improve your general confidence in using Unix.
Displaying file contents

I assume that we are dealing with record oriented data where each line of the file analysed is a case or record - a
collection of data items that belong together.
First lets check the contents of your files with less or head. This will give you a clue to their format. We will start
with a comma delimited file called results.csv. Each line has the following structure
Surname, Sex,
Class,
Maths_score,
English_score,
History_score
These four fields are separated by commas. This is a very common file type and easy to work with, but it has a
disadvantage - you may have text fields that contain commas as data. In those cases it is easiest to use another
character as a field delimiter when you create the data (you may be able to do this for example with Excel- if you
cannot you may have to do some clever data munging). I will cover changing the delimiter later.
You check the file contents either
less results.csv
or
head results.csv
When we view the file with head we see
ADAMS,1,55,63,65
ALI,1,52,46,35
BAGAL,1,51,58,55
BENJAMIN,1,59,70,68
BLAKEMORE,1,56,38,40
(With less you see a screen at a time where as head displays only the first few lines.)
Introduction 1
Counting Data Items

First, lets make some simple counts on this file. We can use wc to count characters, words (anything
surrounded by whitespace) and lines
wc
wc
wc
wc
results.csv
-w results.csv
-c results.csv
-l results.csv
When we are dealing with record oriented data like ours wc -l will display the number of records.
Selecting Data Items

Selecting Rows
Next, we will select some data using grep. The grep command searches a file line by line. Try the following
command
grep ^R results.csv
This will display only three rows of the file. The expression in quotes is the search string. If we wish we can
direct the output of this process to a new file, like this
grep ^R results.csv > outputfile.txt

This command line uses the redirect output symbol. In Unix the default output destination is the screen, and its
known as stdout (when it needs naming). The default input source is the keyboard, known as stdin. So when
data is coming from or going to anywhere else, we use redirection. We use redirection with > to pass the results
of a process to a new output or with < to get data from a new input. If we want to append the data to the end of
an existing file (as new rows) we use >> instead of >.
We can use a similar command line to count the rows selected, but this time lets change the grep command
slightly.
grep ^[RB] results.csv | wc -l

This command line uses the pipe symbol. We use piping with | to pass the results of one process to another
process. If we only wanted a count of the lines that match then instead of piping the result to wc we could use
the -c parameter on grep, like this
grep -c ^[RB] results.csv | wc -l

We have used the character class, indicated by [ and ], containing Rand B and grep will succeed if it finds any of
the characters in the class. We enclose this regular expression in single quotes. Also notice that in the cases
above we have used the anchor ^ to limit the match by grep to the start of a line. The anchor $ would limit the
search to matches at the end of the line.
We use grep in this way to select row data.
More About Searching

The standard form of a basic grep command is
grep -[options] search expression filename

Typically search expression is a regular expression. The simplest type of expression is a string literal - a
succession of characters each treated literally, that is to say standing for themselves and nothing else. If the
string literal contains a space, we will need to surround it by single quote marks. In our data we might look for the
following
grep de souza results.csv

The next thing to learn is how to match a class of characters rather than a specific character. Consider
grep [A-Z] results.csv

This matches any uppercase alphabetic character. Similarly
grep [1-9] results.csv

Introduction 2
matches any numeric character. In both these cases any single character of the right class causes a successful
match. You can specify the class by listing as well. Consider
grep [perl] results.csv

Which matches any character from the list p,e,r,l (the order in which they are listed is immaterial).
You can combine a character class and a literal in a search string. Consider
grep Grade [BC] someresults.csv

this search would find lines containing Grade B and lines containing Grade C. Notice that combining literals and
classes means I need quotes.
You can also search using special characters as wildcards. The character . for example, used in a search stands
for any single character except the newline character. So the search
grep . results.csv
succeeds for every non-empty line. (If . matched the newline character it would succeed for empty lines as well).
The character * stands for zero or any number of repetitions of a character. So
grep a* results.csv
matches
a
aa
aaa
and so on. Notice the blank line there? Probably not, but its there. This regular expression matches zero or
more instances of the preceding character.
Suppose that I wish to find a string that contains any sequence of characters followed by, for example m. The
grep command would be
grep .+m results.csv

This is a greedy search: it is not satisfied with the very first successful match, it continues past the first match it
finds to match the longest string it can. For now we will just accept this greedy searching, but if you investigate
regular expressions further you will discover that some versions have non-greedy matching strategies available.
Selecting Columns
We can also select columns. Because this is a delimited file we can split it into columns at each delimiter - in this
case a comma. This is equivalent to selecting fields from records.
Suppose that we want to extract column two from our data. We do this with the cut command. Heres an
example
cut -d, -f3 results.csv| head

The first nine lines of the resulting display are
55
52
51
59
56
We can display several columns like this
cut d, -f1-4 results.csv

which displays a contiguous range of columns, or
cut -d, -f1,4 results.csv
Introduction 3
which displays a list of separate columns. The -d option on cut specifies the delimiter (your system will have a
default if you dont specify - find out what it is!) and the -f option specifies the column or field number. We use cut
in this way to select column data.
The general form of the cut command is
cut -ddelimiter -ffieldnumbers datafile

So in the examples, we specified comma as the delimiter and used fields 1 and 3 and the range of fields 1 to 3.
Selecting Columns and Rows

Suppose that we want to select just some columns for only some rows? We do this by first selecting rows with
grep and passing this to cut to select columns. You can try
grep ^[AR] results.csv | cut -d, -f 1,4 | less

Again, we use piping to pass the results of one process to another. You could also redirect the output to a new
file.
grep ^[AR] results.csv | cut -d, -f 1,4 > resultsa-r.txt
Transforming Data
There is another comma delimited file called gradedresults.csv which has the following structure
Surname,
Mean_score,
Grade
Currently the grade is expressed as an alphabetic character. You should check this by viewing the surnames
and grades from this file. The command is
cut -d, -f1,3 gradedresults.txt

We can translate the alphabetic grade into a numeric grade (1=A, 2=B etc) with the command tr. Try this
tr ,A ,1 < gradedresults.txt
Notice that I included a leading comma in the search and replace strings because I wanted to catch just the field
containing A. I could have done this more elegantly by using A$ to anchor the match to the end of the line.
In the example tr gets its input from the file by redirection. You can perform a multiple translation by including
more than one pair on the command line. For example
tr ,A ,B ,C ,1 ,2 ,3< gradedresults.txt | less

You can use special characters in a tr command. For example to search for or replace a tab there are two
methods:
1.
use the escape string \t to represent the tab
2.
at the position in the command line where you want to insert a tab, first type control-v (^v) and then
press the tab key.
There are a number of different escape sequences (1 above) and there are different control sequences (2 above)
to represent special characters for example \n or sometimes ^M to represent a new line and \s for white space.
In general the escape sequence is easier to use.
Sorting
Alphabetically
Unix sorts alphabetically by default. This means that 100 comes before 11.
On Rows
You can sort with the command sort. For example
sort results.csv | less

Introduction 4
This sorts the file in UNIX order on each character of the entire line. The default alphanumeric sort order means
that the numbers one to ten would be sorted like this
1,10,2,3,4,5,6,7,8, 9
This makes perfect sense but it can be a surprise the first time you see it.
Descending
You can sort in reverse order with the option -r. Like this
sort -r results.csv | less
Numerically
To force a numeric sort, use the option -n.
sort -n results.csv
You can use a sort on numeric data to get maximum and minimum values for a variable. Sort then pipe to head
1 and tail 1, which will produce the first and last records in the file.
On Columns
To sort on columns you must specify a delimiter, with -t and a field number with -k. To sort on the third column of
the results data, try this
sort -n -t , -k4 results.csv | less

(Ive used a slightly more verbose method of specifying the delimiter here). You can select rows after sorting, like
this
sort-n -t , -k4 results.csv | grep ^[A] | less

Which shows those pupils with surnames beginning with A sorted on the third field of the data file.
To sort on multiple columns we use more than one -k parameter. For example, to sort first on Maths score and
then on surname we use
sort -n -t , -k4n -k1 results.csv | less
Finding Unique Values in Columns

Suppose that you want to know how many different values appear in a particular column. With a little work, you
can find this out using the command uniq. Used alone, uniq tests each line against what preceded it before
writing it out and ignores duplicate lines.
Before we try to use uniq we need a sorted column with some repeated values. We can use cut to extract one.
Test this first
cut -d, -f3 results.csv | less

This should list just the second column of data which has a few duplicate values.
We pass the output through sort to uniq
cut -d, -f3 results.csv | sort |uniq | less

to get data in which the adjacent duplicates have been squeezed to one.
We can now pipe this result to wc-l to get the count of unique values.
cut -d, -f3 results.csv | sort | uniq | wc -l
Joining Data Files

There are two UNIX commands that will combine data from different files: paste and join. We will look first at
paste.
Introduction 5
Paste
Paste has two modes of operation depending on the option selected. The first operation is simplest: paste takes
two files and treats each as column data and appends the second to the first. The command is
paste first_file second_file

Consider this file:
one
two
three
Call this first_file. Then let this
four
seven
ten
five
six
eight nine
eleven twelve
be second_file. The output would be
one
two
three
four
seven
ten
five six
eight nine
eleven twelve
So paste appends the columns from the second file to the first row by row. As with other commands you can
redirect the output to a new file:
paste first_file second_file > new_file

The other use of paste is to linearize a file. Suppose I have a file in the format
Jim
Tyson
UCL
Information Services
You can create this in a text editor. I can use paste to merge the four lines of data into one line
Jim Tyson UCL Information Services

The command is
Paste -s file
As well as the -s option, I can add a delimiter character with -d. Try this
Paste -d: -s file
Join
We have seen how to split a data file into different columns and we can also join two data files together. To do
this there must be a column of values that match in each file and the files must be sorted on the field you are
going to use to join them.
We start with files where for every row in file one there is a row in file two and vice versa.
Consider our two files. The first has the structure
Surname,
Gender,
Maths score,
English score,
History score
The second
Surname,
Mean score,
Grade
We can see then that these could be joined on the column surname with ease since surname is unique. After
sorting both files we can do this with the command line
join -t, -j1 results.csv gradedresults.csv | less

The option-t specifies the delimiter and-j allows us to specify a single field number where this is the shared field.
Introduction 6
If the columns on which to match for joining dont appear in the same position in each file, you can use the -jn m
option several times where in each case n is the numeric file handle (look at the order that you name the files
later) and m is the number of the join field. In fact, we could write
join -t, -j1 1 -j2 1 results.csv gradedresults.csv | less

for the same result as our previous join command.
Essentially, join matches lines on the chosen fields and adds column data. We could send the resulting output to
a new file with > if we wished.
In my example there is (deliberately) one line in file one for each line in file 2. There is of course no guarantee
that this will be the case. To list all the files from a file regardless of a match being found, we use the option -a
and the file handle number.
join -t, -a1 -j1 1 -j2 1 results.csv gradedresults.csv | less

This would list every line of results.csv and only those lines of results.csv where a match is found.
The default join is that only items having a matching element in both files are displayed. We can also produce a
join where all the rows from the first file named and only the matching rows from the second are selected as we
did above. Next we can produce a version where all the rows of the second file are listed with only matching
rows from the first with the following
join -t, -a2 -j1 1 -j2 1 results.csv gradedresults.csv | less

And lastly, we can produce all rows from both files, matching or not with
join -t, -a1 -a2 -j1 1 -j2 1 results.csv gradedresults.csv | less

The last thing we should learn about join is how to control the output. The option -o allows us to choose which
data fields from each file are displayed. For example
-o 0 1.2 2.3
displays the match column ( always denoted 0), the second column from the first file (1.2) and the third column
from the second file (2.3).
sed and AWK - more powerful searching and replacing

sed
Sed is a powerful Unix tool and there are books devoted to explaining it. The name stands for stream editor a
reminder that it reads and processes files line by line. One of the basic uses of sed is to search a file - much like
grep does - and replace the search expression with some other text specified by the user. An example may
make this clearer
sed s/abc/def/g input

After the command name, we have s for search followed by the search string and then the replace string
surrounded and separated by / and then g indicating that this operation is global- we are looking to process every
occurrence of abc in this file. The filename follows, in this case a file called input.
Some sed Hacks
Rather than pretend to cover sed in any real depth, there follow a very short list of sed tricks that are sometimes
useful in processing data files. These are famous sed one-liners and are listed by Eric Pement on his website at
http://www.pement.org/sed/sed1line.txt.
sed G
Double spaces the file. It reads a line and G appends a newline character. Remember that reading in a newline
is basic to seds operation.
sed '/^$/d;G'
Double spaces a file that already has some blank lines. First remove an empty line then append a newline.
sed 'G;G'
Triple spaces the file.
Introduction 7
sed 'n;d'
This removes double line spacing - and does it in a rather crafty way. Assuming that all the first line read is not
blank, then all even lines should be blank, so alternately printing a line out and deleting a line should result in a
single spaced file.
sed '/regex/{x;p;x;}'
This command puts a blank line before every occurrence of the search sting regex.
sed -n '1~2p'
This command deletes odd lines from a file.
I leave the investigation of more sed wizardry to you.
AWK
AWK is a programming language developed specifically for text data manipulation. You can write complete
programs in AWK and execute them in much the same way as a C or Java program (AWK is interpreted though
not compiled like C or byte code compiled like Java).
AWK allows for some sophisticated command line manipulation and I will use a few simple examples to illustrate.
Because our file is comma delimited, we will invoke AWK with the option -F,. AWK will automatically identify the
columns of data and put the fields a row at a time into its variables $1$n$FN. The last always identifies the
last field of data.
So, we can try
awk -F, {print $2, $NF} results.csv

We can also find text strings in a particular column for example with
awk $n~/searchtext results.csv

Where n in $n is a column number.
The ~ means matches. The expression !~means does not match.
Conditional processing in simple cases can be carried out by just stating the condition before the block of code to
be executed (that is inside the braces). For example
awk -F, $2>55 {print $2} results.csv

And we can create complex conditions
awk -F, $2 > 50 || $3 < 50 {print $3} results.csv

The || means OR and && means AND in awk.
But we can construct more complex processes quite easily. The following code wont be difficult to understand if
you know any mainstream programming language
cut results.csv -d, -f2 | awk '{sum=0; for (i=1; i<=NF; i++) s=s+$i; print
sum}'
This code sums the three numeric fields and prints out the result.
As with sed there is a website for useful awk one-liners by Eric Pement at
http://www.pement.org/awk/awk1line.txt
In-line Perl - the Swiss army chainsaw of Unix data manipulation

The Perl programming language has always provided sophisticated data manipulation functions. Learning Perl
would be an even bigger project than learning sed but it is worth knowing at least something about using Perl in
line.
Introduction 8
It is possible to use Perl code on the command line. Consider the simple Perl statement
print "hello"
(Programmers among you, notice that I omit the ;). We can execute this directly from the Unix prompt by
invoking the Perl compiler with the option -e. Try this
perl -e print "Hello"

or better
perl -e print "Hello\n"

You remember \n: it gets us a nice new line. This way of running a Perl program combined with what we can
already do opens up the possibility of sophisticated data transformation. But still the -e option runs a single
(though possibly complex) Perl statement just once. So, we can have
perl-e $number=5;$number>=4?print $number:print "less than four"

Can you work out what you think the result should be. Try it?
The example makes use of the popular but initially puzzling ternary operator which is a kind of shorthand way of
righting a conditional statement. Here the conditional is read
if $number is greater than or equal to four, print $number, else print the string
less than four
The real value of in-line programming comes when we learn that we can loop through the output of other
command line operations and execute Perl code. We do this with the option -n Here is an example
cut-d, -f2 results.csv | perl -ne $_>=55?print"welldone\n":print"what a shame\n"

Or we could do some mathematics
cut -d, -f2 results.csv | perl -ne '$n += $_; END { print "$n\n" }
which will sum the column of numbers.
Another very useful Perl function for command line use is split. In Perl, split takes a string and divides it into
separate data items at a delimiter character and then it puts the results into an array.
To illustrate this try the following
less resluts.txt | perl -ne @fields=split(/,/,$_);print "@fields[0] , "\t",

"@fields[1]" , "\t","@fields[2]","\n" < resluts.txt |
In this example the input from each line ($_ in perl) is split at the comma (/,/ where the slashes are delimiters to
differentiate the comma we pass as a parameter from the comma needed by the split command). The next
statement prints each field from the resultsing array, a tab and then ends with a new line. This example uses
escape sequence for tab again: \t.
Final Exercise
To practise and consolidate try the following:
1.
Take the original results.csv data and find the average mark for each column of examination marks.
Can you see a way to write this set of values to the end of the file on a row labelled averages? (Hint: >
writes the data from a process to a new file but >> appends it to the end of an existing file)
2.
Take the original results.csv and find the average examination mark for each pupil. Can you add this
new column of data to the original file?
3.
Take the original results.csv and find the average examination mark for each pupil and on the basis of
the following rule, assign them to a stream
If average exam mark is greater than or equal to 60 the student is in stream A,
else if average exam mark is greater than or equal to 50 the student is in stream B,
else the student is in stream C.
Create a new file that includes these two new data items for each pupil.
Introduction 9
[TO DO: Add solution.]

THE END.
Index of commands and symbols covered
^
^M, 5 end of line character
|
|, 3 pass output to a further command
<
<, 5 take input from source specified following
>
>, 3 send output to destination specificed following
A
anchor, 3 tie a search to a place in a file or line
$, 3 end of line anchor
AWK, 8 text processing language
C
cut, 4 cut a column of data from a file
G
grep, 3 search a file using regular expressions
H
head, 2 display opening lines of a file
L
less, 2 display file a screen at a time
N
n, 5
P
paste, 6 Join to files linearly
Perl, 9 interpreted language that is good for text processing
print, 9 perl, AWK command to display data on-screen
Introduction 10
S
sed, 8 text processing language
sort, 5 sort data files
T
tr, 5 replace a string in a file
W
wc, 3 count words, lines or characters in a file
wildcard generalise a search string
*, 4 a wild card for zero or more characters
Introduction 11

Further Unix

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Further Unix

Загружено:

Авторское право:

Доступные форматы

Data Manipulation with UNIX

Why read this tutorial?

Displaying file contents

Counting Data Items

Selecting Data Items

grep ^R results.csv > outputfile.txt

grep ^[RB] results.csv | wc -l

grep -c ^[RB] results.csv | wc -l

More About Searching

grep -[options] search expression filename

grep de souza results.csv

grep [A-Z] results.csv

grep [1-9] results.csv

grep [perl] results.csv

grep Grade [BC] someresults.csv

grep .+m results.csv

cut -d, -f3 results.csv| head

cut d, -f1-4 results.csv

cut -d, -f1,4 results.csv

cut -ddelimiter -ffieldnumbers datafile

Selecting Columns and Rows

grep ^[AR] results.csv | cut -d, -f 1,4 | less

grep ^[AR] results.csv | cut -d, -f 1,4 > resultsa-r.txt

cut -d, -f1,3 gradedresults.txt

tr ,A ,B ,C ,1 ,2 ,3< gradedresults.txt | less

use the escape string \t to represent the tab

sort results.csv | less

sort -r results.csv | less

sort -n -t , -k4 results.csv | less

sort-n -t , -k4 results.csv | grep ^[A] | less

sort -n -t , -k4n -k1 results.csv | less

Finding Unique Values in Columns

cut -d, -f3 results.csv | less

cut -d, -f3 results.csv | sort |uniq | less

cut -d, -f3 results.csv | sort | uniq | wc -l

Joining Data Files

paste first_file second_file

be second_file. The output would be

paste first_file second_file > new_file

Jim Tyson UCL Information Services

Paste -d: -s file

join -t, -j1 results.csv gradedresults.csv | less

join -t, -j1 1 -j2 1 results.csv gradedresults.csv | less

join -t, -a1 -j1 1 -j2 1 results.csv gradedresults.csv | less

join -t, -a2 -j1 1 -j2 1 results.csv gradedresults.csv | less

join -t, -a1 -a2 -j1 1 -j2 1 results.csv gradedresults.csv | less

sed and AWK - more powerful searching and replacing

sed s/abc/def/g input

awk -F, {print $2, $NF} results.csv

awk $n~/searchtext results.csv

awk -F, $2>55 {print $2} results.csv

awk -F, $2 > 50 || $3 < 50 {print $3} results.csv

In-line Perl - the Swiss army chainsaw of Unix data manipulation

perl -e print "Hello"

perl -e print "Hello\n"

perl-e $number=5;$number>=4?print $number:print "less than four"

cut-d, -f2 results.csv | perl -ne $_>=55?print"welldone\n":print"what a shame\n"

less resluts.txt | perl -ne @fields=split(/,/,$_);print "@fields[0] , "\t",

[TO DO: Add solution.]

Index of commands and symbols covered

Вам также может понравиться