Академический Документы
Профессиональный Документы
Культура Документы
Introduction
Who is this tutorial for?
This course is about manipulating data using the Unix command line utilities. The kind of data files it deals with
are plain text files containing data in row and columns - the layout you would see in a simple database or
spreadsheet. If you can produce plain text data in this format then you can use these techniques to process it.
To complete this course you will need basic command line Unix skills. Specifically you should be happy with the
following commands:
ls
cat
less
head
tail
cd
mkdir
rm
rmdir
You will also need to understand the general idea of redirection and piping..
Surname, Sex,
Class,
Maths_score,
English_score,
History_score
These four fields are separated by commas. This is a very common file type and easy to work with, but it has a
disadvantage - you may have text fields that contain commas as data. In those cases it is easiest to use another
character as a field delimiter when you create the data (you may be able to do this for example with Excel- if you
cannot you may have to do some clever data munging). I will cover changing the delimiter later.
You check the file contents either
less results.csv
or
head results.csv
When we view the file with head we see
ADAMS,1,55,63,65
ALI,1,52,46,35
BAGAL,1,51,58,55
BENJAMIN,1,59,70,68
BLAKEMORE,1,56,38,40
(With less you see a screen at a time where as head displays only the first few lines.)
Introduction 1
wc
wc
wc
wc
results.csv
-w results.csv
-c results.csv
-l results.csv
When we are dealing with record oriented data like ours wc -l will display the number of records.
grep ^R results.csv
This will display only three rows of the file. The expression in quotes is the search string. If we wish we can
direct the output of this process to a new file, like this
matches any numeric character. In both these cases any single character of the right class causes a successful
match. You can specify the class by listing as well. Consider
grep . results.csv
succeeds for every non-empty line. (If . matched the newline character it would succeed for empty lines as well).
The character * stands for zero or any number of repetitions of a character. So
grep a* results.csv
matches
a
aa
aaa
and so on. Notice the blank line there? Probably not, but its there. This regular expression matches zero or
more instances of the preceding character.
Suppose that I wish to find a string that contains any sequence of characters followed by, for example m. The
grep command would be
Selecting Columns
We can also select columns. Because this is a delimited file we can split it into columns at each delimiter - in this
case a comma. This is equivalent to selecting fields from records.
Suppose that we want to extract column two from our data. We do this with the cut command. Heres an
example
55
52
51
59
56
We can display several columns like this
Introduction 3
which displays a list of separate columns. The -d option on cut specifies the delimiter (your system will have a
default if you dont specify - find out what it is!) and the -f option specifies the column or field number. We use cut
in this way to select column data.
The general form of the cut command is
Transforming Data
There is another comma delimited file called gradedresults.csv which has the following structure
Surname,
Mean_score,
Grade
Currently the grade is expressed as an alphabetic character. You should check this by viewing the surnames
and grades from this file. The command is
tr ,A ,1 < gradedresults.txt
Notice that I included a leading comma in the search and replace strings because I wanted to catch just the field
containing A. I could have done this more elegantly by using A$ to anchor the match to the end of the line.
In the example tr gets its input from the file by redirection. You can perform a multiple translation by including
more than one pair on the command line. For example
2.
at the position in the command line where you want to insert a tab, first type control-v (^v) and then
press the tab key.
There are a number of different escape sequences (1 above) and there are different control sequences (2 above)
to represent special characters for example \n or sometimes ^M to represent a new line and \s for white space.
In general the escape sequence is easier to use.
Sorting
Alphabetically
Unix sorts alphabetically by default. This means that 100 comes before 11.
On Rows
You can sort with the command sort. For example
This sorts the file in UNIX order on each character of the entire line. The default alphanumeric sort order means
that the numbers one to ten would be sorted like this
1,10,2,3,4,5,6,7,8, 9
This makes perfect sense but it can be a surprise the first time you see it.
Descending
You can sort in reverse order with the option -r. Like this
Numerically
To force a numeric sort, use the option -n.
sort -n results.csv
You can use a sort on numeric data to get maximum and minimum values for a variable. Sort then pipe to head
1 and tail 1, which will produce the first and last records in the file.
On Columns
To sort on columns you must specify a delimiter, with -t and a field number with -k. To sort on the third column of
the results data, try this
Introduction 5
Paste
Paste has two modes of operation depending on the option selected. The first operation is simplest: paste takes
two files and treats each as column data and appends the second to the first. The command is
one
two
three
Call this first_file. Then let this
four
seven
ten
five
six
eight nine
eleven twelve
one
two
three
four
seven
ten
five six
eight nine
eleven twelve
So paste appends the columns from the second file to the first row by row. As with other commands you can
redirect the output to a new file:
Jim
Tyson
UCL
Information Services
You can create this in a text editor. I can use paste to merge the four lines of data into one line
Paste -s file
As well as the -s option, I can add a delimiter character with -d. Try this
Join
We have seen how to split a data file into different columns and we can also join two data files together. To do
this there must be a column of values that match in each file and the files must be sorted on the field you are
going to use to join them.
We start with files where for every row in file one there is a row in file two and vice versa.
Consider our two files. The first has the structure
Surname,
Gender,
Maths score,
English score,
History score
The second
Surname,
Mean score,
Grade
We can see then that these could be joined on the column surname with ease since surname is unique. After
sorting both files we can do this with the command line
Introduction 6
If the columns on which to match for joining dont appear in the same position in each file, you can use the -jn m
option several times where in each case n is the numeric file handle (look at the order that you name the files
later) and m is the number of the join field. In fact, we could write
-o 0 1.2 2.3
displays the match column ( always denoted 0), the second column from the first file (1.2) and the third column
from the second file (2.3).
sed G
Double spaces the file. It reads a line and G appends a newline character. Remember that reading in a newline
is basic to seds operation.
sed '/^$/d;G'
Double spaces a file that already has some blank lines. First remove an empty line then append a newline.
sed 'G;G'
Triple spaces the file.
Introduction 7
sed 'n;d'
This removes double line spacing - and does it in a rather crafty way. Assuming that all the first line read is not
blank, then all even lines should be blank, so alternately printing a line out and deleting a line should result in a
single spaced file.
sed '/regex/{x;p;x;}'
This command puts a blank line before every occurrence of the search sting regex.
sed -n '1~2p'
This command deletes odd lines from a file.
I leave the investigation of more sed wizardry to you.
AWK
AWK is a programming language developed specifically for text data manipulation. You can write complete
programs in AWK and execute them in much the same way as a C or Java program (AWK is interpreted though
not compiled like C or byte code compiled like Java).
AWK allows for some sophisticated command line manipulation and I will use a few simple examples to illustrate.
Because our file is comma delimited, we will invoke AWK with the option -F,. AWK will automatically identify the
columns of data and put the fields a row at a time into its variables $1$n$FN. The last always identifies the
last field of data.
So, we can try
cut results.csv -d, -f2 | awk '{sum=0; for (i=1; i<=NF; i++) s=s+$i; print
sum}'
This code sums the three numeric fields and prints out the result.
As with sed there is a website for useful awk one-liners by Eric Pement at
http://www.pement.org/awk/awk1line.txt
Introduction 8
It is possible to use Perl code on the command line. Consider the simple Perl statement
print "hello"
(Programmers among you, notice that I omit the ;). We can execute this directly from the Unix prompt by
invoking the Perl compiler with the option -e. Try this
cut -d, -f2 results.csv | perl -ne '$n += $_; END { print "$n\n" }
which will sum the column of numbers.
Another very useful Perl function for command line use is split. In Perl, split takes a string and divides it into
separate data items at a delimiter character and then it puts the results into an array.
To illustrate this try the following
Final Exercise
To practise and consolidate try the following:
1.
Take the original results.csv data and find the average mark for each column of examination marks.
Can you see a way to write this set of values to the end of the file on a row labelled averages? (Hint: >
writes the data from a process to a new file but >> appends it to the end of an existing file)
2.
Take the original results.csv and find the average examination mark for each pupil. Can you add this
new column of data to the original file?
3.
Take the original results.csv and find the average examination mark for each pupil and on the basis of
the following rule, assign them to a stream
If average exam mark is greater than or equal to 60 the student is in stream A,
else if average exam mark is greater than or equal to 50 the student is in stream B,
else the student is in stream C.
Create a new file that includes these two new data items for each pupil.
Introduction 9
^
^M, 5 end of line character
|
|, 3 pass output to a further command
<
<, 5 take input from source specified following
>
>, 3 send output to destination specificed following
A
anchor, 3 tie a search to a place in a file or line
$, 3 end of line anchor
AWK, 8 text processing language
C
cut, 4 cut a column of data from a file
G
grep, 3 search a file using regular expressions
H
head, 2 display opening lines of a file
L
less, 2 display file a screen at a time
N
n, 5
P
paste, 6 Join to files linearly
Perl, 9 interpreted language that is good for text processing
print, 9 perl, AWK command to display data on-screen
Introduction 10
S
sed, 8 text processing language
sort, 5 sort data files
T
tr, 5 replace a string in a file
W
wc, 3 count words, lines or characters in a file
wildcard generalise a search string
*, 4 a wild card for zero or more characters
Introduction 11