RegEx Globs Bash Awk

Command line tools - bash, awk and sed
We can only explore a small fraction of the capabilities of the bash shell and command-line utilities
in Linux during this course. An entire course could be taught about bash alone, but that is not the
central topic of interest for this class. There are, however, a few more commands and techniques
that are useful enough to merit mention in this session. Previous exercises have used both awk and
sed, but this class will devote more attention to general principles of these tools, as well as some
additional bash functions. We will also discuss "regular expressions" and file "globs" as part of
learning more about command-line tools, because these are generally applicable to many functions.
The document FileGlobbing.pdf (linked from the web page) contains the information from the
Ubuntu manpage on file globbing. Read this document, then interpret the following globs. What
would each of the following patterns match? Test them using the ls command on the
/data/AtRNAseq/ directory:
ls /data/AtRNAseq/*.gz
ls /data/AtRNAseq/[ct][1-3].*
ls /data/AtRNAseq/[!a-m]*.fastq*
ls /data/AtRNAseq/[a-m]*.fastq*
Regular expressions are similar to file globs in the sense that both are ways of writing general
"patterns" that can match many different targets. File globs apply specifically to file and directory
names, while regular expressions apply to any text. Unfortunately there are multiple "flavors" of
regular expressions: basic, extended, Perl-format, and so forth. Different tools use slightly different
syntax and different flavors, so getting familiar enough with regular expressions to make them work
for you is not a trivial task.
The RegularExpressions.pdf document (linked from the web page) has an overview of regular
expressions, although this is by no means a complete explanation of the topic. Read the document,
then interpret the following regular expressions, and note how parentheses are used to indicate
backreferences that would be available for use in a substitution command. NOTE: grouping using
() and alternative search terms using | are capabilities of "extended regular expressions" rather than
"basic regular expressions", so test these regular expressions using the egrep command on a listing
of the files in the /media/lubuntu/data/data directory
ls /data/AtRNAseq/ | egrep "(fastq|fasta|fq|fa)\.gz"
ls /data/AtRNAseq/ | egrep "(cn|ts)[1-3]ln[^3a-zA-Z]\."
Note that grep finds matches to the search pattern regardless of the presence of other characters
around the pattern - this is another difference between regular expressions and file globs. A file glob
finds only filenames that are completely matched by the pattern, while regular expressions find text
that contain the pattern.
Another version of the grep program is zgrep, which can search within gzipped files without saving
a decompressed version of the file. This is useful if it is necessary to find subsets of data within a
large file of sequence data stored in fastq.gz compressed format. One challenge with using zgrep on
fastq.gz files is that it cannot readily distinguish between the four types of lines found in FASTQ-
format files. We will come back to this point later, when we discuss awk in more depth.
With a basic understanding of file globs and regular expressions, we can look at how these are
useful. We will start with commands for locating things. Linux system architecture is complex
enough that it is not always obvious what software is installed on a given system, or if it is not
installed, if it is available in pre-packaged form from the distribution repository. The Boost library
required for compiling TopHat is an example - how can we find out if our Linux system already has
a copy of the Boost library installed? If it does not, how do we know if the library is available in a
Ubuntu package from the repository, or if we should download the source code, compile it, and
install the library? The software for managing packages in Ubuntu and other variants of the Debian
Linux distribibution is called dpkg, short for 'Debian package'. The command to search for packages
with names containing the word 'boost' is
dpkg-query -l *boost*
Try this and see what happens. The information returned provides the complete names of packages
that match the search pattern, along with the status of those packages on the local system (in the
first three columns of each line). If the line begins with 'ii', the package is installed and functional
on the local system; if the line begins with 'un', the package is not installed. Try the search again
using different patterns. For example, leave out one or both asterisks:
dpkg-query -l boost*
The characters 'boost' are surrounded by other characters in the names of the software packages
libboost-system and libboost-thread, so both asterisks are required for the search to be successful -
this behaves like a file glob, rather than like a regular expression search. The dpkg-query command
only searches the names of packages, which may or may not include the pattern of interest. To
search for packages that contain the pattern "boost" in either the package name or description, use
the command apt-cache search:
apt-cache search boost
The apt-cache is a list of names and descriptions of all available packages, installed or not, so this
search returns a long list of packages that include the pattern 'boost' somewhere in the name or
description. Note that leading and trailing asterisks are not needed, unlike dpkg-query - this
behaves like a regular expression search rather than a file glob.
Another command for finding software on the local system is 'locate' - this uses a local database of
installed programs, and so it is wise to make sure the database is up-to-date before doing a search.
Execute the following two commands to update the database and then search it:
sudo updatedb
locate boost
Note that the locate command, similar to apt-cache search, does not require leading or trailing
asterisk characters to find the pattern 'boost', even when it is surrounded by other characters. This
highlights a reality of Linux systems - different parts of the same system don't always follow the
same rules, so it is often important to try different versions of a command, or search online for tips
using search keywords that include the specific command you are trying to use and the specific
distribution you are working with.
An even more general tool for searching on the local system is the bash command find, which will
return a longer list of files and directories containing the pattern 'boost' than the locate command.
Try:
sudo find /[!a]* -iname *boost*
This should be run with sudo privileges if you are searching (as in this example) from the root
directory / rather than in your home directory, because an ordinary user does not have read
privileges on many of the directories searched by the find command, so running the command
without sudo privileges results in a lot of Permission denied error messages. The [!a]* glob
specifies that the /afs directory tree should not be searched, because this will also result in many
Permission denied error messages. The -iname option tells find to search for the pattern '*boost*' in
filenames, in a case-insensitive manner. Other options allow the find command to search based on
file type, file size, creation date, last modification date, or almost any other characteristic. Try the
search using boost without the flanking asterisks - is the find function more similar to file globs or
to regular expressions?
Loops and Conditionals

DNA sequence analysis often requires carrying out the same steps of data processing and analysis
separately for multiple sequence files. For example, even our small example RNA-seq experiment
with only two conditions and three replicates of each condition requires processing six sequence
files separately up to the point where the data from the six files are combined into a single object
containing read counts for each transcript from all six samples. There are two general strategies for
simplifying repetitive sets of tasks that need to be completed on multiple files, and then multiple
methods to use either of the two strategies.
The two strategies are serial processing and parallel processing - one either carries out the tasks on
one file at a time, and repeats the same tasks on one file after another until the whole set is finished,
or one carries out the tasks in parallel on all files at the same time. The choice of strategy is based
on the amount of system resources required per task, the number of files to be processed, and the
total amount of resources available. A workstation with 24 processors and 128 Gb of RAM could
carry out parallel processing of six RNA-seq data files from the example dataset at the same time,
but the BIT laptops with 4 processors and 8 Gb of RAM probably could not. Command-line options
for parallel processing include the commands xargs with tee, or the parallel command; see the
manpages for more information on xargs and tee, and the http://www.gnu.org/software/parallel/
webpage for more information on parallel.
Loops
One way to carry out serial processing is to create a loop of commands. This is a sequence of
commands provided to the interpreter at the same time, that (1) give a list of items to deal, and
(2) specify what is to be done with them. The "items to deal with" can be as simple as counting
through numbers in a sequence or processing each file in turn from a list of filenames, while the
specification of "what is to be done with them" can be as simple or complex as desired. Loops are a
general concept that can be applied in many ways using different command interpreters; we will
focus on examples of loops in the bash shell and in the awk text processing language.
Bash loops can have any of three general forms: for loops, while loops, or until loops (see the Bash
Beginners Guide at http://tldp.org/LDP/Bash-Beginners-Guide/html/chap_09.html). The for loop
takes the form 'for X in Y; do Z; done'. X is a variable, Y is a list of values that will each be assigned
to the variable X, and Z is an action or set of actions to be taken. A simple example which simply
echos the numbers 1 to 10 to the screen is:
for x in {1..10}; do echo $x; done
This example demonstrates that numerical sequences can be specified using curly brackets and two
dots separating the beginning and ending values. A list of text items, such as file names, can be
provided as a space-separated list with no brackets or quotes (provided the file names do not contain
spaces):
for x in c1 c2 c3 t1 t2 t3; do echo $x; done
In order to help the bash shell recognize the boundaries of the string used as the variable name, we
can put curly brackets around the string, so that adjacent characters are clearly marked as not part of
the name. For example, if we wanted to use the same strategy as before to list the names of the
sequence files, but used the variable only for the parts that are different, we could write:
for x in c1 c2 c3 t1 t2 t3; do echo ${x}.fq.gz; done
Instead of just echoing the filename to the screen, we could just as easily use this loop to carry out
serial alignment of all 6 RNA-seq data files to the chromosome 5 reference sequence to produce
BAM output files:
for x in c1 c2 c3 t1 t2 t3; do bwa mem -t 3 Atchr5 ${x}.fq.gz | samtools
view -SbuF4 - | samtools sort -o ${x} -; done
Note that the value of the variable $x is used twice, once to name the input file to be aligned, and a
second time to name the BAM file produced as output. This points out the advantages of a file-
naming system that keeps all the relevant information together in one part of the filename, so the
name can easily be subdivided into different parts, one specific for the sample identity and another
specific for the file format or stage of processing.
There are also advantages to naming files so that the sample identity segment always consists of the
same number of characters. For example, in analysis of genotyping-by-sequencing data using the
STACKS pipeline, sequence reads for each sample are expected to be stored in separate files. This
can mean dozens or hundreds of different sequence files, so typing out a list of all the filenames as
demonstrated above would be very tedious. If the files are named so that the sample identity always
occupies the first 6 characters (eg p01A01 to p10H12 for ten 96-well plates of samples, each named
for the plate and well position where it is stored), then a loop that processes all 960 files can be
written in three lines:
for file in *.fastq.gz;
do name=${file,0,6}; bwa mem -t 3 refindex/refname ${name}.fastq.gz | samtools
view -SbuF4 - | samtools sort -o ${name} -;
done
It is critical, in using loops of this sort, to make sure that the path information is correct relative to
the working directory from which the command is executed. This example would be executed from
within a directory containing the sequence files, and also containing a sub-directory called refindex
in which the genome reference index called refname is stored. Two more points are worth
mentioning here. First, the list of file names that are assigned to the $file variable is produced using
a file glob - this example is simple, but it can be as complex as necessary to achieve the desired
result. Second, the first 6 characters of each filename are extracted using the syntax ${variable,
position, length}, where "position" is 0-indexed. The first character of the filename in the $file
variable is position 0, and a sub-string of length 6 beginning at that first character is assigned to the
variable $name.
We can store the BAM output alignment files in a different directory than the input sequence files, if
desired, by providing more path information in addition to filenames. For example,
for file in seqdata/*.fastq.gz;
do name=${file,8,6}; bwa mem -t 3 refindex/refname ${name}.fastq.gz | samtools
view -SbuF4 - | samtools sort - bamfiles/${name};
done
This bash loop would be executed from a directory containing three sub-directories, the seqdata/
directory with the fastq.gz sequence files for all samples, the refindex/ directory containing the
reference genome index, and the bamfiles/ directory in which the BAM output is to be stored.
Note that the substring assigned to the $name variable now starts at position 8 of the filename
stored in the $file variable, because the path string "seqdata/" occupies positions 0 to 7.
Try the command below to see how files are listed from the /data/AtRNAseq directory:
for file in /data/AtRNAseq/*gz; do echo $file; done
Does this command list only the RNAseq data files? If not, how should it be modified so that it lists only the 6
files of RNAseq data?
We have discussed the text processing language awk and its variant bioawk in the context of
manipulating and summarizing FASTQ sequence and SAM alignment files, but it is a powerful
tool for other kinds of text processing as well. For example, consider the problem of summarizing
a file containing several FASTA-format sequences, each with a single header line and multiple
lines of sequence, like the file test.fa available on the website. If we want to know how long each
of the DNA sequences is, we can use an awk loop to condense all the sequence information onto a
single line, then determine the length of that line.
awk 'BEGIN {RS = ">"; FS = "\n"} {printf "%s\n", ">"$1} {for(i=2;i<=NF;i+=1)
printf "%s", $i} }{printf "%s","\n"}' <infile.fa>
This complex command begins with awk 'BEGIN {RS = ">"; FS = "\n"}: this tells the shell we
are using the awk language, and sets the "record separator" character to ">" and the "field
separator" character to newline. Recall that a "record" in awk refers to a set of data relevant to a
particular observation, and "fields" are subsets of the data for that observation. By changing the
record and field separator values from the defaults (newline and whitespace, respectively) we tell
awk that we want it to treat each FASTA-format sequence as a record, and the individual lines as
fields within that record. The availability of the bioawk variant eliminates the need for this kind of
code for use with FASTA or FASTQ files, but the occasion may arise to use this strategy in other
text-processing applications, so it is useful to know that it is possible.
The printf command provides complete control over the format of printed output, at the cost of
complex syntax. See https://www.gnu.org/software/gawk/manual/html_node/Printf.html for more
information about printf; a brief summary follows.
The "%s" specifies we are printing a 'string', or text; we can specify other types of output with
other characters.
• %d or % i - integer digits
• %e - number in scientific notation, eg 'printf "%4.3e\n", 1950' prints 1.950e+03 with
3 digits to the right of the decimal.
• %c - a character specified by the ASCII character code, eg 'printf "%c", 65' produces
the character 'A'
• %f - a decimal number or 'float', with significant digits specified: 'printf "%4.3f",
1950' prints '1950.000'
Note that printf does not add a newline, so multiple printf statements will all be output to the same
line unless newline characters '\n' are specified as part of the strings to be printed. The next part of
the awk command, {printf "%s\n", ">"$1}, prints the name of the sequence prefixed by a ">"
symbol (which was removed because it is specified as the record separator), followed by a
newline. The loop {for(i=2;i<=NF;i+=1) printf "%s", $i} prints each line of sequence in
turn to output, with no intervening newline characters, for as many lines of sequence as there are in
each record. The final command, {printf "%s","\n"}', prints a newline character at the end of
the sequence so the stage is set for the beginning of the next header line. The input file can be sent
to awk in a pipe, or awk can read directly from a file; default output goes to STDOUT (the screen),
but a file destination can be specified in the print command if desired.
Another example of formatting sequences: the command
awk '{printf "%s%06d\n%s\n", ">",FNR,$0 }' infile.txt > outfile.fa
takes as input a text file of short DNA sequences (such as GBS tags) and converts it to FASTA
format, using as names the number of each sequence in the list. The 'printf "%s06d\n%s\n" ' part
formats the numbers as 6 digits, with leading zeros to keep the sequence names the same length, eg
>000001 to >100000 for a set of 100,000 GBS tags.
Conditionals
The bash shell has multiple types of conditional statements, but those are more complex than we
can dig into today. We will focus on conditional statements in awk, because those are relatively
simple and very useful for dealing with FASTQ formatted sequence files, where each record
occupies four lines in the file.
The mathematical operation "modulo" is valuable in this context - the modulo operation returns the
remainder left over after dividing one number by another. Line number modulo 4 gives a unique
identity to each of the four different types of lines in a FASTQ-format sequence file. In awk
terminology, the operation is written NR%4, and the conditional test is specified with two equal
signs to distinguish it from an assignment operation which is written with one equal sign.
awk 'NR%4==1 {print $0}' prints the first header line with the name of the sequence
awk 'NR%4==2 {print $0}' prints the DNA sequence line
awk 'NR%4==3 {print $0}' prints the header line for the quality scores (often just '+')
awk 'NR%4==0 {print $0}' prints the quality score line
The awk utility does not have the capacity to read gzipped files directly, but it can accept input re-
directed from another shell command. For example, the command
zcat /data/AtRNAseq/c1.fq.gz |awk 'NR%4==1 {print $0}' |head prints the first 10 lines
of sequence names from the c1.fq.gz file. This can also be written as
awk 'NR%4==1 {print $0}' <(zcat /data/AtRNAseq/c1.fq.gz ) | head
This is an example of redirection - the output of the zcat command is redirected from the screen to
serve as input to the awk command. The second version allows the <(zcat <filename) to be
embedded at any point in a pipe, so it can be used to merge data from different files or different
processing streams together if desired, while the first version uses zcat filename | to start the
pipeline, so it can't be inserted at any other position in the sequence of commands.

RegEx Globs Bash Awk

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

RegEx Globs Bash Awk

Загружено:

Авторское право:

Доступные форматы

Command line tools - bash, awk and sed

Loops and Conditionals

Вам также может понравиться