Вы находитесь на странице: 1из 28

LinuxcommandlineexercisesforNGSdataprocessing

byUmerZeeshanIjaz
ThepurposeofthistutorialistointroducestudentstothefrequentlyusedtoolsforNGSanalysisaswellas
givingexperienceinwritingoneliners.Copytherequiredfilestoyourcurrentdirectory,changedirectory(cd)
tothelinuxTutorialfolder,anddoalltheprocessinginside:
[uzi@quince-srv2 ~/]$ cp -r /home/opt/MScBioinformatics/linuxTutorial .
[uzi@quince-srv2 ~/]$ cd linuxTutorial
[uzi@quince-srv2 ~/linuxTutorial]$
IhavedeliberatelychosenAwkintheexercisesasitisalanguageinitselfandisusedmoreoftento
manipulateNGSdataascomparedtotheothercommandlinetoolssuchasgrep,sed,perletc.
Furthermore,havingacommandonawkwillmakeiteasiertounderstandadvancedtutorialssuchasIllumina
AmpliconsProcessingWorkflow.
InLinux,weuseashellthatisaprogramthattakesyourcommandsfromthekeyboardandgivesthemto
theoperatingsystem.MostLinuxsystemsutilizeBourneAgainSHell(bash),butthereareseveraladditional
shellprogramsonatypicalLinuxsystemsuchasksh,tcsh,andzsh.Toseewhichshellyouareusing,type
[uzi@quince-srv2 ~/linuxTutorial]$ echo $SHELL
/bin/bash
Toseewhereyouareinthefilesystem:
[uzi@quince-srv2 ~/linuxTutorial]$ pwd
/home/uzi/linuxTutorial
Listthefilesinthecurrentdirectory:
[uzi@quince-srv2 ~/linuxTutorial]$ ls
data
Nowtrydifferentcommandsfromthesheetgivenbelow:

LinuxCommandsCheatSheet
FileSystem
lslistitemsincurrentdirectory
ls -llistitemsincurrentdirectoryandshowinlongformattoseeperimissions,size,andmodification
date

ls -alistallitemsincurrentdirectory,includinghiddenfiles
ls -Flistallitemsincurrentdirectoryandshowdirectorieswithaslashandexecutableswithastar
ls dirlistallitemsindirectorydir
cd dirchangedirectorytodir
cd ..gouponedirectory
cd /gototherootdirectory
cd ~gototoyourhomedirectory
cd -gotothelastdirectoryyouwerejustin
pwdshowpresentworkingdirectory
mkdir dirmakedirectorydir
rm fileremovefile
rm -r dirremovedirectorydirrecursively
cp file1 file2copyfile1tofile2
cp -r dir1 dir2copydirectorydir1todir2recursively
mv file1 file2move(rename)file1tofile2
ln -s file linkcreatesymboliclinktofile
touch filecreateorupdatefile
cat fileoutputthecontentsoffile
less fileviewfilewithpagenavigation
head fileoutputthefirst10linesoffile
tail fileoutputthelast10linesoffile
tail -f fileoutputthecontentsoffileasitgrows,startingwiththelast10lines
vim fileeditfile
alias name 'command'createanaliasforacommand

System
shutdownshutdownmachine
rebootrestartmachine

dateshowthecurrentdateandtime
whoamiwhoyouareloggedinas
finger userdisplayinformationaboutuser
man commandshowthemanualforcommand
dfshowdiskusage
dushowdirectoryspaceusage
freeshowmemoryandswapusage
whereis appshowpossiblelocationsofapp
which appshowwhichappwillberunbydefault

ProcessManagement
psdisplayyourcurrentlyactiveprocesses
topdisplayallrunningprocesses
kill pidkillprocessidpid
kill -9 pidforcekillprocessidpid

Permissions
ls -llistitemsincurrentdirectoryandshowpermissions
chmod ugo filechangepermissionsoffiletougouistheuser'spermissions,gisthegroup's
permissions,andoiseveryoneelse'spermissions.Thevaluesofu,g,andocanbeanynumberbetween0
and7.
7fullpermissions
6readandwriteonly
5readandexecuteonly
4readonly
3writeandexecuteonly
2writeonly
1executeonly
0nopermissions

chmod 600 fileyoucanreadandwritegoodforfiles


chmod 700 fileyoucanread,write,andexecutegoodforscripts
chmod 644 fileyoucanreadandwrite,andeveryoneelsecanonlyreadgoodforwebpages
chmod 755 fileyoucanread,write,andexecute,andeveryoneelsecanreadandexecutegood
forprogramsthatyouwanttoshare

Networking
wget filedownloadafile
curl filedownloadafile
scp user@host:file dirsecurecopyafilefromremoteservertothedirdirectoryonyourmachine
scp file user@host:dirsecurecopyafilefromyourmachinetothedirdirectoryonaremote
server
scp -r user@host:dir dirsecurecopythedirectorydirfromremoteservertothedirectorydiron
yourmachine
ssh user@hostconnecttohostasuser
ssh -p port user@hostconnecttohostonportasuser
ssh-copy-id user@hostaddyourkeytohostforusertoenableakeyedorpasswordlesslogin
ping hostpinghostandoutputresults
whois domaingetinformationfordomain
dig domaingetDNSinformationfordomain
dig -x hostreverselookuphost
lsof -i tcp:1337listallprocessesrunningonport1337

Searching
grep pattern filessearchforpatterninfiles
grep -r pattern dirsearchrecursivelyforpatternindir
grep -rn pattern dirsearchrecursivelyforpatternindirandshowthelinenumberfound
grep -r pattern dir --include='*.extsearchrecursivelyforpatternindirandonlysearchin
fileswith.extextension
command | grep patternsearchforpatternintheoutputofcommand

find filefindallinstancesoffileinrealsystem
locate filefindallinstancesoffileusingindexeddatabasebuiltfromtheupdatedbcommand.Much
fasterthanfind
sed -i 's/day/night/g' filefindalloccurrencesofdayinafileandreplacethemwithnights
meanssubstitudeandgmeansglobalsedalsosupportsregularexpressions

Compression
tar cf file.tar filescreateatarnamedfile.tarcontainingfiles
tar xf file.tarextractthefilesfromfile.tar
tar czf file.tar.gz filescreateatarwithGzipcompression
tar xzf file.tar.gzextractatarusingGzip
gzip filecompressesfileandrenamesittofile.gz
gzip -d file.gzdecompressesfile.gzbacktofile

Shortcuts
ctrl+amovecursortobeginningofline
ctrl+fmovecursortoendofline
alt+fmovecursorforward1word
alt+bmovecursorbackward1word
Reference:http://cheatsheetworld.com/programming/unix-linux-cheat-sheet/

Exercise1:ExtractingreadsfromaFASTAfilebasedonsupplied
IDs
Awkisaprogramminglanguagewhichallowseasymanipulationofstructureddataandismostlyusedfor
patternscanningandprocessing.Itsearchesoneormorefilestoseeiftheycontainlinesthatmatchwiththe
specifiedpatternsandthenperformassociatedactions.Thebasicsyntaxis:
awk '/pattern1/ {Actions}
/pattern2/ {Actions}' file
TheworkingofAwkisasfollows
Awkreadstheinputfilesonelineatatime.
Foreachline,itmatcheswithgivenpatterninthegivenorder,ifmatchesperformsthecorresponding
action.
Ifnopatternmatches,noactionwillbeperformed.
Intheabovesyntax,eithersearchpatternoractionareoptional,Butnotboth.

Ifthesearchpatternisnotgiven,thenAwkperformsthegivenactionsforeachlineoftheinput.
Iftheactionisnotgiven,printallthatlinesthatmatcheswiththegivenpatternswhichisthedefaultaction.
Emptybraceswithoutanyactiondoesnothing.Itwontperformdefaultprintingoperation.
EachstatementinActionsshouldbedelimitedbysemicolon.
Sayyouhavedata.tsvwiththefollowingcontents:
$ cat data/test.tsv
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
BydefaultAwkprintseverylinefromthefile.
$ awk '{print;}' data/test.tsv
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
Weprintthelinewhichmatchesthepatternblah_C3
$ awk '/blah_C3/' data/test.tsv
blah_C3 ACTTATATATATATA
Awkhasnumberofbuiltinvariables.Foreachrecordi.eline,itsplitstherecorddelimitedbywhitespace
characterbydefaultandstoresitinthe$nvariables.Ifthelinehas5words,itwillbestoredin$1,$2,$3,$4
and$5.$0representsthewholeline.NFisabuiltinvariablewhichrepresentsthetotalnumberoffieldsina
record.
$ awk '{print $1","$2;}' data/test.tsv
blah_C1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2,ACTTTATATATT
blah_C3,ACTTATATATATATA
blah_C4,ACTTATATATATATA
blah_C5,ACTTTATATATT
$ awk '{print $1","$NF;}' data/test.tsv
blah_C1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2,ACTTTATATATT
blah_C3,ACTTATATATATATA
blah_C4,ACTTATATATATATA
blah_C5,ACTTTATATATT
AwkhastwoimportantpatternswhicharespecifiedbythekeywordcalledBEGINandEND.Thesyntaxisas
follows:

BEGIN { Actions before reading the file}


{Actions for everyline in the file}
END { Actions after reading the file }
Forexample,
$ awk 'BEGIN{print "Header,Sequence"}{print $1","$2;}END{print "-------"}'
data/test.tsv
Header,Sequence
blah_C1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2,ACTTTATATATT
blah_C3,ACTTATATATATATA
blah_C4,ACTTATATATATATA
blah_C5,ACTTTATATATT
------WecanalsousetheconceptofaconditionaloperatorinprintstatementoftheformprintCONDITION?
PRINT_IF_TRUE_TEXT:PRINT_IF_FALSE_TEXT.Forexample,inthecodebelow,weidentifysequences
withlengths>14:
$ awk '{print (length($2)>14) ? $0">14" : $0"<=14";}' data/test.tsv
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG>14
blah_C2 ACTTTATATATT<=14
blah_C3 ACTTATATATATATA>14
blah_C4 ACTTATATATATATA>14
blah_C5 ACTTTATATATT<=14
Wecanalsouse1afterthelastblock{}toprinteverything(1isashorthandnotationfor{print $0}
whichbecomes{print}aswithoutanyargumentprintwillprint$0bydefault),andwithinthisblock,we
canchange$0,forexampletoassignthefirstfieldto$0forthirdline(NR==3),wecanuse:
$ awk 'NR==3{$0=$1}1' data/test.tsv
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
Youcanhaveasmanyblocksasyouwantandtheywillbeexecutedoneachlineintheordertheyappear,for
example,ifwewanttoprint$1threetimes(hereweareusingprintfinsteadofprintastheformer
doesn'tputendoflinecharacter),
$ awk '{printf $1"\t"}{printf $1"\t"}{print $1}' data/test.tsv
blah_C1 blah_C1 blah_C1
blah_C2 blah_C2 blah_C2
blah_C3 blah_C3 blah_C3
blah_C4 blah_C4 blah_C4
blah_C5 blah_C5 blah_C5

Although,wecanalsoskipexecutinglaterblocksforagivenlinebyusingnextkeyword:
$ awk '{printf $1"\t"}NR==3{print "";next}{print $1}' data/test.tsv
blah_C1 blah_C1
blah_C2 blah_C2
blah_C3
blah_C4 blah_C4
blah_C5 blah_C5
$ awk 'NR==3{print "";next}{printf $1"\t"}{print $1}' data/test.tsv
blah_C1 blah_C1
blah_C2 blah_C2
blah_C4 blah_C4
blah_C5 blah_C5
Youcanalsousegetlinetoloadthecontentsofanotherfileinadditiontotheoneyouarereading,for
example,inthestatementgivenbelow,thewhileloopwillloadeachlinefromtest.tsvintokuntilno
morelinesaretoberead:
$ awk 'BEGIN{while((getline k <"data/test.tsv")>0) print "BEGIN:"k}{print}'
data/test.tsv
BEGIN:blah_C1

ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG

BEGIN:blah_C2

ACTTTATATATT

BEGIN:blah_C3

ACTTATATATATATA

BEGIN:blah_C4

ACTTATATATATATA

BEGIN:blah_C5

ACTTTATATATT

blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
YoucanalsostoredatainthememorywiththesyntaxVARIABLE_NAME[KEY]=VALUEwhichyoucanlater
usethroughfor(INDEXinVARIABLE_NAME)command:
$ awk '{i[$1]=1}END{for (j in i) print j"<="i[j]}' data/test.tsv
blah_C1<=1
blah_C2<=1
blah_C3<=1
blah_C4<=1
blah_C5<=1
Givenallthatyouhavelearnedsofar,wearegoingtoextractreadsfromaFASTAfilebasedonIDssupplied
inafile.Say,wearegivenaFASTAfilewithfollowingcontents:
[uzi@quince-srv2 ~/linuxTutorial]$ cat data/test.fa
>blah_C1
ACTGTCTGTC

ACTGTGTTGTG
ATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
andanIDsfile:
[uzi@quince-srv2 ~/linuxTutorial]$ cat data/IDs.txt
blah_C4
blah_C5
Afterlookingatthefile,itisimmediatelyclearthatthesequencesmayspanmultiplelines(forexample,for
blah_C1).IfwewanttomatchanID,wecanfirstlinearizethefilebyusingtheconditionaloperatoras
discussedabovetohavethedelimitedinformationofeachsequenceinoneline,andthenmakelogicto
performfurtherfunctionalityoneachlinelater.Ourlogicisthatforlinesthatcontainheaderinformation/^>/
wecandosomethingdifferently,andforotherlinesweuseprintftoremovenewlinecharacter:
[uzi@quince-srv2 ~/linuxTutorial]$ awk '{printf /^>/ ? $0 : $0}' data/test.fa
>blah_C1ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG>blah_C2ACTTTATATATT>blah_C3ACTTATATAT
ATATA>blah_C4ACTTATATATATATA>blah_C5ACTTTATATATT
Wecanthenputeachsequenceonaseparatelineandalsoputatabcharacter("\t")betweentheheader
andthesequence:
[uzi@quince-srv2 ~/linuxTutorial]$ awk '{printf /^>/ ? "\n"$0 : $0}'
data/test.fa
>blah_C1ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
>blah_C2ACTTTATATATT
>blah_C3ACTTATATATATATA
>blah_C4ACTTATATATATATA
>blah_C5ACTTTATATATT[uzi@quince-srv2 ~/linuxTutorial]$ awk '{printf /^>/ ?
"\n"$0"\t" : $0}' data/test.fa
>blah_C1

ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG

>blah_C2

ACTTTATATATT

>blah_C3

ACTTATATATATATA

>blah_C4

ACTTATATATATATA

>blah_C5

ACTTTATATATT

WecanthenuseNR==1blocktostopprintinganewlinecharacterbeforethefirstheader(asyoucansee
thereisanemptyspace)andusenexttoignorethelaterblock:

[uzi@quince-srv2 ~/linuxTutorial]$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ?


"\n"$0"\t" : $0}' data/test.fa
>blah_C1

ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG

>blah_C2

ACTTTATATATT

>blah_C3

ACTTATATATATATA

>blah_C4

ACTTATATATATATA

>blah_C5

ACTTTATATATT

Wecanthenpipethisstreamtoanotherawkstatementusing"\t"asadelimeter(whichyoucanspecify
using-F)andusegsubtoremove>fromthestartofeachlinesinceourIDsfiledoesn'tcontainthat
character:
[uzi@quince-srv2 ~/linuxTutorial]$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ?
"\n"$0"\t" : $0}' data/test.fa | awk -F"\t" '{gsub("^>","",$0);print $0}'
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
NowweloadtheIDs.txtfileintheBEGINblock,storetheIDsinthememory,andinthestreamifthefirst
field($1)matchestheIDstoredinthememory,weoutputtheformattedrecord:
[uzi@quince-srv2 ~/linuxTutorial/data]$ awk 'NR==1{printf $0"\t";next}{printf
/^>/ ? "\n"$0"\t" : $0}' data/test.fa | awk -F"\t" 'BEGIN{while((getline k <
"data/IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
WithBioawkitismuchsimplerasyoudon'thavetolinearizetheFASTAfileastherecordboundariesarethe
completesequenceboundariesandnotlines:
[uzi@quince-srv2 ~/linuxTutorial/data]$ bioawk -cfastx 'BEGIN{while((getline k
<"data/IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' data/test.fa
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
Bioawkcanalsotakeotherinputformatsthatyouspecifywith-c,withthefieldnamesasfollows(youcanuse
thecolumnpairsalternatively):
bed: $1:$chrom $2:$start $3:$end $4:$name $5:$score $6:$strand $7:$thickstart
$8:$thickend $9:$rgb $10:$blockcount $11:$blocksizes $12:$blockstarts
sam: $1:$qname $2:$flag $3:$rname $4:$pos $5:$mapq $6:$cigar $7:$rnext
$8:$pnext $9:$tlen $10:$seq $11:$qual

vcf: $1:$chrom $2:$pos $3:$id $4:$ref $5:$alt $6:$qual $7:$filter $8:$info


gff: $1:$seqname $2:$source $3:$feature $4:$start $5:$end $6:$score $7:$filter
$8:$strand $9:$group $10:$attribute
fastx: $1:$name $2:$seq $3:$qual $4:$comment

Exercise2:AlignmentStatisticsforMetagenomics/Population
Genomics
ForthisexercisewewilluseaC.DifficileRibotype078referencedatabasethatcomprisesof61contigs.Even
thoughitisasinglegenomeforwhichwehaveobtainedthesamples,theworkflowgivenbelowremains
similarforthemetagenomicsampleswhenyouhavecompletegenomesinsteadofcontigsinthereference
database(andsoIusethenomenclature:genomes/contigs).Beforeweanalyzeoursamples,wecandosome
qualitycontrolchecksonourrawsequencesusingFastQC.Runningthefollowingcommandwillgeneratea
M120_S2_L001_R1_001_fastqcfolderwithanhtmlpagefastqc_report.htmlinside.Youcanloadit
upinyourbrowsertoassessyourdatathroughgraphsandsummarytables.
[uzi@quince-srv2 ~/linuxTutorial]$ fastqc data/M120_*R1*.fastq
Started analysis of M120_S2_L001_R1_001.fastq
Approx 5% complete for M120_S2_L001_R1_001.fastq
Approx 10% complete for M120_S2_L001_R1_001.fastq
Approx 15% complete for M120_S2_L001_R1_001.fastq
Approx 20% complete for M120_S2_L001_R1_001.fastq
Approx 25% complete for M120_S2_L001_R1_001.fastq
Approx 30% complete for M120_S2_L001_R1_001.fastq
Approx 35% complete for M120_S2_L001_R1_001.fastq
Approx 40% complete for M120_S2_L001_R1_001.fastq
Approx 45% complete for M120_S2_L001_R1_001.fastq
Approx 50% complete for M120_S2_L001_R1_001.fastq
Approx 55% complete for M120_S2_L001_R1_001.fastq
Approx 60% complete for M120_S2_L001_R1_001.fastq
Approx 65% complete for M120_S2_L001_R1_001.fastq
Approx 70% complete for M120_S2_L001_R1_001.fastq
Approx 75% complete for M120_S2_L001_R1_001.fastq
Approx 80% complete for M120_S2_L001_R1_001.fastq
Approx 85% complete for M120_S2_L001_R1_001.fastq
Approx 90% complete for M120_S2_L001_R1_001.fastq
Approx 95% complete for M120_S2_L001_R1_001.fastq
Approx 100% complete for M120_S2_L001_R1_001.fastq
Analysis complete for M120_S2_L001_R1_001.fastq
[uzi@quince-srv2 ~/linuxTutorial]$
Forexample,hereisthefilegeneratedfortheaboveM120_S2_L001_R1_001.fastqfile:
Alternatively,youcanalsotrymyShellutilitiesforQCaswellasShellwrappersforEMBOSSutilities.
Nextweindexourreferencedatabasefile.Indexingspeedsupalignment,allowingthealignertoquicklyfind
short,nearexactmatchestouseasseedsforsubsequentfullalignments.
[uzi@quince-srv2 ~/linuxTutorial/data]$ bwa index Cdiff078.fa

UseBWAMEMtoalignpairedendsequences.Briefly,thealgorithmworksbyseedingalignmentswith
maximalexactmatches(MEMs)andthenextendingseedswiththeaffinegapSmithWatermanalgorithm
(SW).FromBWAdoc,itissuggestedthatfor70bporlongerIllumina,454,IonTorrentandSangerreads,
assemblycontigsandBACsequences,BWAMEMisusuallythepreferredalgorithm.Forshort
sequences,BWAbacktrackmaybebetter.BWASWmayhavebettersensitivitywhenalignmentgaps
arefrequent.
[uzi@quince-srv2 ~/linuxTutorial]$ bwa mem data/Cdiff078.fa
data/M120_*R1*.fastq data/M120_*R2*.fastq > aln-pe.sam
Wehavegeneratedasamfile(alnpe.sam)whichconsistoftwotypesoflines:headersandalignments.
Headersbeginwith@,andprovidemetadataregardingtheentirealignmentfile.Alignmentsbeginwithany
characterexcept@,anddescribeasinglealignmentofasequencereadagainstthereferencegenome.Note
thateachreadinaFASTQfilemayaligntomultipleregionswithinareferencegenome,andanindividualread
canthereforeresultinmultiplealignments.IntheSAMformat,eachofthesealignmentsisreportedona
separateline.Also,eachalignmenthas11mandatoryfields,followedbyavariablenumberofoptionalfields.
Eachofthefieldsisdescribedinthetablebelow:
Col

Field

Description

QNAME

Querytemplate/pairNAME

FLAG

bitwiseFLAG

RNAME

ReferencesequenceNAME

POS

1basedleftmostPOSition/coordinateofclippedsequence

MAPQ

MAPpingQuality(Phredscaled)

CIAGR

extendedCIGARstring

MRNM

MateReferencesequenceNaMe(=ifsameasRNAME)

MPOS

1basedMatePOSistion

TLEN

inferredTemplateLENgth(insertsize)

10

SEQ

querySEQuenceonthesamestrandasthereference

11

QUAL

queryQUALity(ASCII33givesthePhredbasequality)

OPT

variableOPTionalfieldsintheformatTAG:VTYPE:VALUE

12+

whereFLAGisdefinedas:
Flag

Chr

Description

0x0001

thereadispairedinsequencing

0x0002

thereadismappedinaproperpair

0x0004

thequerysequenceitselfisunmapped

0x0008

themateisunmapped

0x0010

strandofthequery(1forreverse)

0x0020

strandofthemate

0x0040

thereadisthefirstreadinapair

0x0080

thereadisthesecondreadinapair

0x0100

thealignmentisnotprimary

0x0200

thereadfailsplatform/vendorqualitychecks

0x0400

thereadiseitheraPCRoranopticalduplicate

SincetheflagsaregivenindecimalrepresentationintheSAMfile,youcanusethislinktocheckwhichflagis
set.WearegoingtouseSAMToolswhichprovidesvarioustoolsformanipulatingalignmentsintheSAM/BAM
format.TheSAM(SequenceAlignment/Map)format(BAMisjustthebinaryformofSAM)iscurrentlythede
factostandardforstoringlargenucleotidesequencealignments.Ifyouaredealingwithhighthroughput
metagenomicwholegenomeshotgunsequencingdata,youwillhavetodealwithSAM/BAMfiles.Seewhat
SAMtoolshavetooffer:

WecanthenuseaprogramSAMstattogetstatisticsonouraln-pe.samfile:
[uzi@quince-srv2 ~/linuxTutorial]$ samstat aln-pe.sam
Runningtheabovecodewillgenerateaalnpe.sam.samstat.htmlfilewhichyoucanopeninyourbrowser(be
patient,ittakesabitoftimetoload).Plotssuchas"ReadsLengthDistributions"and"BaseQuality
Distributions"maybeofinteresttoyou:

NowweconvertSAMfiletothebinaryBAMfile
[uzi@quince-srv2 ~/linuxTutorial]$ samtools view -h -b -S aln-pe.sam > alnpe.bam
Extractonlythosesequencesthatweremappedagainstthereferencedatabase.Use-F 4switch.
[uzi@quince-srv2 ~/linuxTutorial]$ samtools view -b -F 4 aln-pe.bam > alnpe.mapped.bam
Generateafilelengths.genomethatcontainstwoentriesperrow:genomeidentifierandthecorresponding
genomelength:

[uzi@quince-srv2 ~/linuxTutorial]$ samtools view -H aln-pe.mapped.bam | perl ne 'if ($_ =~ m/^\@SQ/) { print $_ }' | perl -ne 'if ($_ =~ m/SN:(.+)\s+LN:
(\d+)/) { print $1, "\t", $2, "\n"}' > lengths.genome
[uzi@quince-srv2 ~/linuxTutorial]$ cat lengths.genome
Cdiff078_C01

9165

Cdiff078_C02

93786

Cdiff078_C03

752

Cdiff078_C04

5361

Cdiff078_C05

70058

Cdiff078_C06

23538

Cdiff078_C07

98418

Cdiff078_C08

361074

Cdiff078_C09

45183

Cdiff078_C10

141523

Cdiff078_C11

21992

Cdiff078_C12

2353

Cdiff078_C13

133975

Cdiff078_C14

3374

Cdiff078_C15

9744

Cdiff078_C16

25480

Cdiff078_C17

293596

Cdiff078_C18

7057

Cdiff078_C19

73989

Cdiff078_C20

248092

Cdiff078_C21

41937

Cdiff078_C22

65693

Cdiff078_C23

21321

Cdiff078_C24

440055

Cdiff078_C25

210910

Cdiff078_C26

164162

Cdiff078_C27

22782

Cdiff078_C28

201701

Cdiff078_C29

13447

Cdiff078_C30

101704

Cdiff078_C31

146436

Cdiff078_C32

61153

Cdiff078_C33

59640

Cdiff078_C34

193273

Cdiff078_C35

18395

Cdiff078_C36

25573

Cdiff078_C37

61616

Cdiff078_C38

4117

Cdiff078_C39

110461

Cdiff078_C40

125351

Cdiff078_C41

38508

Cdiff078_C42

113221

Cdiff078_C43

500

Cdiff078_C44

547

Cdiff078_C45

613

Cdiff078_C46

649

Cdiff078_C47

666

Cdiff078_C48

783

Cdiff078_C49

872

Cdiff078_C50

872

Cdiff078_C51

879

Cdiff078_C52

921

Cdiff078_C53

955

Cdiff078_C54

1217

Cdiff078_C55

1337

Cdiff078_C56

1445

Cdiff078_C57

2081

Cdiff078_C58

2098

Cdiff078_C59

2512

Cdiff078_C60

2800

Cdiff078_C61

4372

[uzi@quince-srv2 ~/linuxTutorial]$
SortBAMfile.ManyofthedownstreamanalysisprogramsthatuseBAMfilesactuallyrequireasortedBAM
file.-mspecifiesthemaximummemorytouse,andcanbechangedtofityoursystem.
[uzi@quince-srv2 ~/linuxTutorial]$ samtools sort -m 1000000000 alnpe.mapped.bam aln-pe.mapped.sorted
Wewillnowusebedtools.ItisaveryusefulsuiteofprogramsforworkingwithSAM/BAM,BED,VCFandGFF
files,filesthatyouwillencoutermanytimesdoingNGSanalysis.-ibamswitchtakesindexedbamfilethatwe
generatedearlier,-dreportsthedepthateachgenomepositionwith1basedcoordinates,and-gisusedto
providethegenomelengthsfilewegeneratedearlier.Thecoverageflagsareexplainedpictoriallyfrom
genomecovmanpage:

Reference:http://bedtools.readthedocs.org/en/latest/_images/genomecov-glyph.png
[uzi@quince-srv2 ~/linuxTutorial]$ bedtools genomecov -ibam alnpe.mapped.sorted.bam -d -g lengths.genome > aln-pe.mapped.bam.perbase.cov
Lookatthefirstfewentriesinthefilegeneratedabove.Firstcolumnisgenomeidentifier,secondcolumnis
positionongenome,andthirdcolumniscoverage.
[uzi@quince-srv2 ~/linuxTutorial]$ head aln-pe.mapped.bam.perbase.cov
Cdiff078_C01

41

Cdiff078_C01

41

Cdiff078_C01

42

Cdiff078_C01

42

Cdiff078_C01

42

Cdiff078_C01

44

Cdiff078_C01

44

Cdiff078_C01

44

Cdiff078_C01

44

Cdiff078_C01

10

44

[uzi@quince-srv2 ~/linuxTutorial]$
Nowwewillcountonlythosepositionswherewehave>0coverage.
[uzi@quince-srv2 ~/linuxTutorial]$ awk -F"\t" '$3>0{print $1}' alnpe.mapped.bam.perbase.cov | sort | uniq -c > aln-pe.mapped.bam.perbase.count
Toseewhatwehavedone,usethecatcommand
[uzi@quince-srv2 ~/linuxTutorial]$ cat aln-pe.mapped.bam.perbase.count
9165 Cdiff078_C01
93786 Cdiff078_C02
752 Cdiff078_C03
5361 Cdiff078_C04
70058 Cdiff078_C05
23538 Cdiff078_C06
98418 Cdiff078_C07
333224 Cdiff078_C08
44803 Cdiff078_C09
141523 Cdiff078_C10
21969 Cdiff078_C11
2292 Cdiff078_C12
133974 Cdiff078_C13
1762 Cdiff078_C14
50 Cdiff078_C15
10232 Cdiff078_C16
293440 Cdiff078_C17
7057 Cdiff078_C18
73989 Cdiff078_C19
248092 Cdiff078_C20
41937 Cdiff078_C21
65447 Cdiff078_C22
21321 Cdiff078_C23
439123 Cdiff078_C24
210910 Cdiff078_C25
164162 Cdiff078_C26
22782 Cdiff078_C27
201701 Cdiff078_C28
13447 Cdiff078_C29
98510 Cdiff078_C30
146261 Cdiff078_C31
61153 Cdiff078_C32
44523 Cdiff078_C33
193180 Cdiff078_C34
18395 Cdiff078_C35

25573 Cdiff078_C36
61616 Cdiff078_C37
4117 Cdiff078_C38
62897 Cdiff078_C39
125351 Cdiff078_C40
38508 Cdiff078_C41
113221 Cdiff078_C42
442 Cdiff078_C43
649 Cdiff078_C46
663 Cdiff078_C47
766 Cdiff078_C48
580 Cdiff078_C51
1110 Cdiff078_C54
1445 Cdiff078_C56
2512 Cdiff078_C59
2800 Cdiff078_C60
[uzi@quince-srv2 ~/linuxTutorial]$
Wewillnowusetheabovefilewithlengths.genometocalculatetheproportionsofgenomes/contigs
coveredusingthefollowingoneliner.Itreadslengths.genomelinebyline,assignsthegenomeidentifierto
myArray[0],it'slengthtomyArray[1].Itthensearchestheidentifierinalnpe.mapped.bam.perbase.count,extractsthebasecount,andusesbctocalculatetheproportions.
[uzi@quince-srv2 ~/linuxTutorial]$ while IFS=$'\t' read -r -a myArray; do echo
-e "${myArray[0]},$( echo "scale=5;0"$(awk -v pattern="${myArray[0]}"
'$2==pattern{print $1}' aln-pe.mapped.bam.perbase.count)"/"${myArray[1]} | bc )
"; done < lengths.genome > aln-pe.mapped.bam.genomeproportion
[uzi@quince-srv2 ~/linuxTutorial]$ cat aln-pe.mapped.bam.genomeproportion
Cdiff078_C01,1.00000
Cdiff078_C02,1.00000
Cdiff078_C03,1.00000
Cdiff078_C04,1.00000
Cdiff078_C05,1.00000
Cdiff078_C06,1.00000
Cdiff078_C07,1.00000
Cdiff078_C08,.92286
Cdiff078_C09,.99158
Cdiff078_C10,1.00000
Cdiff078_C11,.99895
Cdiff078_C12,.97407
Cdiff078_C13,.99999
Cdiff078_C14,.52222
Cdiff078_C15,.00513
Cdiff078_C16,.40156
Cdiff078_C17,.99946
Cdiff078_C18,1.00000
Cdiff078_C19,1.00000

Cdiff078_C20,1.00000
Cdiff078_C21,1.00000
Cdiff078_C22,.99625
Cdiff078_C23,1.00000
Cdiff078_C24,.99788
Cdiff078_C25,1.00000
Cdiff078_C26,1.00000
Cdiff078_C27,1.00000
Cdiff078_C28,1.00000
Cdiff078_C29,1.00000
Cdiff078_C30,.96859
Cdiff078_C31,.99880
Cdiff078_C32,1.00000
Cdiff078_C33,.74652
Cdiff078_C34,.99951
Cdiff078_C35,1.00000
Cdiff078_C36,1.00000
Cdiff078_C37,1.00000
Cdiff078_C38,1.00000
Cdiff078_C39,.56940
Cdiff078_C40,1.00000
Cdiff078_C41,1.00000
Cdiff078_C42,1.00000
Cdiff078_C43,.88400
Cdiff078_C44,0
Cdiff078_C45,0
Cdiff078_C46,1.00000
Cdiff078_C47,.99549
Cdiff078_C48,.97828
Cdiff078_C49,0
Cdiff078_C50,0
Cdiff078_C51,.65984
Cdiff078_C52,0
Cdiff078_C53,0
Cdiff078_C54,.91207
Cdiff078_C55,0
Cdiff078_C56,1.00000
Cdiff078_C57,0
Cdiff078_C58,0
Cdiff078_C59,1.00000
Cdiff078_C60,1.00000
Cdiff078_C61,0
Wehaveatotalof61genomes/contigsinthereferencedatabase.Toseehowmanygenomes/contigswe
recovered,wewillusethefollowingoneliner:
[uzi@quince-srv2 ~/linuxTutorial]$ awk -F "," '{sum+=$NF} END{print "Total

genomes covered:"sum}' aln-pe.mapped.bam.genomeproportion


Total genomes covered:47.5224
Wealsoneedgenome/contigcoverage,whichwecancalculateas:
[uzi@quince-srv2 ~/linuxTutorial]$ bedtools genomecov -ibam alnpe.mapped.sorted.bam -g lengths.genome | awk -F"\t" '!/^genome/{l[$1]=l[$1]+($2
*$3);r[$1]=$4} END {for (i in l){print i","(l[i]/r[i])}}' > alnpe.mapped.bam.genomecoverage
[uzi@quince-srv2 ~/linuxTutorial]$ cat aln-pe.mapped.bam.genomecoverage
Cdiff078_C10,61.5467
Cdiff078_C11,68.9158
Cdiff078_C12,79.7875
Cdiff078_C13,61.2645
Cdiff078_C14,57.3438
Cdiff078_C15,0.0812808
Cdiff078_C16,23.5227
Cdiff078_C17,57.358
Cdiff078_C30,59.3333
Cdiff078_C18,55.5597
Cdiff078_C31,62.147
Cdiff078_C19,56.3139
Cdiff078_C32,66.0493
Cdiff078_C33,48.8165
Cdiff078_C34,65.7106
Cdiff078_C35,62.7728
Cdiff078_C36,62.7535
Cdiff078_C37,67.2169
Cdiff078_C51,1.05916
Cdiff078_C38,61.9871
Cdiff078_C39,37.3289
Cdiff078_C54,6.46754
Cdiff078_C56,815.224
Cdiff078_C59,801.998
Cdiff078_C01,67.3333
Cdiff078_C02,67.4621
Cdiff078_C03,103.848
Cdiff078_C04,65.4128
Cdiff078_C05,66.1244
Cdiff078_C06,66.239
Cdiff078_C07,76.0081
Cdiff078_C20,55.6661
Cdiff078_C08,60.4236
Cdiff078_C21,56.2321
Cdiff078_C09,76.9986
Cdiff078_C22,56.8815
Cdiff078_C23,53.2772
Cdiff078_C24,56.9991

Cdiff078_C25,57.4446
Cdiff078_C26,59.296
Cdiff078_C40,66.0074
Cdiff078_C27,59.4391
Cdiff078_C41,67.5941
Cdiff078_C28,59.8319
Cdiff078_C42,69.4415
Cdiff078_C29,60.961
Cdiff078_C43,4.812
Cdiff078_C46,29.3837
Cdiff078_C60,62.1336
Cdiff078_C47,7.95946
Cdiff078_C48,15.3436
[uzi@quince-srv2 ~/linuxTutorial]$
Sorttheoriginalbamfile
[uzi@quince-srv2 ~/linuxTutorial]$ samtools sort -m 1000000000 aln-pe.bam alnpe.sorted
NowwewillcheckalignmentstatisticsusingthePicardtools.Notethattheawkstatementgivenbelowisused
totransposetheoriginaltableandyoucandowithoutit.
[uzi@quince-srv2 ~/linuxTutorial]$ java -jar $(which
CollectAlignmentSummaryMetrics.jar) INPUT=aln-pe.sorted.bam OUTPUT=alnpe.sorted.alignment_stats.txt REFERENCE_SEQUENCE=data/Cdiff078.fa
[uzi@quince-srv2 ~/linuxTutorial]$ grep -vi -e "^#" -e "^$" alnpe.sorted.alignment_stats.txt | awk -F"\t" '{ for (i=1; i<=NF; i++) {a[NR,i] =
$i}}NF>p{p=NF}END{for(j=1;j<=p;j++){str=a[1,j];for(i=2; i<=NR; i++)
{str=str"\t"a[i,j];} print str}}'
CATEGORY

FIRST_OF_PAIR

SECOND_OF_PAIR PAIR

TOTAL_READS

425271

425038

850309

PF_READS

425271

425038

850309

PCT_PF_READS

PF_NOISE_READS

PF_READS_ALIGNED

407011

405258

812269

PCT_PF_READS_ALIGNED

0.957063

0.953463

0.955263

PF_ALIGNED_BASES

119451610

118113100

237564710

PF_HQ_ALIGNED_READS

401018

399295

800313

PF_HQ_ALIGNED_BASES

118606615

117274833

235881448

PF_HQ_ALIGNED_Q20_BASES

116971078

111640501

228611579

PF_HQ_MEDIAN_MISMATCHES

PF_MISMATCH_RATE

0.002359

0.007186

0.004759

PF_HQ_ERROR_RATE

0.002269

0.007065

0.004653

PF_INDEL_RATE

0.000124

0.00013

0.000127

MEAN_READ_LENGTH

299.093366

298.832657

298.963048

READS_ALIGNED_IN_PAIRS

404714

404545

809259

PCT_READS_ALIGNED_IN_PAIRS

0.994356

0.998241

0.996294

BAD_CYCLES

STRAND_BALANCE

0.500072

0.500484

0.500278

PCT_CHIMERAS

0.014823

0.014668

0.014746

PCT_ADAPTER

0.000285

0.000261

0.000273

SAMPLE
LIBRARY
READ_GROUP
[uzi@quince-srv2 ~/linuxTutorial]$
Thedetaileddescriptionofthesesummarymetricsaregivenhere.Fromthislink,PF_MISMATCH_RATE,
PF_HQ_ERROR_RATE,andPF_INDEL_RATEareofinteresttous.Ascanbeseen,theerrorratesarequite
lowandwecanproceedwiththeanalysis.NextwewouldliketocalculateGCbias.Forthispurpose,wewill
indexaln-pe.mapped.sorted.bamfile.
[uzi@quince-srv2 ~/linuxTutorial]$ samtools index aln-pe.mapped.sorted.bam
[uzi@quince-srv2 ~/linuxTutorial]$ for i in $(samtools view -H alnpe.mapped.sorted.bam | awk -F"\t" '/@SQ/{gsub("^SN:","",$2);print $2}'
);do samtools view -b aln-pe.mapped.sorted.bam $i > alnpe.mapped.sorted.$i.bam; java -Xmx2g -jar $(which CollectGcBiasMetrics.jar)
R=data/Cdiff078.fa I=aln-pe.mapped.sorted.$i.bam O=alnpe.mapped.sorted.${i}_GCBias.txt CHART=aln-pe.mapped.sorted.${i}_GCBias.pdf
ASSUME_SORTED=true; done
Intheaboveoneliner,CollectGcBiasMetrics.jarwillgenerateaGCbiasplotforeachcontig,andwill
looklikethese:

Nowcollateallthetxtfilestogether:
[uzi@quince-srv2 ~/linuxTutorial]$ for i in $(ls *_GCBias.txt); do awk -v
k="$i" '!/^#/ && !/^$/ && !/^GC/ && !/?/{print k"\t"$1"\t"$5}' $i; done | perl
-ane '$r{$F[0].":".$F[1]}=$F[2];unless($F[0]~~@s){push
@s,$F[0];}unless($F[1]~~@m){push @m,$F[1];}END{print
"Contigs\t".join("\t",@s)."\n";for($i=0;$i<@m;$i++){print
$m[$i];for($j=0;$j<@s;$j++){(not defined $r{$s[$j].":".$m[$i]})?print
"\t".0:print"\t".$r{$s[$j].":".$m[$i]};}print "\n";}}' | sed '1s/alnpe\.mapped\.sorted\.//g;1s/_GCBias\.txt//g' > aln-pe.mapped.sorted.bam.gcbias
Donotgetintimidatedbytheperlonelinerintheabovestatement.Ihaveextracteditfrommy
GENERATEtable.shscript.Ifthedatastreamisoftheform[Contig]\t[Feature]\t[Value],thenyou
canpipethestreamtoGENERATEtable.shtoobtainaContigXFeaturetable:
$ cat test.tsv
contig1 F1

12.2

contig1 F2

34.2

contig1 F3

45.2

contig2 F2

56.3

contig2 F3

56.2

contig3 F1

45.4

contig3 F2

56.3

contig4 F1

23.5

contig5 F1

24.5

$ cat GENERATEtable.sh
#!/bin/bash
less <&0| \
perl -ane '$r{$F[0].":".$F[1]}=$F[2];
unless($F[0]~~@s){
push @s,$F[0];}
unless($F[1]~~@m){
push @m,$F[1];}
END{
print "Contigs\t".join("\t",@s)."\n";
for($i=0;$i<@m;$i++){
print $m[$i];
for($j=0;$j<@s;$j++){
(not defined $r{$s[$j].":".$m[$i]})?print
"\t".0:print"\t".$r{$s[$j].":".$m[$i]};}
print "\n";}}'
$ cat test.tsv | ./GENERATEtable.sh
Contigs contig1 contig2 contig3 contig4 contig5
F1

12.2

45.4

23.5

24.5

F2

34.2

56.3

56.3

F3

45.2

56.2

Nowtakealookatthegeneratedtable:
[uzi@quince-srv2 ~/linuxTutorial]$ head aln-pe.mapped.sorted.bam.gcbias
Contigs Cdiff078_C01

Cdiff078_C02

Cdiff078_C03

Cdiff078_C04

Cdiff078_C05

Cdiff078_C06

Cdiff078_C07

Cdiff078_C08

Cdiff078_C09

Cdiff078_C10

Cdiff078_C11

Cdiff078_C12

Cdiff078_C13

Cdiff078_C14

Cdiff078_C15

Cdiff078_C16

Cdiff078_C17

Cdiff078_C18

Cdiff078_C19

Cdiff078_C20

Cdiff078_C21

Cdiff078_C22

Cdiff078_C23

Cdiff078_C24

Cdiff078_C25

Cdiff078_C26

Cdiff078_C27

Cdiff078_C28

Cdiff078_C29

Cdiff078_C30

Cdiff078_C31

Cdiff078_C32

Cdiff078_C33

Cdiff078_C34

Cdiff078_C35

Cdiff078_C36

Cdiff078_C37

Cdiff078_C38

Cdiff078_C39

Cdiff078_C40

Cdiff078_C41

Cdiff078_C42

Cdiff078_C43

Cdiff078_C46

Cdiff078_C47

Cdiff078_C48

Cdiff078_C51

Cdiff078_C54

Cdiff078_C56

Cdiff078_C59

Cdiff078_C60

19.855622

113.514035

5.94012 0

9.265957

21.189287

1.013634

21.79477

2.099257

8.004623

17.472841

4.059842

0.50366 0

3.954422

15.404681

4.36693 0

1.355168

1.898603

1.454454

11.850605

0.919685

0.229238

5.621592
0

3.322151

0.395449

1.167889

2.616454

0.554199

5.625801

3.721214

0.647487

1.787893

0.73865 2.481033

1.011679

2.647623

1.063016

1.10446 0

0.699548

1.484725

5.277073

23.930557
0

4.918962

1.302196

5.292116

0.488503

0.270507

1.796704

8.023401

1.168119

1.534896

0.32406 4.34351 0

0.738781

0.917049

0.646285

0.2347 0

1.275217

0.465283

2.237992

1.725467

1.625944

2.181692

1.672486

1.27811

0.362786

3.35627 0.457715

4.796381

3.673656

0.915163

1.283086

0.909904

2.225468

0.352394

1.259544

13.719311

1.329643

0.540768

0.765043

0.562234

8.036611

10

0.818789

1.092856

0.885359

1.311588

1.97216 0.494173

0.522956

0.080909
0

1.131649
1.071169

0.512673

2.959123

1.437926

1.849716

0.713225

0.63983 0.739728

1.748817

2.30876 1.411724

0.674416

1.696827

1.361078

0.582441

1.147602

1.42304 1.223984

0.463436

0.393384

11

2.714431

1.504252

0.997528

0.321222

0.794326
0
0.187891

2.288661

4.409023

0.569749

0.724603

0
0.315217

0.494047

0.247215

2.378187

1.226165

1.355215

1.150524

1.9538

0.491785

0.769493

0.719303

0.885474

1.083578

1.954739

1.513891

1.138604

0.904377

0.50476 0.713456
2.795913

1.355035

1.755064

1.204967

1.04366 0.260253

1.697555

0.566266

0.927834

1.018818

1.456152

0.857751

0.848433

0.892897

0.335209

0.829905

0.755116

2.490848

1.208937

0
0

YoucanthenusethefollowingRcodetogenerateaGCvsCoveragetablewhichshowsthatatveryGC,
coveragesgodown(notethatthesearethesmoothedvaluesacrossallgenomes/contigs):
Nowwecalculatemeanqualityscorebycycle
[uzi@quince-srv2 ~/linuxTutorial]$ java -Xmx2g -jar $(which
MeanQualityByCycle.jar) INPUT=aln-pe.mapped.sorted.bam OUTPUT=alnpe.mapped.sorted.mqc.txt CHART_OUTPUT=aln-pe.mapped.sorted.mqc.pdf

Wealsocalculatequalityscoredistribution
[uzi@quince-srv2 ~/linuxTutorial]$ java -Xmx2g -jar $(which
QualityScoreDistribution.jar) INPUT=aln-pe.mapped.sorted.bam OUTPUT=alnpe.mapped.sorted.qsd.txt CHART_OUTPUT=aln-pe.mapped.sorted.qsd.pdf

AnotherusefultoolisQualimapwhichoffersMultisampleBAMQC.
Touseit,weneedtogenerateinput.txtfilewhichcontainslistingofBAMfileswewanttocompare.To
savetime,Iamonlyconsidering3files:
[uzi@quince-srv2 ~/linuxTutorial]$ awk '{split($0,k,".");print k[4]"\t"$0}'
<(ls *.sorted.Cdiff078_C4*.bam | head -3) > input.txt
[uzi@quince-srv2 ~/linuxTutorial]$ cat input.txt
Cdiff078_C40

aln-pe.mapped.sorted.Cdiff078_C40.bam

Cdiff078_C41

aln-pe.mapped.sorted.Cdiff078_C41.bam

Cdiff078_C42

aln-pe.mapped.sorted.Cdiff078_C42.bam

[uzi@quince-srv2 ~/linuxTutorial]$ qualimap multi-bamqc -d input.txt -r -outdir


qualimap
ThiswillgenerateaqualimapfolderthatwillcontainamultisampleBamQcReport.htmlfilethatyoucan
loadinyourbrowser.Docheckthe"PCA"and"InsertSizeHistogram"plot.
Youcanusesamtools flagstattogetmappingstatistics:
[uzi@quince-srv2 ~/linuxTutorial]$ samtools flagstat aln-pe.bam
850309 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
812269 + 0 mapped (95.53%:-nan%)
850309 + 0 paired in sequencing
425271 + 0 read1
425038 + 0 read2
795935 + 0 properly paired (93.61%:-nan%)
809259 + 0 with itself and mate mapped

3010 + 0 singletons (0.35%:-nan%)


11922 + 0 with mate mapped to a different chr
8256 + 0 with mate mapped to a different chr (mapQ>=5)
WithBioawk,youcandoamazingthings:
Extractingunmappedreadswithoutheaders:
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -c sam 'and($flag,4)' aln-pe.sam |
less
M01808:26:000000000-A6KTK:1:1101:10450:1106
*

77

TTAAAGTTAAACTTGTCATATTCATATCTGATTTTTCTACTAGATTCCTTTAAGTTATCCGAACATGAAGCAAGTAATT
TATCCTTAATTAAATTATAGACTTTACTTTCTTTATCAGATAAATCTTTAGCTTTTCCAATACCAGATATAGTAGGAAT
AATTGCATAGTGGTCTGTAACTTTAGATGAATTAAAAATAGACTTAAAGTTTGATTCATTGATTTTAAAATCTTCTTCA
AGTCCTTCTAATAATTCTTTCATAGTATTAACCATATCATTGGTTAAATACCTGCTATCCGTTC
CCCCCGGGGGGGGGGFGFGGGGGGGGGGGGGCEGGGGGFGGGGEFEEGGFEGGGDFFGFGGGFFFGGGGGGFFFGGGGF
EFFEBFGGGC>FGGFBFGGGFDF,@DEEFCFGGGGGGGGCEF,?
EGFGGGFDFGGFGGFDCDFFGFFDFFFBFBFFGFFGEFFAAFCEFFFFF5@09*2?
EE@A*>@AEF5@):=>E;EB**9:495*

AS:i:0 XS:i:0

M01808:26:000000000-A6KTK:1:1101:10136:1113
*

77

CTATTGGAACAAGTGGGGAACTGCAGTCGCCTAACAGAGAATATATTCGTTATCGAATTACATTATCTACTCAAGACAC
CAGTAGAACTCCTAAACTTCTTGAAATACAACTACATGATATACCAAAACCTCCTTATGAGAGACTTGGATTTGCAAGA
CCAGTTGTGTTGGATACTAACGGGGCTTGGGAAGCAGTGTTAGAAAATGCCTTTGATATTGTAGTAACAAGTGAAGTAA
ATGGCGCTGATATTCTGGAGTTTAAACTGCCATTTCATGATTCCAAGCGAGAGACATTAGACA
CCCCCGGGGEGGF
Extractingmappedreadswithheaders:
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -c sam -H '!and($flag,4)' aln-pe.sam
| less
@SQ

SN:Cdiff078_C01 LN:9165

@SQ

SN:Cdiff078_C02 LN:93786

@SQ

SN:Cdiff078_C03 LN:752

@SQ

SN:Cdiff078_C04 LN:5361

@SQ

SN:Cdiff078_C05 LN:70058

@SQ

SN:Cdiff078_C06 LN:23538

@SQ

SN:Cdiff078_C07 LN:98418

@SQ

SN:Cdiff078_C08 LN:361074

@SQ

SN:Cdiff078_C09 LN:45183

@SQ

SN:Cdiff078_C10 LN:141523

@SQ

SN:Cdiff078_C11 LN:21992

@SQ

SN:Cdiff078_C12 LN:2353

@SQ

SN:Cdiff078_C13 LN:133975

@SQ

SN:Cdiff078_C14 LN:3374

@SQ

SN:Cdiff078_C15 LN:9744

@SQ

SN:Cdiff078_C16 LN:25480

:
CreateFASTAfromBAM(usesrevcompifFLAG&16):
[uzi@quince-srv2 ~/linuxTutorial]$ samtools view aln-pe.bam | bioawk -c sam '{
s=$seq; if(and($flag, 16)) {s=revcomp($seq) } print ">"$qname"\n"s}' | less
>M01808:26:000000000-A6KTK:1:1101:19201:1002
NAAAAGAACTGGCAATTGAAAATAATATACCTGTATATCAACCAGTAAAGGCTAGAGATAAAGAATTTATAGATACAAT
TAAATCTTTAAATCCAGATGTAATAGTAGTTGTAGCTTTTGGACAGATACTTCCAAAAGGAATATTAGAGATTCCTAAG
TTTGGATGTATAAATGTTCATGTTTCTTTACTTCCAAAATATAGAGGTGCGGCACCTATAAATTGGGTAATAATAAATG
GTGAAGAAAAGACTGGTGTTACAACTATGTATATGGATGAAGGTCTAGATACTGGA
>M01808:26:000000000-A6KTK:1:1101:19201:1002
NCCAGTATCTAGACCTTCATCCATATACATAGTTGTAACACCAGTCTTTTCTTCACCATTTATTATTACCCAATTTATA
GGTGCCGCACCTCTATATTTTGGAAGTAAAGAAACATGAACATTTATACATCCAAACTTAGGAATCTCTAATATTCCTT
TTGGAAGTATCTGTCCAAAAGCTACAACTACTATTACATCTGGATTTAAAGATTTAATTGTATCTATAAATTCTTTATC
TCTAGCCTTTACTGGTTGATATACAGGTATATTATTTTCAATTGCCAGTTCTTTTA
>M01808:26:000000000-A6KTK:1:1101:12506:1003
NAAAGATATTATTTTTAGCCCTGGTGTTGTACCTGCTGTTGCTATTTTAGTAAGAATATTAACTAATTCTAATGAAGGC
GTGATAATTCAAAAGCCAGTGTATTACCCATTTGAAGCTAAGGTAAAGAGTAATAATAGGGAAGTTGTAAACAATCCTC
TAATATATGAAAATGGGACTTATAGAATGGATTATGATGATTTGGAAGAAAAAGCTAAGTGTAGCAACAATAAAGTACT
GATACTTTGTAGCCCTCACAATCCTGTTGGAAGAGTTTGGAGAGAAGATGAATTAAAAAAGGTT
>M01808:26:000000000-A6KTK:1:1101:12506:1003
NAGATTAAATGTTTTACTTGGAGCTATACATGTAACTATTTTATCCTTGTACTCTGGGCATAATGACTGTAAAGGAGTA
TGTTTAAATCCTTTTCTAATTAAATCAGAATGTATCTCATCAGCTATTATCCATAGGTCATATTTTTTACATATTTCTA
CAACCTTTTTTAATTCATCTTCTCTCCAAACTCTTCCAACAGGATTGTGAGGGCTACAAAGTATCAGTACTTTATTGTT
GCTACACTTAGCTTTTTCTTCCAAATCATCATAATCCATTCTATAAGTCCCATTTTCATATATT
:

Get%GCcontentfromreferenceFASTAfile:
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -c fastx '{ print ">"$name; print
gc($seq) }' data/Cdiff078.fa | less
>Cdiff078_C01
0.28096
>Cdiff078_C02
0.307669
>Cdiff078_C03
0.514628
>Cdiff078_C04
0.26898
>Cdiff078_C05
0.291059
>Cdiff078_C06
0.286006

>Cdiff078_C07
0.282794
>Cdiff078_C08
0.289484
:

GetthemeanPhredqualityscorefromaFASTQfile:
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -c fastx '{ print ">"$name; print
meanqual($qual) }' data/M120_S2_L001_R1_001.fastq | less
>M01808:26:000000000-A6KTK:1:1101:19201:1002
37.3788
>M01808:26:000000000-A6KTK:1:1101:12506:1003
36.9867
>M01808:26:000000000-A6KTK:1:1101:19794:1003
37.1694
>M01808:26:000000000-A6KTK:1:1101:20543:1021
37.01
>M01808:26:000000000-A6KTK:1:1101:14616:1037
33.9133
>M01808:26:000000000-A6KTK:1:1101:10885:1044
35.9502
:

Youwanttoseehowmanysequencesareshorter(lessthan1000bp?)
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -cfastx 'BEGIN{ s = 0} {if
(length($seq) < 1000) s += 1} END {print "Shorter sequences", s}'
data/Cdiff078.fa
Shorter sequences

12

YoucancountsequencesveryeffectivelywithBioawk,becauseNRnowstoresnumberofrecords:
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -cfastx 'END{print NR}'
data/630_S4_L001_R1_001.fastq
329396

FurtherReading
Inthecontextoftheexercises,itwillbehelpfulifyoucouldreadthroughthefollowingonlinetutorials,thoughit
isnotessential:
Bashtutorial(https://jack.logicalsystems.it/homepage/techinfo/Guida-Bash.txt)
Awkoneliners(http://www.pement.org/awk/awk1line.txt)

Sedoneliners(http://sed.sourceforge.net/sed1line.txt)
Perloneliners(http://www.catonmat.net/download/perl1line.txt)
VItutorial(http://www.nanocontact.cz/~trunec/education/unix/vi-tutor.txt)
LastUpdatedbyDrUmerZeeshanIjazon23/01/2015.

Вам также может понравиться