Вы находитесь на странице: 1из 11


sudo apt-get install zlib1g-dev

Permission issue go to folder: chmod 777 . OR *

cd /home/ Location to mosaik-master -- (inside this demo,src,....)
cd src
//New folder bin created besides src.
cd ../demo
Sudo ./Build.sh OR sudo bash Build.sh
//In demo/fastq u will see read.mkb and in demo/reference u will see e.coli.dat

Sudo ./Align.sh OR sudo bash Align.sh

//In demo/fastq u will see other files
//The resulting bam file (read.mka.bam) will be found in the demo/fastq directory.

MOSAIK is a reference-guided assembler comprising of two main modular programs:

MosailBuild and MosaikAligner.

The workflow consists of supplying sequences in FASTA, FASTQ, Illumina Bustard & Gerald, or
SRF file formats and producing results in the BAM format (A binary format for storing sequence
data). A BAM file (.bam) is the binary version of a SAM file. A SAM file (.sam) is a tab-delimited
text file that contains sequence alignment data.

Build phase:
Step 1: Convert the reference into binary format (e.coli.fa to e.coli.dat)
Step 2: For >100 million basepairs make a jump database (e.coli.dat to e.coli.15)
Step 3: Convert the reads to binary format (mate1.fq to read.mkb)


MosaikBuild converts various sequence formats into Mosaiks native read format.

e.coli.fa (sample file for reference sequence)

### Create
# cd ../src
# make

# Convert the reference sequence to our binary format

../bin/MosaikBuild -fr reference/e.coli.fa -oa reference/e.coli.dat

# You may need to create the jump database for large genome (> 100 million basepair)
#../bin/MosaikJump -ia reference/e.coli.dat -hs 15 -out reference/e.coli.15

# Convert the reads to our binary format

../bin/MosaikBuild -q fastq/mate1.fq -q2 fastq/mate2.fq -out fastq/read.mkb -st
(for singled-end) ../bin/MosaikBuild -q fastq/mate1.fastq -st illumina -out
mate1.fq (sample file for reads sequence)

Align Phase:

Pass as inputs the reads (read.mkb) and reference file (e.coli.dat) in binary format to
MosaikAligner. (For >100 million basepairs use the jump database created in build phase)


MosaikAligner pairwise aligns each read to a specified series of reference

sequences and produces BAMs as outputs.

### Create ../bin/MosaikAligner
# cd ../src
# make
# Align the reads


../bin/MosaikAligner -in fastq/read.mkb -out fastq/read.mka -ia reference/e.coli.dat -annpe

$ANN_PATH/2.1.26.pe.100.0065.ann -annse $ANN_PATH/2.1.26.se.100.005.ann

# You may need to use the jump database for large genome (> 100 million basepair)
#../bin/MosaikAligner -in fastq/read.mkb -out fastq/read.mka -ia reference/e.coli.dat -annpe
$ANN_PATH/2.1.26.pe.100.0065.ann -annse $ANN_PATH/2.1.26.se.100.005.ann -j

1. pe.ann and se.ann are on MOSAIK/src/networkFile/.
2. read.mka.bam is the resultant bam while other outputted bams are for other
Final ouput( gold.sam)

What's new?
1. A new neural-net for mapping quality (MQ) calibration is introduced. Initial testing
using simulated reads shows that this method improve the accuracy compared to the
previous MQ scheme.
2. The overall alignment speed is much quicker now due to a banded Smith-
Waterman algorithm implementation. Longer Roche 454 reads align much quicker than
1. A local alignment search option has been added to help rescue mates in
paired-end/mate-pair reads that may be missing due to highly repetitive regions in the
2. SOLiD support has finally come of age. MOSAIK imports and aligns SOLiD
reads in colorspace, but now seamlessly converts the alignments back into basespace.
No more downstream bioinformatics headaches.
3. Robust support for the BAM alignment file formats.
4. The command line parameters have been cleaned up and sensible default
parameters have been chosen. This cuts down the ridiculously long command-lines to
simply specifying an input file and an output file in most cases.

What makes MOSAIK different?

Unlike many current read aligners, MOSAIK produces gapped alignments using the Smith-
Waterman algorithm.
MOSAIK is written in highly portable C++ and currently targetted for the following platforms:
Microsoft Windows, Apple Mac OS X, FreeBSD, and Linux operating systems. Other platforms
can easily be supported upon request.

MOSAIK is multithreaded. If you have a machine with 8 processors, you can use all 8
processors to align reads faster while using the same memory footprint as when using one

MOSAIK supports multiple sequencing technologies. In addition to legacy technologies such as

Sanger capillary sequencing, our program supports next generation technologies such as
Roche 454, Illumina, AB SOLiD, and experimental support for the Helicos Heliscope

SOAP: (Short Oligonucleotide Analysis Package)

SOAPaligner/soap2 is a member of the SOAP (Short Oligonucleotide Analysis Package). It
is an updated version of SOAP software for short oligonucleotide alignment. The new program
features in super fast and accurate alignment for huge amounts of short reads generated by
Illumina/Solexa Genome Analyzer. Compared to soap v1, it is one order of magnitude faster. It
require only 2 minutes aligning one million single-end reads onto the human reference genome.
Another remarkable improvement of SOAPaligner is that it now supports a wide range of the
read length.

System Requirements
1. Hardware:
a) 64-bit x86-64 CPUs with SSE instructions.
b) 8 GB main memory ( for a genome as large as humans).
c) 8 GB hard disk (for a genome as large as humans).
2. Software:
a) 64-bit Linux system (kernel >=2.6).

1. Download the SOAPaligner http://soap.genomics.org.cn/soapaligner.html .
2. In the Linux console, type:
cd <TheDirectoryYouPutTheTarball>
tar zxvf SOAPaligner.tar.gz
cd SOAPaligner
3. In your directory there are 2 executable files, 2bwt-builder (for format) and soap
(for align).
(same as here : https://github.com/gigascience/bgi-

To run SOAPaligner, we need to build index files for the reference genome, and then search
reads against the formatted index files.

1.Format reference sequence:
<ExecutablePath>/2bwt-builder <FastaPath/YourFasta>
eg: ./2bwt-builder reference/human_genome.fa

Then under the directory there will be 13 index files, all their prefixes are your_fasta file name
with .index added, e.g. human_genome.fa.index.
The suffixes include *.amb, *.ann, *.bwt, *.fmv, *.hot, *.lkt, *.pac, *.rev.bwt, *.rev.fmv, *.rev.lkt,
*.rev.pac, *.sa, and *.sai.

2.Alignment quick start:

For alignment of single-end reads:
./soap a <reads_a> -D <index.files> -o <output></output>
./soap a read/mate1.fq -D reference/human_genome.fa.index -o <output></output>

For paired-end reads:

./soap a <reads_a> -b <reads_b> -D <index.files> -o <PE_output> -2 <SE_output> -m
<min_insert_size> -x <max_insert_size>

./soap a read/mate1.fq -b read/mate2.fq -D reference/human_genome.fa.index -o
<PE_output> -2 <SE_output> -m <min_insert_size> -x <max_insert_size>

NOTE: For the D option, the program can only accept the prefix of your index files, such as


-D STR Prefix name for reference index [*.index].

-a STR Query file, for SE reads alignment or one end of PE reads
-b STR Query b file, one end of PE reads
-o STR Output file for alignment results
-2 STR Output file contains mapped but unpaired reads when do PE alignment
-u STR Output file for unmapped reads, [none]
-m INT Minimal insert size INT allowed for PE, [400]
-x INT Maximal insert size INT allowed for PE, [600]
-n INT Filter low quality reads contain more INT bp Ns, [5]
-t Output reads id instead reads name, [none]
-r INT How to report repeat hits, 0=none; 1=random one; 2=all, [1]
-R RF alignment for long insert size(>= 2k bps) PE data, [none] FR alignment
-l INT For long reads with high error rate at 3'-end, those
can't align whole length, then first align 5' INT bp
subsequence as a seed, [256] use whole length of the read
-v INT Totally allowed mismatches in one read, [2]
-M INT Match mode for each read or the seed part of read, which
shouldn't contain more than 2 mismaches, [4]
0: exact match only
1: 1 mismatch match only
2: 2 mismatch match only
3: [gap] (coming soon)
4: find the best hits
-p INT Multithreads, n threads, [1]

SOAPaligner needs about 2 hours to format the reference sequence and build indexing tables.
The RAM usage is depending on the total size of the reference sequence. For the human
reference genome, it will occupy 7GB RAM.

Table 1. Performance of aligning 1 million single-end reads (35bp read length) or 1 million read
pairs onto the human reference genome

Time (sec)Single-end Time (sec)Paired-end RAM

reads reads (GB)

SOAPaligher(soap2) 120 505 6.8

soap 1700+ 5743 13.4

Future Development

1. Binary soap alignment output, and .gz input and output;