You are on page 1of 29

NON-CODING RNA PREDICTION OF CLINICALLY

IMPORTANT MYCOPLASMA BY COMPARATIVE


GENOMIC ANALYSIS

Dissertation submitted to the Madurai Kamaraj University in partial fulfillment


for the requirement of Masters of Science in Biotechnology

School of Biotechnology
Madurai Kamaraj University
Madurai

Regn. No:A242009
OBJECTIVES:
• To choose the best possible approach to predict the
ncRNA

• To standardize the procedure required for the


approach selected.

• Identification and characterization of the ncRNAs


from clinically important Mycoplasma.

• To form the base for the automization procedure for


the ncRNA prediction.
Past
• Sequence similarity search, Statistical analysis, Transcription signal analysis,
Comparative genomic analysis.
• Existing methods are biased to particular classes of ncRNAs only.
•tRNAscan-SE, Mir-Scan etc.,

QRNA - A Blend
• Secondary structure alone is not statistically significant for the detection of
ncRNAs.
• Important sequences that code for proteins and performing important functions
are conserved across the related organisms.
QRNA was developed to screen the conserved RNA secondary structures from the
background of the other conserved sequences.
OUTLINE
INTERGENIC REGIONS OF ORGANISM OF INTEREST
↓ blastn
SEARCH FOR HOMOLOGY ACROSS RELEATED ORGANISMS
↓ Perl scripts
PARSE THE ALIGNMENTS WITH CERTAIN CUTOFFS

THE ALIGNMENTS WERE GIVEN AS INPUT FOR THE QRNA

PUTATIVE ncRNA
PROTEIN CODING REGION→INTERGENIC REGION
.ptt file

Co-ordinates of protein coding regions

Intergenic region co-ordinates

Intergenic region co-ordinates
difference > 50 nucleotides

Range file

Intergenic sequence extraction by EMBOSS application
extractseq –regions @rangefile -separate
GENOME LENGTH COMPARISION OF
THE MYCOPLASMA

Organism Genome
Genome Size Comparision size
M.gallisepticum 9,96,422

M.gen M.genitalium 580,074


M.mycoides 12,11,703
M.pne
M.penetrans 13,58,633
M.pul M.pneumoniae 8,16,394

M.gal M.pulmonis 9,63,879

M.myc
M.gen- Mycoplasma genetalium
M.pen M.pne- Mycoplasma pneumoniae
M.pul- Mycoplasma pulmonis
M.gal- Mycoplasma gallisepticum
0 500000 1000000 1500000 M.myc- Mycoplasma mycoides
M.pen- Mycoplasma penetrans
Genome length
MYCOPLASMA GENOME – INTERGENIC REGION

100%
80%
60%
40%
20%
0%
en

ne

en
al

ul

s
yc

en
.g

.p
.p

.p

.g
.m

pi
M

M
M

sa
o
m
Ho
BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC
REGION IN THE GENOME OF MYCOPLASMA
PROTEIN TABLE OF THE GENOME

Mycoplasma genitalium G37 complete genome - 0..580074


480 proteins
Location Strand Length PID Gene Synonym Code COG Product
Product

735..1829 + 364 3844620 MG001 - - - (dnaN)


1829..2761 + 310 1045670 MG002 - - - dnaJ
2846..4798 + 650 1045671 MG003 - - - (gyrB)
4813..7323 + 836 1045672 MG004 - - - (gyrA)
7295..8548 + 417 1045673 MG005 - - - (serS)
8552..9184 + 210 1045674 MG006 - - - (tmk)
9157..9921 + 254 1045675 MG007 - - - hypothetical
9924..11252 + 442 1045676 MG008 - - - (tdhF)
…… …….. … ….. ……….. ……… .. .. .. …
Protein Co-ordinates Intergeinc Co-ordinates

735 1829 1 734


1829 2761 2762 2845
2846 4798 → 4799 4812
4813 7323 7224 7294
7295 8548 8549 8551
8552 9184 9183 9156
9157 9921 ……. …….
……. …….
CURING Raw intergenisc coordinates

Starting Ending Length


Curing of Intergenic Regions 1 734 734
2762 2845 84
N o . o f In terg en ic R eg io n s

4799 4812 14
1200 7324 7294 -29
1000 8549 8551 3
9185 9156 -28
800 9922 9923 2
Before
600 11253 11251 -1
After 12041 12068 28
400
12726 12701 -24
200
13566 13569 4
0 14434 14395 -38
15317 15555 239
n

n
l

l
yc

u
e

e
.g

.p
.p

.p

.g
.m

M
M

M
M

Starting Ending Length


1 734 734
2762 2845 84
GRAPH SHOWING THE CULLING OF THE 15317 15555 2390
INTERGENIC SEQUENCES BY THE
C PROGRAMME THAT SELECTS THE REGIONS
WHOSE LENGTH IS GREATER THAN OR EQUAL
Intergenic region coordiantes which are
TO 50 NUCLEOTIDES ONLY more than 50 nucleotides in length
INTERGENIC SEQUENCES
>L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequence
AAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGC
AAAAGCTTCTGTACTGTTTATTTA
>L43967_15317_15555 Mycoplasma genitalium G37 intergenic sequence
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT
GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA
ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTT
AATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT
>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequence
ATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAA
AGCAA
>L43967_20356_20543 Mycoplasma genitalium G37 intergenic sequence
CTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAA
GGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTA
AAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATT
TAGCAGAA
…………………………………………………………………………………………………………..

Intergenic sequences extracted in Fasta format


Similarity Search - WU BLAST 2.0
•Six genome databases were made each excluding one organism
•Intergenic sequences of each organism were searched for similarity (blastn)
against the database which doesn’t consist the organisms genome
Organism Database Organisms in Organism Database Organisms in
Created Database Created Database

M.gallisepticum gempppdb M.genitalium M.penetrans ggmpnpudb M.gallisepticum


M.mycoides M.genitalium
M.penetrans M.mycoides
M.pneumoniae M.pneumoniae
M.pulmonis M.pulmonis
M.genitalium gampppdb M.gallisepticum M.pneumoniae ggmpepudb M.gallisepticum
M.mycoides M.genitalium
M.penetrans M.mycoides
M.pneumoniae M.penetrans
M.pulmonis M.pulmonis
M.mycoides ggpppdb M.gallisepticum
M.pulmonis ggmpepndb M.gallisepticum
M.genitalium
M.genitalium
M.mycoides
M.mycoides
M.penetrans
M.penetrans
M.pneumoniae
M.pneumoniae
M.pulmonis

Table showing the list of databases made and the organisms


Parsing alignments - Factors
• Perl script is used to parse the blast alignments
• blastn2qrnadepth.pl is used to parse the alignments.
• Factors considered in parsing
– I trimming
• Evalue
• Minimum and Maximum Identity of alignments
• Length of the alignment
– II trimming
• Score
• Depth of alignments
• Shift
Parsing alignments – QRNA input

• Perl script generates various files


– QRNA input file : filename.q file
• It is a collection of sequences in fasta format, where two
sequences are the two component of an alignmnet with
gaps left in place.
– Parsing report file : filename.q.rep
• It is a report of the blastn alignment that have been
pruned in the process of creating the QRNA input file.
QRNA input file
>L43967_15317_15555-1>179-Mycoplasma
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT
GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA
ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATT
T
>gb-U00089--19096>19275-Mycoplasma
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT
GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA
ACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATT
T
>L43967_19760_19824-5<65-Mycoplasma
TTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAA
AT
>emb-BX293980.1--57200>57261-Mycoplasma
TTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAA
AT
Parsing Report File
FILE: genblast
DIR: /home/kalyankpy/coput2/blast//
FIRST TRIMMING
Minimum length = 1
Maximum Evalue = 0.01
Minimum %id = 0
Maximum %id = 100
SECOND TRIMMING
Alignments culled by = SC
Depth of alignments = 1
shift =1

113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence


Total # alignments: 1121 After First trimming: 88 After Second trimming: 2
57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence
Total # alignments: 152 After First trimming: 3 After Second trimming: 3
……………………………………………………………………………………………….
……………………………………………………………………………………………….

Total #Queries 122


Total #Alignments 53927 ave_len = 309.5
After first trimming 18851 ave_len = 552.6
After second trimming 386 ave_len = 404.2
No. of Blast
No. of blastn hits selected hitsinput
for qrna No. of
alignments
M.gen 386 53927
M.pne 565 44433
M.pul 1012 360830
M.gal 850 154026
M.myc 1787 430551
M.pen 1852 560263

GRAPH SHOWING NUMBER OF ALIGNMENTS


SELECTED FOR QRNA INPUT FOR EACH
GENOME THROUGH THE PERLSCRIPT
QRNA – PARAMETERS

• Scanning window approach


– Window =150 nt; Extension = 50 nt
• Maximum length 9999999
• Local viterbi algorithm
• RIBOPROB matrix
• Shuffling the sequence maintaining
the composition
QRNA OUTPUT
#---------------------------------------------------------------------
# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m
(Sept 1997)
#---------------------------------------------------------------------
# PAM model = BLOSUM62
#---------------------------------------------------------------------
# RNA model = /mix_tied_linux.cfg
# RIBOPROB matrix = /RIBOPROB85-60.mat
#---------------------------------------------------------------------
# seq file = /home/kalyankpy/perlscriptresult/genblast.q
# #seqs: 772 (max_len = 3420)
#---------------------------------------------------------------------
# window version: window = 150 slide = 50 -- length range =
[0,9999999]
#---------------------------------------------------------------------
# 1 [both strands] (sre_shuffled)
>L43967_1_734-90>722-Mycoplasma (664)
>gb-U00089--130>767-Mycoplasma (664)

length of whole alignment after removing common gaps: 664


Divergence time (variable): 0.401
[alignment ID = 61.75 MUT = 29.67 GAP = 8.58
………………………………………………………… ……………….. ( CONTD..)
QRNA OUTPUT
length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)

posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43)


posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46)

L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTT
gb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT

………………………………………………………………………………………………………………………………………………………………………………

LOCAL_DIAG_VITERBI -- [Inside SCFG]

OTH ends *(+) = (0..[150]..149)


OTH ends (-) = (0..[150]..149)

COD ends *(+) = (120..[27]..146)


COD ends (-) = (41..[12]..52)

RNA ends *(+) = (0..[21]..20)


RNA ends (-) = (0..[150]..149)

winner = OTH
OTH = 184.281 COD = 166.408 RNA = 179.710
logoddspostOTH = 0.000 logoddspostCOD = -17.873 logoddspostRNA = -4.571
sigmoidalOTH = 4.571 sigmoidalCOD = -17.932 sigmoidalRNA = -4.571
60
Number of ncRNAs
No. of non-coding predicted
predicted

50
40
Number
30
20
10
0
M.pen M.myc M.gal M.pul M.pne M.gen

Number of ncRNA predicted for


each organism
Range
Length Range of of Non-coding
Non-coding RNARNA
predicted

350
300
Length (nt) 250
200
150
100
50
0
M.pen M.myc M.gal M.pul M.pne M.gen

PICTURE SHOWING THE LENGTH


RANGE OF NON-CODING RNAs.
(Vertical bars represent the spread of scores
and horizontal bar represent the average)
Putative Vs Annotated
•The predicted ncRNa were searched for similarity against the
biochemically characterized ncRNA of Bacteria ( Non-coding RNA database
at http://biobases.ibch.poznan.pl/nc, updated 2002)
•Found similar to the Mc_MCS4 ncRNA of Mycoplasma capricolum.
•Mc_MCS4 was already characterized to be having extensive homology with
the eukaryotic U6 snRNA.
•Another motif in one of the putative ncRNA was found to be conserved
across E.coli, S.typhi, K.pneumoniae as a part of MicF ncRNA in these
organsims.
•MicF was characterised to be regulating the expression of OmpF protein in
these organisms.
•Similarity was also found with OxyS ncRNA of E.coli.
•OxyS was found to modulate the expression of various genes in response to
Hydrogen peroxide.
- In Eukaryotes
• Similarity was observed with few miRNAs
that were present in the miRNA database
(Rfam miRNA registry)
• Same stretch of sequence was present in Human,
Rat and Mouse miRNA.

• Small stretches of similarity was observed


with various ncRNAs playing role in
regulation of development also.
Sequences producing High-scoring Segment Pairs: Score P(N) N

hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop 91 0.26 1


rno-mir-190 MI0000933 Rattus norvegicus miR-190 stem-loop 91 0.26 1
mmu-mir-190 MI0000232 Mus musculus miR-190 stem-loop 86 0.48 1

>hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop


Length = 85

Minus Strand HSPs:

Score = 91 (19.7 bits), Expect = 0.31, P = 0.26


Identities = 45/68 (66%), Positives = 45/68 (66%), Strand = Minus / Plus

Query: 77 AGGTTTAGGTGTTCT-TATTT-ATTTATTAGGTTGTTTAGTT--TC-AATTATTTTTGGA 23
||| | |||| | | ||| || |||||||||||| | || || || ||| | | |
Sbjct: 4 AGGCCTCTGTGTGATATGTTTGATATATTAGGTTGTT-ATTTAATCCAACTATATATCAA 62

Query: 22 ATACTAGT 15
| | || |
Sbjct: 63 ACA-TATT 69
>Hs_NTT
Length = 17,572

Plus Strand HSPs:

Score = 116 (23.5 bits), Expect = 0.025, P = 0.024


Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus

Query: 11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65
|| |||| | || ||| | | || | |||| | ||| | |||| ||| ||||
Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394

Query: 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98
|||| | || ||| |||| | ||||| |||
Sbjct: 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427
CONCLUSIONS
• Comparative genomic analysis was selected for
the ncRNA prediction.
• Procedure for the prediction was standardized.
• One of the putative ncRNA was found to be
similar to the already characterized ncRNA from
the same genus.
• Conserved region of MicF was found to be
present in the putative ncRNA also.
• Identification of the eukaryotic miRNA
counterpart in Mycoplasma.
Future Plans
• To develop programmes for getting the intergenic
region co-ordinates given the protein table file as
input.
• To verify the genuinity of the predictions beyond
the homologous regions found in bacteria.
• To extend the prediction procedure for Eukaryotes.
• To develop the procedure required for classification
of the predicted ncRNAs into subclasses.
• To identify the functions of the putative ncRNAs by
searching their effector targets.
• To automize the whole procedure.
ACKNOWLEDGMENTS
Dr. Z. A. Rafi
Dr. S. Krishnaswamy
The Whole SBT family
Ministry of Human Recourses Development
Department of Education
Department of Science and Technology
Department of Biotechnology
All my classmates