MSC Project

NON-CODING RNA PREDICTION OF CLINICALLY
IMPORTANT MYCOPLASMA BY COMPARATIVE

GENOMIC ANALYSIS
Dissertation submitted to the Madurai Kamaraj University in partial fulfillment

for the requirement of Masters of Science in Biotechnology
School of Biotechnology
Madurai Kamaraj University
Madurai
Regn. No:A242009
OBJECTIVES:
• To choose the best possible approach to predict the
ncRNA
• To standardize the procedure required for the

approach selected.
• Identification and characterization of the ncRNAs

from clinically important Mycoplasma.
• To form the base for the automization procedure for

the ncRNA prediction.
Past
• Sequence similarity search, Statistical analysis, Transcription signal analysis,
Comparative genomic analysis.
• Existing methods are biased to particular classes of ncRNAs only.
•tRNAscan-SE, Mir-Scan etc.,
QRNA - A Blend
• Secondary structure alone is not statistically significant for the detection of
ncRNAs.
• Important sequences that code for proteins and performing important functions
are conserved across the related organisms.
QRNA was developed to screen the conserved RNA secondary structures from the
background of the other conserved sequences.
OUTLINE
INTERGENIC REGIONS OF ORGANISM OF INTEREST
↓ blastn
SEARCH FOR HOMOLOGY ACROSS RELEATED ORGANISMS
↓ Perl scripts
PARSE THE ALIGNMENTS WITH CERTAIN CUTOFFS
↓
THE ALIGNMENTS WERE GIVEN AS INPUT FOR THE QRNA
↓
PUTATIVE ncRNA
PROTEIN CODING REGION→INTERGENIC REGION
.ptt file
↓
Co-ordinates of protein coding regions
↓
Intergenic region co-ordinates
↓
Intergenic region co-ordinates
difference > 50 nucleotides
↓
Range file
↓
Intergenic sequence extraction by EMBOSS application
extractseq –regions @rangefile -separate
GENOME LENGTH COMPARISION OF
THE MYCOPLASMA
Organism Genome
Genome Size Comparision size
M.gallisepticum 9,96,422
M.gen M.genitalium 580,074

M.mycoides 12,11,703
M.pne
M.penetrans 13,58,633
M.pul M.pneumoniae 8,16,394
M.gal M.pulmonis 9,63,879
M.myc
M.gen- Mycoplasma genetalium
M.pen M.pne- Mycoplasma pneumoniae
M.pul- Mycoplasma pulmonis
M.gal- Mycoplasma gallisepticum
0 500000 1000000 1500000 M.myc- Mycoplasma mycoides
M.pen- Mycoplasma penetrans
Genome length
MYCOPLASMA GENOME – INTERGENIC REGION
100%
80%
60%
40%
20%
0%
en
ne
en
al
ul
s
yc
en
.g
.p
.p
.p
.g
.m
pi
M
M
M
sa
o
m
Ho
BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC
REGION IN THE GENOME OF MYCOPLASMA
PROTEIN TABLE OF THE GENOME
Mycoplasma genitalium G37 complete genome - 0..580074

480 proteins
Location Strand Length PID Gene Synonym Code COG Product
Product
735..1829 + 364 3844620 MG001 - - - (dnaN)

1829..2761 + 310 1045670 MG002 - - - dnaJ
2846..4798 + 650 1045671 MG003 - - - (gyrB)
4813..7323 + 836 1045672 MG004 - - - (gyrA)
7295..8548 + 417 1045673 MG005 - - - (serS)
8552..9184 + 210 1045674 MG006 - - - (tmk)
9157..9921 + 254 1045675 MG007 - - - hypothetical
9924..11252 + 442 1045676 MG008 - - - (tdhF)
…… …….. … ….. ……….. ……… .. .. .. …
Protein Co-ordinates Intergeinc Co-ordinates
735 1829 1 734

1829 2761 2762 2845
2846 4798 → 4799 4812
4813 7323 7224 7294
7295 8548 8549 8551
8552 9184 9183 9156
9157 9921 ……. …….
……. …….
CURING Raw intergenisc coordinates
Starting Ending Length

Curing of Intergenic Regions 1 734 734
2762 2845 84
N o . o f In terg en ic R eg io n s
4799 4812 14
1200 7324 7294 -29
1000 8549 8551 3
9185 9156 -28
800 9922 9923 2
Before
600 11253 11251 -1
After 12041 12068 28
400
12726 12701 -24
200
13566 13569 4
0 14434 14395 -38
15317 15555 239
n
n
l
l
yc
u
e
e
.g
.p
.p
.p
.g
.m
M
M
M
M
Starting Ending Length

1 734 734
2762 2845 84
GRAPH SHOWING THE CULLING OF THE 15317 15555 2390
INTERGENIC SEQUENCES BY THE
C PROGRAMME THAT SELECTS THE REGIONS
WHOSE LENGTH IS GREATER THAN OR EQUAL
Intergenic region coordiantes which are
TO 50 NUCLEOTIDES ONLY more than 50 nucleotides in length
INTERGENIC SEQUENCES
>L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequence
AAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGC
AAAAGCTTCTGTACTGTTTATTTA
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT
GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA
ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTT
AATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT
ATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAA
AGCAA
CTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAA
GGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTA
AAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATT
TAGCAGAA
…………………………………………………………………………………………………………..
Intergenic sequences extracted in Fasta format

Similarity Search - WU BLAST 2.0
•Six genome databases were made each excluding one organism
•Intergenic sequences of each organism were searched for similarity (blastn)
against the database which doesn’t consist the organisms genome
Organism Database Organisms in Organism Database Organisms in
Created Database Created Database
M.gallisepticum gempppdb M.genitalium M.penetrans ggmpnpudb M.gallisepticum

M.mycoides M.genitalium
M.penetrans M.mycoides
M.pneumoniae M.pneumoniae
M.pulmonis M.pulmonis
M.genitalium gampppdb M.gallisepticum M.pneumoniae ggmpepudb M.gallisepticum
M.mycoides M.genitalium
M.penetrans M.mycoides
M.pneumoniae M.penetrans
M.pulmonis M.pulmonis
M.mycoides ggpppdb M.gallisepticum
M.pulmonis ggmpepndb M.gallisepticum
M.genitalium
M.genitalium
M.mycoides
M.mycoides
M.penetrans
M.penetrans
M.pneumoniae
M.pneumoniae
M.pulmonis
Table showing the list of databases made and the organisms

Parsing alignments - Factors
• Perl script is used to parse the blast alignments
• blastn2qrnadepth.pl is used to parse the alignments.
• Factors considered in parsing
– I trimming
• Evalue
• Minimum and Maximum Identity of alignments
• Length of the alignment
– II trimming
• Score
• Depth of alignments
• Shift
Parsing alignments – QRNA input
• Perl script generates various files

– QRNA input file : filename.q file
• It is a collection of sequences in fasta format, where two
sequences are the two component of an alignmnet with
gaps left in place.
– Parsing report file : filename.q.rep
• It is a report of the blastn alignment that have been
pruned in the process of creating the QRNA input file.
QRNA input file
>L43967_15317_15555-1>179-Mycoplasma
ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATT
T
>gb-U00089--19096>19275-Mycoplasma
ACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATT
T
>L43967_19760_19824-5<65-Mycoplasma
TTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAA
AT
>emb-BX293980.1--57200>57261-Mycoplasma
TTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAA
AT
Parsing Report File
FILE: genblast
DIR: /home/kalyankpy/coput2/blast//
FIRST TRIMMING
Minimum length = 1
Maximum Evalue = 0.01
Minimum %id = 0
Maximum %id = 100
SECOND TRIMMING
Alignments culled by = SC
Depth of alignments = 1
shift =1
113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence

Total # alignments: 1121 After First trimming: 88 After Second trimming: 2
57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence
Total # alignments: 152 After First trimming: 3 After Second trimming: 3
……………………………………………………………………………………………….
……………………………………………………………………………………………….
Total #Queries 122

Total #Alignments 53927 ave_len = 309.5
After first trimming 18851 ave_len = 552.6
After second trimming 386 ave_len = 404.2
No. of Blast
No. of blastn hits selected hitsinput
for qrna No. of
alignments
M.gen 386 53927
M.pne 565 44433
M.pul 1012 360830
M.gal 850 154026
M.myc 1787 430551
M.pen 1852 560263
GRAPH SHOWING NUMBER OF ALIGNMENTS

SELECTED FOR QRNA INPUT FOR EACH
GENOME THROUGH THE PERLSCRIPT
QRNA – PARAMETERS
• Scanning window approach

– Window =150 nt; Extension = 50 nt
• Maximum length 9999999
• Local viterbi algorithm
• RIBOPROB matrix
• Shuffling the sequence maintaining
the composition
QRNA OUTPUT
#---------------------------------------------------------------------
# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m
(Sept 1997)
#---------------------------------------------------------------------
# PAM model = BLOSUM62
#---------------------------------------------------------------------
# RNA model = /mix_tied_linux.cfg
# RIBOPROB matrix = /RIBOPROB85-60.mat
#---------------------------------------------------------------------
# seq file = /home/kalyankpy/perlscriptresult/genblast.q
# #seqs: 772 (max_len = 3420)
#---------------------------------------------------------------------
# window version: window = 150 slide = 50 -- length range =
[0,9999999]
#---------------------------------------------------------------------
# 1 [both strands] (sre_shuffled)
>L43967_1_734-90>722-Mycoplasma (664)
>gb-U00089--130>767-Mycoplasma (664)
length of whole alignment after removing common gaps: 664

Divergence time (variable): 0.401
[alignment ID = 61.75 MUT = 29.67 GAP = 8.58
………………………………………………………… ……………….. ( CONTD..)
QRNA OUTPUT
length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)
posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43)

posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46)
L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTT
gb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT
………………………………………………………………………………………………………………………………………………………………………………
LOCAL_DIAG_VITERBI -- [Inside SCFG]
OTH ends *(+) = (0..[150]..149)

OTH ends (-) = (0..[150]..149)
COD ends *(+) = (120..[27]..146)

COD ends (-) = (41..[12]..52)
RNA ends *(+) = (0..[21]..20)

RNA ends (-) = (0..[150]..149)
winner = OTH
OTH = 184.281 COD = 166.408 RNA = 179.710
logoddspostOTH = 0.000 logoddspostCOD = -17.873 logoddspostRNA = -4.571
sigmoidalOTH = 4.571 sigmoidalCOD = -17.932 sigmoidalRNA = -4.571
60
Number of ncRNAs
No. of non-coding predicted
predicted
50
40
Number
30
20
10
0
M.pen M.myc M.gal M.pul M.pne M.gen
Number of ncRNA predicted for

each organism
Range
Length Range of of Non-coding
Non-coding RNARNA
predicted
350
300
Length (nt) 250
200
150
100
50
0
M.pen M.myc M.gal M.pul M.pne M.gen
PICTURE SHOWING THE LENGTH

RANGE OF NON-CODING RNAs.
(Vertical bars represent the spread of scores
and horizontal bar represent the average)
Putative Vs Annotated
•The predicted ncRNa were searched for similarity against the
biochemically characterized ncRNA of Bacteria ( Non-coding RNA database
at http://biobases.ibch.poznan.pl/nc, updated 2002)
•Found similar to the Mc_MCS4 ncRNA of Mycoplasma capricolum.
•Mc_MCS4 was already characterized to be having extensive homology with
the eukaryotic U6 snRNA.
•Another motif in one of the putative ncRNA was found to be conserved
across E.coli, S.typhi, K.pneumoniae as a part of MicF ncRNA in these
organsims.
•MicF was characterised to be regulating the expression of OmpF protein in
these organisms.
•Similarity was also found with OxyS ncRNA of E.coli.
•OxyS was found to modulate the expression of various genes in response to
Hydrogen peroxide.
- In Eukaryotes
• Similarity was observed with few miRNAs
that were present in the miRNA database
(Rfam miRNA registry)
• Same stretch of sequence was present in Human,
Rat and Mouse miRNA.
• Small stretches of similarity was observed

with various ncRNAs playing role in
regulation of development also.
Sequences producing High-scoring Segment Pairs: Score P(N) N
hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop 91 0.26 1

rno-mir-190 MI0000933 Rattus norvegicus miR-190 stem-loop 91 0.26 1
mmu-mir-190 MI0000232 Mus musculus miR-190 stem-loop 86 0.48 1
>hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop

Length = 85
Minus Strand HSPs:
Score = 91 (19.7 bits), Expect = 0.31, P = 0.26

Identities = 45/68 (66%), Positives = 45/68 (66%), Strand = Minus / Plus
Query: 77 AGGTTTAGGTGTTCT-TATTT-ATTTATTAGGTTGTTTAGTT--TC-AATTATTTTTGGA 23
||| | |||| | | ||| || |||||||||||| | || || || ||| | | |
Sbjct: 4 AGGCCTCTGTGTGATATGTTTGATATATTAGGTTGTT-ATTTAATCCAACTATATATCAA 62
Query: 22 ATACTAGT 15
| | || |
Sbjct: 63 ACA-TATT 69
>Hs_NTT
Length = 17,572
Plus Strand HSPs:
Score = 116 (23.5 bits), Expect = 0.025, P = 0.024

Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus
Query: 11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65
|| |||| | || ||| | | || | |||| | ||| | |||| ||| ||||
Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394
Query: 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98
|||| | || ||| |||| | ||||| |||
Sbjct: 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427
CONCLUSIONS
• Comparative genomic analysis was selected for
the ncRNA prediction.
• Procedure for the prediction was standardized.
• One of the putative ncRNA was found to be
similar to the already characterized ncRNA from
the same genus.
• Conserved region of MicF was found to be
present in the putative ncRNA also.
• Identification of the eukaryotic miRNA
counterpart in Mycoplasma.
Future Plans
• To develop programmes for getting the intergenic
region co-ordinates given the protein table file as
input.
• To verify the genuinity of the predictions beyond
the homologous regions found in bacteria.
• To extend the prediction procedure for Eukaryotes.
• To develop the procedure required for classification
of the predicted ncRNAs into subclasses.
• To identify the functions of the putative ncRNAs by
searching their effector targets.
• To automize the whole procedure.
ACKNOWLEDGMENTS
Dr. Z. A. Rafi
Dr. S. Krishnaswamy
The Whole SBT family
Ministry of Human Recourses Development
Department of Education
Department of Science and Technology
Department of Biotechnology
All my classmates

MSC Project

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

MSC Project

Загружено:

Авторское право:

Доступные форматы

NON-CODING RNA PREDICTION OF CLINICALLY

IMPORTANT MYCOPLASMA BY COMPARATIVE

Dissertation submitted to the Madurai Kamaraj University in partial fulfillment

• To standardize the procedure required for the

• Identification and characterization of the ncRNAs

• To form the base for the automization procedure for

M.gen M.genitalium 580,074

M.gal M.pulmonis 9,63,879

Mycoplasma genitalium G37 complete genome - 0..580074

735..1829 + 364 3844620 MG001 - - - (dnaN)

735 1829 1 734

Starting Ending Length

Starting Ending Length

Intergenic sequences extracted in Fasta format

M.gallisepticum gempppdb M.genitalium M.penetrans ggmpnpudb M.gallisepticum

Table showing the list of databases made and the organisms

• Perl script generates various files

113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence

Total #Queries 122

GRAPH SHOWING NUMBER OF ALIGNMENTS

• Scanning window approach

length of whole alignment after removing common gaps: 664

posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43)

LOCAL_DIAG_VITERBI -- [Inside SCFG]

OTH ends *(+) = (0..[150]..149)

COD ends *(+) = (120..[27]..146)

RNA ends *(+) = (0..[21]..20)

Number of ncRNA predicted for

PICTURE SHOWING THE LENGTH

• Small stretches of similarity was observed

hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop 91 0.26 1

>hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop

Minus Strand HSPs:

Score = 91 (19.7 bits), Expect = 0.31, P = 0.26

Plus Strand HSPs:

Score = 116 (23.5 bits), Expect = 0.025, P = 0.024

Вам также может понравиться