Академический Документы
Профессиональный Документы
Культура Документы
1
14. Web Programming and Bioinformatics
In Section 3.2, we learned how the syntax of Pascal programming language can be defined with a CFG. (Actually,
Appendix A shows the whole definition of Pascal programming language in terms of a syntax flow graph.) Other
programming languages can also be defined formally with a grammar, except for a few exceptional cases of context
dependency, such as double declaration of a variable in a block and using numeric labels in DO-loop in FORTRAN.
This chapter first shows that the major parts of the popular Web programming language HTML (Hypertext Markup
Language) and XML (Extensible Markup Language), which is designed for e-commerce, can also be defined in terms of
a CFG. Then the chapter presents a potential application of the formal languages to molecular biology.
2
Web Programming and Bioinformatics
14.1 Hyper Text Markup Language (HTML)
In this section, to see how our knowledge in the formal languages could be applicable
to investigating the properties of programming languages, we shall first examine
HTML. The left box below shows an informal definition (itemized for convenience) of
an HTML list that would commonly appear in a text. To the right are CFG rules
translated from the informal definition.
5
Web Programming and Bioinformatics
14.2 DTD and XML
The major objective of HTML is to define the format of a document, and hence it is
not easy to describe the semantics of the text, that is needed in the area of e-
commerce. XML has been introduced to overcome this drawback. Actually, XML is
the language (i.e., the set of documents) written according to a DTD (Document Type
Definition), which can be transformed into a CFG. All DTD’s have the following
form.
<!DOCTYPE name-of-DTD [ List-of-Element-Definitions ]>,
where each element definition has the following format.
<!ELEMENT element-name (Element-Description)>
This format can be transformed into the following CFG rules, where DTD corresponds
to the start symbol S.
DTD → <!DOCTYPE name-of-DTD [ List-of-Element-Definitions ]>
List-of-Element-Definitions → Element-Definition List-of-Element-Definitions
| Element-Definition
Element-Definition → <!ELEMENT element-name (Element-Description)>
6
DTD and XML Web Programming and Bioinformatics
<!DOCTYPE PcSpecs [
<!ELEMENT PCS (PC*)>
<!ELEMENT PC (MODEL, PRICE, PROCESSOR, RAM, DISK+)>
<!ELEMENT MODEL (\#PCDATA)>
<!ELEMENT PRICE (\#PCDATA)>
<!ELEMENT PROCESSOR (MANF, MODEL, SPEED)>
<!ELEMENT MANF (\#PCDATA)>
<!ELEMENT MODEL (\#PCDATA)>
<!ELEMENT SPEED (\#PCDATA)>
<!ELEMENT RAM (\#PCDATA)>
<!ELEMENT DISK (HARDDISK | CD | DVD) >
<!ELEMENT HARDDISK (MANF, MODEL, SIZE)>
<!ELEMENT SIZE (\#PCDATA)>
<!ELEMENT CD (SPEED)>
<!ELEMENT DVD (SPEED)>
]>
8
Web Programming and Bioinformatics
DTD and XML
Here, we show how to transform each element definition of PcSpecs into CFG
rules (shown in the box).
9
Web Programming and Bioinformatics
DTD and XML
.....
DISK → <DISK>HARDDISK</DISK> |
<DISK>CD</DISK> | <DISK>DVD</DISK>
.....
10
Web Programming and Bioinformatics
DTD and XML
<PCS>
<PC>
<MODEL>1234</MODEL>
<PRICE>$3000</PRICE>
<PROCESSOR>
<RAM>512</RAM>
<DISK><HARDDISK>
<MANF>Superdc</MANF>
<MODEL>xx1000</MODEL>
<SIZE>62Gb</SIZE>
</HARDDISK></DISK>
<DISK><CD>
<SPEED>32x</SPEED>
</CD></DISK>
</PC>
<PC> . . . . </PC>
</PCS>
11
Web Programming and Bioinformatics
DTD and XML
Again in XML we see the same context dependency caused by the matching tags.
As in the previous example, if we restrict the matching tags to those chosen from a
finite set of reserved words, such as MODES, PRICE, DISK, etc., XML turns out to
be a CFL.
Warning Signs
Break Time
- On children's alphabet blocks: Letters may be used to construct words, phrases and sentences that may be
deemed offensive.
- On a microscope: Objects are smaller and less alarming than they appear.
- On a revolving door: Passenger compartments for individual use only.
- On work gloves: For best results, do not leave at crime scene.
- On a palm sander: Not to be used to sand palms.
- On a calendar: Use of term "Sunday" for reference only. No meteorological warranties express or implied.
-On a blender: Not for use as an aquarium.
- Dennis -
12
Web Programming and Bioinformatics
14.3 Gene code and grammar
Quite a few linguistic terminologies, such as code, express, translate, edit, etc., have
been adopted by biologists, especially in the field of molecular biology, ever since
James Watson and Francis Crick proposed the structure of DNA (deoxyribonucleic
acid) strands in 1953. This implies that in biology there are problems that we can
approach with the knowledge of automata and formal languages. As an example, this
section shows how it is possible to express genes in terms of a grammar.
Before the example, we need a very brief introduction to the process of protein
synthesis. A genome is a long double helix DNA strand, consisting of four chemical
bases; adenine, thymine, cytosine and guanine, denoted by A, T, C, and G, respectively.
13
Gene Grammar Web Programming and Bioinformatics
Each base in a double helix forms one of the four complementary pairs, A-T, T-A,
C-G, and G-C as the following example shows. (Thus, given a DNA single strand,
we can figure out its complementary strand.)
ACTTAACGGCGAT
TGAATTGCCGCTA
Proteins are synthesized by the following three steps; (1) a single strand segment
(i.e., gene) of the genome is transcribed into an RNA. The RNA is the complementary
strand of the source strand, with every base T replaced by U (uracil). Hence, RNA’s
consist of four bases, A, C, G and U.
5’ end TCAAGCT
5’ end
UCAAGCU
3’ end
3’ end AGTTCGA
RNA
(1) RNA Transcription
14
Gene Grammar Web Programming and Bioinformatics
(2) In eukaryotes, only useful substrings for protein synthesis, called exons, are
spliced (i.e., cut off and concatenated) to form a mRNA (messenger RNA). (3)
Finally, ribosomes, large molecular assemblies traveling along the mRNA, translate
the mRNA into a protein, which is an amino-acid sequence.
exons introns
Transcribed RNA
(2) splicing
Messenger RNA
(3) Translation
Protein
15
Gene Grammar Web Programming and Bioinformatics
There are 20 different amino acids. Hence, proteins can be represented as a string of
20 characters. The translation is done by triplets (i.e., groups of three adjacent bases),
also called codons, according to the table shown below. The translation begins with the
start codon ATG, and continues until it meets one of the stop codons, TAA, TAG, and
TGA (see the table below). For every codon scanned, a specific amino acid is attached
to the next position of the protein being synthesized. Notice that because there are
4× 4× 4 = 64 different codons and 20 amino acids, several codons are translated into the
same amino acid. (Following convention, we shall represent codons using character T
rather than U.)
Ala Arg Asp Ans Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Var Stop
GCA AGA GAT AAT TGT GAA CAA GGA CAT ATA TTA AAA ATG TTT CCA AGT ACA TGG TAT GTA TAA
AAC TGC GAG CAG GGG CAC ATT TTG AAG (start) TTC CCG AGC ACG TAC GTG TAG
GCG AGG GAC
GGT ATC CTA CCT TCA ACT GTT TGA
GCT CGA GGC CTG CCC TCG ACC GTC
CTT TCT
GCC CGG
CTC TCC
CGT
CGC
16
Gene Grammar Web Programming and Bioinformatics
This three step process for protein synthesis is called the central dogma in
molecular biology, because with so many exceptional cases, it is no longer accepted as
a valid theory. However, to show a potential application of the formal languages and
automata theory to molecular biology, we shall present an example based on this
dogma. As we did in Section 14.1 for HTML, we will transform the informal definition
of genes into a formal definition in terms of a grammar. (This example is an excerpt,
lightly edited, from David Sears’ paper “The Linguistics of DNA,” which appeared in
the journal American Scientist, Vol. 80, 1992.)
• The transcript, which is the part of the gene that is copied into a messenger RNA,
has flanking 5’- and 3’-untranslated-regions around a coding-region initiated by a
start-codon.
<gene> → <transcript>
<transcript> → <5’-untranslated-region><start-codon><coding-region>
<3’-untranslated-region>
17
Gene Grammar Web Programming and Bioinformatics
• A stop-codon is any of the three triplets TAA, TAG and TGA, whereas a codon is
any of the remaining 61 possible triplets that are translated into an amino acid
according to the rule given in the codon table.
18
Gene Grammar Web Programming and Bioinformatics
• The first two bases of a codon partially specify the amino acid into which it is
translated. Every codon starting with GC (CG) is translated into alanine (respectively,
arginine), independent of the third base. Every codon starting with GA is translated into
aspartic acid if the third base is a pyrimidine (i.e., C or T). Otherwise, if the third base
is a purine (i.e., A or G), it is translated into glutamic acid. Here are some formal
representations of such translation rules specified in the codon table.
All the grammar rules presented up to this point are context-free. But the following
biological fact requires a rule from the upper level in the Chomsky hierarchy.
• Introns can be transposed to the left or right.
The following slide shows a grammar that specifies the set of “genes” that we have
developed in this section.
19
Gene Grammar Web Programming and Bioinformatics
<gene> → <transcript>
<transcript> → <5’-untranslated-region><start-codon><coding-region><3’-untranslated-region>
<coding-region> → <codon><coding-region> | <splice><coding-region> | <stop-codon>
---------------------------------------------------------------------------
<codon> → <Ala> | <Arg> | <Asp> | <Asn>| <Cys>| <Glu>| <Gln> | <Gly> | <His> | <Ile>|
<Leu>| <Lys>| <Met>| <Phe>| <Pro>| <Ser>| <Thr>| <Typ>| <Tyr>| <Val>
<start-codon> → <Met> <stop-codon> → TAA | TAG | TGA
-----------------------------------------------------------------------------
<Ala> → GC<base> <Arg> → CG<base> <Asp> → GA<pyrimidine> <Cys> → TG<pyrimidine>
<Glu> → GA<purine> <Gln> → CA<purine> <Gly> → GG<base> <His> → CA<Pyrimidine>
<Ile> → AT<pyrimidine> | ATA <Leu> → TT<purine> | CT<base> <Lys> → AA<purine>
<Met> → ATG <Phe> → TT<primidine> <Pro> → CC<base> <Thr> → AC<base>
<Ser> → AG<primidine> | TC<base> <Typ> → TGG <Tyr> → TA<pyrimidine> <Val> → GT<base>
-------------------------------------------------------------------------------
<base> → A | T | G | C <pyrimidine> → C | T <purine> → A | G
-------------------------------------------------------------------------------
<splice> → <intron> <intron> → GT<inton-body>AG
-------------------------------------------------------------------------------
<intron><base> → <base><intron> <base><intron> → <intron><base>
20
Gene Grammar Web Programming and Bioinformatics
The grammar that we have just developed is incomplete because no rules are defined
for the nonterminal symbols <5’-untranslated-region>, <3’-untranslated-region> and
<intron-body>. The two untranslated regions flanking the coding-region usually
contain segments regulating the gene expression. The contents and the function of
these parts are not yet completely understood. The intron-body parts generally consist
of a long repeating string, whose exact role is also under investigation. We need more
research before we can present a formal model for these regions.
What else is love but understanding and rejoicing in the fact that another person lives, acts, and experiences
otherwise than we do….? - Friedrich Nietzsche –
Just because you love someone doesn’t mean you have to be involved with them. Love is not a bandage to cover
wounds. - Hugh Elliot -
21