Genome Organisation

PHAR2811 Dales lecture 4
page 1
Genome Organisation
Synopsis: C-value paradox, different classes of DNA, repetitive DNA and disease. If
protein-coding portions of the human genome make up only 1.5% what is the rest doing?
Transcription of rRNA and tRNA by RNA pol I and III. Other non-coding RNAs, snoRNAs,
miRNAs, natural antisense RNAs and their processing and function.
Definitions:
Genome: the total amount of genetic material, stored as DNA. The nuclear genome refers to
the DNA in the chromosomes contained in the nucleus; in the case of humans the DNA in
the 46 chromosomes. It is the nuclear genome that defines a multicellular organism; it will be
the same for all (almost) cells of the organism. You can have organelle genomes as well
such as the mitochondrial genome. When you want to identify or distinguish one organism
from another, such as in forensic testing, you investigate the genome.
Transcriptome: The total amount of genetic information which has been transcribed by the
cell. This information will be stored as RNA. The transcriptome is unique to a cell type and is
a measure of the gene expression. Different cells within an organism will have different
transcriptomes. Cell types can be identified by their transcriptome.
Proteome: The cells complete protein output. This reflects all the mRNA sequences
translated by the cell. Cell types have different proteomes and these can be used to identify
a particular cell.
Genome Investigations: what can we learn?

There are a number of investigations that can be carried out on the genomes of various organisms; this
is known as comparative genomics and is an emerging field of study in its own right. Having access to
the sequence of a number of genomes has opened this area up enormously. Even before complete
sequences were available, a number of parameters could be measured in various genomes and some
information about genome organisation in multicellular organisms could be gleaned. Some of the
measurements which have been made are:
The base composition of a vast array of organisms; both prokaryotic and eukaryotic
The repetitive and unique sequence content
The number of genes and their distribution throughout the genome
The genome organisation: base composition

The base composition of DNA in an organism is a fixed value and it is expressed as the % of (G + C)
of the total genome. The variation of this value between different prokaryotes is large; this is
surprising given that many of the individual proteins produced by the species have similar amino acid
sequences. Prokaryotic genomic DNA can have as little as 25 % (G + C) in Mycoplasma genitalium to
as high as ~72 % in Micrococcus lysodeikticus. Eukaryotic genomic DNA does not display the same
variation between species. The % (G + C) composition of most plant and animal species falls within a
narrow range, averaging at 39% with a variation of only 6%.
page 2
Base composition within the genome.

In prokaryotes the bases are distributed evenly throughout the genome with a slightly lower (G + C)
content in promoter and intergenic regions; these often have A + T rich segments which melt more
readily than G+C rich regions. The relatively constant base distribution within a given bacterial
genome suggests that although there may be unequal nucleotide pool sizes inside the cell, the system
will have evolved over many generations to be like this, and the rate of DNA replication is constant.
The distribution of the (G + C) content throughout each genome in eukaryotes, however, varies
significantly, unlike prokaryotes. Whereas the mean variation in % (G + C) content throughout the E.
coli genome is only 8.6 % in eukaryotes, this variation is over 30 %. Certain regions of eukaryotic
genomic DNA are found to be (A + T) rich, with a % (G + C) content as low as 18 %, while other
regions have a (G + C) content as high as 70 %.
The genome organisation: repetitive and unique sequences

Recapping.The C-value paradox
The C-value is the total number of DNA bases in the genome (per haploid set of chromosomes).
When you compare this to the complexity of the organism you find a massive disparity. Some
organisms seem to have far too much DNA for their complexity e.g. the carp has 52 chromosomes
while the alligator that eats it has 16! Some flowers have far more genetic material than humans!!
Clearly the amount of DNA is not proportional to that required to produce all the proteins made by the
organism or to their position on the food chain.
Table 1 Examples of organisms, their genome size, number of genes, and % of genome single copy.
Organism
Viruses
Simian Virus 40 (SV40)
Bacteriophage X174
Bacteriophage
Bacteria
M. genitalium
E. coli
Yeast
S. cerevisae
Round Worm
C. elegans
Fruit Fly
D. melanogaster
Mammals
M. musculus
H. sapiens
Mustard Plant
A. thaliana
Size of genome
(kbp)
Estimated # genes
5.1
5.4
48.5
% single copy
100
100
100
580
4,639
470
4,405
>90
92
12,100
6,200
90
97,000
19,000
180,000
13,600
60
2,500,000
3,240,000
30,000
30,000
70
64
125,000
25,500
80
page 3
Only about 1.5% of the DNA in the genome actually codes for proteins. What is
the rest of it doing?
A large amount of non-coding DNA is a feature of the genomes of complex eukaryotes. As I
mentioned in the last lecture the selection pressure for prokaryotes is to be able to rapidly proliferate
when conditions are suitable i.e. good supply of nutrient. They need to be able to quickly swing in to
action under these circumstances and make hay while the sun shines so to speak. Complex
multicellular organisms do not have the same selection pressure. In fact if a particular cell in a
multicellular organism takes it upon themselves to rapidly and uncontrollably divide we call it cancer.
Because the drive to genome efficiency is not so prevalent in eukaryotes (i.e. the bacterium that wants
to divide rapidly has to copy the genome rapidly so it doesnt want any redundant sequences in the
DNA) they can afford to have larger amounts of non-coding DNA. While this explains why the extra
DNA can persist without causing too much trouble to the organism it doesnt account for it.
Lets take a historical perspective for a moment and see when the extra DNA was first discovered.
Lets go back to the melting and re-annealing DNA covered last year. An interesting technique was
pioneered 20 odd years ago which gained the name of Cot plots but it was the first hint as to the
abundance and diversity of sequences in the genome.
You isolate the DNA from a species and cut it up to manageable lengths and then you melt it. Once it
is fully single stranded you monitor, by absorbance at 260 nm, the re-annealing kinetics of the sample.
If all the sequences were represented once in the genome then you would have the standard reassociation curve, as all the fragments would reanneal at the same rate. This is what happens with E.
coli DNA, giving a nice reverse melting curve.
If however, your DNA sample contains some sequences that are represented more than once on the
genome then you get a much more interesting (and more difficult to analyse) plot. Those sequences
that have the most repeats will hybridise quickest. Those with multiple copies will also hybridise
quickly while the more complex single copy sequences take the longest to re-anneal. You plot the
curve with A260 on the y axis and basically time (actually the initial concentration * time or Cot) as a
log scale on the x axis. The plot is then analysed by computer (assuming second order kinetics for rehybridisation). It is second order kinetics because it is a bimolecular reaction and the rate of
association is proportional to the concentration of both strands.
page 4
The rate of rehybridisation is dependent on the complexity of the DNA. Complexity is defined as the
number of bases in a unique sequence. For example Poly U has a complexity of 1; the sequence
AGTTCAGTTCAGTTCAGTTC has a complexity of 5. The Cot for a given sequence of DNA is
dependent on its complexity.
The plot of a complex mixture of DNA sequences has a bi or tri phasic look about it.
Essentially the human genomic DNA Cot curve has 40% fast annealing low Cot sequence elements
and 60% slow renaturating high Cot unique sequences i.e. one copy per genome. The human genome
can be divided up into 4 classes: highly repetitive (hundreds to millions of copies), moderately
repetitive (10s to hundreds of copies), slightly repetitive (1 10 copies) and single copy sequences.
page 5
The repetitive DNA, by sequence analysis, was found to contain short repeat sections found in
satellite DNA (satellite DNA was so named from its behaviour on CsCl density ultra centrifugation),
and sequences of normal length for genes that have large numbers of copies.
Highly repetitive sequences: One group of highly repetitive DNA is the simple sequence DNA
which contains thousands of copies of a simple sequence repeated in tandem. The repeat sequence can
be as short as 5 bases. The repeat sequences are known as:
Short tandem repeats (STRs)
Microsatellites (1 13 bases)
Minisatellites (14 500 bases)
These sequences are often found clustered around the centromer or telomers. The term satellite came
about from its behaviour in a CsCl gradient. Because many of the simple sequence repeat DNA is AT
rich it has a lower density than the more GC rich genomic DNA and therefore banded differently on
the gradient, as a satellite. These repeats account for 10 20 % of higher eukaryotic DNA.
The moderately repetitive DNA includes those sequences which have multiple copies in the
genome, designed to increase the rate or amount of gene product, and some regulatory sequences
found scattered throughout the genome. Sequences such as the ribosomal RNA and transfer RNA,
which are required in large amounts for protein synthesis have many copies on the genome. The
histone sequences also have large copy numbers in the genome.
Another group of moderately repetitive DNA sequences are those scattered throughout the genome;
known as SINES (short interspersed elements) and LINES (long interspersed elements). Some famous
SINES and LINES: Alu repeats are the major SINE in mammalian genomes. They are ~300 bp long
and about a million of such sequences exist scattered throughout the genome. They account for ~10%
of the genome. They are transcribed into RNA but have no known function?? They are known as Alu
repeats because they contain the recognition sequence for the restriction enzyme Alu. The most
common LINE in the human genome is L1, a 6 000 bp sequence which is repeated some 50 000 times
in the human genome. L1 sequences are also transcribed and some even encode proteins! Their
function in the cell is unknown. Both the Alu and L1 sequences are transposable elements, capable of
moving to different sites in the genome.
The group of sequences with a small number of copies on the genome include such sequences as
the globins. This family of genes contains a number of closely related sequences, varying by only a
few bases in the code, will cross hybridise. These are also known as gene clusters.
The final group, the single copy sequences make up the vast majority of genes on the genome (gene
being a functional unit which codes for a single polypeptide chain). This group is the most complex
and takes the longest to re-anneal, hence the log scale on the time.
The highly repetitive DNA re-anneals in seconds while the most complex single copy group takes
hours or days to re-anneal.
page 6
So our genome contains highly repetitive DNA which doesnt code for proteins. It also contains some
multi-gene families and multiple copies of some genes. This makes up 40% of the genome by Cot plot
analysis. What of the other 60% unique sequences?? Remember only 1 2% of the genome is coding
sequence. How do we account for this discrepancy??
Further investigations of eukaryotic genes (the 60%) found they were interrupted by large sections of
non-coding regions called introns. This does not happen in bacteria. These stretches of DNA, which
can make up over 90% of the gene by base # are cut out after transcription when the mRNA is
processed. The coding sections are called exons.
Pseudogenes
The eukaryotic genome also contains pseudogenes that occupy a significant proportion of the
genome. There are actually two classes of pseudogenes. Class I pseudogenes have arisen by gene
duplication and then have been subsequently inactivated by various mutations (insertions,
substitutions or deletions). These pseudogenes are often found near their functional gene counterpart.
The second class, type II pseudogenes, are processed sequences (lacking introns and often containing
a vestigial poly (A) tail) and have originated during evolution from mRNA that was copied by reverse
transcriptase back into DNA. The sequence was then inserted into the genome by a retrotransposon
event. The footprints of this event are evident in the direct repeats that flank the pseudogene; the
repeats have facilitated its insertion back into the genome. These pseuogenes are usually found a long
way from the functional parent gene. Pseudogenes do not code for functional proteins and they are not
translated. The exact number of pseudogenes in the human genome is unknown although estimates,
using various search criteria, have identified some 2,900 regions which probably represent processed
pseudogenes. The pattern that has emerged from analysis of the human genome is that those
sequences which tend to give rise to more pseudogenes have shorter than average transcripts and are
sequences that are involved in nuclear regulation and translation (ribosomal proteins account for 67%,
lamin receptors 10%, translation elongation factors 5%). The common theme amongst these
sequences may actually be the increased level of transcription of these sequences.
Regulatory regions
Other forms of non-coding DNA also play an important role in gene transcription and contribute to
the increased non-coding DNA found in eukaryotes. The promoter regions of eukaryotic genomes
are substantially larger than their prokaryotic counterparts. Transcriptional regulation in eukaryotes is
a very complex process, often involving enhancers and upstream binding sites for regulatory elements.
These regions can cover thousands of bp. Intergenic DNA is estimated to occupy between 63 and 75%
of the total base-pairs in the human genome. The longest stretch of non-coding DNA, termed gene
desert is on chromosome 13. It is 3 038 416 bp long.
What have we learnt from sequencing the human genome?
This was the big event of the last decade of last century. Whole genome analysis has confirmed the
earlier observations concerning gene density in eukaryotes. Below is the table with some general
statistics gleaned from the human genome project.
page 7
Table 2 General characteristics of the Human Genome.
Human Genome : General Statistics

Approximate size of the genome
2.9 Gbp
% (A + T)
54
% (G+C)
38
% undetermined bases in genome
Most GC- rich region 50 kb
Chr 2 (66%)
Most AT-rich region 50 kb
Chr X (25%)
Number of genes
26,383 39,114
Most gene-rich chromosome
Chr. 19 (23 genes/Mb)
Least gene-rich chromosomes
Chr. 13 (5 genes/Mb) and Chr. Y (5 genes/Mb)
Average gene length
27 kbp
Gene with the most exons
Titin (234 exons)
% of genome containing repeat sequences
35
% exon base pairs
1.1 1.4
% intron base pairs
24 - 36
% intergenic DNA (bp)
64 - 75
Repetitive DNA and disease.

Trinucleotide repeats (TNR: this means multiple tandem copies of a 3 nucleotide sequence) are a
specialised type of repeat sequence found in the genome which come about from mutations during
replication, recombination or repair of both somatic and germline cells. This process, known as
dynamic mutation gives rise to unstable repeats. There are an increasing number of genetic disorders
which result from expansion of trinucleotide repeats; many of them are neurological disorders. A
static mutation (and I will discuss this more in a later lecture) is one that occurs in the germline (sperm
or ovary cells which have undergone meiosis) which is passed onto the next generation and stably
retained in the somatic cells genome (mitotic cells). This mutation is present in the genome of all
somatic cells to the same level. Unlike static mutations dynamic mutations change; they continue to
mutate between different tissues and across generations. The longer the tract length (i.e. number of
repeats) the more likely the repeat is going to continue to mutate. This leads to increased severity with
successive generations or in some diseases, the age of onset decreases. In other words with each
generation the disorder becomes worse and/or you start to get the symptoms earlier. This leads to
genetic anticipation. What causes this continuing mutation is unknown but some unusual single
stranded DNA structures are thought to form during repair, recombination and replication. The longer
the tract the more likely these aberrant structures are to form hence perpetuating the mutation process.
The most common are the fragile X syndrome (FRAXA), one of the most common inheritable forms
of mental retardation and Huntingtons disease. The expansion of these repeat sequences is sometimes
found in the coding region of the protein, leading to altered protein function or gain of function as it is
page 8
described in the literature, in the non-coding region where you see loss of function and very recently,
repeats which act at the RNA level, producing a pathogenic RNA species which results in aberrant
RNA protein interactions. This also leads to neuronal dysfunction.
I would like to briefly consider 2 diseases. The fragile X syndrome (FRAXA) results from multiple
copies of the sequence CGG (the expansion) in the 5 UTR of the fragile X syndrome gene, FMR1,
which causes transcriptional silencing of the protein product of this gene, FMRP. The number of
repeats is very important to the final severity of the disease. 5 50 copies has no effect, 50 200
results in an intermediate and distinct syndrome, fragile X tremor/ataxia (FXTAS) while >200 copies
gives rise to the full blown mutation. Mutations in certain regions of the FMR1 coding region produce
a defective protein which also gives rise to the same phenotype as the gene silencing.
Obviously this protein is important to the cell, particularly neurons. The protein and its mRNA
localise in dendritic spines. The expression of the protein is up-regulated in response to stimulation
from glutamate receptors and it is involved in translational repression at synapses. The absence of the
protein results in the neuronal dysfunction. Within the cell it is located largely in the cytoplasm but
does move to the nucleus also. The protein, FMRP has 3 RNA binding domains, it associates with
polyribosomes and seems to be involved in translational repression of a group of mRNA targets i.e.
this protein binds to other mRNA sequences and regulates their translation, probably by mediating
ribosome association and recruiting interfering RNA processes. The target mRNA species it binds to
are sequences involved in cytoskeleton, neuronal development and synaptic transmission.
FMRP is also expressed in the liver, lung, kidney spinal cord, and gastrointestinal tract. These are not
areas of significant problems for sufferers of fragile X syndrome. Two other proteins, FXR1 and
FXR2, have similar functional and structural features and may be they can compensate for a lack of
FMRP in those tissues. Obviously, FXR1 and FXR2 are not able to compensate for an absence of
FMRP in the brains and testicles.
Huntingtons disease.
George Huntington (1850-1916) described the condition while working as a newly qualified
doctor in the rural general practice of his father and grandfather on Long Island, New York
State. Together their observations covered 78 years. Geroge Huntington did not continue
working on Hereditary chorea but went into general practice in Ohio. He never published
another paper in his life - yet his name is remembered from the single one he did write.
A second example is the expansion of the repeats in the coding region of a gene. This is obviously
going to alter the function of the protein product. To date 9 separate disorders are associated with the
expansion of a CAG repeat in the coding region of various proteins. CAG codes for glutamine and the
expansion results in multiple copies of glutamine in the affected protein (polyglutamine disorders).
This gain of function outcome accounts for the neurodegenerative symptoms of Huntingtons
page 9
disorder. The affected protein is expressed widely in the CNS, particularly in certain neuron
populations. The result of the polyglutamine in the protein is a misfolded protein which, in the case of
Huntingtons disorder aggregates and is sequestered into inclusion bodies complete with the
chaperones. This is thought to eventually overload the chaperone and ubiquitin systems. Other
evidence suggests that the inclusion bodies are a protective response and that the mutant protein
actually initiates a cascade of aberrant protein protein interactions which affect many processes
resulting in neuronal dysfunction and death (as always).
The discovery of the HD gene in 1993 resulted in a direct genetic test to make or confirm a diagnosis
of HD in an individual who is exhibiting HD-like symptoms. Using a blood sample, the genetic test
analyzes DNA for the HD mutation by counting the number of repeats in the HD gene region.
Individuals who do not have HD usually have 28 or fewer CAG repeats. Individuals with HD usually
have 40 or more repeats. A small percentage of individuals, however, have a number of repeats that
fall within a borderline region (see table 1).
Table 3
No. of CAG repeats
<
Outcome
Normal range; individual will not develop HD
28
29-34
Individual will not develop HD but the next generation is at risk
35-39
Some, but not all, individuals in this range will develop HD; next
generation is also at risk
>
Individual will develop HD
40
The third problem associated with these trinucleotide expansions concerns the production of a
pathogenic RNA species. I will cover this later.

Genome Organisation

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Genome Organisation

Загружено:

Авторское право:

Доступные форматы

PHAR2811 Dales lecture 4

Genome Investigations: what can we learn?

The genome organisation: base composition

PHAR2811 Dales lecture 4

Base composition within the genome.

The genome organisation: repetitive and unique sequences

PHAR2811 Dales lecture 4

PHAR2811 Dales lecture 4

PHAR2811 Dales lecture 4

PHAR2811 Dales lecture 4

PHAR2811 Dales lecture 4

Table 2 General characteristics of the Human Genome.

Human Genome : General Statistics

% undetermined bases in genome

Most GC- rich region 50 kb

Most AT-rich region 50 kb

Most gene-rich chromosome

Chr. 19 (23 genes/Mb)

Least gene-rich chromosomes

Chr. 13 (5 genes/Mb) and Chr. Y (5 genes/Mb)

Average gene length

Gene with the most exons

Titin (234 exons)

% of genome containing repeat sequences

% exon base pairs

% intron base pairs

% intergenic DNA (bp)

Repetitive DNA and disease.

PHAR2811 Dales lecture 4

PHAR2811 Dales lecture 4

Individual will not develop HD but the next generation is at risk

Individual will develop HD

Вам также может понравиться