Вы находитесь на странице: 1из 4

Bioinformatics Notes

1. Horizontal Gene Transfer

Horizontal gene transfer, or also referred to as lateral gene transfer, is a phenomenon in
which genetic material can be pass down without descent. As such an organism can obtain genetic
material directly from another organism. While HGT is not so common in eukaryotic species, it is
quite common in prokaryotes and archeans.
The discovery of HGT is important since it opens up a new mode of inheritance and shifts
our conceptions of evolution. For instance, the availability of HGT by prokaryotes speeds up the
process of bacterial resistance.
HGT can be identified through three major criterions. First one is nucleotide composition.
Some species have their own distinct GC/AC content at specific gene regions. If a gene is
identified with a GC/AC content that is remarkably different than the other gene regions, then
there is a possibility that the gene may originate from HGT.
Second criterion is codon usage. Some species prefers to use a specific codon for a specific
amino acid. As such gene regions that shows a very different codon usage hints that the gene may
originate from HGT.
Third criterion is phylogenetic position. It is possible to take genes and construct a
phylogenetic tree out of them (i.e. the ancestry of a gene). If a gene within an organism is
identified to deviate from the phylogenetic positions of other genes, then it hints that the gene
may originate from HGT.
An interesting aspect of HGT is that most microbes utilize HGT for transfer of
operational genes rather than informational genes. Informational genes can be thought of as the
blueprint of several proteins, while operational genes can be thought of as how such proteins
should be regulated.

2. C-Value Paradox and N-Value Paradox

In the history of genome studies, there has been an underlying assumption that
organisms with higher complexity should possess more genetic material and genes. As such,
scientists have devised terms such as the C-Value and N-Value (or G-Value) to refer to genetic
material. C-Values refers to the haploid genome size of genomes, N-Value simply refers to the
number of genes an organism possesses.
However it turns out that this particular assumption was dead wrong. Humans typically
have a C-Value of 3.3 Gb, while a regular onion has nearly five times of that at 15 Gb. Even more
surprising is that the unicellular eukaryotic amoebas have C-Values of 290~690Gb!
This is often dubbed as the C-Value paradox, since weve assumed that much more
complex organisms should have more genetic material. Before the whole sequencing of human
genome was completed, some experts predicted that the amount protein coding genes may exceed
over 50,000. However, the world came in to a shock when the International Human Genome
Sequencing Consortium (2004) confirmed that the human gene count was only around
20,000~25,000. This is dubbed as the N-Value paradox.
It turns out that C-Values do not correlate with organism complexity. While this problem
isnt much of a paradox at all, it led to the idea that a huge portion of an organisms genome is
just noncoding, often highly repetitive DNA sequences called as junk DNA. These DNA

sequences does not contribute to the cell at all. In fact, splicing such sequences seems to have no
effect in the organisms survival.
Higher N-Values in plants may be explained by the fact that a lot of plants are polyploid.
Rice for instance, is an ancient polyploid organism. However, animals like the sea urchin have
similar N-Values with human beings. One suggested explanation is that sea urchins have an
immense consortium of genes related to immunity.
While all animals have innate immunity; that is the non-specific immune reaction to a
pathogen, only the Gnathostosomes (jawed vertebrates) like us humans possess the ability of
adaptive immunity. Adaptive immunity lets us to tailor a specific response to a specific pathogen,
as well as the ability to shuffle pre-existing genes to create a diverse library of antibodies. As such
animals like cnidarians may require more immune genes to accomplish the same level of defense.
Regardless of the reason, it is clear that C-Values and N-Values do not correlate to
morphological complexity. One may call human beings as the most complex organism, but
humans are clearly falling behind the race in C-Value and N-Value.

3. CpG Sites
CpG sites refer to the Cytosine-phosphate-Guanine sequence in the DNA. Simply
said, it is the site where cytosine and guanine is found consecutively. The human genome
typically bears a 42% GC content, thus 21% guanine and cytosine. Hence, the probability to find
a cytosine and guanine right next to each other is

0.21 0.21=0.0441

As such it is expected that there are 4.41% CpG sites throughout the human genome.
However, the observed frequency of CpG sites is only a mere 1%, way less than predicted. A
proposed explanation to this phenomenon is the vulnerability of 5-methylcytosine to undergo
spontaneous deamination into thymine.
DNA methylation serves as a way to control gene expression. Methylation involves the
addition of a methyl group to a nucleotide in a DNA region. This process locks it in such a way
that transcription along that region is no longer possible.
Methylation of cytosine produces 5-methylcytosine. However, it turns out that 5methylcytosine is quite prone to deamination, which converts it into thymine. As such, the
amount of CpG sites in methylated DNA is very low compared to the non-methylated
However, some regions of ancestral CpG sites do not get methylated survives the
evolutionary timescale, while methylated sites does not. As such, often there are specific regions
in the genome where the frequency of CpG sites are higher than others. These sites are referred to
as CpG Islands and they are defined as a region having a GC content higher than 50%. These
CpG islands are unique to some genes, so they are sometimes useful in the identification of
4. Single Nucleotide Polymorphisms
There is no doubt that DNA sequence variation occurs within a population of organisms.
These may include insertions, deletions, duplications, translocations, etc. However, the most
common form of DNA variation is single nucleotide polymorphisms (SNPs).
SNPs are DNA variation that only affects a single nucleotide. They usually arise from
mutation, but may or may not have harmful effects. For instance, the degeneracy of the genetic
code allows for a silent mutation. SNPs that do not change the resulting protein sequence is

referred to as synonymous SNPs, while SNPs that changes the resulting protein sequence is
referred to as non-synonymous SNPs. Non-synonymous SNPs may either be a missense or
nonsense mutation.
Two SNPs are often referred to as a SNP allele, these may differ throughout family
lineages or geographical locations. SNP alleles are passed through offspring, which allows it to be
used in DNA fingerprinting and forensic science. An interesting thing about SNPs is that some
research has suggested that SNPs often come in pairs (biallelic), such that they are easily
SNPs can occur in both non-coding and coding regions of the genome. SNPs in the
coding region may lead to non-synonymous SNPs and hence a potential disease. SNPs in
noncoding regions may also cause a disease since some non-coding regions are also transcribed
into functional non-coding RNAs (e.g. tRNAs, rRNAs, etc.) As such, identification of SNPs are
essential in undermining a potential genetic disease.
Since an individual possesses a multitude of SNPs, a combination of such SNPs are
referred to as the haplotype of the individual. The international HapMap project is an ongoing
project that collects haplotypes from individuals around the world and curates it into a database.
This however turns out to be a difficult task as because sometimes it is very hard to identify
The HapMap project is particularly useful to field biologists, as it may give an insight
into migratory and interbreeding patterns of populations.

5. ENCODE Project
The Encyclopedia of DNA Elements (ENCODE) Project is a multinational research
project that aims to identify all functional elements in the human genome. While most protein
coding genes has been studied intensively and understood well, it only accounts for 1.5% of the
whole human genome. The rest, which is non-coding DNA contains mostly what is believed to be
as junk DNA.
A small portion of non-coding DNA is transcribed into regulomes; components that
regulate the activity of other genes or proteins. By mapping genomic regions to specific
regulomes and how they influence gene transcription could reveal connections between gene
expression and certain diseases.
The ENCODE project was launched in 2003 and finished in 2012, where 100% of the
human genome has been mapped into its function. Surprisingly, around 80% of non-coding DNA
is transcribed; approximately 20% was deemed functional, while the other 60% seems to have no
known function. Though the 20% functional non-coding DNA was attributed to the regulation of
gene expression, the rest 60% are transcribed regions with no known function.
This led to ENCODE researchers to conclude that 80% of the genome and more is
associated with a biochemical activity (function). This implies that DNA is not junk after all.
However this conclusion led to a controversy, as 60% of non-coding DNA is only transcribed
without any known function. It is the same as saying that 80% of the job of the genome is to get
transcribed, with 60% being transcribed for no apparent reason.
Nevertheless, the ENCODE project revealed how complex gene regulation is. For
instance, they found that the expression of coding genes are controlled by various regulatory sites
located both near and far from the actual gene.

6. Adam and Eve

Variations in mtDNA sequences are also associated with populations. This is similar to
how SNP haplotypes may be associated with a certain group of people. These mtDNA haplotypes
can be traced back to form an ancestry where the common ancestor of all mtDNA haplotypes are
referred to as the mitochondrial Eve. mtDNA haplotypes are often designated using the
Cambridge Reference Sequence. For instance, the mtDNA haplotype H1 and H3 is prevalent in
Europe, but not in other parts of the world.
Since mtDNA is passed down maternally without any recombination,the mitochondrial
Eve represents the most recent woman from whom all living humans today descended from.
Analysis of the mtDNA haplotypes revealed that the mitochondrial Eve may have possibly
originated from East Africa.
This does not mean that the mitochondrial Eve was the only woman at that time, but she
is the only one who was successful in producing a direct unbroken female lineage to all women
living today. Other Eves simply fail to produce this lineage, where they might have died off as
time progresses.
This finding is a strong evidence to the Recent African Origin (RAO) theory, in which it
states that humans initially originated from Africa before migrating all over the world.
A similar analogue to the mitochondrial Eve in males is the Y-Chromosomal Adam (YMRCA), where MRCA stands for most recent common ancestor. Y-Chromosomal Adam
represents the most recent common ancestor from whom all living males are descended
patrilineally. This doesnt mean that Y-MRCA and mitochondrial Eve lived during the same time
period. In fact, Y-MRCA originated much later than mitochondrial Eve, though they both lived in
It is very probable that most males in the past were part of the mitochondrial Eve
lineage, where at some point the Y-MRCA lineage was also introduced. By chance, only males
with the mitochondrial Eve mtDNA lineage and Y-MRCA lineage survived. Hence all males
today are part of the mitochondrial Eve lineage and Y-MRCA lineage.