Вы находитесь на странице: 1из 5

Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh

Rohit Chandra Dept. of CSE IIIT Delhi rohit11090@iiitd.ac.in Suhanth Boddu Dept. of CSE, IIIT Delhi suhanth11114@iiitd.ac.in Divyanshu Bansal Dept. of CSE, IIIT Delhi divyanshu11045@iiitd.ac.in

AbstractIndia is one of the most ethnically diverse regions of the world and is considered to be a major paleolithic migration route. India is home to several autochthonous subhaplogroups of the haplogroup N which is one of the signature haplogroups that dene out of Africa migrations. Here we try to study the phylogeny of the Brahmin population of the north Indian state of Uttar Pradesh using the theory of coalescence, haplotype trees and the innite alleles model on our mtDNA dataset. We would be using some well dened and established methods with new assumptions and modications to suit our dataset. Based on the observed results we would also try to discuss the possible phylogeography of the population. We would come up with the time to the most recent common maternal ancestor of the population under study and also comment on the possible place of origin and migration paths of this ancestor. This study uses mitochondrial DNA control region sequence data for analysis.

Key words: Mitochondrial DNA (mtDNA), coalescent, haplogroup, haplotype, phylogeny, Brahmin, phylogeny tree, DLoop, hypervariable regions (HVR), control region, single nucleotide polymorphism (SNP). I. I NTRODUCTION Uttar Pradesh is the largest and fth largest state of India in terms of population and area respectively. With a population of 200 million it is one of the most diverse regions in terms of mtDNA haplogroups as evident from the fact that we were able to identify 25 different haplogroups in our small dataset of 50 sequences. Roughly 12 % of UPs population consists of Brahmins and this would be the population of interest for our study. Mitochondrial DNA in humans contains about 16500 base pairs and codes for 13 proteins, 22 tRNA genes and 2 rRNA genes. Mitochondrial DNA is inherited exclusively from the mother which makes it ideal for studying maternal lineage far back in time. With respect to phylogeny studies a part of the mtDNA called the D-Loop is of vital importance. D-loop is a DNA structure where the two strands of a double-stranded DNA molecule are separated for a stretch and held apart by a third strand of DNA. The D-Loop forms a part of the non-coding control region of mtDNA and consists of 1100 base pairs. It sits between bp 16001-574 of the circular genome. Contained within this D-Loop are 2 highly polymorphic regions known as hypervariable regions. HVR1 is located between bp 16001-16560 while HVR2 is located

in bp 1-574. Hypervariable regions are ideal for studying ancestral relationships among organisms as they are highly polymorphic. We will be focussing on HVR1 region for our study. Mutations in the mitochondrial DNA result in substitution of one of the bases A, C, G or T by another one, these mutations can be of two type namely transversion or transition between purines (A, G) and pyrimidine (C, T) and transitions or mutations within purines and pyrimidines.With regards to ancestral inference it makes sense to study the mutation patterns among different individuals and use these patterns to obtain results. While making the distance matrices for further analysis we only include the sites of mutations among the individuals and remove the rest of the sequence just to make things less complex. These sites are called segregation sites and individuals having the same mutation patterns make a lineage. It is safe to assume that people belonging to the same lineage based on HVR regions will have the same ancestral history regardless of mutations in other regions of mtDNA II. M ETHODOLOGY A. Dataset The dataset presented here corresponds to a subset of the data collected by Palanichamy et al. [2] from the states of Uttar Pradesh, West Bengal and Andhra Pradesh. From this dataset 47 sequences were identied that pertained to the Brahmin population of Uttar Pradesh. The dataset is publically available on the NCBI website with the accession numbers AY713976 - AY714050.The innitely many sites model[1] assumes each polymorphism to be a unique mutation. Since the HVR regions of the individuals varied we took the region lying between bp 16024 - 16569 in order to maintain uniformity of the data. This gave a constant gure of 546 bases for all the sequences. It was ensured that no mutation site was left unaccounted for by this adjustment. These ltered sequences were then aligned using clustalw multiple sequence alignment in order to get the segregation sites. The same was done for the HVR2 sequences as well. The sequences with the number of transition difference less than or equal to 2 were grouped together under a lineage. However the sequences with a single transversion difference were considered distinct. 36 lineages were identied and this

narrowed down the data to 36 sequences of 546 base pairs. After the identication of the segregation sites the sequences were aligned to the revised Cambridge reference sequence (rCRS) to get the mutation patterns. These patterns were graphed and 65 different site patterns emerged. These 36 sequences of 65 letters each made the nal version of the dataset.The mtDNA haplogroups in the dataset are H, W, I, J3, R, T2, K1, U3, U7, HV, V and their subhaplogroups. The data is further divided into Bhargava, Chaturvedi and other Brahmins on the basis of surnames. B. Making sense of the data - The Tree The Analyses of Phylogenetics and Evolution (APE) package [3] in R provides convenient methods for creating and analysing genealogical trees. Phangorn and Phyclust packages are useful in creating distance matrices and unrooted trees. Using the unrooted tree as reference various rooted trees can be created and analysed. In order to arrive at one tree for analysis we select the most parsimonious construction for the unrooted tree.

pair of sequences at each iteration so that the total length of the branches on the tree is minimized After a pair of nodes is pulled out, it forms a cluster in the tree and is included in further rounds of iteration. Here also a new distance matrix is generated at each iteration. These methods help us in arriving at the tree with optimal parsimony which can be used for MRCA studies. C. The Most Recent Common Ancestor The innite alleles model suggests that each mutation creates new allele and allele types are all equally different from each other. This model can be applied while calculating the time to most recent common ancestor for two individuals (two genes to be accurate). This model takes into account mutation rate, time, sample size and the number of matching markers [14]. Two sequences differing by a single mutation will have a match on 1050 markers [546 of HVR1 and 504 of HVR2] for our dataset. The chi-squared test comes in handy here as it can be used to verify the tmrca and get a condence interval for our estimates. The probability that two lineages coalesce in the immediately preceding generation is the probability that they share a parental DNA sequence. In a diploid population with a constant effective population size with 2Ne copies of each locus, there are 2Ne potential parents in the previous generation, so the probability that two alleles share a parent is 1/(2Ne) and correspondingly, the probability that they do not coalesce is 1 1/(2Ne). So at each successive preceding generation the probability of coalescence is geometrically distributed and given by the forumula P (t) = (1 1/2Ne )t1 (1/2Ne ) Ewens sampling formula gives us an estimate of how many different alleles are observed a given number of times in the sample. This can come in handly while trying to gain an estimate on the tmrca in different regions of the cladogram. We can take the substitutions in different branches to follow a poisson process with the rate of /2 at each branch. While scoring for the tmrca we give a higher score to transversions and a lower one to a transition since a transition means a closer ancestral relationship as a transversion signies a greater mutation. The p-value obtained by the chi-squared test helps in getting a condence interval estimate for our tmrca calculations. Parsons et al.[4] suggest methods for mapping substitutions and suggest a mutation rate value of 3 105 per nucleotide transmission event for D-loop HVR regions. Two of these transmission events correspond to one generation and this is the value used for getting the time to most recent common ancestor. For our study we have xed a period of 15 years for one transmission event. As we will see this gives a 50% condence interval on our calculations. Average mtDNA haplotype mutation rate of 3 105 gives us the period for which a region of the control region remains unchanged - 3 105 1050 = .0315 per transmission event. This suggests that a combined HVR1 + HVR2 region of mtDNA can survive for 32 generations or 960 years (based on our

Fig. 1. Unrooted genealogical tree for the mtDNA data 21 refers to sequence AY714021

The sequence data is used for creating an Euclidean distance matrix representing the measure of genetic distance between pairs of sequences. The UPGMA (Unweighted pair group method with arithmetic mean) method clusters the two closest sequences then computes the new distance matrix using the arithmetic mean to the rst cluster. This process is repeated until all the sequences are grouped. A variation of the UPGMA method called the neighbour joining method [5] pulls out a

xed transmission event time). A similar estimate for HVR1 region gives a gure of 62 generations. Thus the common female ancestor (MRCA) of two people who randomly match exactly for the combined HVR1+HVR2 mtDNA haplotype can go back to a period of almost 1000 years. D. Resources R Packages - APE [3], Phangorn, Phyclust, rgl, igraph ClustalW multiple sequence alignment Python Snapgene Viewer NCBI genome browser mtDB Human Mitcohondrial Genome Database (www.mtdb.igp.uu.se) III. R ESULTS AND D ISCUSSION With the unrooted tree for the 36 sequences we can arrive at multiple rooted graphs. However as described above we prefer the most parsimonious reconstruction and use distance based neighbour joining to create the tree. First step towards this tree is performing the clustering which can be fed as input for the optimal parsimonious reconstruction.

[13]

we are analysing a group of autochthonous haplogroups this observation also suggests common geographical origins and migrations. We know haplogroup L3 in Africa to be the ancestor of the macrohaplogroups M and N which further branched out into J, K, I, W, R and U in Europe and South-East Asia about 40000-70000 years ago. In the TMRCA calculation a perfect match on all the markers gives suggests that the common maternal ancestor of these individuals may go as far as 32 generations. Since our dataset with 65 segregation sites has the mutation difference between sequences limited to at most 8 we can rule out the origin of the most recent common ancestor outside of the subcontinent. The sequence AY714025 which makes the leftmost branch of the tree belongs to the mtDNA haplogroup U2 which is found in Punjab in Pakistan in addition to the northern and eastern states of India, as the maximum transmission events to the most recent common ancestor in our study reaches a gure of 500 we can safely assume the origins of the mrca to be in one of the above mentioned regions. For the MRCA calculations the markers include mutations in HVR2 region in addition to the HVR1 mutations.

Fig. 2. Cluster dendrogram for the dataset

The MPR function [10] can now serve us to provide the phylogeny tree with optimal parsimony which can facilitate our MRCA analysis. We magnify subregions in this tree to perform the mrca calculations. For exemplifying the method we would be taking the sequences lying on the two extreme ends of the tree and perform the mrca analysis. The tree gives us the sequence AY714025 to be the closest relative of the common ancestor of the individuals included in the dataset, as we move rightwards the number of mutations increase but there are stronger intra-region relationships observed suggesting similar mutation patterns among individuals. Since

Fig. 3. Optimal parsimony tree for the dataset. Optimal parsimonious reconstruction returns a parsimony value of 107

The data obtained from the right branch of the tree pertains to the haplogroups U7, K1, K8, H13, HV2 and V. This dataset has 11 segregation sites. A distance matrix created out of all the sequences pertaining to these haplogroups was constructed using the K80 model [15] which takes into account two kinds of mutations and allots separate probabilities to them hence giving a better bound than the default Jukes and Cantor model.

Fig. 4. Distance matrix created using the K80 model.

A normal distribution is plotted to gain an estimate of the transmission events between 2 sequences by taking the sequences 2 at a time from each branch of the tree. As described above 2 transmission events are taken to be one generation and each transmission events add a value of 15 years to the time to most recent common ancestor. Given below is the normal distribution obtained by taking the sequences of a Brahmin of haplogroup K1 lying at the rightmost edge of the tree and a Chaturvedi of haplogroup V corresponding to accession numbers AY714004 and AY713979 respectively. The normal distribution does not show any rise till reaching the value of 87, this is expected since the calculations performed on our model showed that the control region can survive upto 62 transmission events without any mutations. The normal distribution reaches a maxima at 472 suggesting a gap of 236 generations between the most recent common ancestor of the two individuals. A similar kind of calculation done for the whole tree returns a result of 536 transmission events for the MRCA. Taking into account our assumptions regarding the transmission events this returns a value of 8040. This enables us to conclude with a 50% condence interval that the most recent maternal ancestor of the population under study existed about 8000 years ago. It has been established that the split from haplogroup L3 and expansion of haplogroups M and N started occuring about 40000-7000 years ago. Among the haplogroups studied haplogroup R has the oldest origins going back to 65,000 years while T is the youngest which is estimated to have lived in the region around the Mediterranean Sea around 17,000 years ago which suggests a western migration from the subcontinent through middle east[11]. Keeping these gures in mind it is highly unlikely that the MRCA of the population under study with origins going back to 8000 years came from outside the subcontinent.

Fig. 5. The normal distribution for estimation of TMRCA of AY714004 and AY713979

Fig. 6. Magnied sub-portion of the rightmost branch depicting mutations and the number of transmission events.

It is important to note that most of the studied haplogroups are subgroups of macrohaplogroup N which is found scarcely in the Dravidian and North Eastern population in India. Haplogroup M is the dominant haplogroup among these populations. Out of Africa migrations suggest the origins of the split of haplogroup N to be in North Western part of the Indian subcontinent. Mitochondrial DNA migration patterns suggest that the descendants of haplogroup N had expanded into Europe and developed new subhaplgroups about 8000-17000 years ago [11]. So we can conclude that the most recent maternal ancestor of the population under study originated not more than 8000 years ago in North-Western or Central India.

The descendants of haplogroup R in India namely the haplogroups J, T and group R0 show high haplogroup diversity suggesting their autochthonous status[6]. South Asia lies in the way of earliest dispersals out of Africa and is vital for phylogeography studies in the Western European population [11]. Although the spread of Haplogroup R is very wide it is unlikely that any subhaplogroup in Europe of Central Asia (migration routes out of India) has descended from the most recent common ancestor of the population under study. IV. F UTURE W ORK The present study has been undertaken using the innite alleles model using only a subset of the sequences sequenced by Palanichamy et al. from Uttar Pradesh, Andhra Pradesh, West Bengal and North Eastern States. We would like to undertake a similar study on the whole dataset using the stepwise mutation model which tries to better account for the actual mutational process that occurs at microsatellite markers scoring the marker lengths. The stepwise mutation model looks at the frequency spectrum of the mismatches, namely how many loci show no mismatches, 1 mismatch, 2 mismatches and so on. We would focus on the relationships between the different authochthonous haplogroups and subhaplogroups in different parts of India. We also want to study the subcontinent specic phylogeography and migration patterns of the autochthonous subhaplogroups. During the course of the project we had developed a python program for identifying the mutation sites in the dataset. It takes as input a le containing sequences in phylip or fasta format, loads the data into a matrix and searches for dissimilar bases at indexes. The indexes were numbered 16024-16569 for HVR1 region. It aligns the mutation sites to the Cambridge Reference Sequence (rCRS) and returns the number and indexes of the segregation sites as its output. We plan to come up with a GUI based version of this program. We plan to call it MutaSeg. R EFERENCES
[1] Grifths, R. C., and Simon Tavare. Ancestral inference in population genetics Statistical Science (1994): 307-319. [2] Palanichamy, et al. Phylogeny of Mitochondrial DNA Macrohaplogroup N in India, Based on Complete Sequencing: Implications for the Peopling of South Asia. The American Journal of Human Genetics, 75 (2004), 966-978 [3] Paradis E., Claude J. and Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289-290. [4] Parsons, Thomas J., et al. A high observed substitution rate in the human mitochondrial DNA control region Nature genetics 15.4 (1997): 363-368. [5] Saitou, Naruya, and Masatoshi Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution 4.4 (1987): 406-425. [6] Karmin, Monika. Human mitochondrial DNA haplogroup R in India: dissecting the phylogenetic tree of South Asian-specic lineages. Diss. 2005. [7] Rosenberg, Noah A., and Magnus Nordborg. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Reviews Genetics 3.5 (2002): 380-390.

[8] van Oven M, Kayser M. 2009. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum Mutat 30(2):E386E394. [9] Metspalu, Mait, et al. Most of the extant mtDNA boundaries in south and southwest Asia were likely shaped during the initial settlement of Eurasia by anatomically modern humans. BMC genetics 5.1 (2004): 26. [10] Narushima, H. and Hanazawa, M. A more efcient algorithm for MPR problems in phylogeny. Discrete Applied Mathematics, 80 (1997), 231238. [11] Richards, Martin B., et al. Phylogeography of mitochondrial DNA in western Europe. Annals of human genetics 62.3 (1998): 241-260. [12] Maji, Suvendu, S. Krithika, and T. S. Vasulu. Phylogeographic distribution of mitochondrial DNA macrohaplogroup M in India. Journal of genetics 88.1 (2009): 127-139. [13] Ingman, M. and Gyllensten, U. Human Mitochondrial Genome Database, a resource for population genetics and medical sciences. Nucleic Acids Res 34, D749-D751 (2006). [14] Walsh, Bruce. Estimating the time to the most recent common ancestor for the Y chromosome or mitochondrial DNA for a pair of individuals. Genetics 158.2 (2001): 897-912. [15] Kimura, Motoo. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of molecular evolution 16.2 (1980): 111-120.

Вам также может понравиться