Вы находитесь на странице: 1из 16

A Comparison of Identity by Descent and Identity by State Computational Methods in Determining and Visualizing Candidate Regions for Autosomal

Recessive and Autosomal Dominant Genetic Traits through SNP Microarray Analysis
Alexander Bisignano1,*
1

Chromosoft LLC, P.O. Box 281, Colts Neck, NJ 07722, USA *info@chromosoft.org, Phone: 1-732-858-1622, Fax: 1-732-758-0214 Keywords: Genetics, Linkage, Chromosome Mapping Word Count:

ABSTRACT

INTRODUCTION Single nucleotide polymorphism (SNP) genotype data presents an opportunity to identify shared chromosomal segments between individuals with recent common ancestry. The identification of common regions has applications in determining degree of relationship, mapping meiotic recombinations, mapping uncharacterized high-penetrance, single-gene disorders, and diagnosing pre-conception, pre-implantation, or pre-natal disease risks. Two distinct approaches to determining shared genomic regions from high-density SNP data are identity by descent (IBD) and identity by state (IBS) analyses. IBD analysis requires knowledge of pedigree structure to determine haplotypes inherited within nuclear families. For example, given maternal and paternal genotypes AC and CC, it is possible to determine which of the two maternal chromosomes was transmitted to the child at that locus. If the child is genotype CC, then the maternal chromosome with a C at this position was transmitted. If the child is genotype AC, then the maternal chromosome with A at this position was transmitted. In this way, a situation where one parents genotype is heterozygous and the other parents genotype is homozygous can provide phase-informative data for the childs haplotype received from the heterozygous parent. Haplotype information for multiple siblings can then be used to assess allele sharing. While a prerequisite of IBD analysis is data from the nuclear family and knowledge of the pedigree structure, IBS analysis is capable of determining allele-sharing given high enough SNP density without any prior knowledge of a relationship between individuals. As previously described [Roberson and Pevsner], at each locus the sharing status of two individuals can be assigned one of three IBS values, 0, 1, or 2. The value of zero applies when the genotypes are completely different (e.g. AA and BB). A value of one is given when only one of the alleles is shared between the two (e.g. AA and AB), and a value of two is given when both alleles are shared (e.g. AA and AA, AB and AB, or BB and BB). It can be expected that a certain proportion of IBS-2s, 1s, and 0s will exist between any two individuals. However, a consistent lack of IBS-0s over an extended range indicates that these two individuals must share a haplotype for this given region. Continuing this logic, a consistent lack of IBS-0s as well as IBS1s indicates that the individuals share two haplotypes in this region. Thus with high-density data, IBS can be used to determine allele sharing as well. IBD and IBS analysis methods provide different advantages when analyzing consanguineous or nuclear family samples. IBS provides a clear advantage when no pedigree information is available. IBD and IBS methods can be used separately or in conjunction in assessing family data or group data to determine important alleles shared. Here, two specific analysis methods (one IBS and one IBD) and corresponding programs (HaplOlap and NucleOlap) are presented, and an

analysis is undertaken to demonstrate which techniques are most appropriate and accurate for varying sets of genetic data.

METHODS SNP Microarray genotype data was obtained through analysis of either blood (Bonei Olam) or buccal (23andMe) samples. Genotyping took place at offsite labs contracted by the respective institutions. The data was formatted and curated using Microsoft Excel. Processing of the data occurred with a Windows Vista 64-bit machine equipped with 6GB RAM running an Intel Core i7 920 CPU (2.67 GHz). The NucleOlap and HaplOlap programs were written in Java and compiled on the most recent release of the Java Development Platform. IBD Analysis Method NucleOlap was written to implement IBD methods for identifying candidate regions responsible for autosomal recessive or autosomal dominant genetic disorders. It requires data from parents, and at least two offspring, one of which must be affected. Up to ten offspring can be analyzed at once with the graphical user interface. The algorithm employed by NucleOlap for autosomal recessive disorders proceeds as follows: 1) Identify Informative SNPs from parental data; 2) use informative SNPs to phase progeny data; 3) compare progeny haplotypes to determine whether alleles are shared or not and to eliminate miscalls; 4) eliminate or maintain candidate regions depending on affected/unaffected status and shared/unshared status between siblings. 1. Identification of Informative SNPs from Parental Data Phase informative SNPs were identified based on situations where one parent is heterozygous and the other parent is homozygous. Inference of offspring alleles from parental data has been previously discussed at length (Qian and Beckman).

Fig 1. Identification of informative SNPs: SNPs where one parent is homozygous and the other parent is heterozygous provide phase informative information for the offspring. In the example above, informative SNPs are circled. 2. Re-construction of progeny haplotypes from parental informative SNPs.

Based on the informative SNPs identified from the parents, each offsprings genotypes can be reduced to parentally inherited haplotypes.

Fig 2. Based on the genotype of the child at each informative locus, transmitted haplotypes can be constructed. Note: these are post-meiotic recombination haplotypes. 3. Comparison of progeny haplotypes Progeny haplotypes are compared to determine if the same or different haplotypes were inherited for each region of their maternal and paternal chromosomes. 4. Elimination of regions depending on status of affected/unaffected progeny haplotypes Overlap status from haplotype comparisons are then assessed to determine which regions are candidate regions for the phenotype of interest. Autosomal recessive analysis proceeds as follows: 1. all affected children must share both haplotypes in the same region; any region where the affected children do not have the same haplotypes is discarded. 2. Unaffected children cannot share both haplotypes in the same region as affected children; regions where any unaffected child shares both haplotypes with the affected children are removed from the list of candidate regions.

Fig 3. Affected children are compared to determine the regions where they all share both maternal and paternal haplotypes. Regions where they do not all share the same diplotype are discarded as candidate regions for causing the phenotype of interest. Discarded regions are shown in gray.

Fig 4. Unaffected children are compared to the candidate regions from analysis of affected progeny (Fig 3), and any regions that are candidate regions where unaffected children share the same diplotype as the affected children are discarded. Examples of such regions are circled above. For autosomal dominant disorders, the algorithm is half as complicated. Transmitted haplotypes are only assessed for the affected parent. The childrens haplotypes are then compared to one another to determine where they inherit the same haplotype. Regions are only considered candidate regions if all affected children have inherited the same haplotype and no unaffected children share this haplotype. IBS Analysis Method The IBS Analysis proceeds as previously described [Roberson and Pevsner]. However, more than two individuals can be analyzed at once by taking the total identity by state (TIBS) for each locus between all the individuals. For this to occur a pair-wise comparison of every combination of individuals must take place, which increases the computational requirements of HaplOlap in an exponential fashion for each individual added.

Calculation of TIBS

To determine TIBS, the lowest IBS call between any two members of the group is taken. For example in a group of 9 individuals if all individuals share IBS-1 with the exception of two of them (who share IBS-0), then for that locus, the TIBS will be IBS-0. This is illustrated in the table below. Table 1. Sample Determination of TIBS
Individual 1 AA AA AA AB Individual 2 AA AA AB AB Individual 3 BB AA AA AB TIBS 0 2 1 2

Once the TIBS calls are determined, IBS analysis occurs normally as previously described [Roberson and Pevsner]. Regions with TIBS 0, 1, and 2 calls have no common alleles between the individuals. Regions lacking TIBS-0 calls have one common allele between all individuals, and regions lacking both TIBS-0 and TIBS-1 calls have two common alleles shared by all the individuals. A major difference between IBS and IBD methods is that IBD can use unaffected individuals to further narrow down the candidate genomic regions (within a nuclear family). In its current implementation, IBS cannot do this. For this reason, IBD is able to narrow down the possible loci responsible for a genetic trait more than IBS for nuclear families. HaplOlap and NucleOlap: GUI Implementation HaplOlap and NucleOlap are both written in the Java programming language and are compiled with Graphical User Interfaces (GUI) created using Netbeans. They are platform independent, and require only the latest version of the Java Runtime Environment to run. The programs require a separate text file for each individual with four columns summarizing their genotypes: rsid, chromosome, position, genotype, although customization of input files is easily achieved. NucleOlap can handle up to 12 input files: 2 parents and 10 children. HaplOlap can handle up to 10 input files: any 10 individuals being studied. Both are available through Chromosoft (http://www.chromosoft.org).

RESULTS To compare NucleOlap (IBD) and HaplOlap (IBS) analysis methods, four approaches were taken. The first approach was to verify that the expected amount of diplotype sharing between siblings was identified by each program. The second analysis compared each programs ability to identify candidate regions responsible for a recessive disease inherited in a nuclear family. The third analysis compared each programs ability to identify candidate regions responsible for a dominant disease inherited in a nuclear family. The final analysis compared each programs ability to identify candidate regions for a dominant mutation inherited by three second cousins. Shared Diplotypes between Siblings Agree with Expected Proportion of Sharing In classical models of genetic inheritance it is expected that, on average, two siblings share both parental haplotypes in 25% of their genome, one parental haplotype in 50% of their genome, and no parental haplotypes in 25% of their genome. While individual deviations from this model are to be expected, over the course of multiple samples we expect to approach this ratio of genome sharing. Analysis with NucleOlap (IBD) Data obtained from 23andMes genetic testing service for a family of four siblings was used to test adherence to the 25:50:25 haplotype sharing ratio. Using both NucleOlap and HaplOlap, we are able to quantify the percentage of the genome for which pairs of siblings share both parental haplotypes. Table 2 summarizes these results. Table 2. NucleOlap - Diplotype Sharing Between Siblings
Length of Shared Diplotype (BP) 745,451,161 822,081,169 850,754,064 693,380,355 992,476,348 710,340,415 802,413,919 111,765,738

Children Compared Child 1 v. Child 2 Child 1 v. Child 3 Child 1 v. Child 4 Child 2 v. Child 3 Child 2 v. Child 4 Child 3 v. Child 4 Mean SD

% of Autosomes 26.03 28.70 29.70 24.21 34.65 24.80 28.01 3.90

The four children were compared to find the length (in base pairs) of their autosomes for which they shared both parental haplotypes (diplotypes). The total length of all autosomes is

2,864,255,922 base pairs (build 36.1). On average, the siblings shared 802,413,919 base pairs, or 28.01% with a standard deviation of 111,765,738 base pairs, or 3.90%. Given the low sample number, this average shared percentage of diplotypes between siblings is not significantly higher than the expected 25%. Thus, NucleOlaps identification of shared diplotypes between two siblings seems to agree with expected values. Analysis with HaplOlap (IBS) The same data were analyzed using the IBS program, HaplOlap. 500 SNPs at a time, the degree of allele sharing was assessed between every combination of the siblings. Table 3. HaplOlap - Diplotype Sharing Between Siblings
Length of Shared Diplotype (BP) 701,222,413 777,776,972 803,119,210 654,135,366 953,306,549 658,278,776 757,973,214 113,602,284

Children Compared Child 1 v. Child 2 Child 1 v. Child 3 Child 1 v. Child 4 Child 2 v. Child 3 Child 2 v. Child 4 Child 3 v. Child 4 Mean SD

% of Autosomes 24.48 27.15 28.04 22.84 33.28 22.98 26.46 3.97

The four children compared showed approximately the same pattern of diplotype sharing when using the IBS method. The children shared diplotypes in 26.46% of their genome with a standard deviation of 3.97%. Although the data does not match NucleOlap precisely, the consistency in the pattern and the proximity to 25% means that there is general concordance between the two methods.

Fig 5. Ideograms showing the results from the three experiments. A: analysis of an autosomal recessive mutation in a nuclear family. B: analysis of an autosomal dominant mutation in a nuclear family. C: analysis of an autosomal dominant mutation in an extended family. The left column of ideograms shows the graphical output of NucleOlap (IBD). Green regions indicate candidate regions and black regions indicate regions that have been ruled out. The right column of ideograms shows the graphical output of HaplOlap (IBS). Green regions are regions where affected children all share the same diplotype (thes are candidate regions for A). Red regions are regions where affected children all share one haplotype (these are the candidate regions for

B and C). For all ideograms, a blue dot indicates the position of the gene/region where the deleterious mutation has been previously mapped via other methods. Confirmation of a Previously Identified Recessive Mutation within a Nuclear Family Anonymous data from a family with three children was used to demonstrate each programs ability to narrow down the candidate regions for a recessive mutation. Two of the three progeny were affected with a genetic disorder that showed linkage to chromosome 20. NucleOlap was tested for its ability to: 1) narrow down the candidate regions within the genome that could be responsible for this trait; 2) retain the linked chromosome 20 region as one of the candidate regions. Analysis with NucleOlap (IBD) In the IBD recessive model, for every affected child after the first, the candidate genomic regions should be narrowed down by approximately 75%, and for every unaffected child, the candidate genomic regions should be narrowed down by approximately 25%. Therefore, the expected proportion of the genome that should remain as a candidate for causing our recessive trait after analysis is: 1*0.25*0.75 = 0.1875 or 18.75% After analyzing these genomes with NucleOlap, only 14.18% of the autosomal genome, or 406,037,312 base pairs, remains with 28 candidate regions (green regions, figure 5-a-i). Moreover, the linked region was contained within one of the two candidate regions identified on chromosome 20. Analysis with HaplOlap (IBS) In the IBS recessive model, for every affected child after the first, the candidate genomic regions should be narrowed down by approximately 75%. There is no further region elimination for unaffected children. Therefore, the average expected proportion of the genome that should remain as a candidate for causing the recessive trait in a family with two affected children is: 1*0.25= 0.25 or 25% After analyzing the two affected children with HaplOlap, only 18.19% of the autosomal genome, or 520,878,287 base pairs, remains with 34 candidate regions (green regions, figure 5-a-ii). Again, the linked region on chromosome 20 was contained within one of these candidate regions. Summary of the Recessive Model in a Nuclear Family

NucleOlap and HaplOlap both succeeded in narrowing down the potential causative candidate loci for this family with two affected children and one unaffected child. Table 4 summarizes their performance. Table 4. NucleOlap v. HaplOlap Analysis of a Nuclear Family with an Autosomal Recessive Mutation
Criteria Linked Region Retained? Autosomal Length Remaining (base pairs) Percentage of Autosome Remaining Number of Regions Remaining (discreet) NucleOlap Yes HaplOlap Yes

406,037,312 14.18%
28

520,878,287 18.19%
34

Confirmation of a Previously Identified Dominant Mutation Deidentified data was used to demonstrate NucleOlaps ability to properly identify candidate loci for an autosomal dominant disease. A family of six siblings was used in this test. Three of the children suffered from a autosomal dominant genetic disorder that has previously been linked to a region of chromosome 3. NucleOlap and HaplOlap were tested for their ability to: 1) narrow down the candidate regions within the genome that could be responsible for this trait; 2) identify the linked region within one of the candidate regions. Analysis with NucleOlap (IBD) In our dominant model, one parent is known to be affected. With the IBD analysis, siblings are only assessed for this parents haplotype. For every affected or unaffected child after the first, the candidate genomic regions should be narrowed down by 50%. The expected proportion of the autosomes that should remain as a candidate for causing the dominant trait after analysis is: 1*0.55 = 0.03125 or 3.13% After analyzing the genomes, it was found that 1.39% of the autosomes, or 39,948,360 base pairs, remained with a total of 4 candidate regions (green regions, figure 5-b-i). The linked region (blue circle) was contained within one of the small regions identified on the third chromosome showing that NucleOlap succeeded in both narrowing down the regions responsible for causing this trait while maintaining the proper locus within the candidate regions.

Analysis with HaplOlap (IBS) While a dominant model assumes that one parent is affected, IBS does not rely on pedigree structure and can analyze siblings without the genotypes for the parents. The major drawback, once again, is that the candidate regions are not narrowed down by analyzing unaffected siblings. Therefore, only the three affected children are used in this analysis. For each child, the candidate region is narrowed down by 25%. Along with analyzing SNPS from affected children, if the affected parent is also analyzed, it narrows down the candidate regions by another 50%. For our data, the expected portion of the genome remaining is: 1*0.75*0.75*0.75*0.5 = 0.2109 After analysis, it is found that 30.58% of the genome, or 876,174,348 base pairs, remains as a candidate for causing this dominant mutation. In total, there are 79 candidate regions (red regions, figure 5-b-ii). The linked region is properly identified within one of the regions on the third chromosome (blue circle). Summary of the Dominant Model in a Nuclear Family NucleOlap and HaplOlap both reduced the percentage of the genome that could cause the dominant mutation affecting this nuclear family. Both programs did so while maintaining the actual mutation within the candidate loci. The results are summarized in table 5. Table 5. NucleOlap v. HaplOlap Analysis of a Nuclear Family with an Autosomal Dominant Mutation
Criteria Actual Mutation Retained? Autosomal Length Remaining (base pairs) Percentage of Autosome Remaining Number of Regions Remaining (discreet) NucleOlap Yes HaplOlap Yes

39,948,360 1.39%
4

876,174,348 30.58%
79

Confirmation of a Previously Identified Dominant Mutation Found Between First Cousins A deidentified dataset of three first cousins who share a dominant genetic disorder linked to chromosome 11 was used for the final experiment. On average, a pair of first cousins is expected to share 25% of their genome. Therefore, the average total genomic overlap (regions of the genome where all share the same genes) expected to exist between 3 first cousins is:

1*0.25*0.25 = 0.0625 or 6.25% Analysis with NucleOlap (IBD) Analysis with NucleOlap is not feasible given two factors: 1. There is no parental data for these cousins. 2. We do not know the precise pedigree structure for these cousins. Therefore, there is no reduction in the amount of the genome that may cause this dominant trait on the basis of IBD analysis. Analysis with HaplOlap (IBS) Following analysis, it is found that 1.44% of the autosomes, or 41,150,414 base pairs coming from 4 regions (red regions, figure 5-c-ii), remain as a causative candidate for this autosomal dominant genetic trait. Moreover, the linked region is properly located within the region identified on chromosome 11 (blue circle). Summary of the Dominant Model in an Extended Family In the case of an extended family without pedigree structure, only HaplOlap (IBS) is able to narrow down the candidate genomic regions for causing the trait of interest. The results are summarized in table 6. Table 6. NucleOlap v. HaplOlap Analysis of an Extended Family with an Autosomal Dominant Mutation
Criteria Actual Mutation Retained? Autosomal Length Remaining (base pairs) Percentage of Autosome Remaining Number of Regions Remaining (discreet) NucleOlap Yes HaplOlap Yes

2,864,255,922 100.00%
22

41,150,414 1.44%
4

DISCUSSION Analysis of high-density SNP data presents many challenges which include data management, data formatting, computational intensity, memory allocation, data visualization, and analysis interpretation. Yet SNP microarray data, when employed, is a quick, efficient method of analysis that can aid in many different types of genetic studies. Here, HaplOlap and NucleOlap are presented for the simple and rapid analysis of familial and consanguineous high-density SNP microarray datasets.

Shared Diplotypes between Siblings Agree with Expected Proportion of Sharing HaplOlap and NucleOlap both found diplotype sharing between siblings (28.01% 3.9 and 26.46% 3.97 respectively) to be acceptably close to the expected 25%. A difference in the actual proportion of shared diplotypes between the two programs has to do with different analysis methods. NucleOlap uses Identity by Descent to identify meiotic recombination events through the analysis of informative snps (figure 1), whereas HaplOlap uses the Identity by State method to identify regions based on proportion of genotypes shared within a selected region. Since the HaplOlap heuristic employed in this analysis analyzes 500 SNPs at a time and then makes an IBS call based on the proportion of IBS-0s, 1s, and 2s in the region, it is subject to error if the 500 SNPs spans regions which contain an actual meiotic recombination event. Therefore, the NucleOlap method is currently more accurate in estimating the precise location of recombinations. The number of SNPs used in the HaplOlap heuristic is editable from a user input panel in the GUI. Confirmation of Previously Identified Mutations To assess their value and validity in identifying candidate loci for novel genetic mutations, three different tests were employed, and each program achieved varying levels of success. The Identity by Descent method (NucleOlap) proves to be superior whenever analyzing nuclear family data, likely due to its ability to incorporate unaffected individuals and the fact that its algorithm does not rely on error-prone heuristics. However, in the event that pedigree structure is unknown or parental data is unavailable, the Identity by State method is the better option since IBD cannot be performed. Table 7 summarizes which algorithms are more appropriate for various types of datasets. [Table 7 to be constructed] Other Programs for Familial and Consanguineous Studies Merlin SNPDuo++ Homozygosity Haplotype ACKNOLEDGEMENTS REFERENCES

Вам также может понравиться