Вы находитесь на странице: 1из 8

Available online at www.sciencedirect.

com

ScienceDirect

A genomic view of short tandem repeats


Melissa Gymrek1,2

Short tandem repeats (STRs) are some of the fastest mutating induce a variety of pathogenic effects (Figure 1), includ-
loci in the genome. Tools for accurately profiling STRs from ing polyglutamine aggregation [6], hypermethylation [7],
high-throughput sequencing data have enabled genome-wide RNA toxicity [8], and repeat associated non-ATG (RAN)
interrogation of more than a million STRs across hundreds of translation [9]. Smaller pathogenic repeats have also been
individuals. These catalogs have revealed that STRs are highly shown to affect RNA splicing (cystic fibrosis [10]) or
multiallelic and may contribute more de novo mutations than regulate gene expression (progressive myoclonus epi-
any other variant class. Recent studies have leveraged these lepsy [11] and Gilbert syndrome [12]). Many of these
catalogs to show that STRs play a widespread role in regulating mechanisms are present across multiple loci, indicating
gene expression and other molecular phenotypes. These that they likely represent genome-wide phenomena.
analyses suggest that STRs are an underappreciated but rich
reservoir of variation that likely make significant contributions to The majority of repeat disorders identified so far follow
Mendelian diseases, complex traits, and cancer. autosomal dominant inheritance patterns that were read-
ily identified using linkage analysis in pedigrees. How-
Addresses
ever, STRs may contribute to a variety of inheritance
1
Department of Medicine, University of California San Diego, La Jolla, modes not amenable to traditional linkage techniques.
CA 92093, USA For instance, STRs are predicted to contribute a higher
2
Department of Computer Science and Engineering, University of number of de novo mutations per generation than any
California San Diego, La Jolla, CA 92093, USA
other type of variation [13], but the role of de novo STRs in
Corresponding author: Gymrek, Melissa (mgymrek@ucsd.edu) spontaneous conditions such as autism and neurodeve-
lopmental disorders has so far not been interrogated.
Furthermore, STRs are often highly multi-allelic, and
Current Opinion in Genetics & Development 2017, 44:916 thus may generate complex inheritance patterns not well
This review comes from a themed issue on Molecular and genetic captured by linkage or analysis of bi-allelic single nucle-
bases of disease otide polymorphisms (SNPs).
Edited by Nancy Bonini, Edward Lee and Wilma Wasco
Despite the clear implication of STRs in disease, they
have been notably missing from medical sequencing
studies. Next-generation sequencing (NGS) has the
http://dx.doi.org/10.1016/j.gde.2017.01.012 potential to profile more than a million STRs, but geno-
0959-437X/ 2017 Elsevier Ltd. All rights reserved. typing STRs from NGS has proven challenging. Thus,
STRs are often filtered from sequencing pipelines due to
their low quality calls [14,15], and even known patho-
genic STR mutations cannot be detected in most cases
[16]. The intronic GGGGCC repeat implicated in FTD
and ALS [8,17] identified via a combination of linkage
Introduction and NGS is the only exception known to the author.
Short tandem repeats (STRs), also known as microsatel- Notably, this repeat was not identified through repeat-
lites, consist of repeating motifs of 16 base pairs (bp) and aware genotyping methods, but rather through anoma-
comprise about 3% of the human genome [1]. Their lies in coverage and a cluster of erroneous SNPs result-
repetitive nature induces slippage events during DNA ing from poor sequence alignment at the expanded
replication that result in frequent mutations in the num- repeat.
ber of repeats. As a result, STRs exhibit mutation rates
that are orders of magnitude higher than other types of New bioinformatics tools and advances in sequencing
variation [2], and thus contribute a large fraction of human technologies are beginning to overcome these challenges
genetic variation. and are providing the first genome-wide portrait of STR
variation at a population scale. Here, I review advances
A role for STRs in human disease was established over over the last several years in STR profiling and how these
two decades ago, with independent discoveries of trinu- are leading to an improved understanding of the role of
cleotide expansions resulting in Fragile X Syndrome [3,4] STRs in human traits. Finally, I comment on remaining
and spinal and bulbar muscular atrophy [5]. Since then, challenges in analyzing low complexity regions of the
STR expansions have been implicated in dozens of dis- genome and prospects of emerging long read technologies
orders [6]. Further work has shown that these expansions to help overcome these hurdles.

www.sciencedirect.com Current Opinion in Genetics & Development 2017, 44:916


10 Molecular and genetic bases of disease

Figure 1

GENE REGULATION
Transcription factor binding Unusual DNA secondary structure

TF TF me3 me3 me3

TA TA TA TA A A A A AC AC AC CCG CCG CCG

Regulatory element spacing DNA hypermethylation/Heterochromatinization

TRANSCRIBED AND TRANSLATED REPEATS


Alternative splicing RNA/protein aggregates

TG TG TG AC AC AC ...CAGCAGCAG...
...GTCGTCGTC...
? Transcription
?
HnRNP L
CAGn CUGn

? RAN Translation
CAG-polyGln CUG-polyLeu
RNA hairpin AGC-polySer UGC-polyCys
GCA-polyAla GCU-polyAla
Toxic RNA and protein aggregates
Current Opinion in Genetics & Development

Mechanisms by which STRs affect phenotypes.


A schematic view of known and proposed mechanisms by which STRs might regulate gene expression and function. Top: from left to right, STRs
may form transcription factor binding sites [12,36], affect spacing between regulatory elements [38], induce unusual DNA secondary structures
such as Z-DNA [61], or modulate epigenetic properties such as DNA methylation [62] and heterochromatinzation [63]. Bottom: from left to right,
STRs may mediate effects at the RNA and protein level by modulating alternative splicing through RNA secondary structure [10], affecting RNA
protein binding sites [64], or forming toxic RNA and protein aggregates [9]. (Purple boxes = genes; black lines = DNA; red boxes = STRs; blue
circles = DNA/RNA binding proteins; gray circles = amino acids; green circles = DNA modifications).

Genotyping STRs from high-throughput method based on string matching that shows higher
sequencing data sensitivity to pick up shorter repeats and homopolymer
STRs are challenging to genotype from NGS (Figure 2). runs.
First, short reads often do not span entire repeats, effec-
tively reducing the number of informative reads. Second, Other tools, such as Repeatseq [20], save substantial run
STR variations present as large insertions or deletions time by using pre-existing alignments from indel-tolerant
that may be difficult to align to a reference genome, and aligners such as BWA [21]. This approach is often more
thus introduce significant mapping bias toward shorter sensitive, but is highly affected by the quality of the
alleles. Finally, PCR amplification during library prepa- upstream alignments and may be strongly biased toward
ration often introduces stutter noise in the number of shorter alleles if the aligner cannot identify large inser-
repeats at STRs. tions or deletions. The updated BWA-MEM [22] algo-
rithm exhibits higher sensitivity to larger indels, elimi-
A variety of bioinformatic methods have been developed nating much of this bias. A new generation of STR
to overcome these challenges, many of which are sum- genotyping tools uses BWA-MEM alignments as input
marized in Table 1. Some use custom alignment tech- combined with improved error models to obtain greater
niques to avoid mapping biases imposed by standard genotyping accuracy. popSTR [23] uses population
short read aligners. One example, lobSTR, [18] rapidly information to train locus and individual-specific error
detects reads with a repetitive signature using an entropy profiles. HipSTR [24] uses a repeat-aware Hidden Mar-
metric. It then aligns only non-repetitive flanking kov Model to realign reads, trains locus-specific stutter
regions of the read to the reference genome and employs models, and uses flanking SNPs to physically phase
a model of STR stutter errors to determine the maxi- STRs. These methods are more computationally expen-
mum likelihood genotype at each locus. STR-FM [19] sive, but show more than 97% accuracy on high coverage
uses a similar technique, with an improved detection data against gold standard techniques.

Current Opinion in Genetics & Development 2017, 44:916 www.sciencedirect.com


A genomic view of short tandem repeats Gymrek 11

Figure 2 genome sequencing datasets. An initial population-scale


panel of STR variation profiled exonic repeats in more
(a) than 500 exome sequencing datasets [28]. This study
True allele ACTACGATCACACACACAC----TGTGATCGAT found that more than 94% of repeat variants were missed
ACTACGATCACACACACAC----TGTGATCGAT by standard variant calling pipelines. Subsequently, a
Observed ACTACGATCACACACACACAC--TGTGATCGAT genome-wide study profiled more than 700 000 loci in
(with PCR)
ACTACGATCACACACACACACACTGTGATCGAT
ACTACGATCACACACAC------TGTGATCGAT over 1000 individuals from the 1000 Genomes Project
ACTACGATCACACACACAC----TGTGATCGAT sequenced to low coverage [29]. The combination of
Observed ACTACGATCACACACACAC----TGTGATCGAT higher coverage and PCR-free NGS datasets with con-
(PCR free) ACTACGATCACACACACAC----TGTGATCGAT
ACTACGATCACACACACAC----TGTGATCGAT tinued advances in short read alignment and STR geno-
typing tools has led to significant improvements over
(b) these initial catalogs. A recent study of 300 individuals
Reference ACTACGATCACACACTCTCTCTCTGTGATCGAT sequenced by the Simons Genome Diversity Project [30]
ACTACGATCACAC--TCTCTC--TGTGATCGAT characterized more than a million STR loci with 92%
Input ACTACGATCACAC----TCTCTCTGTGATCGAT accuracy compared to several hundred STRs genotyped
alignments ACTACGATCACAC--TCTCTC--TGTGATCGAT by capillary electrophoresis. These catalogs have revealed
ACTACGATC--ACACTCTCTC--TGTGATCGAT
that STRs often exhibit more than a dozen common
Repeat ACTACGATC----ACACTCTCTCTGTGATCGAT alleles per locus and that each individual harbors tens
aware ACTACGATC----ACACTCTCTCTGTGATCGAT of thousands of STR variants compared to the reference
alignment
ACTACGATC----ACACTCTCTCTGTGATCGAT
ACTACGATC----ACACTCTCTCTGTGATCGAT genome.

(c) Genome-wide STR catalogs will provide an important


Reference CAG100 resource for future medical genetics studies. First, they
Short reads
provide a framework in which to interpret the relevance
(e.g. Illumina) of specific STR alleles observed in disease contexts.
Long reads Population-specific per-locus allele frequencies serve as
(e.g. PacBio) a valuable reference point for prioritizing and filtering
Current Opinion in Genetics & Development candidate variants, as has been successfully demonstrated
in the case of SNPs or short indels [31]. Furthermore, data
Challenges and solutions for genotyping STRs from sequencing data. for hundreds of thousands of loci allows estimation of the
(a) PCR amplification introduces stutter errors in the number of effects of sequence features on STR variability [29,30]
repeats. The first box shows example reads observed in traditional and accurate estimation of per-locus mutation rates as has
sequencing protocols, with frequent PCR errors. The bottom shows
example reads from PCR-free data, showing greatly reduced stutter
recently been demonstrated on the Y chromosome [13].
noise. (b) Local sequence alignment is challenging at complex Thus, it is now possible to determine the significance of a
repeats. The first box shows example reads aligned with a repeat de novo mutation based on predictions of the underlying
agnostic aligner. Multiple best alignments are possible, and thus mutation rate. Second, large STR catalogs are allowing
inconsistent alignments are returned. The bottom box shows example the first genome-wide association studies of the contri-
reads realigned using a repeat-aware technique such as that used in
HipSTR [24], which aligns repeat-containing reads simultaneously and
butions of STRs to complex traits, as described below.
consistently. (c) Long repeats cannot be spanned by short reads. The Overall, these new resources will pave the way toward
first box shows example Illumina reads mapped to an STR locus with including STRs in standard NGS pipelines in medical
100 copies of a CAG motif. The bottom box shows example long genetics applications.
reads (e.g., PacBio), which can span thousands of base pairs with a
single read. In all figures, red represents the STR and black represents
non-repetitive flanking regions. Plots do not represent real data.
An emerging role for STRs in complex traits
Genome-wide association studies (GWAS) based on SNP
genotyping have made tremendous progress in identify-
ing genomic loci associated with disease. However, in
almost all cases, genome-wide significant hits fail to
Surveying STR variation at scale explain a majority of trait heritability [32]. One potential
Until recently, most studies of STR variation focused on source of missing heritability lies in variants such as
only several thousand of the more than 1 million genomic STRs that are not accessible from traditional SNP arrays,
STRs. The advent of NGS and novel STR genotyping as has been hypothesized by a growing number of groups
tools described above have allowed the first genome-wide [33,34]. Concrete examples of this phenomenon are
characterization of STR variation. These studies have now being discovered. For instance, systematic dissection
grown steadily in size, with early studies surveying just of the strongest schizophrenia association signal revealed
two [25] or several [26,27] genomes, and the most recent a recurrent copy number variation (CNV) not in strong
panels investigating hundreds to thousands of whole LD with any single SNP to be the causal variant [35].

www.sciencedirect.com Current Opinion in Genetics & Development 2017, 44:916


12 Molecular and genetic bases of disease

Table 1

Tools for genome-wide profiling of short tandem repeats

Method description Language/ STR Multi Models Uses


platform ref sample PCR existing
errors alignment
STR-specific tools
lobSTR [18] Genotypes 16 bp STRs using flank-mapping and maximum C++ Y Y Y Ya
likelihood-based genotyping.
RepeatSeq [20] Genotypes 16 bp STRs using existing alignments followed C++ Y N Y Y
by Bayesian model selection.
STR-FM [19] Genotypes 16 bp STRs using flank-mapping and maximum Python/ Y N Y N
likelihood-based genotyping. Galaxy
TSSV [65] Targeted profiling of complex allelic variants in pure and Python Yb N N N
mixed genomes
popSTR [23] Genotypes 26 bp STRs using existing alignments and trains C++ Y Y Y Y
individual-specific error models
HipSTR [24] Genotypes 16 bp STRs using repeat aware realignment and C++ Y Y Y Y
learns locus-specific error models. Phases STRs onto SNP
haplotypes
TRhist [60] Detects expanded 26 bp STRs using hybrid short and long Java N N N N
read approach
VNTRseek [56] Detects minisatellites (7 bp) C Y N Y N
STR profiling in cancer
MSIseq [66] MSI decision tree classifier using existing genotype calls for R Y NA N Yc
tumor samples
MSIsensor [67] Identifies changes in length distributions of STRs in paired C++ Y NA N Y
tumor/normal samples
mSINGS [68] Identifies changes in fraction of unstable STRs for tumor vs. a Python Y NA N Y
control population
a
lobSTR has a custom alignment module, but can also genotype using alignments created from other tools.
b
TSSV reference consists of flanking sequences and a regular expression describing the target locus.
c
Uses existing genotype calls, not sequence alignments.

While most well-studied STRs lie in or near protein based GWAS to detect underlying STR associations
coding regions, it is becoming increasingly clear that (Figure 3) and so far precluding accurate phasing and
STRs located in non-coding regions play an important imputation into SNP datasets. Furthermore, high error
regulatory role. Dozens of single gene studies have iden- rates reduce power to detect associations, even when full
tified STRs that control expression of nearby genes sequence data is available [34]. An added challenge is
[11,3639] via a variety of mechanisms, summarized in determining the correct statistic to test for association
Figure 1. Furthermore, genome-wide analyses revealed [33]. Traditional GWAS techniques assume bi-allelic
that STRs are enriched in human promoter and enhancer variants, and test for statistical associations between the
regions [40,41] and are a hallmark of enhancers in number of reference alleles (0, 1, or 2) and the trait of
Drosophila [41]. Population-scale STR catalogs com- interest. STRs often exhibit complex allelic spectra with
bined with large genomics datasets have allowed the first many common alleles, and will require different regres-
systematic identification of STRs linearly associated with sion approaches. Future development of STR-aware
gene expression and DNA methylation on a genome- imputation and association techniques are likely to
wide scale [34,42]. STRs were shown to contribute a increase our understanding of the contribution of STRs
median of 1015% of cis heritability of gene expression to complex traits.
mediated by all common variants in lymphoblastoid cell
lines [34]. Taken together, these studies point to an A genomic portrait of STR variability in cancer
important regulatory role of STRs, strongly supporting Beyond inherited disorders, STRs have been shown to
the hypothesis that they will contribute to complex traits play a critical role in certain cancers. Microsatellite insta-
in humans. bility (MSI) is a tumor phenotype seen in colorectal [43]
and other cancers [4446] in which defects in the mis-
Despite growing evidence that STRs contribute to the match repair system lead to hypermutability at STR loci.
genetic architecture of complex traits, several barriers MSI has significant prognostic value [47], and may in
have prevented the integration of STRs into standard some cases be an actionable target [48]. It is traditionally
GWAS pipelines. Because of their high mutation rates diagnosed based on a handful of STR markers (e.g., the
and frequent recurrent mutations, STRs have diminished Promega Corp. or Bethesda Panel [49]). However, these
LD with nearby SNPs [29], limiting the power of SNP- panels offer only a limited view of genomic features of

Current Opinion in Genetics & Development 2017, 44:916 www.sciencedirect.com


A genomic view of short tandem repeats Gymrek 13

Figure 3

(a) (b)
4.0
A/G (AC)n 3.0

Phenotype
2.0
1.0 1.0
0.0
-1.0
0.8 -2.0
-3.0
-4.0
0.6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Power

(c) 4.0 # AC Repeats


3.0
0.4

Phenotype
2.0
1.0
0.2 0.0
-1.0
-2.0
0.0 -3.0
0.0 0.02 0.04 0.06 0.08 0.1 -4.0
AA AG GG
Effect size
SNP Genotype
Current Opinion in Genetics & Development

SNP-based GWAS is underpowered for detecting underlying STR associations.


(a) Power of a causal STR (red) vs. a linked SNP (black) to detect an underlying STR association. An STR with a mutation rate of 10 4 mutations
per generation was simulated on a SNP haplotype background. A quantitative phenotype was simulated using an additive model in which the
phenotype depends on the sum of the two STR alleles at a locus. The plot shows the power of the STR and strongest linked SNP (r2 = 0.37) to
detect an association, showing that a single SNP is often underpowered to detect STR associations. (b) and (c) show example relationships
between the causal STR (b) and the strongest linked SNP (c) and phenotype. Gray dashed lines denote the average value of the phenotype. See
Supplementary Note for simulation details.

MSI and thus suffer from limited accuracy and repertoire of variation profiled in cancer NGS studies,
resolution. potentially shedding light on an uncharacterized set of
driver mutations.
NGS now enables profiling MSI on a genome-wide scale.
Besides the STR-specific tools described above, a variety
of tools have been designed specifically to detect MSI by Conclusions
comparing NGS from normal and tumor samples A recent proliferation of tools now allow fast and accurate
(Table 1). The availability of large cancer NGS datasets profiling of STRs from NGS, and are approaching quality
such as TCGA (http://cancergenome.nih.gov/) have pro- obtained for other variant types. These methods have led
vided the first genome-wide portraits of MSI in a variety to the generation of population-wide and genome-wide
of cancer types [50,51]. These studies find that MSI catalogs of STR variation that are now enabling the first
exhibits tumor type-specific mutation patterns that were large-scale studies of STRs in a variety of medical genet-
only revealed from genome-wide analyses, and show that ics applications, including complex traits, Mendelian
NGS provides highly accurate diagnoses of MSI diseases, and cancer.
phenotypes.
Despite recent progress, important challenges remain.
Genome-wide STR genotyping has the potential to trans- STR genotyping still requires separate pipelines beyond
form applications in cancer research beyond the study of standard SNP and indel calling approaches, and integrat-
classical MSI. For instance, STR genotyping from NGS ing call sets across multiple variant types is challenging.
has proven effective for reconstructing cancer cell Additionally, long or complex repeats, such as those
lineages from single cells [52], which provides an in depth involved in most characterized trinucleotide expansion
view of a cancers evolution over time. In another exam- disorders, are still called incorrectly in most cases. Fur-
ple, the fusion protein EWS-FLI1 in Ewing Sarcoma has thermore, repeats with longer motif sizes (>6 bp) are
been shown to bind specifically to GGAA repeats [53] to almost completely ignored by NGS algorithms [56].
establish an oncogenic regulatory program [54]. A recent Finally, most statistical genetics techniques have been
study found a common variant disrupting a GGAA repeat built assuming bi-allelic variants. Thus, standard statisti-
may contribute to Ewing Sarcoma susceptibility [55]. cal genetics approaches must be modified to accommo-
Finally, I envision repeats will soon be added to the date variants with complex allelic spectra.

www.sciencedirect.com Current Opinion in Genetics & Development 2017, 44:916


14 Molecular and genetic bases of disease

New sequencing technologies have the potential to over- 8. DeJesus-Hernandez M, Mackenzie IR, Boeve BF, Boxer AL,
Baker M, Rutherford NJ, Nicholson AM, Finch NA, Flynn H,
come many of these challenges. Long-read technologies Adamson J et al.: Expanded GGGGCC hexanucleotide repeat in
can now sequence reads dozens of kilobases long, thus noncoding region of C9ORF72 causes chromosome 9p-linked
FTD and ALS. Neuron 2011, 72:245-256.
spanning most known STR expansions [57] as well as
repeats with longer motifs. These methods as well as 9. Pearson CE: Repeat associated non-ATG translation initiation:
one DNA, two transcripts, seven reading frames, potentially
synthetic long read platforms, such as 10X Genomics nine toxic entities! PLoS Genet 2011, 7:e1002018.
[58], can also provide long-range phasing information for 10. Hefferon TW, Groman JD, Yurk CE, Cutting GR: A variable
reconstruction of phased haplotypes of STRs and nearby dinucleotide repeat in the CFTR gene contributes to
phenotype diversity by forming RNA secondary structures that
variants that may be used for downstream imputation. alter splicing. Proc Natl Acad Sci U S A 2004, 101:3504-3509.
Hybrid approaches combining short and long read data
11. Borel C, Migliavacca E, Letourneau A, Gagnebin M, Bena F,
have shown promising results in accurately resolving long Sailani MR, Dermitzakis ET, Sharp AJ, Antonarakis SE: Tandem
and complex regions [59,60] while maintaining the per- repeat sequence variation as causative cis-eQTLs for protein-
coding gene expression variation: the case of CSTB. Hum
base accuracy afforded by short reads. As methods for Mutat 2012, 33:1302-1309.
STR genotyping mature, including STRs and other
12. Hsieh T-Y, Shiu T-Y, Huang S-M, Lin H-H, Lee T-C, Chen P-J,
repeats in standard sequencing pipelines will add a valu- Chu H-C, Chang W-K, Jeng K-S, Lai MMC et al.: Molecular
able layer of information that will likely provide a new pathogenesis of Gilberts syndrome: decreased TATA-binding
protein binding affinity of UGT1A1 gene promoter.
reservoir of yet undiscovered disease variants. Pharmacogenet Genom 2007, 17:229-236.
13. Willems T, Gymrek M, Poznik GD, Tyler-Smith C, 1000 Genomes
Conflict of interest Project Chromosome Y Group, Erlich Y: Population-scale
sequencing data enable precise estimates of Y-STR mutation
The author has no conflict of interest to disclose. rates. Am J Hum Genet 2016, 98:919-933.
14. Kirby A, Gnirke A, Jaffe DB, Bareova V, Pochet N, Blumenstiel B,
Appendix A. Supplementary data Ye C, Aird D, Stevens C, Robinson JT et al.: Mutations causing
medullary cystic kidney disease type 1 lie in a large VNTR in
Supplementary data associated with this article can be MUC1 missed by massively parallel sequencing. Nat Genet
found, in the online version, at http://dx.doi.org/10.1016/j. 2013, 45:299-303.
gde.2017.01.012. 15. Li H: Toward better understanding of artifacts in variant calling
from high-coverage samples. Bioinformatics 2014, 30:2843-
2851.
Acknowledgements
16. Keogh MJ, Chinnery PF: Next generation sequencing for
I apologize to those whose work could not be discussed due to space neurological diseases: new hope or new hype? Clin Neurol
limitations. I thank Thomas Willems, Joe Gleeson, Jonathan Sebat, and Neurosurg 2013, 115:948-953.
Alon Goren for helpful discussions and comments on the manuscript.
17. Renton AE, Majounie E, Waite A, Simon-Sanchez J, Rollinson S,
Gibbs JR, Schymick JC, Laaksovirta H, van Swieten JC,
References and recommended reading Myllykangas L et al.: A hexanucleotide repeat expansion in
C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD.
Papers of particular interest, published within the period of review,
Neuron 2011, 72:257-268.
have been highlighted as:
18. Gymrek M, Golan D, Rosset S, Erlich Y: lobSTR: a short tandem
 of special interest repeat profiler for personal genomes. Genome Res 2012,
 of outstanding interest 22:1154-1162.

1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, 19. Fungtammasan A, Ananda G, Hile SE, Su MS-W, Sun C, Harris R,
Devon K, Dewar K, Doyle M, FitzHugh W et al.: Initial sequencing Medvedev P, Eckert K, Makova KD: Accurate typing of short
and analysis of the human genome. Nature 2001, 409:860-921. tandem repeats from genome-wide sequencing data and its
applications. Genome Res 2015, 25:736-749.
2. Sun JX, Helgason A, Masson G, Ebenesersdottir SS, Li H,
Mallick S, Gnerre S, Patterson N, Kong A, Reich D et al.: A direct 20. Highnam G, Franck C, Martin A, Stephens C, Puthige A,
characterization of human mutation based on microsatellites. Mittelman D: Accurate human microsatellite genotypes from
Nat Genet 2012, 44:1161-1165. high-throughput resequencing data using informed error
profiles. Nucleic Acids Res 2013, 41:e32.
3. Kremer EJ, Pritchard M, Lynch M, Yu S, Holman K, Baker E,
Warren ST, Schlessinger D, Sutherland GR, Richards RI: Mapping 21. Li H, Durbin R: Fast and accurate short read alignment with
of DNA instability at the fragile X to a trinucleotide repeat Burrows-Wheeler transform. Bioinformatics 2009, 25:1754-
sequence p(CCG)n. Science 1991, 252:1711-1714. 1760.

4. Verkerk AJ, Pieretti M, Sutcliffe JS, Fu YH, Kuhl DP, Pizzuti A, 22. Li H: Aligning sequence reads, clone sequences and assembly
Reiner O, Richards S, Victoria MF, Zhang FP: Identification of a contigs with BWA-MEM. arXiv [q-bio.GN] 2013. [no volume].
gene (FMR-1) containing a CGG repeat coincident with a 23. Kristmundsdottir S, Sigurpalsdottir BD, Kehr B, Halldorsson BV:
breakpoint cluster region exhibiting length variation in fragile  popSTR: population-scale detection of STR variants.
X syndrome. Cell 1991, 65:905-914. Bioinformatics 2016 http://dx.doi.org/10.1093/bioinformatics/
5. La Spada AR, Wilson EM, Lubahn DB, Harding AE, Fischbeck KH: btw568 https://www.ncbi.nlm.nih.gov/pubmed/27591079.
Androgen receptor gene mutations in X-linked spinal and This study develops a method for genotyping STRs called popSTR that
bulbar muscular atrophy. Nature 1991, 352:77-79. builds locus- and individual-specific error profiles by leveraging popula-
tion information. It relies on previously generated alignments, reducing
6. Mirkin SM: Expandable DNA repeats and human disease. total analysis time compared to previous tools that realign raw reads.
Nature 2007, 447:932-940.
24. Willems T, Zielinski D, Gordon A, Gymrek M, Erlich Y: Genome-
7. Sutcliffe JS, Nelson DL, Zhang F, Pieretti M, Caskey CT, Saxe D,  wide profiling of heritable and de novo STR variations. bioRxiv
Warren ST: DNA methylation represses FMR-1 transcription in 2016 http://dx.doi.org/10.1101/077727 http://www.biorxiv.org/
fragile X syndrome. Hum Mol Genet 1992, 1:397-400. content/early/2016/09/27/077727.

Current Opinion in Genetics & Development 2017, 44:916 www.sciencedirect.com


A genomic view of short tandem repeats Gymrek 15

This paper introduces a haplotype-based tool called HipSTR for geno- 38. Rockman MV, Wray GA: Abundant raw material for cis-
typing, phasing, and calling de novo STR variants from whole genome regulatory evolution in humans. Mol Biol Evol 2002, 19:1991-
sequencing data. HipSTR infers both the sequence and length of STR 2004.
alleles with 99% accuracy, far greater than previous tools.
39. Shimajiri S, Arima N, Tanimoto A, Murata Y, Hamada T, Wang KY,
25. Payseur BA, Jing P, Haasl RJ: A genomic portrait of human Sasaguri Y: Shortened microsatellite d(CA)21 sequence down-
microsatellite variation. Mol Biol Evol 2011, 28:303-312. regulates promoter activity of matrix metalloproteinase
9 gene. FEBS Lett 1999, 455:70-74.
26. McIver LJ, Fondon JW 3rd, Skinner MA, Garner HR: Evaluation of
microsatellite variation in the 1000 Genomes Project pilot 40. Sawaya S, Bagshaw A, Buschiazzo E, Kumar P, Chowdhury S,
studies is indicative of the quality and utility of the raw data Black MA, Gemmell N: Microsatellite tandem repeats are
and alignments. Genomics 2011, 97:193-199. abundant in human promoters and are associated with
regulatory elements. PLoS One 2013, 8:e54710.
27. ODushlaine CT, Shields DC: Marked variation in predicted and
observed variability of tandem repeat loci across the human 41. Yanez-Cuna JO, Arnold CD, Stampfel G, Boryn  LM, Gerlach D,
genome. BMC Genom 2008, 9:175.  Rath M, Stark A: Dissection of thousands of cell type-specific
enhancers identifies dinucleotide repeat motifs as general
28. McIver LJ, McCormick JF, Martin A, Fondon JW 3rd, Garner HR: enhancer features. Genome Res 2014, 24:1147-1156.
Population-scale analysis of human microsatellites reveals This study revealed that dinucleotide repeat motifs are highly-enriched in
novel sources of exonic variation. Gene 2013, 516:328-334. enhancers in Drosophila, particularly in enhancers that are broadly active
across cell types. Initial analyses show these motifs are also enriched in
29. Willems T, Gymrek M, Highnam G, 1000 Genomes Project human regulatory regions, further suggesting a role of STRs in gene
 Consortium, Mittelman D, Erlich Y: The landscape of human STR regulation.
variation. Genome Res 2014, 24:1894-1904.
This study generated the first population- and genome-wide catalog of 42. Quilez J, Guilmatre A, Garg P, Highnam G, Gymrek M, Erlich Y,
STR variation by profiling STRs in the 1000 Genomes Project. The catalog  Joshi RS, Mittelman D, Sharp AJ: Polymorphic tandem repeats
provides the first genome-wide portrait of allele frequency spectra across within gene promoters act as modifiers of gene expression
STRs, linkage disequilibrium between STRs and SNPs, and sequence and DNA methylation in humans. Nucleic Acids Res 2016,
determinants of STR variability. 44:3750-3762.
This study detected associations between more than 100 STRs and
30. Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, expression or DNA methylation of nearby genes. This and Ref. [34]
 Zhao M, Chennagiri N, Nordenfelt S, Tandon A et al.: The Simons show that STRs are likely to play an important role in regulating gene
Genome Diversity Project: 300 genomes from 142 diverse expression in cis.
populations. Nature 2016, 538(7624):201-206 http://dx.doi.org/
10.1038/nature18964 https://www.ncbi.nlm.nih.gov/pubmed/ 43. Vilar E, Gruber SB: Microsatellite instability in colorectal
27654912. cancer-the stable evidence. Nat Rev Clin Oncol 2010, 7:153-162.
This study reports high coverage PCR-free whole genomes for 300 indi-
viduals. Supplemental Chapter 6 builds on Ref. [29] to build a deep 44. Meyer LA, Broaddus RR, Lu KH: Endometrial cancer and Lynch
genome-wide catalog of STR variation with accurate individual genotypes syndrome: clinical and pathologic considerations. Cancer
and including homopolymers. The catalog shows that STRs provide Control 2009, 16:14-22.
detailed information about population structure.
45. Singer G, Kallinowski T, Hartmann A, Dietmaier W, Wild PJ,
31. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Schraml P, Sauter G, Mihatsch MJ, Moch H: Different types of
Fennell T, ODonnell-Luria AH, Ware JS, Hill AJ, Cummings BB microsatellite instability in ovarian carcinoma. Int J Cancer
et al.: Analysis of protein-coding genetic variation in 60,706 2004, 112:643-646.
humans. Nature 2016, 536:285-291.
46. Pritchard CC, Morrissey C, Kumar A, Zhang X, Smith C, Coleman I,
32. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Salipante SJ, Milbank J, Yu M, Grady WM et al.: Complex MSH2
Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW and MSH6 mutations in hypermutated microsatellite unstable
et al.: Common SNPs explain a large proportion of the advanced prostate cancer. Nat Commun 2014, 5:4988.
heritability for human height. Nat Genet 2010, 42:565-569. 47. Onda M, Nakamura I, Suzuki S, Takenoshita S, Brogren CH,
33. Press MO, Carlson KD, Queitsch C: The overdue promise of Stampanoni S, Li D, Rampino N: Microsatellite instability in
 short tandem repeat variation for heritability. Trends Genet thyroid cancer: hot spots, clinicopathological implications,
2014, 30:504-512. and prognostic significance. Clin Cancer Res 2001, 7:3444-
This paper reviews arguments that STRs likely contribute to complex trait 3449.
heritability and outlines challenges in incorporating STRs into genotype- 48. Le DT, Uram JN, Wang H, Bartlett BR, Kemberling H, Eyring AD,
phenotype maps. Skora AD, Luber BS, Azad NS, Laheru D et al.: PD-1 blockade in
tumors with mismatch-repair deficiency. N Engl J Med 2015,
34. Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B, Georgiev S,
372:2509-2520.
 Daly MJ, Price AL, Pritchard JK, Sharp AJ et al.: Abundant
contribution of short tandem repeats to gene expression 49. Murphy KM, Zhang S, Geiger T, Hafez MJ, Bacher J, Berg KD,
variation in humans. Nat Genet 2016, 48:22-29. Eshleman JR: Comparison of the microsatellite instability
This study performed the first genome-wide quantification of the con- analysis system and the Bethesda panel for the determination
tribution of STRs to gene expression variation in humans. It found 2,060 of microsatellite instability in colorectal cancers. J Mol Diagn
expression associated STRs and showed that these account for 1015% 2006, 8:305-311.
of heritability mediated by all cis variants. This and Ref. [42] show that
STRs are likely to play an important role in regulating gene expression in 50. Kim T-M, Laird PW, Park PJ: The landscape of microsatellite
cis. instability in colorectal and endometrial cancer genomes. Cell
2013, 155:858-868.
35. Sekar A, Bialas AR, de Rivera H, Davis A, Hammond TR,
Kamitaki N, Tooley K, Presumey J, Baum M, Van Doren V et al.: 51. Hause RJ, Pritchard CC, Shendure J, Salipante SJ: Classification
Schizophrenia risk from complex variation of complement  and characterization of microsatellite instability across
component 4. Nature 2016, 530:177-183. 18 cancer types. Nat Med 2016, 22(11):1342-1350 http://dx.doi.
org/10.1038/nm.4191 https://www.ncbi.nlm.nih.gov/pubmed/
36. Contente A, Dittmer A, Koch MC, Roth J, Dobbelstein M: A 27694933.
polymorphic microsatellite that mediates induction of PIG3 by This study explores the genome-wide landscape of MSI in cancer. They
p53. Nat Genet 2002, 30:315-320. find specific tumor-type specific MSI signatures, show evidence that MSI
is a continuous phenotype correlated with survival, and develop a highly
37. Gebhardt F, Zanker KS, Brandt B: Modulation of epidermal accurate MSI classifier based on whole genome sequencing data.
growth factor receptor gene transcription by a polymorphic
dinucleotide repeat in intron 1. J Biol Chem 1999, 274:13176- 52. Biezuner T, Spiro A, Raz O, Amir S, Milo L, Adar R, Chapal-Ilani N,
13180. Berman V, Fried Y, Ainbinder E et al.: A generic, cost-effective
and scalable cell lineage analysis platform. Genome Res 2016,

www.sciencedirect.com Current Opinion in Genetics & Development 2017, 44:916


16 Molecular and genetic bases of disease

26(11):1588-1599 http://dx.doi.org/10.1101/gr.202903.115 60. Doi K, Monjo T, Hoang PH, Yoshimura J, Yurino H, Mitsui J,
https://www.ncbi.nlm.nih.gov/pubmed/27558250. Ishiura H, Takahashi Y, Ichikawa Y, Goto J et al.: Rapid detection
of expanded short tandem repeats in personal genomics using
53. Guillon N, Tirode F, Boeva V, Zynovyev A, Barillot E, Delattre O: hybrid sequencing. Bioinformatics 2014, 30:815-822.
The oncogenic EWS-FLI1 protein binds in vivo GGAA
microsatellite sequences with potential transcriptional 61. Rothenburg S, Koch-Nolte F, Rich A, Haag F: A polymorphic
activation function. PLoS One 2009, 4:e4932. dinucleotide repeat in the rat nucleolin gene forms Z-DNA and
inhibits promoter activity. Proc Natl Acad Sci U S A 2001,
54. Riggi N, Knoechel B, Gillespie SM, Rheinbay E, Boulay G, 98:8985-8990.
Suva ML, Rossetti NE, Boonseng WE, Oksuz O, Cook EB et al.:
EWS-FLI1 utilizes divergent chromatin remodeling 62. Stoger R, Kajimura TM, Brown WT, Laird CD: Epigenetic variation
mechanisms to directly activate or repress enhancer elements illustrated by DNA methylation patterns of the Fragile-X gene
in Ewing sarcoma. Cancer Cell 2014, 26:668-681. FMR1. Hum Mol Genet 1997, 6:1791-1801.

55. Grunewald TGP, Bernard V, Gilardi-Hebenstreit P, Raynal V, 63. Kumari D, Usdin K: Chromatin remodeling in the noncoding
Surdez D, Aynaud M-M, Mirabeau O, Cidre-Aranaz F, Tirode F, repeat expansion diseases. J Biol Chem 2009, 284:7413-7417.
Zaidi S et al.: Chimeric EWSR1-FLI1 regulates the Ewing 64. Hui J, Hung L-H, Heiner M, Schreiner S, Neumuller N, Reither G,
sarcoma susceptibility gene EGR2 via a GGAA microsatellite. Haas SA, Bindereif A: Intronic CA-repeat and CA-rich elements:
Nat Genet 2015, 47:1073-1078. a new class of regulators of mammalian alternative splicing.
56. Gelfand Y, Hernandez Y, Loving J, Benson G: VNTRseek-a EMBO J 2005, 24:1988-1998.
computational tool to detect tandem repeat variants in high- 65. Anvar SY, van der Gaag KJ, van der Heijden JWF, Veltrop MHAM,
throughput sequencing data. Nucleic Acids Res 2014, 42:8884- Vossen RHAM, de Leeuw RH, Breukel C, Buermans HPJ,
8894. Verbeek JS, de Knijff P et al.: TSSV: a tool for characterization of
complex allelic variants in pure and mixed genomes.
57. Loomis EW, Eid JS, Peluso P, Yin J, Hickey L, Rank D, Bioinformatics 2014, 30:1651-1659.
McCalmon S, Hagerman RJ, Tassone F, Hagerman PJ:
Sequencing the unsequenceable: expanded CGG-repeat 66. Huang MN, McPherson JR, Cutcutache I, Teh BT, Tan P,
alleles of the fragile X gene. Genome Res 2013, 23:121-128. Rozen SG: MSIseq: software for assessing microsatellite
instability from catalogs of somatic mutations. Sci Rep 2015,
58. Zheng GXY, Lau BT, Schnall-Levin M, Jarosz M, Bell JM, 5:13321.
Hindson CM, Kyriazopoulou-Panagiotopoulou S, Masquelier DA,
Merrill L, Terry JM et al.: Haplotyping germline and cancer 67. Niu B, Ye K, Zhang Q, Lu C, Xie M, McLellan MD, Wendl MC,
genomes with high-throughput linked-read sequencing. Nat Ding L: MSIsensor: microsatellite instability detection using
Biotechnol 2016, 34:303-311. paired tumor-normal sequence data. Bioinformatics 2014,
30:1015-1016.
59. Mostovoy Y, Levy-Sakin M, Lam J, Lam ET, Hastie AR, Marks P,
Lee J, Chu C, Lin C, Dakula et al.: A hybrid approach for de novo 68. Salipante SJ, Scroggins SM, Hampel HL, Turner EH, Pritchard CC:
human genome sequence assembly and phasing. Nat Methods Microsatellite instability detection by next generation
2016, 13:587-590. sequencing. Clin Chem 2014, 60:1192-1199.

Current Opinion in Genetics & Development 2017, 44:916 www.sciencedirect.com

Вам также может понравиться