You are on page 1of 60

Munchen

Technische Universitat

IESP Workshop on Environmental Modelling

Large Scale Earthquake Simulation on


Supercomputing Platforms
M. Bader, A. Breuer, A. Heinecke, S. Rettenberger
C. Pelties, A.-A. Gabriel
Munchen,
Technische Universitat

Munchen
Ludwig-Maximilians-Universitat

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

Munchen
Technische Universitat

HPC Meets Geoscience

Alexander
Breuer

Alice-Agnes
Gabriel

Alexander
Heinecke

Christian
Pelties

Sebastian
Rettenberger

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

Munchen
Technische Universitat

The Simulation Pipeline


phenomenon, process etc.
modelling

?
mathematical model
v
a
l
i
d
a
t
i
o
n

numerical treatment

?
numerical algorithm

parallel implementation

?
simulation code

visualization
?
results to interpret

 HHH



j embedding
H
statement

tool

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

Munchen
Technische Universitat

Faster, Bigger, More


Why parallel high performance computing:
Response time (speed-up):

compute a problem in

1
p

time

speed up engineering processes


real-time simulations (tsunami warning?)

Problem size (scale-up):

compute a p-times bigger problem


Simulation of multi-scale phenomena
maximal problem size that fits into the machine
Throughput:

compute p problems at once


case and parameter studies, statistical risk scenarios, etc.
massively distributed computing (SETI@home, e.g.)

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

Munchen
Technische Universitat

Overview and Agenda


Trends in Supercomputing:
current and future supercomputing architectures
roadmap to exascale performance

Seismic Simulations with SeisSol:


dynamic rupture and seismic wave propagation
unstructured tetrahedral meshes
scenarios: Mount Merapi, Landers Earthquake

Optimisation for Petascale Platforms:


code generation to optimize element-local matrix kernels
hybrid MPI/OpenMP parallelisation
offload scheme to address multiphysics

Performance on Heterogeneous Supercomputers:


scalability tests on Tianhe-2 and Stampede
wave propagation vs. multiphysics earthquake simulation
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

Munchen
Technische Universitat

Part I
Past and Present Trends in
High Performance Computing

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

Munchen
Technische Universitat

Four Horizons for Enhancing the Performance . . .


. . . of Parallel Simulations Based on Partial Differential Equations
(David Keyes, 2000)
1. Expanded Number of Processors
in 2000: 1000 cores; in 2010: 200,000 cores
2. More Efficient Use of Faster Processors
PDF working-sets, cache efficiency
3. More Architecture-Friendly Algorithms
improve temporal/spatial locality

4. Algorithms Delivering More Science per Flop


adaptivity (in space and time), higher-order methods, fast
solvers
D. Keyes: Four horizons for enhancing the performance of parallel simulations based on partial
differential equations. In: Euro-Par 2000 Parallel Processing, LNCS 1900, p. 117, 2000.
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

Munchen
Technische Universitat

Computational Science Demands a New Paradigm


Computational simulation must meet three challenges to become a
mature partner of theory and experiment
(Post & Votta, 2005)
1. performance challenge
exponential growth of performance, massively parallel
architectures
2. programming challenge
new (parallel) programming models

3. prediction challenge
careful verification and validation of codes; towards
reproducible simulation experiments
D. E. Post and L. G. Votta: Computational science demands a new paradigm. Physics Today 58(1),
p. 3541, 2005.

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

Munchen
Technische Universitat

Free Lunch is Over()


. . . actually already over for quite some time!

Speedup of software
only due to parallelism:
CPU clock speed has stalled
instruction-level parallelism

per core has stalled


power consumption has stalled!
number of cores is growing
size of vector units is growing
()

Quote and image taken from:


H. Sutter, The Free Lunch Is Over:
A Fundamental Turn Toward Concurrency in Software,
Dr. Dobbs Journal 30(3), March 2005.

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

Munchen
Technische Universitat

Manycore CPU Intel MIC Architecture


Intel MIC Architecture:

FIXED FUNCTION LOGIC

VECTOR
IA CORE

VECTOR
IA CORE

VECTOR
IA CORE

VECTOR
IA CORE

INTERPROCESSOR NETWORK
COHERENT
CACHE

COHERENT
CACHE

COHERENT
CACHE

COHERENT
CACHE

COHERENT
CACHE

COHERENT
CACHE

COHERENT
CACHE

COHERENT
CACHE

INTERPROCESSOR NETWORK
VECTOR
IA CORE

VECTOR
IA CORE

VECTOR
IA CORE

VECTOR
IA CORE

MEMORY and I/O INTERFACES

An Intel Co-Processor Architecture

Many cores and many, many more threads


Standard IA programming and memory model

(source: Intel/K. Skaugen SC10 keynote presentation)

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

10

Munchen
Technische Universitat

Manycore CPU Intel Xeon Phi Coprocessor

coprocessor = works as an extension card on the PCI bus


60 cores, 4 hardware threads per core
simpler architecture for each core, but
wider vector computing unit (8 double-precision floats)
next generation (Knights Landing) announced to be available as
standalone CPU

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

11

cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The
are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory
partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM
memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread
global scheduler distributes thread blocks to SM thread schedulers.

512 CUDA
Munchen
Technische Universit
at
cores

Manycore GPU NVIDIA Fermi

(source:around
NVIDIAacommon
Fermi Whitepaper)
Fermis 16 SM are positioned
L2 cache. Each SM is a vertical
rectangular strip that contain an orange portion (scheduler and dispatch), a green portion
(execution units), and light blue portions (register file and L1 cache).
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

12

Warp Scheduler

Warp Scheduler

Dispatch Unit

Dispatch Unit

Munchen
Technische Universitat

ocessor

Manycore
GPUseveral
NVIDIA
generation
SM introduces
ural innovations that make it not only the
werful SM yet built,
but also Streaming
the most
Third Generation
Multiprocessor
mable and efficient.

Fermi (2)

Register File (32,768 x 32-bit)

Instruction Cache
Warp Scheduler

Core

programmable and efficient.

Dispatch Unit

Core

Core

LD/ST

Core
LD/ST

The third generation SM introduces several


architectural innovations that make it not only the

Performancemost
CUDA
powerful cores
SM yet built, but also the most

Warp Scheduler

Dispatch Unit

Register File (32,768 x 32-bit)

SFU
LD/ST

Core
Core

Core

Core
Core

Core
LD/ST

Core
LD/ST

Core
LD/ST

SFU

512 High Performance CUDA cores


features 32 CUDA
CUDA Core
Core
Core
Core
Core
Dispatch Port
CUDA Core
rsa fourfold Each SM features 32 CUDA
processorsa fourfold
Operand Collector
over prior SM increase over prior SM
Core
Core
Core
Core
designs. Each CUDA
Each CUDA processor has a fully
FP Unit
INT Unit
integer arithmetic
r has a fully pipelined
logic unit (ALU) and floating
Core
Core
Core
Core
point unit (FPU). Prior GPUs
IEEE 754-1985
Resultused
Queue
integer arithmetic
floating point arithmetic. The Fermi architecture
implements the new IEEE 754-2008 floating-point
(ALU) and floating
Core
Core
Core
Core
standard, providing the fused multiply-add (FMA)
instructionused
for both IEEE
single and754-1985
double precision
(FPU). Prior GPUs
arithmetic. FMA improves over a multiply-add
oint arithmetic.(MAD)
The
Fermi
architecture
instruction
by doing
the multiplication and
Core
Core
Core
Core
addition with a single final rounding step, with no
nts the new IEEE
754-2008
floating-point
loss of
precision in the addition.
FMA is more
(source: NVIDIA
Fermi
Whitepaper)
Fermi Streaming Multiprocessor (SM)
than
performing the operations
providing
the accurate
fused
multiply-add
(FMA)
Core
Core
Core
separately. GT200 implemented double precision FMA. Core
n for both single
and
double
precision
InScale
GT200,
the integerSimulation
ALU was limited
to 24-bit precision
for multiply operations; as a result,
M. Bader et al.: Large
Earthquake
on Supercomputing
Platforms
multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly
c. IESP
FMA
improves
overModelling,
a multiply-add
Interconnect Network
Workshop
on Environmental
TUM-IAS, Sep 12, 2014
designed integer ALU supports full 32-bit precision for all instructions, consistent with standard
LD/ST

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

LD/ST

LD/ST

LD/ST

Dispatch Port

SFU

LD/ST

SFU

Operand Collector

LD/ST

FP Unit

LD/ST
LD/ST

LD/ST

INT Unit

LD/ST

LD/ST

Result Queue

SFU

LD/ST
LD/ST

LD/ST

Core

Core

Core

Core

LD/ST

LD/ST

SFU

LD/ST

Core

Core

Core

Core

LD/ST

SFU

LD/ST

LD/ST

Core

Core

Core

Core

LD/ST

Interconnect Network

LD/ST
LD/ST

64 KB Shared Memory / L1 Cache

LD/ST

SFU

Uniform Cache

LD/ST
LD/ST

13

Munchen
Technische Universitat

The Memory Challenge


memory per core is decreasing (Xeon Phi: 100 MB)
performance of memory falling behind:

CPU speedup: 60 % per year


memory bandwidth improvement: 25 % per year
memory latency improvement: 5 % per year
Roofline Model:
GFlop/s log

peak FP performance

h
idt

dw

without vectorization

n
ba

pe

m MA
ea
str ut NU
e
o
ak
trid
th

wi

nit

without instructionlevel parallism

u
n

1/8

1/4

matrix mult.
(100x100)

5pt stencil

SpMV

no

1/2

16

32

64

Operational Intensity [Flops/Byte] log


M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

14

Munchen
Technische Universitat

Top 500 (www.top500.org) June 2014

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

15

Munchen
Technische Universitat

Top 500 Spotlights Tianhe-2 and K Computer


Tianhe-2/MilkyWay-2 Intel Xeon Phi (NUDT)

3.1 mio cores(!) Intel Ivy Bridge and Xeon Phi


Linpack benchmark: 33.8 PFlop/s
17 MW power(!!)

Knights Corner / Intel Xeon Phi / Intel MIC as accelerator

57 cores, roughly 1.1 GHz


Titan Cray XK7, NVIDIA K20x (ORNL)

18,688 compute nodes; 300,000 AMD Opteron cores


18,688 NVIDIA Tesla K20 GPUs
Linpack benchmark: 17.6 PFlop/s
8.2 MW power

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

16

Munchen
Technische Universitat

Top 500 Spotlights Sequoia and K Computer


Sequoia IBM BlueGene/Q (LLNL)

98,304 compute nodes; 1.6 mio cores


18-core CPU (PowerPC; 16 compute, 1 OS, 1 redundant)
Linpack benchmark: 17.1 PFlop/s
8 MW power

K Computer SPARC64 (RIKEN, Kobe)


88,128 processors; 705,024 cores

Linpack benchmark: 10.51 PFlop/s


12 MW power

SPARC64 VIIIfx 2.0GHz (8-core CPU)

(successor announced to have 32 cores)

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

17

Munchen
Technische Universitat

Performance Development in Supercomputing

(source: www.top500.org)
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

18

Munchen
Technische Universitat

An Exascale Roadmap
Aggressively Designed Strawman Architecture
Level
FPU
Core
Proc. Chip
Node
Group
rack
System

What
FPU, regs,. instr.-memory
4 FPUs, L1
742 cores, L2/L3, Intercon.
Proc. chip, DRAM
12 proc. chips, routers
32 groups
583 racks

Perf.
1.5 GF
6 GF
4.5 TF
4.5 TF
54 TF
1.7 PF
1 EF

Power
30 mW
141 mW
214 W
230 W
3.5 KW
116 KW
67.7 MW

RAM

16 GB
192 GB
6.1 TB
3.6 PB

approx. 285,000 cores per rack; 166 mio cores in total


Source: ExaScale Computing Study: Technology Challenges in Achieving Exascale
Systems
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

19

Munchen
Technische Universitat

Exascale Roadmap Should You Bother?


Your departments compute cluster in 5 years?
a Petaflop System!
one rack of the Exaflop system

using the same/similar hardware

extrapolated example machine:

peak performance: 1.7 PFlop/s


6 TB RAM, 60 GB cache memory
total concurrency: 1.1 106
number of cores: 280, 000
number of chips: 384

Source: ExaScale Software Study: Software Challenges in Extreme Scale


Systems

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

20

Munchen
Technische Universitat

Your Departments PetaFlop/s Cluster in 5 Years?


Piz Daint (CSCS Lugano, Switzerland; Top500 # 6 )
5,272 compute nodes
per node: 1 Xeon E5-2670, 1 NVIDIA Tesla K20X GPUs
Linpack benchmark: 6.3 PFlop/s
2.3 MW power

Stampede (TACC, Austin; Top500 # 7)


102,400 cores (incl. Xeon Phi: MIC/many integrated cores)
Linpack benchmark: 5 PFlop/s

Intel Xeon Phi as accelerator (61 cores, 1.1 GHz)


wider vector FP units: 64 bytes (i.e., 16 floats, 8 doubles)
4.5 MW power
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

21

Munchen
Technische Universitat

International Exascale Software Project Roadmap


Towards an Exa-Flop/s Platform in 2018 (www.exascale.org):
1. technology trends
concurrency, reliability, power consumption, . . .
blueprint of an exascale system: 10-billion-way concurrency,
100 million to 1 billion cores, 10-to-100-way concurrency per
core, hundreds of cores per die, . . .
2. science trends
climate, high-energy physics, nuclear physics, fusion energy
sciences, materials science and chemistry, . . .
3. X-stack (software stack for exascale)
energy, resiliency, heterogeneity, I/O and memory

4. Polito-economic trends
exascale systems run by government labs, used by CSE
scientists
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

22

Munchen
Technische Universitat

Part II
SeisSol An ADER-DG Code on
Unstructured Tetrahedral Meshes

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

23

Munchen
Technische Universitat

Dynamic Rupture and Earthquake Simulation

Tohoku subduction zone: CAD model and tetrahedral mesh (C. Pelties)

Use of Adaptive Tetrahedral Meshes:


curved subduction zones that meet surface at shallow angles

high impact on uplift for tsunamigenic earthquakes

complicated fault systems with multiple branches

non-linear multiphysics dynamic rupture simulation

goal: automated meshing process (incl. CAD generation)


M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

24

Munchen
Technische Universitat

Dynamic Rupture and Earthquake Simulation

Landers fault system: simulated ground motion and tetrahedral mesh

Use of Adaptive Tetrahedral Meshes:


curved subduction zones that meet surface at shallow angles

high impact on uplift for tsunamigenic earthquakes

complicated fault systems with multiple branches

non-linear multiphysics dynamic rupture simulation


goal: automated meshing process (incl. CAD generation)
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

24

Munchen
Technische Universitat

Application Scenario: Mount Merapi

simulate monitoring of tremors in a stratovolcano


seismic wave propagation in highly complicated topography
point source representing a seismic moment tensor
accurate modeling of highest frequencies ( 20 Hz)
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

25

Munchen
Technische Universitat

Application Scenario: Mount Merapi


1

10

fine mesh
coarse mesh

amplitude spectrum

vertical stress

zz

[Pa]

40
20
0
20
40
0

fine mesh
coarse mesh

10

10

10

time [s]

10

10

10
frequency [Hz]

10

simulate monitoring of tremors in a stratovolcano


seismic wave propagation in highly complicated topography
point source representing a seismic moment tensor
accurate modeling of highest frequencies ( 20 Hz)
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

25

Munchen
Technische Universitat

1992 Landers M7.2 Earthquake

multiphysics simulation of dynamic rupture and resulting ground

motion of a M7.2 earthquake


fault inferred from measured data, regional topography from

satellite data, physically consistent stress and friction parameters


1D velocity structure, low velocity near surface
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

26

Munchen
Technische Universitat

Multiphysics Dynamic Rupture Simulation

spontaneous rupturing due to exceeded stress limits


including rupture jumps, fault branching, etc.
tackles fundamental questions on the dynamics of earthquakes

in natural fault zone systems


M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

27

Munchen
Technische Universitat

Multiphysics Dynamic Rupture Simulation

spontaneous rupturing due to exceeded stress limits


including rupture jumps, fault branching, etc.
tackles fundamental questions on the dynamics of earthquakes

in natural fault zone systems


M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

27

Munchen
Technische Universitat

Landers Earthquake Results


Observations:
complex rupture dynamics

(fault branching, etc.)


high-frequency signals (up to

10 Hz) from rupture propagate


directly into wave field
accelerograms with frequencies

up to 10 Hz
ground shaking in the

engineering frequency band


42 s simulated time

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

28

Munchen
Technische Universitat

Part III
Optimizing SeisSol for
Petascale Seismic Simulations
on SuperMUC
PRACE ISC Award 2014

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

29

+
Aqx +eqs
Bqy + Cqz =
l:
Governing
knowns,

The nine dimensional vector of unknowns,


0 11 1

Munchen
Technische Universitat

includes the normal stress components 11 , 22 and 33 , the shear stresses 12 ,


23
and 13 and the particle velocities in x-, y-, and z-direction, u, v and w
(see [7, ch. 22.1], TODO: Puente 1.2). Furthermore the matrices A, B and C
are defined by (see [Eigenstructure3D elastic.mws TODO]):
(47)
0
1
0
0 0
0
0
0
- - 2 0
0
B 0
0 0
0
0
0
0
0 C
B
C
B 0
0 0
0
0
0
0
0 C
B
C
B
C
0 0
0
0
0
0
- 0 C
B 0
0
0 0
0
0
0
0
0
0 C
A 12
=B
y
z
11
22t
33 x
B
C
includes the normal stress components
,
and
, the shear stresses
,B
C
0
0
0
0
0
0
0
0
-
23
C
and 13 and the particle velocities in x-, y-, and z-direction, u, v and wB
T
B--1 0 0
C
0
0
0
0
0
0
B
C
(see [7, ch. 22.1], TODO: Puente 1.2). Furthermore the11
matrices
22 A, B33and C12
23
13
-1
@ 0
0 0 -
0
0
0
0
0 A
are defined by (see [Eigenstructure3D elastic.mws TODO]):
-1
0
0 0
0
0 -
0
0
0
0
1
0
1
0
0
0
0
0
0 0
0
0
0 0
0
0
0
- - 2 0
0
B0
B 0
0
0
0
0
0 0 - - 2 0 C
0 0
0
0
0
0
0 C
B
C
B
C
B0
B 0
0
0
0
0
0 0
0 C
0 0
0
0
0
0
0 C
B
C
B
C
B0
C
B 0
C
0
0
0
0
0
-
0
0
0
0
0
0
0
0
-
0
B
C
B
C
0
0
0
0
0 0
0
-C
0 0
0
0
0
0
0
0 C
B =B
(48)
A =B
B0
C.
B 0
C
B0
C
B 0
C
0
0
0
0
0
0
0
-
0
0
0
0
0
0
0
0
B -1
C
B
C
-1
B-
C
B0
C
0
0
0
0
0
0
0
0
0
0
-
0
0
0
0
0
B
C
B
C
@ 0
@0 --1 0
0 0 --1 0
0
0
0
0 A
0
0
0 0
0
0 A
0
0 0
0
0 --1
0
0
0
0
0
0
0
--1 0 0
0
0
0
1
0
1
0
0
0
0
0
0 0
0
0 0
0
0
0
0
0
0
B0
B0 0
C
0
0
0
0 0 - - 2 0 C
0
0
0
0
0
0
B 0
C
B
C
B0
B0 0
0
0
0
0
0 0
0 C
0
0
0
0
0
0 - - 2 C
B
C
B
C
B0
C
B0 0
C
0
0
0
0
0
-
0
0
0
0
0
0
0
0
0
B
C
B
C
C
0
0
0
0
0 0
0
-C
B =B
(48)
0
0
0
0
0 -
0
C
=B
B0
C.
B0 0
C
B0
C
B0 0
C
0
0
0
0
0
0
0
0
0
0
0
0
- 0
0
B
C
B
C
-1
-1
B0 0
C
B0 0
C
0
-
0
0
0
0
0
0
0
0
-
0
0
0
B
C
B
C
-1
-1
@0 -
@0 0
A
0
0
0
0 0
0
0 A
0
0 -
0
0
0
0
-1
-1
0
0
0
0
-
0 0
0
0
0 0 -
0
0
0
0
0
0
0
1
0 0
0
0
0
0
0
0
B0 0
C y, z) and (x, y, z) are the Lame parameters, whereas is the shear modulus
0
0
0
0
0
0
- (x,
B
C doesnt have a direct physical interpretation (see [7, ch. 22.1]), (x, y, z) >
B0 0
0
0
0
0
0
0
- 2and
C
B al.: Large Scale Earthquake Simulation on Supercomputing
C thePlatforms
0
is
density of the material (see [7, ch. 2.12.4]).
M. Bader et
B0 0
C
0
0
0
0
0
0
0
B
C
einecke,
S. Rettenberger,
of SeisSol
4
IESP C
Workshop
30
0
0
0 Modelling,
0 TUM-IAS,
0 Optimization
-Sep 12,02014C
=B
B0 0on Environmental
C
B 22 C
B 33 C
B C
B 12 C
B C
B 23 C
C
q=B
B 13 C ,
B C
B C
B u C
B C
@ v A
w

Seismic Wave Propagation with SeisSol

11 Wave Equations: (velocity-stress formulation)


Elastic

BqBBy +
C Cqz = 0.
C
22

q + Aq + Bq + Cq = 0

B 33 C with q = ( , , , , , , u, v , w)
B 12 C
B C
B 23 C
C
=B
(47)
B 13 C ,
B C
B C
B uhighCorder discontinuous Galerkin discretisation
B C
@ vADER-DG:
A high approximation
(47) order in space and time:
additional features: local time stepping, high accuracy of
wearthquake faulting (full frictional sliding)

ents

Dumbser, Kaser
et al. [3,5]

11

22

and

33

, the shear stresses

12

Department of Informatics V
Munchen
Technische Universitat

+M-1 Ak I(tn , tn+1 , Qn


k )K

Technische Universitt Mnchen

+M-1 Bk I(tn , tn+1 , Qn


k )K

-1 nADER-DG
n+1
n

SeisSol in a Nutshell
k
k

+M C I(t , t
, Q )K
SeisSol in a nutshell:
ADER-DG

Cauchy
Kovalewski

Update scheme

Qn+1
k

4
|Sk | -1 X -,i n n+1 n
-1
= Qk M
F I(t , t
, Qk )Nk,i A+
k Nk,i
|Jk |
i=1

4
X

-1
F+,i,j,h I(tn , tn+1 , Qn
k(i) )Nk,i Ak(i) Nk,i

i=1
n n+1

(9

+M K I(t , t
, Qn
k )Ak

+M-1 K I(tn , tn+1 , Qn


k )Bk

+M-1 K I(tn , tn+1 , Qn


k )Ck
-1

J Code
Structure of the
X
(tn+1 - tn )j+1

I(tn , tn+1 , Qn
k) =

j=0

(j + 1)!

@j
Qk (tn )
@tj

(Qk )t = -M-1 (K )T Qk Ak + (K )T Qk Bk + (K )T Qk Ck

Matrix Patterns

M.Breuer,
Bader et A.
al.:Heinecke,
Large
Scale
Earthquake
Simulation
Supercomputing
Platforms
This
chapter
shows onthe
matrix of
patterns
A.
S.
Rettenberger,
Optimization
SeisSol
IESP
Workshop
on Environmental Modelling, TUM-IAS, Sep 12, 2014
M.
Bader,
C. Pelties

in SeisSol for the ADER-DG


schem
5
31
of polynomial order 3, which results in (3 + 1)(3 + 2)(3 + 3)/6 = 20 degrees

tiplication with the sparse Jacobians Ak , Bk an


ure 1 shows all involved sparsity patterns of the A
Optimisation of (B
Sparse
Matrix Operations
= 35) scheme.
Munchen
Technische Universitat

Apply sparse matrices to multiple DOF-vectors Qk


0

10

10

11

11

11

12

12

10

11

9
10

12

12

13

13

13

13

14

14

14

14

15

15

15

17

17

18

16

16

17

17
18

15

16

16

18

18

19

19

19

19

20

20

20

20

21

21

21

21

22

22

22

22

23

23

23

23

24

24

24

24

25

25

25

25

26

26

26

26

27

27

27

27

28

28

28

28

29

29

29

29

30

30

30

30

31

31

31

32

32

32

32

33

33

33

33

34

34
0

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

31

34

34
0

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Code Generator for Sparse


Kernels:
(Breuer
et al. [1])
Figure
1. Sparsity
patterns
appearing in the ADER ti
avoid overhead of CSR (or similar) data structures;

store CSR elements vector, only


integration:
In the
volume
integration, g
full unrolling ofVolume
all element
operations using
a code
generator
we
multiply
the
sparse
stiffness
matrices
K , K
use intrinsics and apply blocking to improve vectorisation

unknowns I(t n ,t n+1 , Qnk ) previously obtained b


the left. The intermediate results are again multip

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

32

Munchen
Technische Universitat

Optimization of CK Kernels Code Generator

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

33

Munchen
Technische Universitat

Optimization of Sparse Kernels Results


up to 5.0 GFLOPs per core (Intel SNB-EP, 2.7 GHz)
factor 25 speedup due to generated intrinsics code

(dual-socket compute node, SNB-EP, 2.7 GHz)


classic

generated
60
55.7

GFLOPS

55.4

53.0

50.2

45

2.7

41.3

2.4

2.1

23.4

24.8

3.4

30
5.2

20.3

15
0

14.8
7.9

10

20

35

56

#basis functions

less effective for boundary kernel (vs. time & volume kernel)
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

34

Munchen
Technische Universitat

Dense Kernels Code Generator

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

35

12

13

14

13

13

13

1
2

15

16

17

18

14

15
16

17

17

18

14

2
3

16

16

14
15

15

Munchen
Technische Universitat

17

18

18

19

19

19

19

20

20

20

20

21

21

21

8
0

21

22

22

22

22

23

23

23

23

24

24

24

24

25

25

25

25

Optimisation of Sparse Matrix Operations


26

26

27

27

28

28

29

29

29

31

32

33

32

33

34
0

30

31

32

32

30

31

31

33

29

30

30

26

27

28

34
0

26

27

28

33

34

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

34
0

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

) Computation
of the
first time
derivative
1 /t1 QkQ
.k
Apply sparse
matrices
to multiple
DOF-vectors

00

00

11

0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
0
0

0
0

1
1

2
2

3
3

4
4

5
5

6
6

7
7

1
1

2
2

3
3

4
4

5
5

6
6

7
7

8
8

8
8

11

22

22

33

33

44
55

44

55

6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24
25
25
26
26
27
27
28
28
29
29
30
30
31
31
32
32
33
33
34
34

66

77

88

99

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

1010
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24
25
25
26
26
27
27
28
28
29
29
30
30
31
31
32
32
33
33
34
34

10

10

11

1
2
3

4
5
6

7
8

11

12

12

13

13

14

14

15

15

16

16

17

21

21

8
0

22

22

23

23

24

24

25

25

26

26

27

27

28

28

29

30

31

31

32

32

33

29

30

19
20

33

34

2
3

18

19
20
0

17

18

34
0

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

me
/t
Qk .second derivative 2 /t2 Qk .
b) derivative
Computation
of the
Dense vs. Sparse Kernels: (Breuer et al. [2])

generate
customized
dense kernels:
recursions
of the ADER
time integration
for a fifth order method.
optimized
for
(fixed)
small
nerate zero-blocks in the derivatives,size
ivory blocks hit zero blocks.
for sparse and dense kernels:
exploit zero-blocks generated during recursive CK computation
0

10

11

12

10

11

12

13

13

14

14

15

15

16

16

17

18

17

18

1
2

19

19

20

20

21

21

8
0

22

22

Evaluation
of the ADER Time Integration
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
23

23

24

24

25

25

26

26

27

27

28

28

29

30
31

29

30
31

IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014


32

32

33

33

34

34

36

Munchen
Technische Universitat

Performance Optimization
Switch between Sparse/Dense Kernels:
auto-tuning approach on

order

benchmark scenarios

boundary

sparse sparse sparse sparse sparse sparse

13%

sparse sparse

dense

sparse sparse sparse

26%

performance for each matrix

sparse sparse

dense

sparse

dense

17%

select sparse vs. dense kernel

sparse sparse sparse sparse sparse sparse

23%

sparse

measure sparse vs. dense

based on best time to solution

dense

dense

sparse

dense

dense

dense

9%

Hybrid MPI+OpenMP Parallelisation:


careful OpenMP parallelisation of all parts

(not only main kernels communication buffers, etc.)

targeted at manycore platforms, such as Intel Xeon Phi


OpenMP improvements for Xeon Phi also lead to noticeable

improvements for standard CPUs


M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

37

Munchen
Technische Universitat

Mesh Generation and Partitioning


Mesh Generation:
high-quality meshes required

(shallow subduction zones,


complicated fault structures)
with 108 109 grid cells
using SimModeler by Simmetrix

(http://simmetrix.com/)
Two-stage approach to provide parallel mesh partitions:
graph-based partitioning (ParMETIS)
create customised parallel format (based on netCDF) for mesh

partitions
highly scalable mesh input via netCDF/MPI-IO in SeisSol

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

38

Munchen
Technische Universitat

Parallel Mesh Input

#partitions

mesh and partition information

combined in a single mesh file

#vertices

per element
Include partition boundaries
binary file format (HDF5/netCDF)
only collective MPI I/O operations

to access data

max. #elements
per partition

New Mesh Pipeline:


ParMETIS
Simulation Modeling
Suite C++ API
CAD Model

PUMGen
SimModeler/
GAMBIT

Gambit Mesh

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

39

Munchen
Technische Universitat

Mesh Initialization
300

GAMBIT Neutral + Metis Partition le


netCDF le

250

SuperMUC
Strong scaling
LOH.1

runtime (s)

200

150

100

7,252,482
cells

50

0
256

512

1024

2048
#tasks

4096

8000

16000

Runtime and memory requirements are reduced to

O(#cells/#partitions)
largest test: mesh with 8,847,360,000 tetrahedrons

read in 23 seconds on 9216 nodes/ranks


M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

40

Munchen
Technische Universitat

SuperMUC @ Leibniz Supercomputing Centre

9216 compute nodes (18 thin node islands): 147,456 cores

(per node: 2 Intel SNB 8-core CPUs, Xeon E5-2680, 2.7 GHz)
Infiniband FDR10 interconnect (fat tree on islands)
#10 in Top 500 (Nov 2013): 2.897 PFlop/s
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

41

Munchen
Technische Universitat

Landers 1992 Production Run on SuperMUC


Non-zeros only

FLOP/s

% peak performance

60
50
40
30
20
10
0
256

512

1024

2048

4096

9216

total number of nodes

Strong Scaling:
191 million tetrahedrons; 220,982 element faces on fault
6th order, 96 billion degrees of freedom
1.30 PFLOPS equiv. to 40.7 % peak efficiency on 147,456 cores
80 % parallel efficiency (compared to 4096 cores)
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

42

Munchen
Technische Universitat

Landers 1992 Production Run on SuperMUC


Non-zeros only

FLOP/s

% peak performance

60
50
40
30
20
10
0
256

512

1024

2048

4096

9216

total number of nodes

Production Simulation:
7 h 15 min computing time on SuperMUC (234,456 time steps)
output at 23 receivers and of the fault
1.25 PFLOPS sustained performance

(SuperMUC record for a simulation of that complexity)


M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

42

Munchen
Technische Universitat

Part IV
Dynamic Rupture Simulations on
Heterogeneous Supercomputers
Finalist for ACM Gordon Bell Prize 2014

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

43

Munchen
Technische Universitat

Supercomputing Platforms
Stampede @ TACC, Austin
6400 compute nodes, 522,080 cores
per node:

2 SNB-EP (8c) + 1 Xeon Phi SE10P


Mellanox FDR 56 interconnect (fat tree)
#7 in Top 500: 5.168 PFlop/s

Tianhe-2 @ NSCC, Guangzhou


8000 compute nodes used, 1.6 Mio cores
per node:

2 SNB-EP (12c) + 3 Xeon Phi 31S1P


TH2-Express custom interconnect
#1 in Top 500: 33.862 PFlop/s
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

44

Munchen
Technische Universitat

Optimization for Intel Xeon Phi Platforms


Offload Scheme:

Host

PCIe

to address load

time integration of
MPI boundary cells

imbalances of
multiphysics simulation
hides communication

with Xeon Phi and


between nodes

download cells for


receivers, DR, MPI
time integration of
non-MPI cells,
volume integration

MPI comm.,
receiver output

dynamic rupture
fluxes, fault output

upload MPIreceived cells

upload dynamic
rupture updates

OpenMP parallelisation:

wave propagation
fluxes

apply dynamic
rupture updates,
pack transfer data

to address manycore

parallelism with 13
coprocessors
careful parallelisation

Xeon Phi

download all data


(if required)
plot wave field
(if required)

of all loops

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

45

Munchen
Technische Universitat

Weak Scaling of Wave Propagation


Tianhe-2, 1 card
Tianhe-2, 2 cards
Tianhe-2, 3 cards

100
97.5
95
92.5
90

4096
6144
9216

2048

1024

512

256

128

64

32

16

85

87.5
1

% parallel efficiency

SuperMUC, classic
Stampede
SuperMUC, gr. buff.

# nodes

goal: test scalability towards large problem sizes


cubic domain, uniformly refined tetrahedral cells
weak scaling: 400,000 elements per card/node
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

46

Munchen
Technische Universitat

Weak Scaling of Wave Propagation


Tianhe-2, 1 card
Tianhe-2, 2 cards
Tianhe-2, 3 cards

100
97.5
95
92.5
90

4096
6144
9216

2048

1024

512

256

128

64

32

16

85

87.5
1

% parallel efficiency

SuperMUC, classic
Stampede
SuperMUC, gr. buff.

# nodes

more than 90 % parallel efficiency on Tianhe-2 and Stampede


87 % on full SuperMUC (no overlapping)

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

46

Munchen
Technische Universitat

Weak Scaling of Wave Propagation


Tianhe-2, 1 card
Tianhe-2, 2 cards
Tianhe-2, 3 cards

100
97.5
95
92.5
90

4096
6144
9216

2048

1024

512

256

128

64

32

16

85

87.5
1

% parallel efficiency

SuperMUC, classic
Stampede
SuperMUC, gr. buff.

# nodes

8.6 PFlop/s on Tianhe-2 (8000 nodes)


2.3 PFlop/s on Stampede (6144 nodes)
1.6 PFlop/s on SuperMUC (9216 nodes)
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

46

Munchen
Technische Universitat

Weak Scaling Peak Efficiency


SuperMUC, gr. buff.
Stampede

Tianhe-2
SuperMUC, classic

% peak: hardware

60
50
40
30
20
10

4096
6144
9216

2048

1024

512

256

128

64

32

16

% peak: non-zero

0
30
25
20
15
10
5
0

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms

# nodes

IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

47

Munchen
Technische Universitat

Strong Scaling of Landers Scenario


Stampede
SuperMUC, classic
SuperMUC, gr. buff.

Tianhe-2, 1 card
Tianhe-2, 2 cards
Tianhe-2, 3 cards

90
80
70
60

9216

6144

4096

2048

1024

40

512

50
256

% parallel efficiency

100

# nodes

191 million tetrahedrons; 220,982 element faces on fault


6th order, 96 billion degrees of freedom

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

48

Munchen
Technische Universitat

Strong Scaling of Landers Scenario


Stampede
SuperMUC, classic
SuperMUC, gr. buff.

Tianhe-2, 1 card
Tianhe-2, 2 cards
Tianhe-2, 3 cards

90
80
70
60

9216

6144

4096

2048

1024

40

512

50
256

% parallel efficiency

100

# nodes

more than 85 % parallel efficiency on Stampede and Tianhe-2

(when using only one Xeon Phi per node)


multiple-Xeon-Phi performance suffers from MPI communication
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

48

Munchen
Technische Universitat

Strong Scaling of Landers Scenario


Stampede
SuperMUC, classic
SuperMUC, gr. buff.

Tianhe-2, 1 card
Tianhe-2, 2 cards
Tianhe-2, 3 cards

90
80
70
60

9216

6144

4096

2048

1024

40

512

50
256

% parallel efficiency

100

# nodes

3.3 PFlop/s on Tianhe-2 (7000 nodes)


2.0 PFlop/s on Stampede (6144 nodes)
1.3 PFlop/s on SuperMUC (9216 nodes)
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

48

Munchen
Technische Universitat

Landers Strong Scaling Peak Efficiency


SuperMUC, gr. buff.
Stampede

Tianhe-2
SuperMUC, classic

% peak: hardware

50
40
30
20
10

20
15
10

9216

6144

4096

2048

1024

512

5
256

% peak: non-zero

0
25

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms

# nodes

IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

49

Munchen
Technische Universitat

Summary and Key Findings


Multiphysics Dynamic Rupture Simulations with SeisSol:
high-order ADER-DG on unstructured adaptive meshes
focus on complicated geometries and rupture physics
non-linear interaction of rupture process and seismic waves

Petascale Performance on Heterogeneous Platforms:


exploits high computational intensity of ADER-DG
requires careful tuning of the entire simulation pipeline
code generation and auto-tuning to accelerate element kernels
scalable mesh input (and output) for 200M cells on 147k cores
offload-scheme for multiphysics with Xeon Phi
approx. factor 6 improvement in time-to-solution
factor 20200 improvement in problem size
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

50

Munchen
Technische Universitat

Next Steps and Future Challenges . . .


from one simulation to many simulations:

ensemble simulations for parameters studies


(sensitivity studies, e.g.)
uncertainty quantification (hazard maps, e.g.)
coupling with other models:

geodynamics simulations to infer realistic load on faults;


towards seismic cycle simulations
simulate tsunamigenic earthquake to obtain time-dependent
displacements as initial condition for tsunamis
challenges on the road to exascale:

resilience, reliability and uniformity of hardware


new measures of performance: energy efficiency
how to deal with the generated output (high-order data, e.g.)

M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms


IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

51

Munchen
Technische Universitat

References
[1] A. Breuer, A. Heinecke, M. Bader, C. Pelties: Accelerating SeisSol by generating
vectorized code for sparse matrix operators. In: Advances in Parallel Computing 25,
IOS Press, 2014. Proceedings of ParCo 2013
[2] A. Breuer, A. Heinecke, S. Rettenberger, M. Bader, A.-A. Gabriel, C. Pelties:
Sustained Petascale Performance of Seismic Simulations with SeisSol on
SuperMUC. In: Supercomputing, LNCS 8488, p. 118. PRACE ISC Award 2014.

[3] M. Dumbser, M. Kaser:


An arbitrary high-order discontinuous Galerkin method for
elastic waves on unstructured meshes II. The three-dimensional isotropic case.
Geophys. J. Int. 167(1), 2006.
[4] A. Heinecke, A. Breuer, S. Rettenberger, M. Bader, A.-A. Gabriel, C. Pelties, A.
Bode, W. Barth, X.-K. Liao, K. Vaidyanathan, M. Smelyanskiy, P. Dubey: Petascale
High Order Dynamic Rupture Earthquake Simulations on Heterogeneous
Supercomputers. Gordon Bell Prize Finalist 2014.

[5] M. Kaser,
M. Dumbser, J. de la Puente, H. Igel: An arbitrary high-order
Discontinuous Galerkin method for elastic waves on unstructured meshes
III. Viscoelastic attenuation. Geophys. J. Int. 168(1), 2007.

[6] C. Pelties, J. de la Puente, J.-P. Ampuero, G. B. Brietzke, M. Kaser:


Three-dimensional dynamic rupture simulation with a high-order discontinuous
Galerkin method on unstructured tetrahedral meshes. J. Geophys. Res.: Solid Earth,
117(B2), 2012.
M. Bader et al.: Large Scale Earthquake Simulation on Supercomputing Platforms
IESP Workshop on Environmental Modelling, TUM-IAS, Sep 12, 2014

52