HPC Best Practices For Fea

HPC Best Practices
for FEA
John Higgins, PE
Senior Application Engineer
1
2012 ANSYS, Inc.
May 18, 2012
Agenda
Overview
Parallel Processing Methods
Solver Types
Performance Review
Memory Settings
GPU Technology
Software Considerations
Appendix
2
2012 ANSYS, Inc.
May 18, 2012
Overview
Basic information
Output data
A model
A machine
Elapsed Time
Need for speed :

Implicit structural FEA codes
Mesh fidelity continues to increase

More complex physics being analyzed
Lots of computations !!
2012 ANSYS, Inc.
May 18, 2012
Overview
Basic information
Solver Configuration
A model :
-Size / number of
DOF
-Analysis type
Elapsed Time
Analysing the model prior to launch the

run may help to choose the more suitable
solver configuration at the first attempt
A machine :
-Number of cores
-RAM
2012 ANSYS, Inc.
Output data
May 18, 2012
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
2012 ANSYS, Inc.
May 18, 2012
Output data
Elapsed Time
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
2012 ANSYS, Inc.
May 18, 2012
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
Elapsed Time
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
2012 ANSYS, Inc.
May 18, 2012
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
2012 ANSYS, Inc.
May 18, 2012
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
Parallel Processing
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
2012 ANSYS, Inc.
May 18, 2012
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
Parallel Processing Hardware

Workstation/Server:
Shared memory (SMP) single box,

or
Distributed memory (DMP) single box,
Workstation
10
2012 ANSYS, Inc.
May 18, 2012
Parallel Processing Hardware

Cluster (Workstation Cluster, Node Cluster):
Distributed memory (DMP) multiple boxes, cluster
Cluster
11
2012 ANSYS, Inc.
May 18, 2012
Parallel Processing Hardware + Software
12
Laptop/Desktop
or
Workstation/Server
Cluster
ANSYS
YES
SMP (per node)
Distributed ANSYS
YES
YES
2012 ANSYS, Inc.
May 18, 2012
Distributed ANSYS Design Requirements

No limitation in simulation capability
Reproducible and consistent results
Support all major platforms
13
2012 ANSYS, Inc.
May 18, 2012
Distributed ANSYS Architecture

Domain decomposition approach
Break problem into N pieces (domains)

Solve the global problem independently within
each domain
Communicate information across the boundaries
as necessary
Processor 3
Processor 1
Processor 2
14
2012 ANSYS, Inc.
May 18, 2012
Processor 4
Distributed ANSYS Architecture

process 0 (host)
domain
decomposition
process 1
interprocess
communication
process n-1
domain 0
elem
domain 1
elem
assemble
assemble
assemble
solve
solve
solve
elem output
elem output
elem output
combining results
15
domain n-1
elem
2012 ANSYS, Inc.
May 18, 2012
Distributed ANSYS Solvers

Distributed sparse (default)
Supports all analyses supported with DANSYS ( Linear,
Non Linear, Static , Transient )
Distributed PCG
For static and full transient analyses
Distributed LANPCG (eigensolver)
For modal analyses
16
2012 ANSYS, Inc.
May 18, 2012
Benefits of Distributed ANSYS

The entire SOLVE phase is parallel
More computations performed in parallel faster solution time
Better speedups than SMP
Can achieve > 4x on 8 cores (Try getting that with SMP!!!!)
Can be used for jobs running on up to hundreds of cores
Can take advantage of resources on multiple machines
Whole new class of problems can be solved!
Memory usage and bandwidth scales
Disk (I/O) usage scales (i.e. parallel I/O)
17
2012 ANSYS, Inc.
May 18, 2012
Solver Types
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
18
2012 ANSYS, Inc.
May 18, 2012
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
Solver Types
Solution Overview
5% CPU time
Prep Data
10% CPU time
Element Formation
10% CPU time
Solution Procedures
Global Assembly
70% CPU time
Solve [K]{x} = {b}
5% CPU time
Element Stress Recovery
Solver
Equation solver dominates solution CPU time! Need to pay attention to equation solver
Equation solver also consumes the most system resources (memory and I/O)
19
2012 ANSYS, Inc.
May 18, 2012
Solver Types
Solution Overview
Prep Data
Element Formation
Solution Procedures
Global Assembly
Solve [K]{x} = {b}
Element Stress Recovery
20
2012 ANSYS, Inc.
May 18, 2012
System Resolution
Solver Architecture
data in-core objects
database
esav
Element
Formation
emat
Symbolic
Assembly
full
PCG Solver
rst,rth
Output
Element
Output
21
2012 ANSYS, Inc.
May 18, 2012
Sparse Solver
Solver Types: SPARSE (Direct)

SPARSE (Direct)
Filing
LN09
*.BCS: Stats from Sparse Solver
*.full: Assembled Stiffness Matrix
22
2012 ANSYS, Inc.
May 18, 2012
Solver Types: SPARSE (Direct)

SPARSE (Direct)
PROS
-
More robust with poorly conditioned problems (Shell-Beams)
Solution always guaranteed
Fast for 2nd Solve or Higher (Multiple Load cases)
CONS
23
Factoring matrix & Solving are resource intensive
Large memory requirements
2012 ANSYS, Inc.
May 18, 2012
Solver Types: PCG (Iterative)

PCG (Iterative)
24
Minimization of residuals/potential energy (Standard Conjugate

Gradient Method) ( {r} = {f} [K].{u} )
Iterative process requiring a convergence test (PCGTOL).
Preconjugate CG used instead to reduce the number of iterations

( Preconditioner [Q] [K-1] - [Q] cheaper than [K-1] )
Number of iterations
2012 ANSYS, Inc.
May 18, 2012

PCG (Iterative)
PCGTOL need to be used ( ill conditionned model ) with lower value 1e-9 or 1e-10
to let ANSYS follow the same path ( equilibrium iterations ) than the direct
solver
PCGTOL
25
2012 ANSYS, Inc.
May 18, 2012

PCG (Iterative)
Filing
*.PC*
*.PCS: Iterative solver stats
26
2012 ANSYS, Inc.
May 18, 2012

PCG (Iterative)
PROS
-
Less memory requirements
Better suited for well conditioned bigger problem
CONS
27
Not useful with near or rigid body behavior
Less robust with ill-conditioned models (Shells & Beams, inadequate

boundary conditions (Rigid Body Motions), elements considerably
elongated, nearly singular matrices) more difficult to approximate
[K-1] with [Q]
2012 ANSYS, Inc.
May 18, 2012

Level Of Difficulty
LOD number is available in the solver output (solve.out)
.. but can also be seen along with

number of PCG iteration required
to reach a converged solution
within jobname.PCS file.
28
2012 ANSYS, Inc.
May 18, 2012

Other ways to evaluate ill-conditioning
Error message is also an indication.

Although we propose to change some MULT coefficient, model should be
carefully reviewed first and SPARSE solver considered for resolution instead.
29
2012 ANSYS, Inc.
May 18, 2012
Solver Types
Comparative
30
2012 ANSYS, Inc.
May 18, 2012
Performance Review
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
31
2012 ANSYS, Inc.
May 18, 2012
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
Performance Review
Process Resource Monitoring (only available on Windows7)
Windows Resource Monitor is a powerful tool for understanding how your
system resources are used by processes and services in real time.
32
2012 ANSYS, Inc.
May 18, 2012
Performance Review
How to access to the Process Resource Monitoring ? :
- from OS Task Manager (Ctrl + Shift + Esc) :
- Click Start, click in the Start Search box, type resmon.exe, and then press ENTER.
33
2012 ANSYS, Inc.
May 18, 2012
Performance Review
Process Resource Monitoring - CPU
Shared Memory (SMP)
34
2012 ANSYS, Inc.
May 18, 2012
Distributed Memory (DMP)
Performance Review
Process Resource Monitoring - Memory
Before the solve :
During the solve :
Information from the solve.out :
35
2012 ANSYS, Inc.
May 18, 2012
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
36
2012 ANSYS, Inc.
May 18, 2012
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
Performance Review
ANSYS End Statistics
Basic information about Analysis

Solving directly available at the end of
Solver Output file (*.out) in Solution
Information
37
2012 ANSYS, Inc.
May 18, 2012
Total Elapsed Time
Performance Review
Other main output data to check :
Output Data
38
Description
Elapsed Time (sec)
Total time of the simulation
Solver rate (Mflops)
Speed of the solver
Bandwidth (Gbytes/s)
I/O rate
Memory Used (Mbytes)
Memory required
Number of iterations (PCG)
Available for PCG only
2012 ANSYS, Inc.
May 18, 2012
Performance Review
Elapsed Time Solver rate Bandwidth
PCG (*.PCS file)
39
2012 ANSYS, Inc.
May 18, 2012
Memory Used
SPARSE (*.BCS file)
Memory Settings
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
40
2012 ANSYS, Inc.
May 18, 2012
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
Memory Settings
SPARSE: Solver Output Statistics >> Memory Checkup
41
2012 ANSYS, Inc.
May 18, 2012
Memory Settings Test Case 1

Test Case 1 : Small model (need 4 Gb Scratch Memory < RAM)
Machine reference : 6Gb RAM , enough memory but
Default : BCSOPTION,,OPTIMAL
Elapsed Time = 146 sec

42
2012 ANSYS, Inc.
May 18, 2012
BCSOPTION,,INCORE

Test Case 1 : Small model (need ..Gb Scratch Memory < RAM)
Machine reference : 6Gb RAM
43
2012 ANSYS, Inc.
May 18, 2012
BCSOPTION,,INCORE

Test Case 2 : Large model (need 21.1Gb Scratch Memory > RAM)
Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0
Do not set always incore when memory available is not enough !!

44
2012 ANSYS, Inc.
May 18, 2012
BCSOPTION,,INCORE

Test Case 2 : Large model (need 21.1Gb Scratch Memory > RAM)
Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0
45
2012 ANSYS, Inc.
May 18, 2012
BCSOPTION,,INCORE
Memory Settings
SPARSE: Solver Output Statistics >> Memory Checkup
206MB available for Sparse solver at time of factorization.
This is sufficient to run in Optimal out-of-core mode (which requires 126MB)

and obtain good performance.
If more than 1547 MB is available, itll run fully in-core best performance
Avoid using Minimum out-of-core -- memory less than 126 MB

Incore >1547 MB
46
2012 ANSYS, Inc.
May 18, 2012
1.5 GB > Optimal > 126 MB
Out-of-Core <126 MB
Memory Settings
SPARSE: 3 Memory Modes can be Observed
Performance
Best
In-core mode (optional)
Requires the most amount of memory

Performs no I/O
Optimal out-of-core mode (default)

Balances memory usage and I/O
Minimum core mode (not recommended)

Requires the least amount of memory
Performs most amount I/O
Worst
47
2012 ANSYS, Inc.
May 18, 2012

Test Case 3 : trap to avoid : launch a run on a network ( or a slow drive )
Local solve on a local disk (left ) vs a slow disk ( networked or USB (right )
Elapsed Time = 1 998 sec

48
2012 ANSYS, Inc.
May 18, 2012
Elapsed Time = 3 185 sec
Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations
# of cores used (SMP,DMP)
From PCGOPT,Lev_Diff
Important statistic!
49
2012 ANSYS, Inc.
May 18, 2012
Performance Review
Check total number of PCG iterations
Less than 1000 iterations: good performance

Greater than 1000 iterations: performance is deteriorated. Try increasing Lev_Diff on
PCGOPT)
Greater than 3000 iterations: assuming you have tried increasing Lev_Diff, either
abandon PCG and use Sparse solver or improve element aspect ratios, boundary
conditions, and/or contact conditions
<1000 Iterations
50
2012 ANSYS, Inc.
May 18, 2012
>3000 Iterations
Performance Review
If too much iteration :

*Use parallel processing
Use PCGOPT,lev
Refine your mesh
Check for too high stiffness
51
2012 ANSYS, Inc.
May 18, 2012
Performance Review
PCG: Solver Statistics - *.PCS File>> Efficiency & Solver MemoryUsage
52
2012 ANSYS, Inc.
May 18, 2012
GPU Technology
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
-GPU
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
53
2012 ANSYS, Inc.
May 18, 2012
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
GPU Technology
Graphics processing units (GPUs)
Widely used for gaming, graphics rendering
Recently been made available as general-purpose accelerators
Support for double precision arithmetic
Performance exceeding the latest multicore CPUs
So how can ANSYS Mechanical make use of this new

technology to reduce the overall time to solution??
54
2012 ANSYS, Inc.
May 18, 2012
GPU Technology Introduction

CPUs and GPUs used in a collaborative fashion
CPU
GPU
PCI Express
channel
Multi-core processors
Typically 4-12 cores
Powerful, general purpose
55
2012 ANSYS, Inc.
May 18, 2012
Many-core processors
Typically hundreds of cores
Great for highly parallel code
GPU Accelerator capability

Motivation
Equation solver dominates solution time
Logical place to add GPU acceleration
5%-30% time
Element Formation
5%-10% time
Solution Procedures
5%-10% time
Global Assembly
60%-90% time Equation Solver (e.g., [A]{x} = {b})

1%-10% time Element Stress Recovery
56
2012 ANSYS, Inc.
May 18, 2012
solver

Accelerate sparse direct solver (Boeing/DSP)
GPU is only used to factor a dense frontal matrix
Decision is made based on frontal matrix size on when
to send data to GPU or not:
Too small, too much overhead, stays on CPU

Too large, exceeds GPU memory, stays on CPU
57
2012 ANSYS, Inc.
May 18, 2012

Supported hardware
Currently recommending NVIDIA Tesla 20-series cards
Recently added support for Quadro 6000
Requires the following items
Larger power supply (1 card needs about 225W)
Open 2x form factor PCIe x16 Gen2 slot
Supported on Windows/Linux 64-bit
NVIDIA Tesla
C2050
NVIDIA Tesla
C2070
NVIDIA Quadro
6000
225 Watts
225 Watts
225 Watts
CUDA cores
448
448
448
Memory
3 GB
6 GB
6 GB
144 GB/s
144 GB/s
144 GB/s
1030/515 Gflops
1030/515 Gflops
Power
Memory Bandwidth
58
Peak Speed (SP/DP)

1030/515 Gflops
May 18, 2012
2012 ANSYS, Inc.
ANSYS Mechanical SMP GPU Speedup
Solver
Kernel
Speedups
Overall
Speedups
Intel Xeon 5560 processors
(2.8 GHz, 8 cores total)
32 GB of RAM
Windows XP SP2 (64-bit)
Tesla C2050 (ECC,ON; WDDM driver)
59
2012 ANSYS, Inc.
May 18, 2012
Distributed ANSYS GPU Speedup @ 14.0

Vibroacoustic harmonic analysis of an audio speaker
Direct sparse solver
Quarter-symmetry model with 700K DOF:
657424 nodes
465798 elements
higher-order acoustic fluid elements (FLUID220/221)
Distributed ANSYS Results (baseline is 1 core):

With GPU, ~11x speedup on 2 cores!
15-25% faster than SMP with same number of cores
Cores
2
4
2
4
GPU
no
no
yes
yes
Speedup
Speedup
2.25
4.29
11.36
11.51
12.00
10.00
8.00
6.00
Windows workstation: Two Intel Xeon 5530

processors (2.4 GHz, 8 cores total), 48 GB RAM,
NVIDIA Quadro 6000
60
2012 ANSYS, Inc.
May 18, 2012
4.00
2.00
DANSYS+GPU
SMP+GPU
DANSYS
SMP
0.00
ANSYS Mechanical 14.0 Performance for Tesla C2075
ANSYS Mechanical Times in Seconds
3000
Lower
is
Better
Xeon 5670 2.93 GHz Westmere (Dual Socket)

Xeon 5670 2.93 GHz Westmere + Tesla C2075
2000
Add a Tesla C2075 to

use with 6 cores:
now 46% faster than
12, with 6 available
for other tasks
1848
4.2x
1000
1192
3.5x
846
2.7x
564 2.1x
444
516 1.9x
399
342
314
273
270
2 Core
4 Core
6 Core
8 Core
1 Core
1 Socket
61
V13sp-5 Model
2012 ANSYS, Inc.
May 18, 2012
Turbine
geometry
2,100 K DOF
SOLID187 FEs
Static, nonlinear
One iteration
Direct sparse
12 Core
2 Socket
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz
48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17

V13sp-5 benchmark (turbine model)
200000
Factorization speed (Mflops)
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
0
62
2012 ANSYS, Inc.
100
200
May 18, 2012
300
400
500
600
700
Front size (MB)
800
900
1000
1100
1200
1300
ANSYS Mechanical Multi-Node GPU

Solder Joint Benchmark (4 MDOF, Creep Strain Analysis)
5.0
4.0
Total Speedup
Linux cluster : Each node

contains 12 Intel Xeon
5600-series cores, 96 GB
RAM, NVIDIA Tesla M2070,
InfiniBand
R14 Distributed ANSYS w/wo GPU

4.4x
Without GPU
With GPU
3.4x
3.2x
3.0
2.0
1.7x
1.9x
1.0
0.0
Solder
balls
16 cores
Results Courtesy of MicroConsult Engineering, GmbH
Mold
PCB
63
2012 ANSYS, Inc.
May 18, 2012
32 cores
64 cores
Trends in Performance by Solver Type

Comparative Trends
With multiple
cores & GPUs all
trends can
change due to
speedup
difference
Elapsed
Time
II
III
PCG1
sparse
sparse+gpu
PCG2
3 areas can be defined:

I.
SPARSE is more efficient
II.
Either SPARSE or PCG can be used
Number of DOF
III. PCG solver works faster since it needs less I/O exchanges with HD
Need to evaluate Sparse & PCG behavior & speedup on your own model!
64
2012 ANSYS, Inc.
May 18, 2012
Other Software Considerations

Tips and Tricks on performance gains
65
Some considerations on scalability of DANSYS

Working with solution differences
Working with a case that does not (or hardly) scale
Working with programmable features for parallel runs
2012 ANSYS, Inc.
May 18, 2012
Scalability Considerations
Load balance
Improvements on domain decomposition
Amdahls Law
Algorithmic enhancements: every part of the code is to run in parallel
User controllable items:
Contact pair definitions: big contact pairs hurt load balance (one contact pair is put
into one domain in our code )
CE definition: many CE terms hurt load balance and Amdahls law ( CE needs
communications among domains that the CEs are defined )
Use best and most suitable hardware possible (speedup of the CPU, memory, I/O and
interconnects)
66
2012 ANSYS, Inc.
May 18, 2012
Scalability Considerations: Contact

Avoid overlapping
contact surface if
possible
Define half circle as
target, dont define
full circle
Define potential
contact surface
into smaller pieces
Avoid defining whole exterior surface as one piece target

Break pairs into smaller pieces if possible
Remember: one whole contact pair is processed on one processor (contact
work cannot be spread out)
67
2012 ANSYS, Inc.
May 18, 2012
Scalability Considerations: Contact

Trim
Avoid defining un-used surfaces as contact or target: i.e. reduce

potential contact definition to minimum:
In rev. 12.0: Use new control CNCheCK,TRIM

In rev. 11.0: Turn NLGEOM,OFF when define contact pairs in WB. WB
auto turns on facility like CNCheCK, TRIM internally.
68
2012 ANSYS, Inc.
May 18, 2012
Scalability Considerations: Remote Load/Disp

Point load distribution (remote load)
All nodes connected to one RBE3 node have to be
grouped into the same domain. This hurts load
balance! Try to reduce # of RBE3 nodes.
Point moment and it is

distributed to internal
surface of the hole
69
2012 ANSYS, Inc.
May 18, 2012
Deformed shape
Example of Bonded Contact and Remote Loads: Universal Joint Model

14 bonded
contact pairs
Internal CE
generated
by bonded
contact
Torque
Torque defined by RBE3 on end
surface only good practice
This model has small pieces of contacts

and RBE3, it scales well in DANSYS
70
2012 ANSYS, Inc.
May 18, 2012
Working With Solution Differences in Parallel Runs

Most of solution differences come from contact applications when NP =1, versus
NP = 2, 3, 4, 5, 6, 7,
Check on contact pairs to make sure we dont have a case of bifurcation and also plot
71
deformations to see the case.

Tighten CNVTOL convergence tolerance to see solution accuracy. If solution is less than,
say, 1 % in difference, then parallel computing can make some difference in convergence,
all solutions are acceptable.
If solution is well-defined and all input settings are correct, report this case to ANSYS Inc.
for investigations
2012 ANSYS, Inc.
May 18, 2012
Working With a Case of Poor Scalability

No scalability (speedup) at all (or even slower than NP = 1)
72
Is this problem too small (normally DOFs should be greater 50K)?

Do I have a slow disk, problem is so big that I/O size exceeds the memory I/O buffer?
Is every NODE of my machines connected to public network?
Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly not
scalable)
Resume data at /solu level and dont read in input files every time of the run
etc
2012 ANSYS, Inc.
May 18, 2012
Working With a Case of Poor Scalability

Yes, I have scalability but poor (say, speedup < 2X)
73
Is this GigE or other slow interconnect?

Are all processors sharing one disk (SF mount)?
Do other people run the job on the same machine the same time?
Do I have many big pairs of contacts or do I have remote load or displacement that tie to
the major portions of the model?
Am I using a generation of dual/quad cores where the memory bandwidth is totally
shared within a core?
Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly not
scalable)
Resume data at /solu level and dont read in input files every time of the run
etc
2012 ANSYS, Inc.
May 18, 2012
APPENDIX
74
2012 ANSYS, Inc.
May 18, 2012
Platform MPI Installation for ANSYS 14

Note for ANSYS Mechanical R13 users
- Do not uninstall HP-MPI, this is required for compatibility purposes with R13.
- Verify that HP-MPI is installed in its default location :

C:\Program Files (x86)\Hewlett-Packard\HP-MPI, this is required for ANSYS
Mechanical R13 to execute properly.
75
2012 ANSYS, Inc.
May 18, 2012

- Run setup.exe of AnsysR14 Installation as Administrator :
- Install Platform MPI :
- Follow the Platform MPI Installation Instructions

76
2012 ANSYS, Inc.
May 18, 2012

Note for ANSYS Mechanical R13 users
For ANSYS Mechanical customers who have R13 installed and wish to
continue to use R13, please run the following command to ensure
compatibility :
"%AWP_ROOT140%\commonfiles\MPI\Platform\8.1.2\Windows\HPMPICO
MPAT\hpmpicompat.bat
(by default : C:\Program Files\ANSYS Inc\v140\commonfiles\MPI\Platform
\8.1.2\Windows\HPMPICOMPAT\hpmpicompat.bat)
The command will display a dialog box with a title of "ANSYS 13.0 SP1
Help".
77
2012 ANSYS, Inc.
May 18, 2012

To finish the installation :
- Go to %AWP_ROOT140%\
commonfiles\MPI\Platform\8.1\Windows\setpcmpipassword.bat
(by default : C:\Program Files\ANSYS Inc\v140\
commonfiles\MPI\Platform\8.1.2\Windows\setpcmpipassword.bat)
- Run "sethpmpipassword.bat", tape your Windows User Password and press Enter :
78
2012 ANSYS, Inc.
May 18, 2012
Test MPI Installation for ANSYS 14

The installation is now finished. How to verify the proper functioning ?
- Edit the file "test_mpi14.bat" attached in the .zip
"c:\program files\ansys inc\v140\ansys\bin\winx64\ansys140" -mpitest -mpi pcmpi -np 2
- Change the Ansys path and the number of processors if necessary (-np x)
- Save and run the file "test_mpi14.bat"
- The expected result is shown below :
79
2012 ANSYS, Inc.
May 18, 2012
Test Case Batch launch (Solver Sparse)

- The file "cube_sparse_hpc.txt" is an input file for a simple analysis
(pressure on a cube).
- Edit the file "job_sparse.bat" and change the Ansys path and/or the
number of processors is necessary.
- Possibility to change the number of mesh division of the cube to try out
the performance of your machine. (-ndiv xx)
- Save and run the file "job_sparse.bat".
Informations about the file "job_sparse.bat"
80
-b : batch
-np : number of processors
-j : jobname
-ndiv : number of division (for this exemple only)
-i : input file
-acc nvidia : use GPU acceleration
-o : output file
-mpi pcmpi : plateform MPI
2012 ANSYS, Inc.
May 18, 2012

- Possibility to check your processors running with the Windows Task
Manager. (Ctrl+Shift+Esc)
Exemple with 6 processus requested :
Advice : do not request all the processors available if you want to do

something else during the running.
81
2012 ANSYS, Inc.
May 18, 2012

Once the running is finished :
- Read the file .out to collect all the informations about the solver output.
- The main informations are :
- Elapsed Time (sec)
- Latency Time from Master to core
- Communication Speed from Master to core
- Equation solver computational rate
- Equation solver effective I/O rate
82
2012 ANSYS, Inc.
May 18, 2012
Test Case Workbench launch

- Open a Workbench Project with AnsysR14
- Open Mechanical
- Go to : Tools -> Solver Process Setting -> Advanced
- Check "Distributed Solution", specify the number of processors used and

write the Additionnal Command (-mpi pcmpi) as shown below :
1
3
Possibility to
use GPU
83
2012 ANSYS, Inc.
May 18, 2012
Test Case Workbench launch

In the Analysis settings :
- Possibility to choose the Solver Type (Direct = Sparse, Iterative = PCG)
- Solve your model

- Read the Solver Output from the Solution Information
84
2012 ANSYS, Inc.
May 18, 2012
Appendix
Automated run for a model
Compare customer results with ANSYS reference

First step for an HPC test on customer machine
85
2012 ANSYS, Inc.
May 18, 2012
General view
The goal of this Excel file is twofold :
On the one hand, it enables to write the batch launch commands of multiple analysis
in a file (job.bat)
On the other hand, it enables to extract informations from the different solve.out files
and write them in Excel.
INPUT DATA
86
2012 ANSYS, Inc.
May 18, 2012
OUTPUT DATA
INPUT DATA
1
3
87
2012 ANSYS, Inc.
May 18, 2012
INPUT DATA
1
Name of the machines used for the solve with PCMPI (up to 3)
Not required if the solve is performed on a single machine
88
2012 ANSYS, Inc.
May 18, 2012
INPUT DATA
2
89
Description
Choice
Machine
Number of machines used
1,2 or 3
Solver
Type of solver used
sparse or pcg
Division
Division of the edge for meshing
Any integer
Release
Select Ansys Release
140 or 145
GPU
Use GPU acceleration
yes or no
np total
Total number of cores
No choice (value calculated)
np / machine Number of cores by machines
Any integer
PCG level
Only available for PCG solver
1,2,3 or 4
Simulation
method
Shared Memory or Distributed

Memory
SMP or DMP
2012 ANSYS, Inc.
May 18, 2012
INPUT DATA
3
Create a job.bat file with all the input data given in the Excel
90
2012 ANSYS, Inc.
May 18, 2012
OUTPUT DATA
1
2
3
91
2012 ANSYS, Inc.
May 18, 2012
OUTPUT DATA
1
Read the informations from all the *.out files.
Nb : All the files must be in the same directory.
If a *.out file is not found, a pop-up will appear :
Continue : over pass this file and go to next

STOP : stop reading all the next *.out files
92
2012 ANSYS, Inc.
May 18, 2012
OUTPUT DATA
2
Output Data
93
Description
Elapsed Time (sec)
Total time of the simulation
Solver rate (Mflops)
Speed of the solver
Bandwidth (Gbytes/s)
I/O rate
Memory Used (Mbytes)
Memory required
Number of iterations (PCG)
Available for PCG only
2012 ANSYS, Inc.
May 18, 2012
OUTPUT DATA
Elapsed Time Solver rate Bandwidth
Memory Used
All this informations are extracted from the *.out files :
PCG
94
2012 ANSYS, Inc.
May 18, 2012
SPARSE
OUTPUT DATA
3
Hyperlinks are automatically created to open the different *.out

files directly from Excel.
Nb : if an error occurred during the solve (*** ERROR ***), it
will be automatically highlighted in the Excel file.
95
2012 ANSYS, Inc.
May 18, 2012
And now :
waiting your feedback ,
from your results
96
2012 ANSYS, Inc.
May 18, 2012
Any suggestion/question for Excel tool

improvement :
gabriel.messager@ansys.com
97
2012 ANSYS, Inc.
May 18, 2012
THANK YOU
98
2012 ANSYS, Inc.
May 18, 2012

HPC Best Practices For Fea

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

HPC Best Practices For Fea

Загружено:

Авторское право:

Доступные форматы

HPC Best Practices

2012 ANSYS, Inc.

May 18, 2012

2012 ANSYS, Inc.

May 18, 2012

Need for speed :

Mesh fidelity continues to increase

2012 ANSYS, Inc.

May 18, 2012

Analysing the model prior to launch the

2012 ANSYS, Inc.

May 18, 2012

2012 ANSYS, Inc.

May 18, 2012

2012 ANSYS, Inc.

May 18, 2012

2012 ANSYS, Inc.

May 18, 2012

2012 ANSYS, Inc.

May 18, 2012

2012 ANSYS, Inc.

May 18, 2012

Parallel Processing Hardware

Shared memory (SMP) single box,

Distributed memory (DMP) single box,

2012 ANSYS, Inc.

May 18, 2012

Parallel Processing Hardware

2012 ANSYS, Inc.

May 18, 2012

Parallel Processing Hardware + Software

SMP (per node)

2012 ANSYS, Inc.

May 18, 2012

Distributed ANSYS Design Requirements

Support all major platforms

2012 ANSYS, Inc.

May 18, 2012

Distributed ANSYS Architecture

Break problem into N pieces (domains)

2012 ANSYS, Inc.

May 18, 2012

Distributed ANSYS Architecture

2012 ANSYS, Inc.

May 18, 2012

Distributed ANSYS Solvers

2012 ANSYS, Inc.

May 18, 2012

Benefits of Distributed ANSYS

2012 ANSYS, Inc.

May 18, 2012

2012 ANSYS, Inc.

May 18, 2012

10% CPU time

10% CPU time

Solve [K]{x} = {b}

Element Stress Recovery

2012 ANSYS, Inc.

May 18, 2012

Element Stress Recovery

2012 ANSYS, Inc.

May 18, 2012

data in-core objects