Академический Документы
Профессиональный Документы
Культура Документы
for FEA
John Higgins, PE
Senior Application Engineer
1
Agenda
Overview
Parallel Processing Methods
Solver Types
Performance Review
Memory Settings
GPU Technology
Software Considerations
Appendix
2
Overview
Basic information
Output data
A model
A machine
Elapsed Time
Overview
Basic information
Solver Configuration
A model :
-Size / number of
DOF
-Analysis type
Elapsed Time
A machine :
-Number of cores
-RAM
Output data
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
Output data
Elapsed Time
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
Elapsed Time
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)
Parallel Processing
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)
Workstation
10
Cluster
11
12
Laptop/Desktop
or
Workstation/Server
Cluster
ANSYS
YES
Distributed ANSYS
YES
YES
13
each domain
Communicate information across the boundaries
as necessary
Processor 3
Processor 1
Processor 2
14
Processor 4
domain
decomposition
process 1
interprocess
communication
process n-1
domain 0
elem
domain 1
elem
assemble
assemble
assemble
solve
solve
solve
elem output
elem output
elem output
combining results
15
domain n-1
elem
Distributed PCG
For static and full transient analyses
Distributed LANPCG (eigensolver)
For modal analyses
16
17
Solver Types
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
18
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)
Solver Types
Solution Overview
5% CPU time
Prep Data
Element Formation
Solution Procedures
Global Assembly
70% CPU time
5% CPU time
Solver
Equation solver dominates solution CPU time! Need to pay attention to equation solver
Equation solver also consumes the most system resources (memory and I/O)
19
Solver Types
Solution Overview
Prep Data
Element Formation
Solution Procedures
Global Assembly
Solve [K]{x} = {b}
20
System Resolution
Solver Architecture
database
esav
Element
Formation
emat
Symbolic
Assembly
full
PCG Solver
rst,rth
Output
Element
Output
21
Sparse Solver
Filing
LN09
*.BCS: Stats from Sparse Solver
*.full: Assembled Stiffness Matrix
22
PROS
-
CONS
23
24
Number of iterations
PCGTOL
25
26
PROS
-
CONS
27
28
29
Solver Types
Comparative
30
Performance Review
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
31
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)
Performance Review
Process Resource Monitoring (only available on Windows7)
Windows Resource Monitor is a powerful tool for understanding how your
system resources are used by processes and services in real time.
32
Performance Review
How to access to the Process Resource Monitoring ? :
- from OS Task Manager (Ctrl + Shift + Esc) :
- Click Start, click in the Start Search box, type resmon.exe, and then press ENTER.
33
Performance Review
Process Resource Monitoring - CPU
Shared Memory (SMP)
34
Performance Review
Process Resource Monitoring - Memory
Before the solve :
35
Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
36
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)
Performance Review
ANSYS End Statistics
Performance Review
Other main output data to check :
Output Data
38
Description
Bandwidth (Gbytes/s)
I/O rate
Memory required
Performance Review
Elapsed Time Solver rate Bandwidth
39
Memory Used
Number of iterations
Memory Settings
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
40
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)
Memory Settings
SPARSE: Solver Output Statistics >> Memory Checkup
41
Default : BCSOPTION,,OPTIMAL
BCSOPTION,,INCORE
Default : BCSOPTION,,OPTIMAL
43
BCSOPTION,,INCORE
Default : BCSOPTION,,OPTIMAL
BCSOPTION,,INCORE
Default : BCSOPTION,,OPTIMAL
45
BCSOPTION,,INCORE
Memory Settings
SPARSE: Solver Output Statistics >> Memory Checkup
206MB available for Sparse solver at time of factorization.
Out-of-Core <126 MB
Memory Settings
SPARSE: 3 Memory Modes can be Observed
Performance
Best
Worst
47
Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations
# of cores used (SMP,DMP)
From PCGOPT,Lev_Diff
Important statistic!
49
Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations
Check total number of PCG iterations
PCGOPT)
Greater than 3000 iterations: assuming you have tried increasing Lev_Diff, either
abandon PCG and use Sparse solver or improve element aspect ratios, boundary
conditions, and/or contact conditions
<1000 Iterations
50
>3000 Iterations
Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations
51
Performance Review
PCG: Solver Statistics - *.PCS File>> Efficiency & Solver MemoryUsage
52
GPU Technology
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
-GPU
Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings
53
Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network
Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)
GPU Technology
Graphics processing units (GPUs)
Widely used for gaming, graphics rendering
Recently been made available as general-purpose accelerators
Support for double precision arithmetic
Performance exceeding the latest multicore CPUs
54
GPU
PCI Express
channel
Multi-core processors
Typically 4-12 cores
Powerful, general purpose
55
Many-core processors
Typically hundreds of cores
Great for highly parallel code
5%-30% time
Element Formation
5%-10% time
Solution Procedures
5%-10% time
Global Assembly
56
solver
57
NVIDIA Tesla
C2070
NVIDIA Quadro
6000
225 Watts
225 Watts
225 Watts
CUDA cores
448
448
448
Memory
3 GB
6 GB
6 GB
144 GB/s
144 GB/s
144 GB/s
1030/515 Gflops
1030/515 Gflops
Power
Memory Bandwidth
58
Solver
Kernel
Speedups
Overall
Speedups
Intel Xeon 5560 processors
(2.8 GHz, 8 cores total)
32 GB of RAM
Windows XP SP2 (64-bit)
Tesla C2050 (ECC,ON; WDDM driver)
59
GPU
no
no
yes
yes
Speedup
Speedup
2.25
4.29
11.36
11.51
12.00
10.00
8.00
6.00
4.00
2.00
DANSYS+GPU
SMP+GPU
DANSYS
SMP
0.00
3000
Lower
is
Better
2000
1848
4.2x
1000
1192
3.5x
846
2.7x
564 2.1x
444
516 1.9x
399
342
314
273
270
2 Core
4 Core
6 Core
8 Core
1 Core
1 Socket
61
V13sp-5 Model
Turbine
geometry
2,100 K DOF
SOLID187 FEs
Static, nonlinear
One iteration
Direct sparse
12 Core
2 Socket
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz
48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
0
62
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
Without GPU
With GPU
3.4x
3.2x
3.0
2.0
1.7x
1.9x
1.0
0.0
Solder
balls
16 cores
Mold
PCB
63
32 cores
64 cores
With multiple
cores & GPUs all
trends can
change due to
speedup
difference
Elapsed
Time
II
III
PCG1
sparse
sparse+gpu
PCG2
II.
Number of DOF
III. PCG solver works faster since it needs less I/O exchanges with HD
Need to evaluate Sparse & PCG behavior & speedup on your own model!
64
65
Scalability Considerations
Load balance
Improvements on domain decomposition
Amdahls Law
Algorithmic enhancements: every part of the code is to run in parallel
User controllable items:
Contact pair definitions: big contact pairs hurt load balance (one contact pair is put
into one domain in our code )
CE definition: many CE terms hurt load balance and Amdahls law ( CE needs
communications among domains that the CEs are defined )
Use best and most suitable hardware possible (speedup of the CPU, memory, I/O and
interconnects)
66
Define potential
contact surface
into smaller pieces
Deformed shape
Internal CE
generated
by bonded
contact
Torque
Torque defined by RBE3 on end
surface only good practice
Check on contact pairs to make sure we dont have a case of bifurcation and also plot
71
72
73
APPENDIX
74
75
77
78
- Change the Ansys path and the number of processors if necessary (-np x)
- Save and run the file "test_mpi14.bat"
- The expected result is shown below :
79
80
-b : batch
-j : jobname
-i : input file
-o : output file
82
3
Possibility to
use GPU
83
84
Appendix
Automated run for a model
85
General view
The goal of this Excel file is twofold :
On the one hand, it enables to write the batch launch commands of multiple analysis
in a file (job.bat)
On the other hand, it enables to extract informations from the different solve.out files
and write them in Excel.
INPUT DATA
86
OUTPUT DATA
INPUT DATA
1
3
87
INPUT DATA
1
Name of the machines used for the solve with PCMPI (up to 3)
Not required if the solve is performed on a single machine
88
INPUT DATA
2
89
Description
Choice
Machine
1,2 or 3
Solver
sparse or pcg
Division
Any integer
Release
140 or 145
GPU
yes or no
np total
Any integer
PCG level
1,2,3 or 4
Simulation
method
SMP or DMP
INPUT DATA
3
Create a job.bat file with all the input data given in the Excel
90
OUTPUT DATA
1
2
3
91
OUTPUT DATA
1
Read the informations from all the *.out files.
Nb : All the files must be in the same directory.
If a *.out file is not found, a pop-up will appear :
92
OUTPUT DATA
2
Output Data
93
Description
Bandwidth (Gbytes/s)
I/O rate
Memory required
OUTPUT DATA
Elapsed Time Solver rate Bandwidth
Memory Used
Number of iterations
PCG
94
SPARSE
OUTPUT DATA
3
95
And now :
waiting your feedback ,
from your results
96
97
THANK YOU
98