Вы находитесь на странице: 1из 98

HPC Best Practices

for FEA

John Higgins, PE
Senior Application Engineer
1

2012 ANSYS, Inc.

May 18, 2012

Agenda
Overview
Parallel Processing Methods
Solver Types
Performance Review

Memory Settings
GPU Technology
Software Considerations

Appendix
2

2012 ANSYS, Inc.

May 18, 2012

Overview
Basic information

Output data

A model
A machine

Elapsed Time

Need for speed :


Implicit structural FEA codes

Mesh fidelity continues to increase


More complex physics being analyzed
Lots of computations !!

2012 ANSYS, Inc.

May 18, 2012

Overview
Basic information

Solver Configuration

A model :
-Size / number of
DOF
-Analysis type

Elapsed Time

Analysing the model prior to launch the


run may help to choose the more suitable
solver configuration at the first attempt

A machine :
-Number of cores
-RAM

2012 ANSYS, Inc.

Output data

May 18, 2012

Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

2012 ANSYS, Inc.

May 18, 2012

Output data
Elapsed Time

Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

2012 ANSYS, Inc.

May 18, 2012

Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network

Output data
Elapsed Time

Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

2012 ANSYS, Inc.

May 18, 2012

Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network

Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)

Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

2012 ANSYS, Inc.

May 18, 2012

Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network

Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)

Parallel Processing
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

2012 ANSYS, Inc.

May 18, 2012

Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network

Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)

Parallel Processing Hardware


Workstation/Server:

Shared memory (SMP) single box,


or

Distributed memory (DMP) single box,

Workstation
10

2012 ANSYS, Inc.

May 18, 2012

Parallel Processing Hardware


Cluster (Workstation Cluster, Node Cluster):
Distributed memory (DMP) multiple boxes, cluster

Cluster
11

2012 ANSYS, Inc.

May 18, 2012

Parallel Processing Hardware + Software

12

Laptop/Desktop
or
Workstation/Server

Cluster

ANSYS

YES

SMP (per node)

Distributed ANSYS

YES

YES

2012 ANSYS, Inc.

May 18, 2012

Distributed ANSYS Design Requirements


No limitation in simulation capability
Reproducible and consistent results

Support all major platforms

13

2012 ANSYS, Inc.

May 18, 2012

Distributed ANSYS Architecture


Domain decomposition approach

Break problem into N pieces (domains)


Solve the global problem independently within

each domain
Communicate information across the boundaries
as necessary
Processor 3

Processor 1

Processor 2

14

2012 ANSYS, Inc.

May 18, 2012

Processor 4

Distributed ANSYS Architecture


process 0 (host)

domain
decomposition

process 1

interprocess
communication

process n-1

domain 0
elem

domain 1
elem

assemble

assemble

assemble

solve

solve

solve

elem output

elem output

elem output

combining results
15

domain n-1
elem

2012 ANSYS, Inc.

May 18, 2012

Distributed ANSYS Solvers


Distributed sparse (default)
Supports all analyses supported with DANSYS ( Linear,
Non Linear, Static , Transient )

Distributed PCG
For static and full transient analyses
Distributed LANPCG (eigensolver)
For modal analyses

16

2012 ANSYS, Inc.

May 18, 2012

Benefits of Distributed ANSYS


The entire SOLVE phase is parallel
More computations performed in parallel faster solution time
Better speedups than SMP
Can achieve > 4x on 8 cores (Try getting that with SMP!!!!)
Can be used for jobs running on up to hundreds of cores
Can take advantage of resources on multiple machines
Whole new class of problems can be solved!
Memory usage and bandwidth scales
Disk (I/O) usage scales (i.e. parallel I/O)

17

2012 ANSYS, Inc.

May 18, 2012

Solver Types
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

18

2012 ANSYS, Inc.

May 18, 2012

Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network

Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)

Solver Types
Solution Overview
5% CPU time

Prep Data

10% CPU time

Element Formation

10% CPU time

Solution Procedures

Global Assembly
70% CPU time

Solve [K]{x} = {b}

5% CPU time

Element Stress Recovery

Solver

Equation solver dominates solution CPU time! Need to pay attention to equation solver
Equation solver also consumes the most system resources (memory and I/O)

19

2012 ANSYS, Inc.

May 18, 2012

Solver Types
Solution Overview

Prep Data
Element Formation
Solution Procedures
Global Assembly
Solve [K]{x} = {b}

Element Stress Recovery

20

2012 ANSYS, Inc.

May 18, 2012

System Resolution
Solver Architecture

data in-core objects

database

esav

Element
Formation

emat

Symbolic
Assembly

full

PCG Solver

rst,rth

Output
Element
Output

21

2012 ANSYS, Inc.

May 18, 2012

Sparse Solver

Solver Types: SPARSE (Direct)


SPARSE (Direct)

Filing
LN09
*.BCS: Stats from Sparse Solver
*.full: Assembled Stiffness Matrix

22

2012 ANSYS, Inc.

May 18, 2012

Solver Types: SPARSE (Direct)


SPARSE (Direct)

PROS
-

More robust with poorly conditioned problems (Shell-Beams)

Solution always guaranteed

Fast for 2nd Solve or Higher (Multiple Load cases)

CONS

23

Factoring matrix & Solving are resource intensive

Large memory requirements

2012 ANSYS, Inc.

May 18, 2012

Solver Types: PCG (Iterative)


PCG (Iterative)

24

Minimization of residuals/potential energy (Standard Conjugate


Gradient Method) ( {r} = {f} [K].{u} )

Iterative process requiring a convergence test (PCGTOL).

Preconjugate CG used instead to reduce the number of iterations


( Preconditioner [Q] [K-1] - [Q] cheaper than [K-1] )

Number of iterations

2012 ANSYS, Inc.

May 18, 2012

Solver Types: PCG (Iterative)


PCG (Iterative)
PCGTOL need to be used ( ill conditionned model ) with lower value 1e-9 or 1e-10
to let ANSYS follow the same path ( equilibrium iterations ) than the direct
solver

PCGTOL
25

2012 ANSYS, Inc.

May 18, 2012

Solver Types: PCG (Iterative)


PCG (Iterative)
Filing
*.PC*
*.PCS: Iterative solver stats

26

2012 ANSYS, Inc.

May 18, 2012

Solver Types: PCG (Iterative)


PCG (Iterative)

PROS
-

Less memory requirements

Better suited for well conditioned bigger problem

CONS

27

Not useful with near or rigid body behavior

Less robust with ill-conditioned models (Shells & Beams, inadequate


boundary conditions (Rigid Body Motions), elements considerably
elongated, nearly singular matrices) more difficult to approximate
[K-1] with [Q]

2012 ANSYS, Inc.

May 18, 2012

Solver Types: PCG (Iterative)


Level Of Difficulty
LOD number is available in the solver output (solve.out)

.. but can also be seen along with


number of PCG iteration required
to reach a converged solution
within jobname.PCS file.

28

2012 ANSYS, Inc.

May 18, 2012

Solver Types: PCG (Iterative)


Other ways to evaluate ill-conditioning

Error message is also an indication.


Although we propose to change some MULT coefficient, model should be
carefully reviewed first and SPARSE solver considered for resolution instead.

29

2012 ANSYS, Inc.

May 18, 2012

Solver Types
Comparative

30

2012 ANSYS, Inc.

May 18, 2012

Performance Review
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

31

2012 ANSYS, Inc.

May 18, 2012

Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network

Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)

Performance Review
Process Resource Monitoring (only available on Windows7)
Windows Resource Monitor is a powerful tool for understanding how your
system resources are used by processes and services in real time.

32

2012 ANSYS, Inc.

May 18, 2012

Performance Review
How to access to the Process Resource Monitoring ? :
- from OS Task Manager (Ctrl + Shift + Esc) :

- Click Start, click in the Start Search box, type resmon.exe, and then press ENTER.

33

2012 ANSYS, Inc.

May 18, 2012

Performance Review
Process Resource Monitoring - CPU
Shared Memory (SMP)

34

2012 ANSYS, Inc.

May 18, 2012

Distributed Memory (DMP)

Performance Review
Process Resource Monitoring - Memory
Before the solve :

During the solve :

Information from the solve.out :

35

2012 ANSYS, Inc.

May 18, 2012

Overview
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

36

2012 ANSYS, Inc.

May 18, 2012

Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network

Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)

Performance Review
ANSYS End Statistics

Basic information about Analysis


Solving directly available at the end of
Solver Output file (*.out) in Solution
Information
37

2012 ANSYS, Inc.

May 18, 2012

Total Elapsed Time

Performance Review
Other main output data to check :

Output Data

38

Description

Elapsed Time (sec)

Total time of the simulation

Solver rate (Mflops)

Speed of the solver

Bandwidth (Gbytes/s)

I/O rate

Memory Used (Mbytes)

Memory required

Number of iterations (PCG)

Available for PCG only

2012 ANSYS, Inc.

May 18, 2012

Performance Review
Elapsed Time Solver rate Bandwidth

PCG (*.PCS file)

39

2012 ANSYS, Inc.

May 18, 2012

Memory Used

Number of iterations

SPARSE (*.BCS file)

Memory Settings
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

40

2012 ANSYS, Inc.

May 18, 2012

Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network

Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)

Memory Settings
SPARSE: Solver Output Statistics >> Memory Checkup

41

2012 ANSYS, Inc.

May 18, 2012

Memory Settings Test Case 1


Test Case 1 : Small model (need 4 Gb Scratch Memory < RAM)
Machine reference : 6Gb RAM , enough memory but

Default : BCSOPTION,,OPTIMAL

Elapsed Time = 146 sec


42

2012 ANSYS, Inc.

May 18, 2012

BCSOPTION,,INCORE

Elapsed Time = 77 sec

Memory Settings Test Case 1


Test Case 1 : Small model (need ..Gb Scratch Memory < RAM)
Machine reference : 6Gb RAM

Default : BCSOPTION,,OPTIMAL

43

2012 ANSYS, Inc.

May 18, 2012

BCSOPTION,,INCORE

Memory Settings Test Case 2


Test Case 2 : Large model (need 21.1Gb Scratch Memory > RAM)
Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0
Do not set always incore when memory available is not enough !!

Default : BCSOPTION,,OPTIMAL

Elapsed Time = 1249 sec


44

2012 ANSYS, Inc.

May 18, 2012

BCSOPTION,,INCORE

Elapsed Time = 4767 sec

Memory Settings Test Case 2


Test Case 2 : Large model (need 21.1Gb Scratch Memory > RAM)
Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0

Default : BCSOPTION,,OPTIMAL

45

2012 ANSYS, Inc.

May 18, 2012

BCSOPTION,,INCORE

Memory Settings
SPARSE: Solver Output Statistics >> Memory Checkup
206MB available for Sparse solver at time of factorization.

This is sufficient to run in Optimal out-of-core mode (which requires 126MB)


and obtain good performance.
If more than 1547 MB is available, itll run fully in-core best performance

Avoid using Minimum out-of-core -- memory less than 126 MB


Incore >1547 MB
46

2012 ANSYS, Inc.

May 18, 2012

1.5 GB > Optimal > 126 MB

Out-of-Core <126 MB

Memory Settings
SPARSE: 3 Memory Modes can be Observed

Performance
Best

In-core mode (optional)

Requires the most amount of memory


Performs no I/O

Optimal out-of-core mode (default)


Balances memory usage and I/O

Minimum core mode (not recommended)


Requires the least amount of memory
Performs most amount I/O

Worst
47

2012 ANSYS, Inc.

May 18, 2012

Memory Settings Test Case 3


Test Case 3 : trap to avoid : launch a run on a network ( or a slow drive )
Local solve on a local disk (left ) vs a slow disk ( networked or USB (right )

Elapsed Time = 1 998 sec


48

2012 ANSYS, Inc.

May 18, 2012

Elapsed Time = 3 185 sec

Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations
# of cores used (SMP,DMP)

From PCGOPT,Lev_Diff

Important statistic!

49

2012 ANSYS, Inc.

May 18, 2012

Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations
Check total number of PCG iterations

Less than 1000 iterations: good performance


Greater than 1000 iterations: performance is deteriorated. Try increasing Lev_Diff on

PCGOPT)
Greater than 3000 iterations: assuming you have tried increasing Lev_Diff, either
abandon PCG and use Sparse solver or improve element aspect ratios, boundary
conditions, and/or contact conditions

<1000 Iterations

50

2012 ANSYS, Inc.

May 18, 2012

>3000 Iterations

Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations

If too much iteration :


*Use parallel processing
Use PCGOPT,lev
Refine your mesh
Check for too high stiffness

51

2012 ANSYS, Inc.

May 18, 2012

Performance Review
PCG: Solver Statistics - *.PCS File>> Efficiency & Solver MemoryUsage

52

2012 ANSYS, Inc.

May 18, 2012

GPU Technology
Basic information
A model :
-Size / number of
DOF
-Analysis type
A machine :
-Number of cores
-RAM
-GPU

Solver Configuration
Parallel Processing
Method :
-Shared Memory
(SMP)
-Distributed Memory
(DMP)
Solver type :
-Direct (Sparse)
-Iterative (PCG)
Memory Settings

53

2012 ANSYS, Inc.

May 18, 2012

Information during
the solve
Resource Monitor :
-CPU
-Memory
-Disk
-Network

Output data
-Elapsed Time
-Equation solver
computational rate
-Equation solver
effective I/O rate
(Bandwidth)
-Total memory used
(incore/out-of-core?)

GPU Technology
Graphics processing units (GPUs)
Widely used for gaming, graphics rendering
Recently been made available as general-purpose accelerators
Support for double precision arithmetic
Performance exceeding the latest multicore CPUs

So how can ANSYS Mechanical make use of this new


technology to reduce the overall time to solution??

54

2012 ANSYS, Inc.

May 18, 2012

GPU Technology Introduction


CPUs and GPUs used in a collaborative fashion
CPU

GPU

PCI Express
channel

Multi-core processors
Typically 4-12 cores
Powerful, general purpose

55

2012 ANSYS, Inc.

May 18, 2012

Many-core processors
Typically hundreds of cores
Great for highly parallel code

GPU Accelerator capability


Motivation
Equation solver dominates solution time
Logical place to add GPU acceleration

5%-30% time

Element Formation

5%-10% time

Solution Procedures

5%-10% time

Global Assembly

60%-90% time Equation Solver (e.g., [A]{x} = {b})


1%-10% time Element Stress Recovery

56

2012 ANSYS, Inc.

May 18, 2012

solver

GPU Accelerator capability


Accelerate sparse direct solver (Boeing/DSP)
GPU is only used to factor a dense frontal matrix
Decision is made based on frontal matrix size on when
to send data to GPU or not:

Too small, too much overhead, stays on CPU


Too large, exceeds GPU memory, stays on CPU

57

2012 ANSYS, Inc.

May 18, 2012

GPU Accelerator capability


Supported hardware
Currently recommending NVIDIA Tesla 20-series cards
Recently added support for Quadro 6000
Requires the following items
Larger power supply (1 card needs about 225W)
Open 2x form factor PCIe x16 Gen2 slot
Supported on Windows/Linux 64-bit
NVIDIA Tesla
C2050

NVIDIA Tesla
C2070

NVIDIA Quadro
6000

225 Watts

225 Watts

225 Watts

CUDA cores

448

448

448

Memory

3 GB

6 GB

6 GB

144 GB/s

144 GB/s

144 GB/s

1030/515 Gflops

1030/515 Gflops

Power

Memory Bandwidth
58

Peak Speed (SP/DP)


1030/515 Gflops
May 18, 2012

2012 ANSYS, Inc.

ANSYS Mechanical SMP GPU Speedup

Solver
Kernel
Speedups

Overall
Speedups
Intel Xeon 5560 processors
(2.8 GHz, 8 cores total)
32 GB of RAM
Windows XP SP2 (64-bit)
Tesla C2050 (ECC,ON; WDDM driver)
59

2012 ANSYS, Inc.

May 18, 2012

Distributed ANSYS GPU Speedup @ 14.0


Vibroacoustic harmonic analysis of an audio speaker
Direct sparse solver
Quarter-symmetry model with 700K DOF:
657424 nodes
465798 elements
higher-order acoustic fluid elements (FLUID220/221)

Distributed ANSYS Results (baseline is 1 core):


With GPU, ~11x speedup on 2 cores!
15-25% faster than SMP with same number of cores
Cores
2
4
2
4

GPU
no
no
yes
yes

Speedup

Speedup
2.25
4.29
11.36
11.51

12.00

10.00
8.00
6.00

Windows workstation: Two Intel Xeon 5530


processors (2.4 GHz, 8 cores total), 48 GB RAM,
NVIDIA Quadro 6000
60

2012 ANSYS, Inc.

May 18, 2012

4.00
2.00

DANSYS+GPU

SMP+GPU

DANSYS
SMP

0.00

ANSYS Mechanical 14.0 Performance for Tesla C2075

ANSYS Mechanical Times in Seconds

3000

Lower
is
Better

Xeon 5670 2.93 GHz Westmere (Dual Socket)


Xeon 5670 2.93 GHz Westmere + Tesla C2075

2000

Add a Tesla C2075 to


use with 6 cores:
now 46% faster than
12, with 6 available
for other tasks

1848

4.2x

1000

1192

3.5x

846

2.7x
564 2.1x

444

516 1.9x
399

342

314

273

270

2 Core

4 Core

6 Core

8 Core

1 Core

1 Socket
61

V13sp-5 Model

2012 ANSYS, Inc.

May 18, 2012

Turbine
geometry
2,100 K DOF
SOLID187 FEs
Static, nonlinear
One iteration
Direct sparse

12 Core

2 Socket
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz
48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17

GPU Accelerator capability


V13sp-5 benchmark (turbine model)
200000

Factorization speed (Mflops)

180000
160000
140000
120000
100000
80000
60000
40000
20000
0
0
62

2012 ANSYS, Inc.

100

200

May 18, 2012

300

400

500

600

700

Front size (MB)

800

900

1000

1100

1200

1300

ANSYS Mechanical Multi-Node GPU


Solder Joint Benchmark (4 MDOF, Creep Strain Analysis)
5.0
4.0
Total Speedup

Linux cluster : Each node


contains 12 Intel Xeon
5600-series cores, 96 GB
RAM, NVIDIA Tesla M2070,
InfiniBand

R14 Distributed ANSYS w/wo GPU


4.4x

Without GPU
With GPU

3.4x

3.2x

3.0

2.0

1.7x

1.9x

1.0
0.0
Solder
balls

16 cores

Results Courtesy of MicroConsult Engineering, GmbH

Mold
PCB

63

2012 ANSYS, Inc.

May 18, 2012

32 cores

64 cores

Trends in Performance by Solver Type


Comparative Trends

With multiple
cores & GPUs all
trends can
change due to
speedup
difference

Elapsed
Time

II

III
PCG1
sparse
sparse+gpu
PCG2

3 areas can be defined:


I.

SPARSE is more efficient

II.

Either SPARSE or PCG can be used

Number of DOF

III. PCG solver works faster since it needs less I/O exchanges with HD
Need to evaluate Sparse & PCG behavior & speedup on your own model!
64

2012 ANSYS, Inc.

May 18, 2012

Other Software Considerations


Tips and Tricks on performance gains

65

Some considerations on scalability of DANSYS


Working with solution differences
Working with a case that does not (or hardly) scale
Working with programmable features for parallel runs

2012 ANSYS, Inc.

May 18, 2012

Scalability Considerations
Load balance
Improvements on domain decomposition
Amdahls Law
Algorithmic enhancements: every part of the code is to run in parallel
User controllable items:

Contact pair definitions: big contact pairs hurt load balance (one contact pair is put
into one domain in our code )

CE definition: many CE terms hurt load balance and Amdahls law ( CE needs
communications among domains that the CEs are defined )

Use best and most suitable hardware possible (speedup of the CPU, memory, I/O and
interconnects)

66

2012 ANSYS, Inc.

May 18, 2012

Scalability Considerations: Contact


Avoid overlapping
contact surface if
possible
Define half circle as
target, dont define
full circle

Define potential
contact surface
into smaller pieces

Avoid defining whole exterior surface as one piece target


Break pairs into smaller pieces if possible
Remember: one whole contact pair is processed on one processor (contact
work cannot be spread out)
67

2012 ANSYS, Inc.

May 18, 2012

Scalability Considerations: Contact


Trim

Avoid defining un-used surfaces as contact or target: i.e. reduce


potential contact definition to minimum:

In rev. 12.0: Use new control CNCheCK,TRIM


In rev. 11.0: Turn NLGEOM,OFF when define contact pairs in WB. WB
auto turns on facility like CNCheCK, TRIM internally.
68

2012 ANSYS, Inc.

May 18, 2012

Scalability Considerations: Remote Load/Disp


Point load distribution (remote load)
All nodes connected to one RBE3 node have to be
grouped into the same domain. This hurts load
balance! Try to reduce # of RBE3 nodes.

Point moment and it is


distributed to internal
surface of the hole
69

2012 ANSYS, Inc.

May 18, 2012

Deformed shape

Example of Bonded Contact and Remote Loads: Universal Joint Model


14 bonded
contact pairs

Internal CE
generated
by bonded
contact

Torque
Torque defined by RBE3 on end
surface only good practice

This model has small pieces of contacts


and RBE3, it scales well in DANSYS
70

2012 ANSYS, Inc.

May 18, 2012

Working With Solution Differences in Parallel Runs


Most of solution differences come from contact applications when NP =1, versus
NP = 2, 3, 4, 5, 6, 7,

Check on contact pairs to make sure we dont have a case of bifurcation and also plot

71

deformations to see the case.


Tighten CNVTOL convergence tolerance to see solution accuracy. If solution is less than,
say, 1 % in difference, then parallel computing can make some difference in convergence,
all solutions are acceptable.
If solution is well-defined and all input settings are correct, report this case to ANSYS Inc.
for investigations

2012 ANSYS, Inc.

May 18, 2012

Working With a Case of Poor Scalability


No scalability (speedup) at all (or even slower than NP = 1)

72

Is this problem too small (normally DOFs should be greater 50K)?


Do I have a slow disk, problem is so big that I/O size exceeds the memory I/O buffer?
Is every NODE of my machines connected to public network?
Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly not
scalable)
Resume data at /solu level and dont read in input files every time of the run
etc

2012 ANSYS, Inc.

May 18, 2012

Working With a Case of Poor Scalability


Yes, I have scalability but poor (say, speedup < 2X)

73

Is this GigE or other slow interconnect?


Are all processors sharing one disk (SF mount)?
Do other people run the job on the same machine the same time?
Do I have many big pairs of contacts or do I have remote load or displacement that tie to
the major portions of the model?
Am I using a generation of dual/quad cores where the memory bandwidth is totally
shared within a core?
Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly not
scalable)
Resume data at /solu level and dont read in input files every time of the run
etc

2012 ANSYS, Inc.

May 18, 2012

APPENDIX

74

2012 ANSYS, Inc.

May 18, 2012

Platform MPI Installation for ANSYS 14


Note for ANSYS Mechanical R13 users
- Do not uninstall HP-MPI, this is required for compatibility purposes with R13.

- Verify that HP-MPI is installed in its default location :


C:\Program Files (x86)\Hewlett-Packard\HP-MPI, this is required for ANSYS
Mechanical R13 to execute properly.

75

2012 ANSYS, Inc.

May 18, 2012

Platform MPI Installation for ANSYS 14


- Run setup.exe of AnsysR14 Installation as Administrator :

- Install Platform MPI :

- Follow the Platform MPI Installation Instructions


76

2012 ANSYS, Inc.

May 18, 2012

Platform MPI Installation for ANSYS 14


Note for ANSYS Mechanical R13 users
For ANSYS Mechanical customers who have R13 installed and wish to
continue to use R13, please run the following command to ensure
compatibility :
"%AWP_ROOT140%\commonfiles\MPI\Platform\8.1.2\Windows\HPMPICO
MPAT\hpmpicompat.bat
(by default : C:\Program Files\ANSYS Inc\v140\commonfiles\MPI\Platform
\8.1.2\Windows\HPMPICOMPAT\hpmpicompat.bat)
The command will display a dialog box with a title of "ANSYS 13.0 SP1
Help".

77

2012 ANSYS, Inc.

May 18, 2012

Platform MPI Installation for ANSYS 14


To finish the installation :
- Go to %AWP_ROOT140%\
commonfiles\MPI\Platform\8.1\Windows\setpcmpipassword.bat
(by default : C:\Program Files\ANSYS Inc\v140\
commonfiles\MPI\Platform\8.1.2\Windows\setpcmpipassword.bat)
- Run "sethpmpipassword.bat", tape your Windows User Password and press Enter :

78

2012 ANSYS, Inc.

May 18, 2012

Test MPI Installation for ANSYS 14


The installation is now finished. How to verify the proper functioning ?
- Edit the file "test_mpi14.bat" attached in the .zip
"c:\program files\ansys inc\v140\ansys\bin\winx64\ansys140" -mpitest -mpi pcmpi -np 2

- Change the Ansys path and the number of processors if necessary (-np x)
- Save and run the file "test_mpi14.bat"
- The expected result is shown below :

79

2012 ANSYS, Inc.

May 18, 2012

Test Case Batch launch (Solver Sparse)


- The file "cube_sparse_hpc.txt" is an input file for a simple analysis
(pressure on a cube).
- Edit the file "job_sparse.bat" and change the Ansys path and/or the
number of processors is necessary.
- Possibility to change the number of mesh division of the cube to try out
the performance of your machine. (-ndiv xx)
- Save and run the file "job_sparse.bat".
Informations about the file "job_sparse.bat"

80

-b : batch

-np : number of processors

-j : jobname

-ndiv : number of division (for this exemple only)

-i : input file

-acc nvidia : use GPU acceleration

-o : output file

-mpi pcmpi : plateform MPI

2012 ANSYS, Inc.

May 18, 2012

Test Case Batch launch (Solver Sparse)


- Possibility to check your processors running with the Windows Task
Manager. (Ctrl+Shift+Esc)
Exemple with 6 processus requested :

Advice : do not request all the processors available if you want to do


something else during the running.
81

2012 ANSYS, Inc.

May 18, 2012

Test Case Batch launch (Solver Sparse)


Once the running is finished :
- Read the file .out to collect all the informations about the solver output.
- The main informations are :
- Elapsed Time (sec)
- Latency Time from Master to core
- Communication Speed from Master to core
- Equation solver computational rate

- Equation solver effective I/O rate

82

2012 ANSYS, Inc.

May 18, 2012

Test Case Workbench launch


- Open a Workbench Project with AnsysR14
- Open Mechanical
- Go to : Tools -> Solver Process Setting -> Advanced

- Check "Distributed Solution", specify the number of processors used and


write the Additionnal Command (-mpi pcmpi) as shown below :
1

3
Possibility to
use GPU

83

2012 ANSYS, Inc.

May 18, 2012

Test Case Workbench launch


In the Analysis settings :
- Possibility to choose the Solver Type (Direct = Sparse, Iterative = PCG)

- Solve your model


- Read the Solver Output from the Solution Information

84

2012 ANSYS, Inc.

May 18, 2012

Appendix
Automated run for a model

Compare customer results with ANSYS reference


First step for an HPC test on customer machine

85

2012 ANSYS, Inc.

May 18, 2012

General view
The goal of this Excel file is twofold :

On the one hand, it enables to write the batch launch commands of multiple analysis
in a file (job.bat)
On the other hand, it enables to extract informations from the different solve.out files
and write them in Excel.
INPUT DATA

86

2012 ANSYS, Inc.

May 18, 2012

OUTPUT DATA

INPUT DATA
1
3

87

2012 ANSYS, Inc.

May 18, 2012

INPUT DATA
1

Name of the machines used for the solve with PCMPI (up to 3)
Not required if the solve is performed on a single machine

88

2012 ANSYS, Inc.

May 18, 2012

INPUT DATA
2

89

Description

Choice

Machine

Number of machines used

1,2 or 3

Solver

Type of solver used

sparse or pcg

Division

Division of the edge for meshing

Any integer

Release

Select Ansys Release

140 or 145

GPU

Use GPU acceleration

yes or no

np total

Total number of cores

No choice (value calculated)

np / machine Number of cores by machines

Any integer

PCG level

Only available for PCG solver

1,2,3 or 4

Simulation
method

Shared Memory or Distributed


Memory

SMP or DMP

2012 ANSYS, Inc.

May 18, 2012

INPUT DATA
3
Create a job.bat file with all the input data given in the Excel

90

2012 ANSYS, Inc.

May 18, 2012

OUTPUT DATA
1
2
3

91

2012 ANSYS, Inc.

May 18, 2012

OUTPUT DATA
1
Read the informations from all the *.out files.
Nb : All the files must be in the same directory.
If a *.out file is not found, a pop-up will appear :

Continue : over pass this file and go to next


STOP : stop reading all the next *.out files

92

2012 ANSYS, Inc.

May 18, 2012

OUTPUT DATA
2

Output Data

93

Description

Elapsed Time (sec)

Total time of the simulation

Solver rate (Mflops)

Speed of the solver

Bandwidth (Gbytes/s)

I/O rate

Memory Used (Mbytes)

Memory required

Number of iterations (PCG)

Available for PCG only

2012 ANSYS, Inc.

May 18, 2012

OUTPUT DATA
Elapsed Time Solver rate Bandwidth

Memory Used

Number of iterations

All this informations are extracted from the *.out files :

PCG

94

2012 ANSYS, Inc.

May 18, 2012

SPARSE

OUTPUT DATA
3

Hyperlinks are automatically created to open the different *.out


files directly from Excel.
Nb : if an error occurred during the solve (*** ERROR ***), it
will be automatically highlighted in the Excel file.

95

2012 ANSYS, Inc.

May 18, 2012

And now :
waiting your feedback ,
from your results

96

2012 ANSYS, Inc.

May 18, 2012

Any suggestion/question for Excel tool


improvement :
gabriel.messager@ansys.com

97

2012 ANSYS, Inc.

May 18, 2012

THANK YOU

98

2012 ANSYS, Inc.

May 18, 2012

Вам также может понравиться