Вы находитесь на странице: 1из 54

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Rocks Rocks Clusters Clusters a a nd nd Object Object
Institute for Computational and Mathematical Engineering Rocks Rocks Clusters Clusters a a nd nd Object Object

RocksRocks ClustersClusters aandnd ObjectObject StorageStorage

Steve Jones

Technology Operations Manager Institute for Computational and Mathematical Engineering Stanford University

Larry Jones

Vice President, Product Marketing

Panasas Inc.

Institute for Computational and Mathematical Engineering

Research Groups

Computational and Mathematical Engineering Research Groups • Flow Physics and Computation • Aeronautics and

Flow Physics and Computation

Aeronautics and Astronautics

Chemical Engineering

Center for Turbulence Research

Center for Integrated Turbulence Simulations

Thermo Sciences Division

Funding

Sponsored Research (AFOSR/ONR/DARPA/DURIP/ASC)

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Active collaborations with the Labs Buoyancy driven

Active collaborations with the Labs

Buoyancy driven instabilities/mixing - CDP for modeling plumes (Stanford/SNL)

LES Technology - Complex Vehicle Aerodynamics using CDP (Stanford/LLNL)

Tsunami modeling - CDP for Canary Islands Tsunami Scenarios (Stanford/LANL)

Parallel I/O & Large-Scale Data Visualization - UDM integrated in CDP (Stanford/LANL)

Parallel Global Solvers - HyPre Library integrated in CDP (Stanford/LLNL)

Parallel Grid Generation - Cubit and related libraries (Stanford/SNL)

Merrimac - Streaming Supercomputer Prototype (Stanford/LLNL/LBNL/NASA)

Institute for Computational and Mathematical Engineering

Affiliates Program

Institute for Computational and Mathematical Engineering Affiliates Program
Institute for Computational and Mathematical Engineering Affiliates Program

Institute for Computational and Mathematical Engineering

The Research

Institute for Computational and Mathematical Engineering The Research MOLECULES TO PLANETS !

MOLECULES TO PLANETS !

Institute for Computational and Mathematical Engineering The Research MOLECULES TO PLANETS !
Institute for Computational and Mathematical Engineering The Research MOLECULES TO PLANETS !
Institute for Computational and Mathematical Engineering The Research MOLECULES TO PLANETS !

Institute for Computational and Mathematical Engineering

Tsunami Modeling

Computational and Mathematical Engineering Tsunami Modeling Preliminary calculations Preliminary calculations Ward &

Preliminary calculations

Engineering Tsunami Modeling Preliminary calculations Preliminary calculations Ward & Day, Ge ophysical J.
Engineering Tsunami Modeling Preliminary calculations Preliminary calculations Ward & Day, Ge ophysical J.

Preliminary calculations

Engineering Tsunami Modeling Preliminary calculations Preliminary calculations Ward & Day, Ge ophysical J. (2001)

Ward & Day, Geophysical J. (2001)

Institute for Computational and Mathematical Engineering

Landslide Modeling

and Mathematical Engineering Landslide Modeling • Extends existing Lagrangian particle-tracking

• Extends existing Lagrangian particle-tracking capability in CDP

• Collision model based on the distinct element method*

• Originally developed for the analysis of rock mechanics problems

* Cundall P.A., Strack O.D.L., A discrete numerical model for granular assemblies, Géotechnique 29, No 1, pp. 47-65.

* Cundall P.A., Strack O.D.L., A discrete numerical model for granular assemblies, Géotechnique 29, No 1,

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering “Some fear flutter because they don’t understand it, and

“Some fear flutter because they don’t understand it, and some fear it because they do.” -von Karman-

Engineering “Some fear flutter because they don’t understand it, and some fear it because they do.”

Institute for Computational and Mathematical Engineering

9/12/97

Institute for Computational and Mathematical Engineering 9/12/97
Institute for Computational and Mathematical Engineering 9/12/97

Institute for Computational and Mathematical Engineering

Limit Cycle Oscillation

Institute for Computational and Mathematical Engineering Limit Cycle Oscillation
Institute for Computational and Mathematical Engineering Limit Cycle Oscillation

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering 9 8 7 6 5 4 3 2 1 0.6
9 8 7 6 5 4 3 2 1 0.6 0.8 1 1.2 1.4 Torsional
9
8
7
6
5
4
3
2
1
0.6
0.8
1
1.2
1.4
Torsional Frequency (Hz)

Mach Number

4 3 2 1 0 -1 0.6 0.8 1 1.2 1.4 Mach Number 48 minutes
4
3
2
1
0
-1
0.6
0.8
1
1.2
1.4
Mach Number
48 minutes
on 240
processors
Torsional Damping Ratio (%)

(Clean Wing)

Flight Test Data (Clean Wing)0.6 0.8 1 1.2 1.4 Mach Number 48 minutes on 240 processors Torsional Damping Ratio (%)

3D Simulation1.4 Mach Number 48 minutes on 240 processors Torsional Damping Ratio (%) (Clean Wing) Flight Test

Institute for Computational and Mathematical Engineering

Databases?

Desert Storm

(1991)

and Mathematical Engineering Databases? Desert Storm (1991) Iraq War (2003) 400,000 configurations to be flight tested

Iraq War

(2003)

400,000 configurations to be flight tested

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Potential 4 3.5 3 2.5 2 1.5 1 0.5 0
Potential 4 3.5 3 2.5 2 1.5 1 0.5 0 Damping Coefficient (%) -- 1
Potential
4
3.5
3
2.5
2
1.5
1
0.5
0
Damping Coefficient (%) -- 1 st Torsion

Flight Test

FOM (1,170 s.)

TP-ROM (5 s.)

0.6

0.8

1

Mach Number

1.2

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering A A Brief Brief Introduction Introduction to to Clustering

AA BriefBrief IntroductionIntroduction toto ClusteringClustering andand RocksRocks

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Brief History of Clustering (very brief) • NOW pioneering the

Brief History of Clustering (very brief)

Engineering Brief History of Clustering (very brief) • NOW pioneering the vision for clus ters of

• NOW pioneering the vision for clusters of commodity processors

– David Culler (UC Berkeley) started in early 90’s

– SunOS / SPARC

– First generation of Myrinet, active messages

– Glunix (Global Unix) execution environment

• Beowulf popularized the notion and made it very affordable

– Tomas Sterling, Donald Becker (NASA)

– Linux

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Types of Clusters

• Highly Available (HA)

– Generally small, less than 8 nodes

– Redundant components

– Multiple communication paths

– This is NOT Rocks

• Visualization clusters

– Each node drives a display

– OpenGL machines

– This is not core Rocks

– But there is a Viz Roll

• Computing (HPC clusters)

– AKA Beowulf

– This is core Rocks

Rocks – But there is a Viz Roll • Computing (HPC clusters) – AKA Beowulf –
Rocks – But there is a Viz Roll • Computing (HPC clusters) – AKA Beowulf –
Rocks – But there is a Viz Roll • Computing (HPC clusters) – AKA Beowulf –

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Definition: HPC Cluster Architecture © 2005 UC Regents

Definition: HPC Cluster Architecture

Institute for Computational and Mathematical Engineering Definition: HPC Cluster Architecture © 2005 UC Regents

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

The Dark Side of Clusters

and Mathematical Engineering The Dark Side of Clusters • Clusters are phenomenal price/performance computational

• Clusters are phenomenal price/performance computational engines…

– Can be hard to manage without experience

– High performance I/O is still unresolved

– Finding out where something has failed increases at least linearly as cluster size increases

• Not cost-effective if every cluster “burns” a person just for care and feeding

• Programming environment could be vastly improved

• Technology is changing rapidly

– Scaling up is becoming commonplace (128-256 nodes)

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Minimum Components

and Mathematical Engineering Minimum Components Power Local Hard Drive Ethernet i386 (Athlon/Pentium)
Power Local Hard Drive Ethernet
Power
Local Hard
Drive
Ethernet

i386 (Athlon/Pentium) x86_64 (Opteron/EM64T) ia64 (Itanium) server

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Optional Components

• High performance network

– Myrinet

– Infiniband (SilverStorm or Voltaire)

• Network addressable power distribution unit

• Keyboard/video/mouse network not required

– Non-commodity

– How do you manage your network?

• Keyboard/video/mouse network not required – Non-commodity – How do you manage your network? © 2005
• Keyboard/video/mouse network not required – Non-commodity – How do you manage your network? © 2005
• Keyboard/video/mouse network not required – Non-commodity – How do you manage your network? © 2005
• Keyboard/video/mouse network not required – Non-commodity – How do you manage your network? © 2005

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering The Top 2 Most Critical Problems • The largest problem

The Top 2 Most Critical Problems

• The largest problem in clusters is software skew

– When SW configuration on some nodes is different than on others

– Small differences (minor version numbers on libraries) can cripple a parallel program

• The second most important problem is adequate job control of the parallel process

– Signal propagation

– Cleanup

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Rocks (Open source clustering distribution) • Technology

Rocks (Open source clustering distribution)

Engineering Rocks (Open source clustering distribution) • Technology transfer of commodity clustering to

• Technology transfer of commodity clustering to application scientists

– “Make clusters easy”

– Scientists can build their own supercomputers and migrate up to national centers as needed

• Rocks is a cluster on a CD

– Red Hat Enterprise Linux (opensource and free)

– Clustering software (PBS, SGE, Ganglia, NMI)

– Highly programmatic software configuration management

• Core software technology for several campus projects

• Core software technology fo r several campus projects – BIRN, Center for Theoretical Biologica l
• Core software technology fo r several campus projects – BIRN, Center for Theoretical Biologica l
• Core software technology fo r several campus projects – BIRN, Center for Theoretical Biologica l

– BIRN, Center for Theoretical Biological Physics, EOL, GEON, NBCR, OptlPuter

• First SW release release Nov. 2000

• Supports x86, Opteron, EMT64, and Itanium

NBCR, OptlPuter • First SW release release Nov. 2000 • Supports x86, Opteron, EMT64, and Itanium
NBCR, OptlPuter • First SW release release Nov. 2000 • Supports x86, Opteron, EMT64, and Itanium

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Philosophy

for Computational and Mathematical Engineering Philosophy • Caring and feeding for a system is not fun

• Caring and feeding for a system is not fun

• System administrators cost more than clusters

– 1 TFLOP cluster is less than $200,000 (US)

– Close to actual cost of a full-time administrator

• System administrator is the weakest link in the cluster

– Bad ones like to tinker

– Good ones still make mistakes

is the weakest link in the cluster – Bad ones like to tinker – Good ones

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Philosophy (continued)

• All nodes are 100% automatically configured

– Zero “hand” configuration

– This includes site-specific configuration

• Run on heterogeneous standard high volume components

– Use components that offer the best price/performance

– Software installation and configuration must support different hardware

– Homogeneous clusters do not exist

– Disk imaging requires homogeneous cluster

hardware – Homogeneous clusters do not exist – Disk imaging require s homogeneous cluster © 2005
hardware – Homogeneous clusters do not exist – Disk imaging require s homogeneous cluster © 2005

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Philosophy (continued)

• Optimize for installation

– Get the system up quickly

– In a consistent state

– Build supercomputers in hours not months

• Manage through re-installation

– Can re-install 128 nodes in under 20 minutes

– No support for on-the-fly system patching

• Do not spend time trying to maintain system consistency

– Just re-install

– Can be batch driven

• Uptime in HPC is a myth

– Supercomputing sites have monthly downtime

– HPC is not HA

driven • Uptime in HPC is a myth – Supercomputing sites have monthly downtime – HPC
driven • Uptime in HPC is a myth – Supercomputing sites have monthly downtime – HPC

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Rocks Basic Approach

1. Install a frontend

– Insert Rocks Base CD

– Insert Roll CDs (optional components)

– Answer 7 screens of configuration data

– Drink coffee (takes about 30 minutes to install)

2. Install compute nodes

– Login to frontend

– Execute insert-ethers

– Boot compute node with Rocks Base CD (or PXE)

– Insert-ethers discovers nodes

– Go to step 3

3. Add user accounts

4. Start computing

– Go to step 3 3. Add user accounts 4. Start computing • Optional Rolls –
– Go to step 3 3. Add user accounts 4. Start computing • Optional Rolls –

• Optional Rolls

– Condor

– Grid (based on NMI R4)

– Intel (compilers)

– Java

– SCE (developed in Thailand)

– Sun Grid Engine

– PBS (developed in Norway)

– Area51 (security monitoring tools)

– Many others…

© 2005 UC Regents

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering The Clusters
Institute for Computational and Mathematical Engineering The Clusters
Institute for Computational and Mathematical Engineering The Clusters

The Clusters

Institute for Computational and Mathematical Engineering

Iceberg

• 600 Processor Intel Xeon 2.8GHz

• Fast Ethernet

– Install Date 2002

• 1 TB Storage

• Physical installation - 1 week

• Rocks installation tuning - 1 week

Ethernet – Install Date 2002 • 1 TB Storage • Physical installation - 1 week •
Ethernet – Install Date 2002 • 1 TB Storage • Physical installation - 1 week •
Ethernet – Install Date 2002 • 1 TB Storage • Physical installation - 1 week •
Ethernet – Install Date 2002 • 1 TB Storage • Physical installation - 1 week •

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Iceberg at Clark Center • One week to move and

Iceberg at Clark Center

• One week to move and rebuild the cluster

• Then running jobs again

Mathematical Engineering Iceberg at Clark Center • One week to move and rebuild the cluster •
Mathematical Engineering Iceberg at Clark Center • One week to move and rebuild the cluster •
Mathematical Engineering Iceberg at Clark Center • One week to move and rebuild the cluster •

Institute for Computational and Mathematical Engineering

Top 500 Supercomputer

Institute for Computational and Mathematical Engineering Top 500 Supercomputer
Institute for Computational and Mathematical Engineering Top 500 Supercomputer

Institute for Computational and Mathematical Engineering

Nivation

for Computational and Mathematical Engineering Nivation • 164 Processor Intel Xeon 3.0GHz • 4GB RAM per

• 164 Processor Intel Xeon 3.0GHz

• 4GB RAM per node

• Myrinet

• Gigabit Ethernet

• Two 1 TB NAS Appliances

• 4 Tools Nodes

Intel Xeon 3.0GHz • 4GB RAM per node • Myrinet • Gigabit Ethernet • Two 1

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Campus Backbone FrontendFrontend SSeervrverer Tools-1
Institute for Computational and Mathematical Engineering Campus Backbone FrontendFrontend SSeervrverer Tools-1
Campus Backbone FrontendFrontend SSeervrverer Tools-1 Tools-2 Tools-3 Tools-4 NFSNFS AppliAppliaannccee GigE Net
Campus
Backbone
FrontendFrontend SSeervrverer
Tools-1
Tools-2
Tools-3
Tools-4
NFSNFS AppliAppliaannccee
GigE Net
NFSNFS AppliAppliaannccee

Eliminated Bottlenecks

Redundancy

400MBytes/sec

Node Node Node Node Node Node Node Node Myrinet
Node
Node
Node
Node
Node
Node
Node
Node
Myrinet
Redundancy 400MBytes/sec Node Node Node Node Node Node Node Node Myrinet Huge Bottleneck/ Single Point of

Huge Bottleneck/

Single Point of Failure

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Panasas Integration in less than 2 hours • Installation and

Panasas Integration in less than 2 hours

• Installation and configuration of Panasas Shelf - 1 hour

• Switch configuration changes for link aggregation - 10 minutes

• Copy RPM to /home/install/contrib/enterprise/3/public/i386/RPMS - 1 minute

• create/edit extend-compute.xml - 5 minutes

# Add panfs to fstab

REALM=10.10.10.10

mount_flags="rw,noauto,panauto" /bin/rm -f /etc/fstab.bak.panfs /bin/rm -f /etc/fstab.panfs /bin/cp /etc/fstab /etc/fstab.bak.panfs /bin/grep -v "panfs://" /etc/fstab > /etc/fstab.panfs /bin/echo "panfs://$REALM:global /panfs panfs $mount_flags 0 0" >> /etc/fstab.panfs /bin/mv -f /etc/fstab.panfs /etc/fstab /bin/sync

/sbin/chkconfig --add panfs /usr/local/sbin/check_panfs LOCATECRON=/etc/cron.daily/slocate.cron LOCATE=/etc/sysconfig/locate LOCTEMP=/tmp/slocate.new

/bin/cat $LOCATECRON | sed "s/,proc,/,proc,panfs,/g" > $LOCTEMP /bin/mv -f $LOCTEMP $LOCATECRON /bin/cat $LOCATECRON | sed "s/\/afs,/\/afs,\/panfs,/g" > $LOCTEMP /bin/mv -f $LOCTEMP $LOCATECRON

• [root@rockscluster]# rocks-dist dist ; cluster-fork ‘/boot/kickstart/cluster-kickstart’ - 30 minutes

• /etc/auto.home

userX -fstype=panfs panfs://10.x.x.x/home/userX - script it to save time

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Benchmarking Panasas using bonnie++ #!/bin/bash #PBS -N BONNIE

Benchmarking Panasas using bonnie++

#!/bin/bash

#PBS -N BONNIE

#PBS -e Log.d/BONNIE.panfs.err

#PBS -o Log.d/BONNIE.panfs.out

#PBS -m aeb

#PBS -M hpcclusters@gmail.com

#PBS -l nodes=1:ppn=2

#PBS -l walltime=30:00:00

PBS_O_WORKDIR='/home/sjones/benchmarks' export PBS_O_WORKDIR

### --------------------------------------- ### BEGINNING OF EXECUTION ### ---------------------------------------

echo The master node of this job is `hostname` echo The job started at `date` echo The working directory is `echo $PBS_O_WORKDIR` echo This job runs on the following nodes:

echo `cat $PBS_NODEFILE`

### end of information preamble

cd $PBS_O_WORKDIR cmd="/home/tools/bonnie++/sbin/bonnie++ -s 8000 -n 0 -f -d /home/sjones/bonnie" echo "running bonnie++ with: $cmd in directory "`pwd` $cmd >& $PBS_O_WORKDIR/Log.d/run9/log.bonnie.panfs.$PBS_JOBID

Institute for Computational and Mathematical Engineering

NFS - 8 Nodes

Computational and Mathematical Engineering NFS - 8 Nodes Version 1.03 Machine Size Sequential Out put Sequential

Version 1.03

Machine

Size

Sequential Output

Sequential Input-

-Block- -Rewrite- -Block-

K/sec %CP K/sec %CP

K/sec %CP

Random Seeks

K/sec %CP

compute-3-82 8000M

2323 0

348 0

5119 1

51.3 0

compute-3-81 8000M

2333 0

348 0

5063 1

51.3 0

compute-3-80 8000M

2339 0

349 0

4514 1

52.0 0

compute-3-79 8000M

2204 0

349 0

4740 1

99.8 0

compute-3-78 8000M

2285 0

354 0

3974 0

67.9 0

compute-3-77 8000M

2192 0

350 0

5282 0

46.8 0

compute-3-74 8000M

2292 0

349 0

5112 1

45.4 0

compute-3-73 8000M

2309 0

358 0

4053 0

64.6 0

17.80MB/sec for concurrent write using NFS with 8 dual processor jobs

36.97MB/sec during read process

Institute for Computational and Mathematical Engineering

PanFS - 8 Nodes

Version 1.03

Machine

Size

Sequential Output

-Block-

K/sec %CP K/sec %CP

-Rewrite-

Sequential Out put -Block- K/sec %CP K/sec %CP -Rewrite- Sequential Input- -Block- K/sec %CP Random Seeks

Sequential Input- -Block- K/sec %CP

Random Seeks

K/sec %CP

compute-1-18. 8000M

20767

8

4154

3

24460

7

72.8

0

compute-1-17. 8000M

19755

7

4009

3

24588

7

116.5

0

compute-1-16. 8000M

19774

7

4100

3

23597

7

96.4

0

compute-1-15. 8000M

19716

7

3878

3

25384

8

213.6

1

compute-1-14. 8000M

19674

7

4216

3

24495

7

72.8

0

compute-1-13. 8000M

19496

7

4236

3

24238

7

71.0

0

compute-1-12. 8000M

19579

7

4117

3

23731

7

97.1

0

compute-1-11. 8000M

19688

7

4038

3

24195

8

117.7

0

154MB/sec for concurrent write using PanFS with 8 dual processor jobs

190MB/sec during read process

Institute for Computational and Mathematical Engineering

NFS - 16 Nodes

Computational and Mathematical Engineering NFS - 16 Nodes Version 1.03 Machine Size Sequential Out put Sequential

Version 1.03

Machine

Size

Sequential Output

Sequential Input-

-Block- -Rewrite- -Block-

K/sec %CP K/sec %CP

K/sec %CP

Random Seeks

K/sec %CP

compute-3-82 8000M

1403

0

127

0

2210

0

274.0

2

compute-3-81 8000M

1395

0

132

0

1484

0

72.1

0

compute-3-80 8000M

1436

0

135

0

1342

0

49.3

0

compute-3-79 8000M

1461

0

135

0

1330

0

53.7

0

compute-3-78 8000M

1358

0

135

0

1291

0

54.7

0

compute-3-77 8000M

1388

0

127

0

2417

0

45.5

0

compute-3-74 8000M

1284

0

133

0

1608

0

71.9

0

compute-3-73 8000M

1368

0

128

0

2055

0

54.2

0

compute-3-54 8000M

1295

0

131

0

1650

0

47.4

0

compute-2-53 8000M

1031

0

176

0

737

0

18.3

0

compute-2-52 8000M

1292

0

128

0

2124

0

104.1

0

compute-2-51 8000M

1307

0

129

0

2115

0

48.1

0

compute-2-50 8000M

1281

0

130

0

1988

0

92.2

1

compute-2-49 8000M

1240

0

135

0

1488

0

54.3

0

compute-2-47 8000M

1273

0

128

0

2446

0

52.7

0

compute-2-46 8000M

1282

0

131

0

1787

0

52.9

0

20.59MB/sec for concurrent write using NFS with 16 dual processor jobs 27.41MB/sec during read process

Institute for Computational and Mathematical Engineering

PanFS - 16 Nodes

Version 1.03

Machine

Size

Sequential Output

-Block-

K/sec %CP K/sec %CP

-Rewrite-

Sequential Out put -Block- K/sec %CP K/sec %CP -Rewrite- Sequential Input- -Block- K/sec %CP Random Seeks

Sequential Input- -Block- K/sec %CP

Random Seeks

K/sec %CP

compute-1-26 8000M 14330

5

3392

2

28129

9

54.1

0

compute-1-25 8000M 14603

5

3294

2

30990

9

60.3

0

compute-1-24 8000M 14414

5

3367

2

28834

9

55.1

0

compute-1-23 8000M 9488 3

2864

2

17373

5

121.4

0

compute-1-22 8000M 8991 3

2814

2

21843

7

116.5

0

compute-1-21 8000M

9152

3

2881

2

20882

6

80.6

0

compute-1-20 8000M

9199

3

2865

2

20783

6

85.2

0

compute-1-19 8000M 14593

5

3330

2

29275

9

61.0

0

compute-1-18 8000M

9973

3

2797

2

18153

5

121.6

0

compute-1-17 8000M

9439

3

2879

2

22270

7

64.9

0

compute-1-16 8000M

9307

3

2834

2

21150

6

99.1

0

compute-1-15 8000M

9774

3

2835

2

20726

6

77.1

0

compute-1-14 8000M 15097

5

3259

2

32705 10

60.6

0

compute-1-13 8000M 14453

5

2907

2

36321 11

126.0

0

compute-1-12 8000M 14512

5

3301

2

32841 10

60.4

0

compute-1-11 8000M 14558

5

3256

2

33096 10

62.2

0

187MB/sec for concurrent write using PanFS with 8 dual processor jobs 405MB/sec during read process Capacity imbalances on jobs - 33MB/sec increase from 8 to 16 job run

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Panasas statistics during write process [pancli] sysstat storage

Panasas statistics during write process

[pancli] sysstat storage

IP

CPU

Disk Ops/s KB/s

Capacity (GB)

Util Util

In

Out

Total Avail Reserved

10.10.10.250

55%

22%

127

22847

272

485

367

48

10.10.10.253

60%

24%

140

25672

324

485

365

48

10.10.10.245

53%

21%

126

22319

261

485

365

48

10.10.10.246

55%

22%

124

22303

239

485

366

48

10.10.10.248

57%

22%

134

24175

250

485

369

48

10.10.10.247

52%

21%

124

22711

233

485

366

48

10.10.10.249

57%

23%

135

24092

297

485

367

48

10.10.10.251

52%

21%

119

21435

214

485

366

48

10.10.10.254

53%

21%

119

21904

231

485

367

48

10.10.10.252

58%

24%

137

24753

300

485

366

48

Total "Set 1"

55%

22%

1285 232211 2621

4850

3664

480

Sustained BW 226 MBytes/Sec during 16 1GB concurrent writes

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Panasas statistics during read process [pancli] sysstat storage

Panasas statistics during read process

[pancli] sysstat storage

IP

CPU

Disk Ops/s KB/s

Capacity (GB)

Util Util

In

Out

Total Avail Reserved

10.10.10.250

58%

95%

279

734

21325

485

355

48

10.10.10.253

60%

95%

290

727

22417

485

353

48

10.10.10.245

54%

92%

269

779

19281

485

353

48

10.10.10.246

59%

95%

290

779

21686

485

354

48

10.10.10.248

60%

95%

287

729

22301

485

357

48

10.10.10.247

52%

91%

256

695

19241

485

356

48

10.10.10.249

57%

93%

276

708

21177

485

356

48

10.10.10.251

49%

83%

238

650

18043

485

355

48

10.10.10.254

45%

82%

230

815

15225

485

355

48

10.10.10.252

57%

94%

268

604

21535

485

354

48

Total "Set 1"

55%

91%

2683 7220 202231

4850

3548

480

Sustained BW 197 MBytes/Sec during 16 1GB concurrent sequential reads

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering This is our typical storage util ization with the cluster

This is our typical storage utilization with the cluster at 76%

[pancli] sysstat storage

IP

CPU

Disk Ops/s KB/s

 

Capacity (GB)

 

Util

Util

In

Out

Total

Avail

Reserved

10.10.10.250

6%

5%

35

292

409

485

370

48

10.10.10.253

5%

4%

35

376

528

485

368

48

10.10.10.245

4%

3%

29

250

343

485

368

48

10.10.10.246

6%

4%

28

262

373

485

369

48

10.10.10.248

5%

3%

27

234

290

485

372

48

10.10.10.247

3%

3%

1

1

2

485

370

48

10.10.10.249

5%

3%

48

258

365

485

371

48

10.10.10.251

4%

3%

46

216

267

485

369

48

10.10.10.254

4%

3%

32

256

349

485

370

48

10.10.10.252

4%

3%

34

337

499

485

370

48

Total

4%

3% 315 2482 3425

4850

3697

480

sustained BW 2.42 Mbytes/sec in - 3.34 Mbytes/sec out

[root@frontend-0 root]# showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME

8649

sjones

Running

2

1:04:55:08

Sun May 15 18:20:23

8660

user1

Running

6

1:23:33:15

Sun May 15 18:58:30

8524

user2

Running

16

2:01:09:51

Fri May 13 20:35:06

8527

user3

Running

16

2:01:23:19

Fri May 13 20:48:34

8590

user4

Running

64

3:16:42:50

Sun May 15 10:08:05

8656

user5

Running

16

4:00:55:36

Sun May 15 18:20:51

8647

user6

Running

5

99:22:50:42

Sun May 15 18:15:58

7 Active Jobs

65

of

125 of 164 Processors Active (76.22%)

(79.27%)

82 Nodes Active

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Panasas Panasas Object Object Storage Storage
Institute for Computational and Mathematical Engineering Panasas Panasas Object Object Storage Storage

PanasasPanasas ObjectObject StorageStorage

Institute for Computational and Mathematical Engineering

Requirements for Rocks

• Performance

Engineering Requirements for Rocks • Performance – High read concurrency for par allel application and data
Engineering Requirements for Rocks • Performance – High read concurrency for par allel application and data

– High read concurrency for parallel application and data sets

– High write bandwidth for memory checkpointing, interim and final output

• Scalability

– More difficult problems typically means larger data sets

– Scaling cluster nodes requires scalable IO performance

• Management

– Single system image maximizes utility for user community

– Minimize operations and capital costs

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Shared Storage: The Promise Cluster Compute Nodes • Shared

Shared Storage: The Promise

and Mathematical Engineering Shared Storage: The Promise Cluster Compute Nodes • Shared storage cluster computing
Cluster Compute Nodes
Cluster Compute Nodes

• Shared storage cluster computing

– Compute anywhere model

• Partitions available globally; no replicas required (shared datasets)

• No data staging required

• No distributed data consistency issues

• Reliable checkpoints; application reconfiguration

• Results gateway

– Enhanced reliability via RAID

– Enhanced manageability

• Policy-based management (QoS)

Institute for Computational and Mathematical Engineering

Shared Storage Challenges

Cluster Compute Nodes
Cluster Compute Nodes
Engineering Shared Storage Challenges Cluster Compute Nodes • Performance, scalability & management – Single
Engineering Shared Storage Challenges Cluster Compute Nodes • Performance, scalability & management – Single

• Performance, scalability & management

– Single file system performance limited

– Multiple volumes and mount points

– Manual capacity and load balancing

– Large quantum upgrade costs

Institute for Computational and Mathematical Engineering

Motivation for New Architecture

and Mathematical Engineering Motivation for New Architecture • A highly scalable, interoperable, shared storage system
and Mathematical Engineering Motivation for New Architecture • A highly scalable, interoperable, shared storage system

• A highly scalable, interoperable, shared storage system

– Improved storage management

• Self-management, policy-driven storage (i.e. backup and recovery)

– Improved storage performance

• Quality of service, differentiated services

– Improved scalability

• Of performance and metadata (i.e. free block allocation)

– Improved device and data sharing

• Shared devices and data across OS platforms

Institute for Computational and Mathematical Engineering

Institute for Computational and Mathematical Engineering Next Generation Storage Cluster Linux Linux Compute Compute

Next Generation Storage Cluster

and Mathematical Engineering Next Generation Storage Cluster Linux Linux Compute Compute Cluster Cluster • Scalable

LinuxLinux

ComputeCompute

ClusterCluster

Storage Cluster Linux Linux Compute Compute Cluster Cluster • Scalable performance – Offloaded data path enable
Storage Cluster Linux Linux Compute Compute Cluster Cluster • Scalable performance – Offloaded data path enable
Storage Cluster Linux Linux Compute Compute Cluster Cluster • Scalable performance – Offloaded data path enable
Storage Cluster Linux Linux Compute Compute Cluster Cluster • Scalable performance – Offloaded data path enable
Storage Cluster Linux Linux Compute Compute Cluster Cluster • Scalable performance – Offloaded data path enable

• Scalable performance

– Offloaded data path enable direct disk to client access

– Scale clients, network and capacity

– As capacity grows, performance grows

• Simplified and dynamic management

SingleSingle Step:Step: Perform job directly from high I/O Panasas Storage Cluster
SingleSingle Step:Step:
Perform job directly
from high I/O Panasas
Storage Cluster
path
path

Control

– Robust, shared file access by many clients

– Seamless growth within single namespace eliminates time-consuming admin tasks

ParallelParallel

datadata

pathspaths

• Integrated HW/SW solution

Parallel data data paths paths • Integrated HW/SW solution – Optimizes performance and manageability – Ease

– Optimizes performance and manageability

– Ease of integration and support

MetadataMetadata

ManManaagersgers

ObjectObject StoragStoragee DevicesDevices

Object Object Storag Storag e e Devices Devices

Panasas Storage Cluster

Institute for Computational and Mathematical Engineering

Object Storage Fundamentals

and Mathematical Engineering Object Storage Fundamentals • An object is a logical unit of storage –
and Mathematical Engineering Object Storage Fundamentals • An object is a logical unit of storage –

• An object is a logical unit of storage

– Lives in flat name space with ID

• Contains application data & attributesattributes

– Metadata: block allocation, length

– QoS requirements, capacity quota, etc.

• Has file-like methods

– create, delete, read, write

• Three types of objects:

– Root Object - one per device

– Group Object - a “directory” of objects

– User Object - for user data

• Objects enforce access rights

– Strong capability-based access control

of objects – User Object - for user data • Objects enforce access rights – Strong

Institute for Computational and Mathematical Engineering

Panasas ActiveScale File System

Institute for Computational and Mathematical Engineering Panasas ActiveScale File System
Institute for Computational and Mathematical Engineering Panasas ActiveScale File System
Institute for Computational and Mathematical Engineering Panasas ActiveScale File System

Institute for Computational and Mathematical Engineering

Panasas Hardware Design

and Mathematical Engineering Panasas Hardware Design – Stores objects using SATA – 500 GB – Stores
– Stores objects using SATA – 500 GB – Stores up to 5 TBs per
Stores objects using SATA
500 GB
– Stores up to 5 TBs per shelf
– 55 TB per rack!
16-Port GE Switch

– Orchestrates system activity

– Balances objects across StorageBlades

Redundant power + battery

KEYKEY:: Hardware maximizes next generation file system
KEYKEY::
Hardware maximizes next generation file system

Institute for Computational and Mathematical Engineering

Ease of Management

and Mathematical Engineering Ease of Management Panasas Panasas Addresses Addresses Key Key Drivers

PanasasPanasas AddressesAddresses KeyKey DriversDrivers ofof TCOTCO

Management Problem Panasas Solution - Multiple physical & logical data sets + Single, seamless namespace
Management Problem
Panasas Solution
- Multiple physical & logical data sets
+ Single, seamless namespace
- Manual allocation of new storage
+ Automatic provisioning
- Ongoing adjustments for efficiency
+ Dynamic load balancing
- System backup
+ Scalable snapshots and backup
- Downtime and recovery
+ Advanced RAID
- Security breaches
+ Capability-controlled access over IP

80% of Storage TCO

Source: Gartner

Institute for Computational and Mathematical Engineering

Industry-Leading Performance

and Mathematical Engineering Industry-Leading Performance • Breakthrough data throughput AND random I/O – Tailored
and Mathematical Engineering Industry-Leading Performance • Breakthrough data throughput AND random I/O – Tailored
and Mathematical Engineering Industry-Leading Performance • Breakthrough data throughput AND random I/O – Tailored

• Breakthrough data throughput AND random I/O

– Tailored offerings deliver performance and scalability for all workloads

Institute for Computational and Mathematical Engineering

Support Rocks Clusters

and Mathematical Engineering Support Rocks Clusters 1. Register Your Rocks Cluster

1. Register Your Rocks Cluster

http://www.rocksclusters.org/rocks-register/

Cluster

Organization

Processor CPU

CPU Clock

FLOPS

 

Type

count

(GHz)

(GFLOPS)

Iceberg

Bio-X @ Stanford

Pentium 4

604

2.80

3382.40

Firn

ICME @ Stanford

Pentium 3

112

1.00

112.00

Sintering

ICME @ Stanford

Opteron

48

1.60

153.60

Regelation ICME @ Stanford

Pentium 4

8

3.06

48.96

Gfunk

ICME @ Stanford

Pentium 4

164

2.66

872.48

Nivation

ICME @ Stanford

Pentium 4

172

3.06

1052.60

Current Rocks Statistics (as of 06/26/2005 22:55:00 Pacific)

486 Clusters

29498 CPUs

134130.8 GFLOPS

2. Demand your vendors support Rocks!

Institute for Computational and Mathematical Engineering

Thank you

for Computational and Mathematical Engineering Thank you For more informati on about Steve Jones: High Performance
for Computational and Mathematical Engineering Thank you For more informati on about Steve Jones: High Performance

For more information about Steve Jones:

High Performance Computing Clusters http://www.hpcclusters.org

For more information about Panasas:

http://www.panasas.com

For more information about Rocks:

http://www.rocksclusters.org