Malkawi Keynote-Speech-Challenges To HPCS

The 3rd International Conference
on Emerging Ubiquitous Systems and

Pervasive Networks
www.iasks.org/conferences/EUSPN2011
Amman, Jordan
October 10-13, 2011
Challenges to High Productivity

Computing Systems and
Networks
Mohammad Malkawi
Dean of Engineering,
Jadara University
mmalkawi@aimws.com
Outline
High Productivity Computing

Systems (HPCS) - The Big Picture
The Challenges
IBM PERCS
Cray Cascade
SUN Hero Program
Cloud Computing
HPCS: The Big Picture

Manufacture and deliver a peta-flop
class computer
Complex architecture
High performance
Easier to program
Easier to use
HPCS Goals
Productivity
Reduce code development time
Processing power
Floating point & integer arithmetic
Memory
Large size, high bandwidth & low

latency
Interconnection
Large bisection bandwidth
HPCS Challenges
High Effective Bandwidth
High bandwidth/low latency memory systems
Balanced System Architecture
Processors, memory, interconnects,

programming environments
Robustness
Hardware and software reliability

Compute through failure
Intrusion identification and resistance
techniques.
HPCS Challenges
Performance Measurement and
Prediction
New class of metrics and benchmarks to

measure and predict performance of system
architecture and applications software
Scalability
Adapt and optimize to changing workload and

user requirements; e.g., multiple
programming models, selectable machine
abstractions, and configurable
software/hardware architectures
Productivity Challenges
Quantify productivity for code development

and production
Identify characteristics of
Application codes
Workflow
Bottlenecks and obstacles
Lessons learned so that decisions by the

productivity team and the vendors are based
on real data rather than anecdotal data
Did Not Learn the Lessons
Figure 2: Defect Arrival Rate for R8, R9 and R10
Productivity Dilemma - 1
Diminishing productivity is alarming
Coding
Debugging
Optimizing
Modifying
Over-provisioning hardware
Running high-end applications
Not long ago, a computational scientist

could personally write, debug and optimize
code to run on a leadership class high
performance computing system without the
help of others.
Today, the programming for a cluster of
machines is significantly more difficult than
traditional programming, and the scale of
the machines and problems has increased
more than 1,000 times.
Owning and running high-end

computational facilities for nuclear
research, seismic modeling, gene sequencing
or business intelligence, takes sizeable
investment in terms of staffing,
procurement and operations.
Applications achieve 5 to 10 percent of the
theoretical peak performance of the system.
Applications must be restarted from scratch
every time a hardware or software failure
interrupts the job.
HPCS Trends: Productivity Crisis
High Productivity Computing
Scaling the Program

Without Scaling the
Programmer
Bandwidth enables productivity and
allows for simpler programming
environments and systems with
greater fault tolerance
Language Challenges
MPI is a fairly low-level language
Reliable, predictable and works.
Extension of Fortran, C and C++

New languages with higher level of
abstraction
Improve legacy applications
Scale to Petascale levels
SUN Fortress
IBM - X10
Cray Chapel
Open MP
Global View Programming Model
Global View programs present a single,

global view of the program's data
structures,
Begin with a single main thread.
Parallel execution then spreads out
dynamically as work becomes available.
Unprecedented Performance Leap
Performance targets require aggressive

improvements in system parameters
traditionally ignored by the "Linpack"
benchmark.
Improve system performance under the
most demanding benchmarks (GUPS)
Determine whether general applications
will be written or modified to benefit
from these features.
Trade-Offs
Portability versus innovations

Abstractions vs. difficulty of
programming and performance
overhead
Shared memory versus message
passing
Cost of Petascale Computing
Require petabytes of memory

Order of 106 processors
Hundreds of petabytes of disk storage for
capacity and bandwidth.
Power consumption and cost for DRAM
and disks (Tens of Mega Watts)
Operational cost
The DARPA HPCS Program

First major program to devote effort to
make high end computers more userfriendly
Mask the difficulty of developing and running

codes on HPCS
Mask the challenge of getting good
performance for a general code
Fast, large, and low latency RAM
Fast processing
Quantitative measure of productivity
IBM HPCS EXAMPLE
IBM HPCS Program PERC 2011
Productive, Easy-to-use, Reliable Computer
Rich programming environment
Develop new applications and maintain existing ones.

Support existing programming models and languages
Scalability to the peta-level
Automate performance tuning tasks

Rich graphical interfaces
Automate monitoring and recovery tasks
Fewer system administrators to handle
larger systems more effectively
IBM Blue Gene HPCS Base
IBM Approach - Hardware
Innovative processor chip design & leverage

the POWER processor server line.
Lower Soft Error Rates (SER)
Reduce the latency of memory accesses by
placing the processors close to large
memory arrays.
Multiple chip configuration to suit different
workloads.
IBM Approach - Software
Large set of tools integrated into a

modern, user-friendly programming
environment.
Support both legacy programming
models and languages (MPI, OpenMP, C,
C++, Fortran, etc.),
Support emerging ones (PGAS)
Design new experimental programming
language, called X10.
X10 Features
Designed for parallel processing from the

ground up.
Falls under the Partitioned Global Address
Space (PGAS) category
Balance between a high-level abstraction
and exposing the topology of the system
Asynchronous interactions among the
parallel threads
Avoid the blocking synchronization style
CRAY HPCS EXAMPLE
Multiple Processing Technologies

In high performance computing: one size does
not fit all
Heterogeneous computing using custom processing

technologies.
Performance achieved via deeper pipelining

and more complex microarchitectures
Introduction of multi-core processors:
Further stresses processor-memory balance issues

Drives up the number of processors required to solve
large problems
Specialized Computing Technologies
Vector processing and field

programmable gate arrays
(FPGAs)
Ability to extract more performance out of the

transistors on a chip with less control overhead.
Allow higher processor performance, with lower
power
Reduce the number of processors required to
solve a given problem
Vector processors tolerate memory latency
extremely well
Specialized Computing Technologies
Multithreading improve latency tolerance
Cascade design will combine

multiple computing technologies
Pure scalar nodes, based on Opteron

microprocessors
Nodes providing vector, massively
multithreaded, and FPGA-based
acceleration.
Nodes that can adapt their mode of
operation to the application.
Cray: The Cascade Approach
Scalable, high-bandwidth system

Globally addressable memory
Heterogeneous processing technologies
Fast serial execution
Massive multithreading
Vector processing and FPGA-based
application acceleration.
Adaptive supercomputing:
The system adapts to the application rather than

requiring the programmer to adapt the application
to the system.
Cascade Approach
Use Cray T3ETM massively parallel system

Use best-of-class microprocessor
Processors directly access global memory
with very low overhead and at very high
data rates.
Hierarchical address translation allows the
processors to access very large data sets
without suffering from TLB faults
AMD's Opteron will be the base processor
for Cascade
Cray Adaptive Supercomputing
The system adapts to the application

The user logs into a single system, and sees
one global file system.
The compiler analyzes the code to
determine which processing technology best
fits the code
The scheduling software automatically
deploys the code on the appropriate nodes.
Balanced Hardware Design

A balanced hardware design
Complements processor flops with memory,

network and I/O bandwidth
Scalable performance
Improving programmability and breadth
of applicability.
Balanced systems also require fewer
processors to scale to a given level of
performance, reducing failure rates and
administrative overhead.
Cray- System Bandwidth Challenge

The Cascade program is attacking this
problem on two fronts
Signalling technology and

Network design.
Provide truly massive global bandwidth at

an affordable cost.
A key part of the design is a common,
globally addressable memory across the
whole machine.
Efficient, low-overhead communication.
Cray- System Bandwidth Challenge
Accessing remote data is as simple

as issuing a load or store
instruction, rather than calling a
library function to pass messages
between processors.
Allows many outstanding references to

be overlapped with each other and
with ongoing computation.
Cray Programming Model
Support MPI for legacy purposes

Unified Parallel C (UPC) and Coarray
Fortran (CAF)
simpler and easier to write than MPI

Reference memory on remote nodes as
easily as referencing memory on the local
node
Data sharing is much more natural
Communication overhead is much lower.
The Chapel Cray HPCS Language

Support for graphs, hash tables, sparse
arrays, and iterators.
Ability to separate the specification of an
algorithm from structural details of the
computation including
Data layouts
Work decomposition and communication.
Simplifies the creation of the basic algorithms
Allows these structural components to be
gradually tuned over time.
Cray's Programming Tools
Reduce the complexity of working

on highly scalable applications.
The Cascade debugger solution will
Focus on data rather than control

Support application porting
Allow scaling commensurate with the
application
Integrated user environment (IDE)
Cascade Performance Analysis Tools
Hardware performance counters

Software introspection techniques.
Present the user with insight, rather than
statistics.
Act as a parallel programming expert
Provide high-level feedback on program
behaviour
Provide suggestions for program
modifications to remove key bottlenecks
or otherwise improve performance.
SUN HPCS EXAMPLE
Evolution of HPCS at SUN

Grid:
Loosely coupled heterogeneous resources

Multiple administrative domains
Wide area network
Clusters
Tightly coupled high performance systems

Message passing MPI
Ultrascale
Distributed scalable systems

High productivity shared memory systems
High bandwidth, global address space, unified
administration tools
SUN Approach The Hero System
Rich bandwidth
Low latencies
Very high levels of fault tolerance
Highly integrated toolset to scale the
program and not the programmers
Multithreading technologies ( > 100
concurrent threads)
SUN Approach The Hero System
Globally addressable memory

System level and application
checkpointing
Hardware and software telemetry for
dramatically improved fault tolerance.
The system appears more like a flat
memory system
Focus on solving the problem at hand
rather than making elaborate efforts to
distribute data in a robust manner.
Definition: Bisection Bandwidth

A standard metric for systems ability to globally move data
Example is an all-toall interconnect
between 8 cabinets
There are 28 total
connections, of
which 16 cross the
bisection (orange)
and 12 do not (blue)
Split a system into equal halves such that there is
the minimum number of connections across the
split- the bandwidth across the split is the
bisection bandwidth
High bandwidth optical

connections are key to
meeting HPCS peta-scale
bisection bandwidth
target
System Bandwidth Over Time

A giant leap in productivity expected
High Bandwidth Required by HPCS

Radical Changes From Todays Architecture Necessary
Motivation for Higher Bandwidth
Growing BW demand in HPCS
Multicore CPUs: Aggregation

of multiple cores is
unstoppable and copper
interconnects are stressed at
very large scale
Silicon Photonics is the solution since it
brings a potential of unlimited BW on
the best medium allowing for large
aggregation of multicore CPUs
Clusters are growing in number of

nodes and in performance/node
Interconnects are the
limiting factor in BW,
latency, distance
Protocols reduce latency &
copper increases latency.
Silicon Photonics brings high
BW and low latency
Storage I/O BW increasing exponentially

due to the faster data/rate and the
parallelism caused by striping technologies
WDM will eventually allow 10Tb of data to
be transmitted down a single piece of fiber
Silicon Photonics is at the
beginning of its life cycle with
headroom for explosive BW
growth without any increase in latency or
reduction in reach
Proximity + CMOS Photonics
Proximity Communication -2
Proximity Communication
Capacitive coupling enables high-speed
data communication between
neighboring chips without the need for
wires of any kind
Allows for the alignment of metal plates

on one chip with metal plates on a
neighboring chip and the transfer of data
between them
reduced power
improves cross-section bandwidth

and
communication power
Proximity Communication - SUN
3.6 x 4.1 mm test chip

0.35 um technology
50 um bit pitch
1.35 Gbps/channel for 16
simultaneous channels
< 10^-12 BER @ 1Gbps
3.6 mW/channel static
power
3.9 pJ/bit dynamic power
Low Cost, Low Power Optics
DWDM CMOS Photonics
CMOS Photonics Module
SUN Programming Model

Simpler Code with High Bandwidth Shared Memory
NAS Parallel Benchmark CG (Conjugate Gradient) Lines of
Code
SUN Fortress Language

To Do For Fortran What JavaTM Did For
C
Catch stupid mistakes

Extensive libraries
Platform indpendence
Security model
Type safety
Multithreading
Dynamic compilation
Object-Based Smart Storage

With Object Storage File Systems For
Massive Scalability and Extreme
Performance
Ultra-scale Computing in 2010
Simpler development environments will

make HPC more accessible to a diverse
range of users
Lone researchers and small teams will
once again be able to harness the
computational power of leadership class
systems
Many gaps regarding commercial and
scientific computing will narrow
Cloud Computing
Service computing
The net is the computer
More than 100 vendors
Growing fast
Programming environment
BACKUP SLIDES
HPCS Technologies
Some Publicly Announced Projects
IBM HPCS - PERCS
Open source operating systems and

hypervisors will provide HPC-oriented
Virtualization
Security
Resource management
Affinity control
Resource limits
Checkpoint-restart and reliability features

that will improve the robustness and
availability of the system.
MPI Paradigm
Writing applications in MPI requires breaking up all the data and

computation into a large number of discrete pieces
and then using library code to explicitly bundle up data and pass it
between processors in messages whenever processors need to share
data.
It's a cumbersome affair that distracts scientists from their primary
focus.
Once an application is written, it's generally a time-consuming process
to debug and tune it.
Traditional debugging models just don't scale well to thousands or tens
of thousands of processors (try opening up 10,000 debugger windows,
one for each thread!).
Trying to figure out why your application isn't getting the performance
you think it should is also exceedingly difficult at large scales.
Traditional profiling and even sophisticated statistics-gathering may be
insufficient to ascertain why the performance is lagging, much less how
to change the code to improve it.
The time spent trying to structure an

application to fit the attributes of the target
machine.
If the machine is a cluster with limited
interconnect bandwidth
the programmer must carefully minimize

communication
make sure that any sparse data to be communicated
is first bundled together into larger messages to
reduce communication overheads.
If the machine uses conventional

microprocessors
Care must be taken to maximize cache reuse
Eliminate global memory references,

which tend to stall the processor.
If the machine looks like a hammer
You'd better make all your codes look like

nails!
This can lead to "unnatural" algorithms and
data structures, which significantly reduces
programmer productivity

Malkawi Keynote-Speech-Challenges To HPCS

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Malkawi Keynote-Speech-Challenges To HPCS

Загружено:

Авторское право:

Доступные форматы

The 3rd International Conference

on Emerging Ubiquitous Systems and

Challenges to High Productivity

High Productivity Computing

HPCS: The Big Picture

Reduce code development time

Floating point & integer arithmetic

Large size, high bandwidth & low

Large bisection bandwidth

High bandwidth/low latency memory systems

Balanced System Architecture

Processors, memory, interconnects,

Hardware and software reliability

New class of metrics and benchmarks to

Adapt and optimize to changing workload and

Quantify productivity for code development

Bottlenecks and obstacles

Lessons learned so that decisions by the

Did Not Learn the Lessons

Figure 2: Defect Arrival Rate for R8, R9 and R10

Diminishing productivity is alarming

Running high-end applications

Not long ago, a computational scientist

Owning and running high-end

HPCS Trends: Productivity Crisis

High Productivity Computing

Scaling the Program

MPI is a fairly low-level language

Reliable, predictable and works.

Extension of Fortran, C and C++

Global View Programming Model

Global View programs present a single,

Unprecedented Performance Leap

Performance targets require aggressive

Portability versus innovations

Cost of Petascale Computing

Require petabytes of memory

The DARPA HPCS Program

Mask the difficulty of developing and running

IBM HPCS EXAMPLE

IBM HPCS Program PERC 2011

Productive, Easy-to-use, Reliable Computer

Rich programming environment

Develop new applications and maintain existing ones.

Automate performance tuning tasks

IBM Blue Gene HPCS Base

IBM Approach - Hardware

Innovative processor chip design & leverage

IBM Approach - Software

Large set of tools integrated into a

Designed for parallel processing from the

CRAY HPCS EXAMPLE

Multiple Processing Technologies

Heterogeneous computing using custom processing

Performance achieved via deeper pipelining

Further stresses processor-memory balance issues

Specialized Computing Technologies

Vector processing and field

Ability to extract more performance out of the

Specialized Computing Technologies

Multithreading improve latency tolerance

Cascade design will combine

Pure scalar nodes, based on Opteron