Вы находитесь на странице: 1из 45



“A Technology that makes Supercomputer Personal”


•A supercomputer is a computer that is at the frontline of
current processing capacity, particularly speed of
calculation. Unit Petaflops

•Supercomputers are used for highly calculation-intensive
tasks. Used in Molecular Modeling, Climate Research etc.

•Developed by companies like Cray, IBM, & Hewlett Packard.

•As of July 2009, CRAY Jaguar was the fastest

•In July 2010, “Nebulae” of NSCS overtook CRAY Jaguar by
peak performance at 2.98 petaflops per second (PFlop/s).

•Nebulae Used NVidia's Tesla C2050 GPU with CUDA
Technology to boost its Performance.

Nebulae Specification

•A Graphics processing unit or GPU
• (VPU) is a specialized processor that

• offloads 3D or 2D graphics rendering

• from the microprocessor.

•Used in embedded systems,

• mobile phones, personal computers,
• workstations, and game consoles.

•Predominantly GPU was a supplement to CPU. Now nVidia Tesla

GPU can compute
Inbuilt Dedicated14Hybrid
times faster than CPU Computations.
[ nVidia GeForce GTX 280 GPU vs Intel i7 960]

GPU Computing
•The excellent floating point
performance in GPUs led to
the advent of General Purpose
Computing on GPU’s(GPGPU)
•GPU computing is the use
of a GPU to do general
purpose scientific and
engineering computing
•The model for GPU
computing is to use a CPU
and GPU together in a
heterogeneous computing


•Sequential Part  CPU
•Computational Part  GPU
•Computations And Sequencing Take Part Simultaneously.
•Huge Performance Boost considering Traditional way
of computing.
•Known as GPGPU.

GPGPU (General Purpose GPU)

•Recognized Graphical Programming languages like
OpenGL, and CG.

Disadvantages of GPGPU
•Required Graphical Languages.
•Difficult for the users to program in Graphical Languages.
•Developers should make scientific applications look like
Graphical applications.
•Stream Processing.


•CUDA – Compute Unified Device Architecture
•Parallel Computing Architecture.
•Parallel or “Many-core” architecture runs thousands of
threads simultaneously
•Computing Engine
•Scientific Application Compiled Directly.
•Programmers use “C for CUDA” ( C with nVidia extensions )
•Compiled through ‘PathScale Open64 C’ Compiler.
•Third Party wrappers are available for Python, FORTRAN,
Java and MATLAB.

Advantages of CUDA
•CUDA with industry-standard C
ü Write a program for one thread
ü Instantiate it on many parallel threads
ü Familiar programming model and language

•CUDA is a scalable parallel programming model
ü Program runs on any number of processors
without recompiling


•Scattered Reads (Arbitrary Addressing)

•Shared Memory (16 KB)
•Faster Downloads and Read backs to and from GPU.
•Full Support for integer and bitwise operation,
including integer texture Lookups

CUDA Programming Model
•Parallel code (kernel) is launched and
executed on a device by many threads
•Threads are grouped into thread blocks
•Parallel code is written for a thread
üEach thread is free to execute a unique code path
üBuilt-in thread and block ID variables
CUDA Architecture
•The CUDA Architecture
Consists of several

•Parallel compute engines

•OS kernel-level support

•User-mode driver ( Device API )

•ISA (Instruction Set Architecture)

for parallel computing
Tesla 10 Series

•CUDA Computing with Tesla T10

•240 SP processors at 1.45 GHz: 1 TFLOPS peak
•30 DP processors at 1.44Ghz: 86 GFLOPS peak
•128 threads per processor: 30,720 threads total

Thread Hierarchy
•Threads launched for a parallel
section are partitioned into
thread blocks.
ØGrid = all blocks for a given

•Thread block is a group of threads that can
üSynchronize their execution
üCommunicate via shared
Warps and Half Warps
GPU Memor y Allocation /
•Host (CPU) manages device (GPU) memory:
•cudaMalloc (void ** pointer, size_t nbytes)
•cudaMemset (void * pointer, int value, size_t
•cudaFree (void* pointer)
Why should I use a GPU as a Processor

•When compared to the latest quad-core CPU, Tesla 20-

series GPU computing processors deliver equivalent
performance at 1/20th the power consumption and
1/10th the cost
•When computational fluid dynamics problem is solved
it takes
Ø9 minutes on a Tesla S870(4GPUs)
Ø12 hours on one 2.5 GHz CPU core

• Double Precision Performance

• Intel core i7 980XE is 107.6 GFLOPS
• AMD Hemlock 5970 is 928 GFLOPS (GPU)
• nVidia's Tesla S2050 & S2070 is 2.1 TFlops - 2.5

•GeForce 8800 GTX - 346 GFLOPs(GPU).
•Core 2 Duo E6600 - 38 GFLOPs.
•Athlon 64 X2 4600+ - 19 GFLOPs.

1.Accelerated Rendering of 3D Graphics
2.Video Forensic
3.Molecular Dynamics
4.Computational Chemistry
5.Life Sciences
6.Bio Informatics
7.Medical Imaging
8.Gaming Industry
9.Weather and Ocean Modeling
10.Electronic Design Automation (Real
Time Cloth Simulation OptiTex.com)
11.Video Imaging
12.Video Acceleration
The G80 Architecture
•Support C
•Used Scalar Thread Processor, which
eliminated need of Merging vector registers.
•SIMT (Single Instruction Multiple Thread)
•Shared Memory
•Next major version GT200 (GeForce GTX 280,
QuadroFX 5800, Tesla T10 )
•Increased CUDA Cores from 128 to 240.
•Double Precision floating point introduced.
Next Generation CUDA
•The next generation CUDA
architecture, codenamed
Fermi is the most advanced
GPU architecture ever built.
Its features include
•512 CUDA cores
•3.2 billion transistors
•Nvidia Parallel Datacache™
•Nvidia Gigathread™ Engine
•ECC Support
Next Generation CUDA Compute Architecture
• Double precision performance as well.
• ECC Memory. Triple Memory Redundancy.
• True cache
• 16 KB of SM shared memory to speed up their
• faster context switches between application programs and
faster graphics and computing interoperation.
• Users requested faster read-modify-write atomic
operations for their parallel algorithms.
Architectural Improvements
• Third Generation Streaming Multiprocessor (SM)
• Second Generation Parallel Thread Execution ISA
• Improved Memory Subsystem
• NVIDIA GigaThread™ Engine
Third Generation Streaming Multiprocessor
•32 CUDA cores per SM, 4x over GT200
•8x the peak double precision floating point
performance over GT200.
•Dual Warp Scheduler that schedules and
dispatches two warps of 32 threads per clock
•64 KB of RAM with a configurable partitioning of
shared memory and L1 cache
Second Generation Parallel Thread
Execution ISA
•Unified Address Space with Full C++ Support
Optimized for OpenCL and DirectCompute
•Full IEEE 754-2008 32-bit and 64-bit precision
•Full 32-bit integer path with 64-bit extensions
Memory access instructions to support
transition to 64-bit addressing
•Improved Performance through Predication
Improved Memory Subsystem
•NVIDIA Parallel DataCache™ hierarchy with
Configurable L1 and Unified L2 Caches
•First GPU with ECC memory support
•Greatly improved atomic memory operation
NVIDIA GigaThread™ Engine

•10x faster application context switching
•Concurrent kernel execution
•Out of Order thread block execution
•Dual overlapped memory transfer engines
•Nexus  First Development Environment.
•Supports massively parallel CUDA C, OpenCL, Direct
•Works along with Visual Studio 2010.
•Manage Massive Parallelism.
•Real-time Benchmarking.
The CUDA computing represents a new direction
towards the parallel computing. It is the result of Radical
rethinking of role, purpose, & capability of the GPU.
With this technology atomic operations can be operated
Up to twenty times faster. On software side it gives more
power to the computation, and provide high performance.
It can be used in Massively parallel GPU computation appz.
In general we can say CUDA is a technology which make
“Supercomputer Personal” With its combination of
Ground breaking performance, functionality, and simplicity
CUDA Computing Represents a revolution in GPU
1)“GPU Acceleration of Object Classification algorithms
Using nVidia CUDA”, Jesse Patrick Harvey,
Rochester Institute of Technology, 2009
2) ”CUDA by Example: An Introduction to General Purpose
GPU”, Jason Sanders, Edward Kandrot,
Addison-Wesley 2010
3) ”High Performance and Hardware aware computing”
Rainer Buchty,Jan Philipp, Addison-Wesley,2009
4) nVidia Whitepaper from
5) http://www.manifold.net/doc/nvidia_cuda.htm
Any Questions ???