Вы находитесь на странице: 1из 36

AnySL

Efficient and Portable Multi-Language Shading


Philipp Slusallek
Sebastian Hack, Ralf Karrenberg, Dmitri Rubinstein

German Research Center for Artificial Intelligence (DFKI)


Intel Visual Computing Institute
Saarland University

Monday, August 15, 2011

Saarbrcken

Monday, August 15, 2011

Saarland Campus

Monday, August 15, 2011

Computer Science
at the Saarland Campus

Monday, August 15, 2011

Computer Science
at the Saarland Campus

Monday, August 15, 2011

Computer Science
at the Saarland Campus

Monday, August 15, 2011

Computer Science
at the Saarland Campus

Monday, August 15, 2011

Computer Science
at the Saarland Campus

Monday, August 15, 2011

Computer Science
at the Saarland Campus

Multimodal
Computing
and
Interaction

Monday, August 15, 2011

Computer Science
at the Saarland Campus

Multimodal
Computing
and
Interaction

Monday, August 15, 2011

Shaders

Programmable Shading

Allows for controlling core rendering features

Today: Many different shading languages

HLSL, glsl, Cg, RenderMan, MetaSL, OSL, OpenRL,


many C++ dialects,
Mostly the same features, expressed differently

We need a portable way to exchange materials

Material properties, light emission, participating media,

Common specification of shading features


Ease implementation for different renderers and HW

Here: Efficient and Portable Implementation

Monday, August 15, 2011

Shaders

A plug-in for the innermost loops

From one-liners to thousands of lines of code


Run for every new ray, surface hit, light sample,

Sometimes, once for every MADD along ray

Efficient implementation

Low overhead interface to renderer

Ideally works directly on internal data structures

Highly optimized code for specific HW architectures

Use of SIMD (SSE, AVX, PTX, )

Monday, August 15, 2011

Implementation Choices
Renderer

Data
Code

Shader DSO/DLL

Glue
Code

C/C++
API
/
ABI

C/C++
API
/
ABI

C++
Shader

Shaders code in C++

API specifies interface to renderer


Separate C/C++ compilation to DLL/DSO
API gets mapped directly to platform specific ABI

Predefined data layout, function call overhead


No optimization options in interface

Monday, August 15, 2011

Implementation Choices
Renderer

Data
Code

Shader DSO/DLL

Gen.
API/
ABI

Gen.
API/
ABI

SL
Shader

Using a Shading Language Compiler

Compiler can transform and optimize shader code

Requires renderer and language specific compiler

E.g. use of renderer internal APIs: No glue code


Transform shaders to SIMD
Most renders support only one shading language

Renderer-specific code gets embedded in result

Monday, August 15, 2011

Implementation Choices
Renderer

Data
Code

Renderer

Glue
Code

A
P
I

SL
Shader

Compiler (LLVM)

Data
Code

Optimized
Shader

Compiler (LLVM)

AnySL: Embedded SL Compiler

Any language compiled into portable format


Types, data layout, interface not fixed yet
Renderer supplies implementations at runtime
Embedded compiler links and optimizes code

Monday, August 15, 2011

AnySL: Portable Shading

Any Shading Language Supported

Common Intermediate Format

Independent of renderer and HW architecture

Easy Implementation by Renderer

Currently: RenderMan, C++ dialects, Javascript,

Need only supply the glue code

Different Backends

Ray Tracing: PBRT, Manta, RTfact,


Rasterization: Deferred shading (with RTT)
HW: x86, SSE, AVX, PTX, OpenCL, glsl,

Monday, August 15, 2011

AnySL & XML3D: Interactive RenderMan in


Your Web Browser

Monday, August 15, 2011

AnySL

Implementation

Monday, August 15, 2011

AnySL: Implementation

Designing an Interpreter: Options

Many OP-codes with large switch() statement


Replace OP-codes with function calls

Subroutine Threaded Code

Long list of function calls

Nice for portability, implementations can be replaced

Even for control flow (if) and types (allocate a float)


E.g.: use predication for if or substitute own float type

Can be directly encoded in compiled code

Use LLVM bitcode for representation Efficiency

Monday, August 15, 2011

Subroutine Threaded Code


Original shader code

Handling control flow: RM illuminace loop

Conversion to Threaded Code

Mapping to Threaded Code

Its implementation
(supplied by renderer)
Possible implementation
(supplied by renderer)
Monday, August 15, 2011

But Interpreters are Slow?!?

STC is used for portable representation only

Type Replacement

Eliminated at runtime with embedded compiler


Substitute own types and operators
Inline all interpreter calls
Perform all usual scalar optimization

Can be used for special shader functionality

Taking derivatives of arbitrary expressions


Bounding the result of shader over intervals

E.g. using Affine Arithmetic [Heidrich et al., 1998]

Monday, August 15, 2011

How it All Fits Together

Monday, August 15, 2011

Special Functionality

Derivatives of arbitrary expressions

Implemented through Automatic Differentiation

Each type stores and maintains (2) derivatives


Each operation updates value and derivatives
Input provides initial derivatives (e.g. w.r.t screen space)

Bounding the value of a shader over interval

Implemented through Interval or Affine Arithmetic

Each type stores and maintain value plus interval

AA: plus terms for linear dependencies on (all) input values

Each operation updates value and derivatives


Input provides initial interval (e.g. w.r.t parameter space)

All maps nicely to Type Replacement

Monday, August 15, 2011

Results
Automatic differentiation for anti-aliasing

Point sampling

Monday, August 15, 2011

Analytic AA: Blend to average near Nyquist

Optimization:
Packet-Based Shading

Modern ray tracers shoot packets of rays

Exploit SIMD instructions of modern CPUs

Can execute instruction on k n floats at once


Current architectures:

SSE (4), AVX (8), KNF (16), GPU (32)

Shader function has to shade n hit points at once

Monday, August 15, 2011

AnySL:
Packetized Shaders

Writing packetized shaders is REALLY HARD

Not an option for any application

You may not want to do this by hand:

Monday, August 15, 2011

AnySL:
Packetized Shaders

Given:

Needed:

A shader is given by a control-flow graph of scalar


instructions
A packetized shader is a new shader that executes k
instances of the original shader at once

Control flow of instances can diverge!

Monday, August 15, 2011

Main Issues: Control Flow

Diverging control flow of a shader

Need to efficiently merge flows again!

Shaders are nested in a deep recursion

Must handle closures and reordering of packets

Monday, August 15, 2011

Packetized Shaders

Approach

Program transformation
Flatten control flow
Every instance executes
all instructions
Mask out wrong results
Loops are iterated until
last instance is done
Already exited instances
are invalidated
Simulate what GPUs do in HW

Monday, August 15, 2011

AnySL:
Dealing With Data Divergence

SSE has no gather/scatter support

Need to resort to serial load/store

Data must be in multiple of four and properly aligned


Extract individual values from SSE vector
Load/Store
Merge/blend results back into SSE register
Very expensive (lots of dependencies)

Calling non-packetized functions

Essentially, the same as scatter/gather


E.g. hand-crafted SSE noise() function

Monday, August 15, 2011

Packetized Shader

Results

Packet size of 4 (SSE)


Completely automated (LLVM)
Shaders are packetized automatically
On average 3.2x speedup
for complete rendering
Not specific to graphics
Can be used wherever
data parallelism is available

Monday, August 15, 2011

AnySL Results

Monday, August 15, 2011

Applications Beyond Graphics

Whole Function Vectorization

Transform a function over one or more scalar


parameter into function over SIMD parameters
Maintaining semantics within each SIMD lane
Application to shader code & packet ray tracing

OpenCL-Compiler

Simply add an OpenCL-Frontend


Re-use existing AnySL backends
Currently fastest OpenCL compiler for CPUs & GPUs

Monday, August 15, 2011

AnyDSL

Vision

Language, enabling domain specific environments

A new base language (others are to complex already)


New environments can be written in AnyDSL

Meta programming

Think libraries of types, code, syntax, etc.

Ensures predictable performance


Programmer directly controls which parts of a program are
evaluated at compile time
Convenient syntax, no special templates

Implicit support for parallelism

Based on continuation passing style

Monday, August 15, 2011

ECOUSS Project

Efficient and Open Compiler Environment for


Semantically Annotated Parallel Simulations
German National Project

Application Partners

Supercomputing Center HLRS, Stuttgart


Cray Computer
BMW Group
Bhringer-Ingelheim (Pharmacy)

Research Partners

Intel Visual Computing Institute


German Research Center for Artificial Intelligence (DFKI)
Karlsruhe Institute of Technology

Monday, August 15, 2011

Conclusions

AnySL

Shaders are compiled to platform-independent code

Reduce work for the renderer implementer

Eliminates interfaces and optimized code

High-performance through packetization

Need only supply renderer-specific code and link to AnySL

Highly-optimizing JIT compiler within the renderer

Can be produced from any shading language

Significant speedup on benchmarks (~3.2x )


Eliminated need for SIMD shader coding

Many applications beyond graphics

Monday, August 15, 2011

Вам также может понравиться