GP Series Chapter 19

Chapter Nineteen
The Programmable Pipeline

Introduction
Before the advent of dedicated 3D accelerator hardware in home computers, the graphics programmer
was tasked with writing all of the code to determine the color of every pixel in the final rendered image.
These graphics engines, referred to as software engines, were kept reasonably simplistic with regards the
algorithms used to determine these colors and often ran at very low resolutions, relative to modern
standards, in order to maintain acceptable performance levels. The programmer had a certain amount of
freedom in terms of the lighting models employed and the methods used to determine the position of
each vertex in the scene and the color of each pixel on the screen. However, imagination and creativity
were curtailed by reality -- anything other than basic lighting and transformation algorithms running on
the CPU would not execute at interactive frame rates on anything but the most powerful of computers.
In 1995 a revolution occurred when 3Dfx® released a 3D graphics accelerator, the Voodoo1™, that was
sold in large quantities to the gaming market. Although crude by today’s standards (it accelerated
rasterization of primitives only), it carried the visual quality of PC games to new and unexpected
heights. Games could now be run at higher resolutions with hardware assisted bilinear filtering to
remove the blocky, pixelated look that gamers had been accustomed to. For the first time, PC games
rivaled the visual quality of games found on dedicated arcade level hardware.
Seeing the unanticipated uptake of 3D hardware by the gaming public, other companies naturally
followed along with their own 3D accelerators in order to carve out their piece of this growing and very
lucrative market. It wasn’t long before 3D hardware accelerators became commonplace, even in
modestly priced PCs. As such, their presence in end-user systems could be relied upon by game
developers and games began to be produced that actually required such hardware be installed in order to
run. The first wave of graphics accelerators did not support transformation and lighting in hardware, and
as such this was still the job of the CPU. Nevertheless, a new age in PC gaming had begun!
Microsoft® was quick to jump on the new features that 3D hardware acceleration offered and DirectX
was updated to interface seamlessly with this new generation of hardware. 3D graphics APIs such as
DirectX and OpenGL had enjoyed some success up to this point, but when the market was flooded with
many different graphics cards, all quite different from one another under the hood, programmers were
quick to utilize these APIs for game development because they provided a consistent interface to the
variety of hardware configurations available. The programmer no longer had to write multiple code
paths throughout the engine to handle the variety of hardware that might be installed. Furthermore,
graphics card giants such as nVidia® and ATI® (now AMD®) worked closely with Microsoft® to make
sure that the drivers provided with their cards interfaced with DirectX and supported as many of its
features in hardware as possible.
Graphics programmers could now use DirectX to transform, light, and render their primitives, relying on
the API and the underlying HAL driver provided by the graphics card manufacturer to make use of any
hardware rasterization assistance that may be present in the form of a 3D graphics accelerator. DirectX
also provided a complete software transformation and lighting module that could be used to transform
and light primitives at optimal speeds on the CPU, so the engine programmer no longer had to write
these pieces himself. Thus, DirectX provided a high-level graphics rendering framework that would
automatically make use of any 3D hardware installed on an end-user system and provided efficient
software solutions for transformation and lighting, for which no hardware acceleration alternative
www.gameinstitute.com
Page 2 of 139
existed at the time. Yet interestingly, many graphics programmers still preferred to implement their own
vertex transformation and lighting modules and utilized DirectX only to make use of 3D hardware
during rasterization. This afforded the programmer a degree of flexibility and creativity to implement
lighting and transformation code using algorithms of their choosing, which enabled their games to have
a unique look compared to other titles.
The DirectX transformation and lighting (T&L) module has been discussed thoroughly throughout this
course series. The Blinn/Phong model is used for computing lighting results at the vertex level which is
then interpolated linearly across primitives to generate per pixel colors. The transformation module
works by setting states such as the world, view, and projection matrices which collectively help us map
world locations to screen pixels when a primitive is rendered. We also saw in Module II how support for
multi-matrix transformations per-vertex (e.g., skinning) is made possible using the DirectX
transformation module by setting multiple matrices in the matrix palette. While programmers did not
initially move over to using the DirectX T&L module in large numbers, something was about to happen
that would change that outlook significantly.
With the uptake of graphics accelerators on such a large scale, companies began researching ways to
improve their technologies for getting even better gaming performance on the PC. The next logical step
was to find a way to accelerate not only the rasterization of primitives, but the entire transformation and
lighting pipeline (geometry processing). The same transformation and lighting algorithms used by
DirectX’s own software module, chosen for its speed over its complexity, were hardwired into the
graphics chips to provide hardware accelerated support for geometry transformation and lighting. For
programmers already utilizing the DirectX T&L module, the change from software to hardware was
transparent. You could work with DirectX’s T&L interface just as before, but it would automatically
utilize hardware features where available (DirectX 7 onwards). Thus, developers quickly moved over to
using APIs like DirectX and OpenGL for managing their complete T&L pipeline, since programmers
and publishers alike were eager to boast support for T&L hardware in their latest engine. Implementing
software-based transformation and lighting code was no longer a viable option if they wanted their
games to compete, since hardware T&L allowed games to run in even higher resolutions with more
polygon throughput, leaving the CPU free to concentrate on other tasks such as AI, physics, and game
state management.
In 1999, the first hardware accelerated T&L graphics card was released by nVidia® -- the geForce™
256. The “ge” stood for geometry acceleration as this was the first 3D card that supported geometry
transformation and lighting in hardware. The geForce™ 256 didn’t stay on top of the pile for very long
as nVidia® released the geForce2™ GTS (the first in a series of geForce™ 2 cards) in late 2000 which
also boasted T&L support but outperformed its predecessor by up to 40%.
Note: The GTS qualifier to the geForce2 name is an abbreviation for Giga Texel Shader, so named for its
texel processing rate (1.6 billion per second).
In that same year ATI® released their first generation Radeon™ card (initially codenamed Rage 6 after
their previous range of cards) which was comparable in specification to the geForce™ 2 (DirectX 7,
T&L hardware support, etc.) but with a third texture unit compared to the geForce™ 2’s two texture
units. The geForce™ 2, with its extremely reasonable price tag sold in massive numbers and catapulted
T&L hardware into mainstream PCs and thus, increased uptake of DirectX 7’s (the first T&L compliant
version of DirectX) transformation and lighting interfaces into mainstream game development.
Page 3 of 139
Although these early T&L capable cards undoubtedly made another vast improvement to both the visual
quality and performance of games, with hindsight many game programmers refer to this period as the
‘dark ages’ with respect to artistic creativity. The problem with this wave of T&L cards is that they
accelerated rendering using specific algorithms that were ‘fixed’ on the graphics chip. These cards were
hardwired to use the basic Gouraud interpolated / Blinn-Phong vertex lighting model (i.e., a hardware
accelerated version of DirectX’s own fixed-function pipeline).
Although multi-texturing capabilities were supported on these chips, making techniques like light
mapping and even bump mapping (with limitations) possible, programmers were pretty much stuck with
the lighting model and texturing algorithms that these chips (and DirectX) provided. As a game
programmer you definitely wanted to utilize this ‘fixed-function pipeline’ since doing so would ensure
that your game ran as fast as possible in the highest available resolutions. However, because almost
every game released during this period utilized the same algorithms, they started to look very similar to
one another. The textures and the geometry were different, but they all used the same hardware
accelerated fixed-function transformation and lighting algorithms. Because the position of each vertex
and the color of each pixel was now determined on the graphics card, the programmer had no way to
access these algorithms or change their behaviors. This stifled creativity for programmer and artist alike
because instead of being able to come up with new and innovative techniques to implement lighting and
shadows, they had to live with what the graphics hardware (via a fixed-function interface) provided.
Throughout this time, and for nearly an entire decade before, Pixar Animation Studios had been using a
software package called RenderMan™ on their high-end graphics systems to develop visually stunning
and critically acclaimed animated feature films like Toy Story, A Bug’s Life, and Finding Nemo. The
graphics in these movies wowed audiences worldwide with their ‘never before seen’ effects, made
possible due to the RenderMan™ approach of allowing the artist/programmer access to its lighting and
transformation algorithms. Because the RenderMan™ T&L pipeline was programmable, it could be
modified to support a nearly unlimited variety of algorithms. It was this software that was to influence
the next generation in 3D accelerators for home computers.
RenderMan™ provided programmable access to the underlying algorithms used to determine the color
of each pixel on the screen and of each object in the scene. If a given object required special
transformation or lighting consideration, a separate program, called a shader, could be written and
applied to that object. This program would contain the code to transform and light the object when it
was rendered. Each object (or polygon) in the scene could have its own shader program assigned to it,
describing to the RenderMan™ system the algorithms that should be used to transform, light, and shade
that particular entity. Because of the complexity of the RenderMan™ system, it was limited to a
software-only implementation. This is fine in the movie business, where a computer can be left to render
a single frame over minutes or even hours if necessary, but has no applicability in the real-time graphics
world on consumer level PCs and consoles.
In March of 2001 (three months after nVidia® acquired 3Dfx®) the first of a new generation of graphics
cards was released – nVidia’s geForce™ 3. In response to developers requesting greater choice and
flexibility in the graphics pipeline, the geForce™ 3 supported programmable shading capability (similar
in concept to RenderMan™) with real-time performance and at a consumer level price tag. The
geForce™ 3 supported both programmable vertex and pixel shaders. This was required under the
DirectX 8.0 specification, which had also evolved to provide interfaces to this new programmable
Page 4 of 139
pipeline. Any graphics card wishing to boast full support for DirectX 8.0 now had to also support
programmable vertex and pixel shaders.
Note: It is a little known fact that the geForce2™ actually had shaders, but they were non-
programmable and thus hardly ever used outside of simple tech demos.
The geForce™ 3 enjoyed undisputed performance supremacy throughout its lifetime. Unlike most
nVidia® graphics products, it was aimed at the high-end and upper-midrange gaming market, and at no
time was there a cheap, entry-level version of it — nor was there any need for one, as the numerous
geForce™ 2 variants were well-placed to serve the mass-market. In early 2002 ATI® released their first
DirectX 8 (shader capable) card in the form of the Radeon™ 8500. This was the video card that put
ATI® on a competitive footing with nVidia® and separated them from the other failing graphics card
companies of the period.
All of this new 3D hardware provided a means (via APIs like DirectX) to allow the game developer to
program the GPU to achieve whatever transformation and lighting techniques were desired on a per-
object or even per-polygon basis. This is done by writing shader programs that are uploaded to the GPU
at runtime and executed.
Note: Fortunately for us, shader programs can be written inside effect files and will be automatically
uploaded to the GPU and utilized when we invoke an ID3DXEffect. This means we have a very small
learning curve with regards to using shaders in our current programs.
This programmable pipeline allowed the programmer to bypass the fixed-function transformation and
lighting algorithms normally carried out per vertex and supply his or her own code that will still be
executed in hardware. From the programmer’s perspective it is the best of both worlds -- much more
control of the rendering results, just as it was in the days of the software engine, but with the custom
code being executed on a dedicated graphics chip for top performance.
As shader capable cards became commonplace, the visual quality of PC and console games entered a
new realm of excellence as shaders allowed developers to create specialized per-pixel lighting
techniques, realistic shadows, new special effects, and much more. When we look at the games market
today, we see that games are released that rely on shader hardware being present and shader capable
cards are the norm, even in the very lowest end of today’s range of PCs. The latest games now look
amazing (e.g., the Crysis series, Metal Gear Solid 4, etc.) and very distinctly different from one another
since they all use customized rendering code in their shader programs.
With the general adoption of shader capable cards, the progress in technology has been mostly measured
in terms of shader model (we’ll talk more about shader models in a bit). The geForce™ 3, being the first
shader capable card, supported vertex shader model 1.1 and pixel shader model 1.1. The shaders on such
cards could only support a very limited number of instructions and blending operations. Shader
programs had to be kept very short and only a relatively small number of constant registers were
available for passing data from the application into the shader.
You will recall from the last chapter that a constant register is really just a block of memory on the
graphics card that we use to store application data for use in a shader program. When using effect files
that contain shader programs, effect parameters are automatically uploaded into the constant registers of
Page 5 of 139
the graphics card when the effect is invoked. Shaders are essentially just functions that are executed
either per-vertex or per-pixel and the constant registers provide a means for us to send data (our
parameters) into those functions to configure their behavior. When using effect files that contain
shaders, all we have to do is set our effect parameters and rely on the D3DX framework to upload the
effect parameters into the appropriate constant registers when the effect is triggered. Inside the shader
function, these parameters can be accessed just like traditional variables and will be used to influence
texture sampling, vertex transformations, color blending operations, and so on.
The Radeon™ 8500 was slightly later to market and supported shader model 1.4, which had been
implemented in the latest revision of DirectX (version 8.1). DirectX 8.1 added shader models 1.2, 1.3,
and 1.4 and a DirectX 8 compliant card had to be capable of supporting at least one or more of these
shader models. Shader model 1.2 and 1.3 were rather incremental technology upgrades. Pixel shader 1.2
for example added four new instructions to the shader language and two new arithmetic operations to
that of 1.1 to aid in making shader writing more efficient. Even less significant was shader model 1.3
which added a single new instruction to the shader language. Shader model 1.4 really improved things
significantly and allowed for higher shader function instruction counts and access to six textures,
compared to only four in the previous shader models. In pixel shader models prior to 1.4, the shader had
been limited to containing no more than four texture lookup operations and eight arithmetic operations
(for color blending, etc). In pixel shader model 1.4, not only was this increased to six texture operations
and eight arithmetic operations, but also the concept of multiple phases (like multiple passes) was
added. Each phase was allowed to execute six texture and eight arithmetic operations, significantly
increasing the overall instruction count of 1.4 pixel shaders and the types of effects that could be
achieved using them.
DirectX 9.0 introduced shader model 2.0, which brought with it a significant increase in shader
complexity. Unlike the shader models that had come before, where only simple programs could be
written, vertex shader model 2.0 supported 256 instructions (doubling 1.4's 128 count limit). Further, it
supported static flow control, making conditional code paths and loops possible within the vertex shader.
More impressive was the increase in instructions allowed in a pixel shader, which had grown from a
measly 6+8 per phase in 1.4 to an impressive 96 (32 texture and 64 arithmetic). There were several point
releases of shader model 2.0 (2.a and 2.b) which added features to the base 2.0 model. Shader model 2.a
was implemented to support additions beyond the 2.0 base specifications present in nVidia® cards while
at the time including extra registers for data storage and an even larger instruction count in the pixel
shader -- 512. Shader model 2.a also allowed static flow control inside the pixel shader. Shader model
2.b was implemented to support the extra features available in ATI® cards at the time and also supported
a 512 instruction count limit for the pixel shader. Pixel shader model 2.b also provided 32 temporary
registers on the hardware for data storage during intermediate calculations compared to 12 registers in
2.0 and 22 in 2.a. However, pixel shader model 2.b did not support static flow control. In many ways,
the 2.x models were really just proprietary stepping stones between shader models 2.0 and 3.0. They
provided developers access to the growing list of features on the latest cards which had gone beyond the
2.0 shader model specification laid down by DirectX 9.0, but their vendor-centric nature arguably made
them less palatable.
Shader model 2.0 level cards provided much more than just higher instruction counts. They also
provided access to multiple render targets and true 16 and 32-bit floating point operations inside the
Page 6 of 139
pixel shader (previous shader models had used fixed-point 16-bit vector instructions). The shader model
2.0 upgrade also provided support for sampling up to 16 textures simultaneously.
With DirectX 9.0c, support for shader model 3.0 was brought to market and once again we saw a
significant jump in terms of available features. Dynamic flow control was now introduced in both vertex
and pixel shaders and instruction limits were fixed at an absolute minimum of 512 for both vertex and
pixel shaders. This opened the door to the development of long complex shaders and thus some
incredible visuals which could be done in a single pass. With shader model 3.0, the vertex shader could
even access certain types of textures, allowing vertices to be procedurally generated/updated inside the
shader using data fetched from a texture (e.g., a height map).
Note: Static flow control essentially means that conditional code paths (if/else blocks) can be specified
inside the shader, but that the code path is chosen by testing only values stored in the constant registers.
Since the constant registers are set prior to the draw call (by invoking the ID3DXEffect for example) this
means the same code path will be taken for every vertex/pixel rendered by the draw call. With dynamic
flow control, the conditional test parameters can be passed as inputs into the shader per vertex or per
pixel or even generated on the fly. As such, a different code path can be executed for each vertex or
pixel inside of the same single draw call (at considerably more expense however).
In most cases, a graphics adapter will be backwardly compatible with all previous shader models. For
example, an adapter that supports shader model 3.0 will also support 1.4 and 2.0. The only exceptions
here are the 2.a and 2.b shader models which are only supported by nVidia® and ATI® cards,
respectively.
Below is a table showing just a few of the consumer level graphics cards that appeared as the shader
models evolved through to the eventual 3.0 level in DirectX 9.0c.
Graphics Card VS Version PS Version

geForce 3 vs 1.1 ps 1.1
Radeon 850 – 9200 vs 1.1 ps 1.1 – ps 1.4
geForce 4TI vs 1.1 ps 1.1 – ps 1.3
Radeon 9500 – 9800 vs 1.1 + vs 2.0 ps 1.1 – ps 2.0
geForce FX vs 1.1 – vs 2_a ps 1.1 – ps 2_a
Radeon X800 vs 1.1 – vs 2.0 ps 1.1 – ps 2_b (not ps 2_a)
geForce 6800/7800/7900 vs 1.1 – vs 3.0 ps 1.1 - ps 3.0
Cards that support only 1.x shader models are considered to be DirectX 8 cards, whereas graphics cards
that support shader models 2.0 and 3.0 are considered DirectX 9 level cards. More correctly, a card that
supports shader model 2.0 is referred to as a DirectX 9.0 card and one that supports shader model 3.0 is
referred to as a DirectX 9.0c card.
Since this primarily is a DirectX 9 course, we will write our programs targeting shader model 2.0+ level
hardware. DirectX 9 has been out for many years now and it is not unreasonable to expect the public to
have compliant hardware. Of course, by using high level shading language and effect files, it is a simple
matter to supply fallback effects which contain shader programs targeted at earlier 1.x shader models
(which would contain fewer instructions and would not use elements of the shader language that only
became available in later models). However, the feature sets of the earlier 1.x shader models are rather
measly by 2.0 and 3.0 standards, so you may find their limitations too great to warrant supporting.
Page 7 of 139
The DirectX 9 SDK documentation has a wealth of information about the capabilities of the various
shader models including instruction limits, the various instructions available in each version of the
shader language, number of textures supported, etc. Simply type “vertex shader” or “pixel shader” in the
keyword edit box of the Index tab and you will see a list of topics covering a wide variety of
information. Check out Vertex Shader Differences and Pixel Shader Differences for a quick overview of
the different capabilities in each shader model. You will also find (under Vertex Shader Instructions and
Pixel Shader Instructions) all of the instructions supported by the language for each shader model.
Because you already have this information at your disposal, we will not list all such tables and charts
here, but will instead focus on how to write shaders by example, leaving you to reference the
documentation when needed. However, below we show some of the key differences between the vertex
and pixel shader model versions.
Note: The following tables list only some of the differences between shader models. There are many
differences in practice between each shader model, such as additional hardware registers for storing and
working with data passed into the shader, extra instructions both for arithmetic and for controlling static
and dynamic flow, etc. Check the DirectX 9 documentation for the exact details for each shader model.
Vertex Shader Models Differences
Instruction Slots Static Dynamic Vertex

Flow Flow Texture
Control Control Lookup
1.x 128 No No No
2.0 256 Yes No No
2.x 256 Yes Yes No
3.0 512 minimum, and up to the number of slots in Yes Yes Yes
D3DCAPS9.MaxVertexShader30InstructionSlots. This is typically
in the tens of thousands for modern cards.
Pixel Shader Model Differences
Instruction Slots Static Dynamic Other

Flow Flow Significant
Control Control Extras
1.1 12 ( 4 Texture + 8 Arithmetic) No No
1.4 6 Texture + 8 Arithmetic per phase. No No
2.0 96 (32 Texture + 64 Arithmetic) No No
2.x 96 minimum, and up to the number of slots in Yes Yes No separate
D3DCAPS9.D3DPSHADERCAPS2_0.NumInstructionSlots. texture instruction
limit.
3.0 512 minimum, and up to the number of slots in Yes Yes No separate
D3DCAPS9.MaxPixelShader30InstructionSlots. No separate texture instruction
texture instruction limit. limit.
Now that we know some of the history behind the evolution of shaders, let us really start drilling down
on what they are and where they fit into the DirectX pipeline.
Page 8 of 139
19.1 Introducing Vertex & Pixel Shaders
In Figure 19.1 we see a high level version of the
Direct3D 9.0 graphics pipeline. The application passes
the vertex stream to the D3D pipeline via a call to one of
the DrawPrimitive functions. Direct3D has support for
higher-order surfaces (curved surfaces) which allow for
a surface to be passed as a series of control points that
mathematically describe the surface. A tesselator unit
can then generate the desired number of vertices to
represent that surface on the fly.
Note: We will not be discussing higher order surfaces in

this course. While they have been supported for some time,
they have not been widely adopted for game development.
However, as of DirectX 9, the tesselation unit can also be
used to look up per-vertex displacement values from a
displacement texture and pass them on to the vertex
shader. Tesselation is a more full featured system in DX11.
At this point, when a vertex shader has not been set on

the device, the vertices will be passed to the Direct3D
fixed-function transformation and lighting pipeline
where they will be transformed and lit using the methods
we have discussed numerous times throughout this
series. It is in this section that the vertex light/color
values are calculated and the position is transformed into
clip space using the currently set world, view and
projection matrices. However, if a vertex shader has
been set on the device prior to the draw call, the fixed
function T&L module will not be invoked to transform
the vertices. Instead, the vertices will be sent, one at a
time, into the vertex shader function. The vertex shader
now has the responsibility of transforming each vertex
into clip space, calculating any required color values,
transforming any normals, texture coordinates, etc.
There are two important things to remember here. First,
the vertex shader is executed once for each vertex in the
vertex stream. This means that if your DrawPrimitive
function passed a stream of 100 vertices in, the vertex
shader program will be executed 100 times. With each
invocation of the shader, it is passed one vertex from the
stream (including its position, texture coordinates, etc.)
and it must, at a minimum, convert the input position
into clip space so that it is ready to be processed by the
rest of the Direct3D pipeline. Figure 19.1
Page 9 of 139
You might be wondering why you would want to write your own transformation and lighting shader
code instead of relying on the Direct3D pipeline to do it for you? Well, first bear in mind that in a vertex
shader we are free to do whatever we want; the only minimum requirement is that we provide an output
clip space position. We can use a proprietary lighting model (e.g., something other than vanilla Blinn-
Phong) if we are doing any lighting at the vertex level. We can procedurally generate additional texture
coordinates on the fly to perform all sorts of interesting effects in real-time (reflection, shadows,
animations, etc.). We can even generate data that we simply want passed along to our pixel shader,
where some additional work will take place. The point is that we can do quite a bit more than just
transform a vertex position and generate a lighting color or two for interpolation purposes.
Note: From DirectX 10 forward, with the exception of the output merging (blending) stage most of the
fixed-function pipeline is gone completely. You will have to write vertex and pixel shaders to do just
about everything, so writing your own shaders will be mandatory for all future 3D development using
DirectX.
We can see in Figure 19.1 that after leaving the vertex shader, the clip space vertex is passed back to the
automated Direct3D pipeline. At this point the usual processes kick in, including backface culling,
frustum and user-defined clip plane clipping, and projection via the divide by w. The resulting post-
projection 2D vertices are then transformed and scaled to fit the dimensions of the viewport such that the
vertex is in screen space and the first major leg of its journey through the pipeline is complete.
Looking at Figure 19.1 we can see that once the vertices are in screen space they are reassembled back
into triangle primitives and rasterization can occur. The vertices are used to form the points of a triangle
and the Direct3D pipeline then interpolates the vertex positions to generate the position of each fragment
(i.e., potential pixel) inside the triangle. We call them potential pixels because they may be rejected
further along in the pipeline. As each fragment inside the triangle is visited during the rasterization
process, the position, color, and texture coordinates are calculated using linear interpolation (based on
the position of the fragment relative to the three vertices) of the data stored at each of the three vertices.
This is what allows us to store different colors in each vertex and have the color smoothly blend from
one to another across the face of the triangle (when Gouraud shading is enabled).
Note: Vertex texture coordinates are automatically linearly interpolated by the Direct3D pipeline, and you
will see later how this storage space is often used to store many types of data. It is rare, for example,
that we would ever need to use all eight sets of texture coordinates per vertex solely for texture
sampling, so we can leverage any available coordinate slots to pass along other bits of information that
would be useful to the pixel shader. For example, inside the vertex shader you might compute and store
a light to vertex direction vector inside the second set of texture coordinates output by the shader. As the
pipeline simply sees this data as just another set of texture coordinates, it will automatically linearly
interpolate it across the face, generating unique light to pixel vectors that can then be used inside the
pixel shader to do per-pixel lighting calculations.
Each fragment in the triangle is processed one at a time (technically, hardware can and will
simultaneously process multiple fragments in parallel, but for the sake of our current conversation, we
will simply consider that a behind-the-scenes optimization, not a fundamental shader execution
concept). When pixel shaders are not in use, the fragment is passed to the fixed-function Direct3D color
blender whose properties we control via the texture stage and sampler states. These inform the fixed-
function pipeline how the colors of the current fragment should be blended together with any colors
sampled from currently set texture maps. The fragment contains its own unique texture coordinates at
Page 10 of 139
this point (interpolated from the vertex texture coordinates in the previous phase) and how these texture
coordinates are used is controlled by the texture stages. We have already seen how sampled texture
colors can be blended together with interpolated diffuse and specular colors as well as colors sampled
from other texture stages. We can also set texture transformations on a per-stage basis to alter a
fragment’s texture coordinates prior to performing a texture lookup. The output of the fixed-function
texture cascade is the final color of the fragment, which is then passed on to the rest of the pipeline
(which we will get back to in a moment).
The fixed-function texture cascade has given us loyal service up until now, but it is also this component
that imposes the largest set of limitations on our visual quality. The color blending operations that we
wish to perform must be available to us as texture stage state color operations because we have no direct
control over the code that produces the final color otherwise. As commercial games advance to towards
more visual realism, fixed-function states simply are not capable of getting us where we need to be. It is
this limitation that pixel shaders liberate us from because we can now replace the fixed-function blender
with our own pixel shading programs. While vertex shaders are extremely powerful and fully necessary,
it is pixel shaders that are mostly responsible for the huge leap in game visuals that we have seen in
recent years. They afford the programmer complete control over the color of each fragment as it passes
through the pipeline. Just as the vertex shader is executed once per vertex, the pixel shader is invoked to
process each fragment of the primitive currently being rasterized.
It suffices to say that an inefficient pixel shader will hurt performance, bearing in mind that it generally
gets executed with much greater frequency than vertex programs. Indeed it is for this reason that early
pixel shader models insisted on pixel shader programs being very small and limited to a small set of
simple instructions. Nowadays, hardware has advanced to a point where complex programs can be
executed per pixel and many of the performance concerns have been reduced in later shader models
(pixel processing parallelism in hardware plays a very big role here). While pixel shaders can certainly
be larger than they were a few years back, it still stands to reason that you should still try to keep them
as efficient as possible given their frequency of execution. This is particularly true given the fact that
desired screen resolutions have also trended upwards in recent years.
Note: It is possible to use vertex shaders without pixel shaders and vice versa although you will most
often choose to use both simultaneously. This is because your pixel shaders will often require per-vertex
data to have been calculated and interpolated.
Pixel shaders open the way for new ideas that were very hard to implement prior to their introduction.
For example, we know that a pixel shader gets executed for each fragment of a rendered primitive. Thus,
if we render a single polygon that covers the entire frame buffer (e.g., a quad), we know that a pixel
shader will be invoked for each pixel in that frame buffer. Now, imagine that the frame buffer has
already had a scene rendered to it but we have not yet presented it. Consider what is possible if we
copied the frame buffer data into a texture, set a pixel shader that can sample that texture, and then
rendered a full-screen quad. The pixel shader will be invoked for each pixel in the frame buffer and can
sample the matching color from the input texture. With access to that color, we can now do whatever we
want with it (e.g., brighten it, change its contrast, map it to another color, etc.) prior to overwriting the
color in the frame buffer or even alpha blending the output right on top of the existing color. Thus, pixel
shaders can be used not only during the rendering of an object’s polygons to perform lighting, shadows,
reflections, and bump/normal mapping (to name but a few ideas), they can also be invoked to process an
entire final rendered image prior to presentation. This latter idea is referred to as image (post-)processing
Page 11 of 139
and you will see later in this course how it is used to good effect in our updated lab project rendering
system.
Note: When we intend to perform image processing, we will often use a texture as a temporary render
target instead of the usual backbuffer. This allows us to avoid having to copy/transfer the data out from
the backbuffer and is thus more efficient. In DX9 we can even instruct the device to render directly to
multiple render targets simultaneously (up to four).
The job of the pixel shader is pretty simple -- it has to generate an output color (i.e., the color of the
fragment) which will then be handed back and processed by the rest of the Direct3D pipeline. With
DirectX 9.0 and shader model 2.0, support for multiple render targets (up to four) was introduced, thus
allowing the pixel shader to output four colors simultaneously (one for each render target). You also
have the option of outputting a custom depth value for the fragment if you wish to calculate this
information in a custom manner. Generally speaking, you will probably not use pixel shader depth
output very often -- just let Direct3D use the default depth value it calculated for the fragment. Indeed, it
can actually harm performance on modern hardware which supports early culling of fragments based on
depth testing. This feature, called early Z-cull, will not work if you return depth from your pixel shader.
See below for more details.
Referring back to Figure 19.1 once again, we can see that after the pixel shader has been executed for
the current fragment, the result is fed into an occlusion testing system. This is a visibility mechanism
introduced with DirectX 9.0 which allows the developer to draw a primitive and get back the number of
pixels that passed the depth test. If an occlusion query is performed and returns 0, it means none of the
pixels passed the depth test and the primitive need not be rendered. If a non-zero count is returned, the
developer knows that the object/primitive should be rendered. To make this as efficient as possible, a
low polygon placeholder object that bounds the more complex object is often used for the test (a box for
example) so that not too much time is spent processing unnecessary detail. If zero is returned in such a
case, it means the bounding geometry is not visible from the current camera position and the object(s)
inside it (which might be comprised of many thousands of polygons) need not be rendered.
It seems a shame that, according to the diagram, occlusion and depth testing are performed after the
invocation of the pixel shader. If you look at Figure 19.1 you will notice that depth testing with the z-
buffer is actually performed pretty late in the pipeline. This means that all pixels, even those that will
ultimately fail the depth test later in the pipeline, still have to pass through the pixel shader and the rest
of the pipeline processes before the depth test. The result is a lot of wasted processing for pixels that will
never be visible. Fortunately, as mentioned above, most modern graphics cards implement depth
rejection systems prior to the pixel shader to address this issue. They are usually broad phase solutions,
so they are not as accurate as the per-pixel depth test that happens later in the pipeline, but even so, they
can reject a lot of pixels early on and save many needless invocations of the pixel shader. Their broad
phase nature (they actually deal with blocks of pixels simultaneously) means that some pixels that are
not visible will still pass through the pixel shader pipeline to later be rejected by the conventional depth
test. But for the most part, early Z culling does an excellent job of reducing pixel processing and
contributes significantly to the speed of modern cards.
Page 12 of 139
Note: To make sure you get the best performance from early depth testing systems, it is a good idea to
render your scene in front-to-back order when possible and/or even to do a depth-only draw pass to pre-
fill the depth buffer before doing your more expensive shading passes (more on these ideas as we
progress in the course).
The pixel is next passed into the scissor test, which is sort of like a stencil test that deals with only
rectangular clipping areas. You use the IDirect3DDevice9::SetScissorRect method to define a
rectangular region on the render target in which all visible pixels are to be contained. The pixel is then
compared against the bounds of this rectangle and rejected if found to exist outside.
If the pixel survives the scissor test, the alpha test is next (if alpha testing has been enabled), where its
alpha value will be compared against the alpha testing reference value set by the application. We saw in
Module I how to activate alpha testing and set up the alpha reference value and comparison functions
used in the test. If the pixel fails the alpha test, it is rejected and no further processing occurs. If it
survives, it is passed along to the depth/stencil test where its depth value is tested against the pixel in the
backbuffer or render target texture that it ultimately intends to overwrite. By default, if the depth value
of the current pixel is larger than that of the pixel already stored in the render target at that location, the
pixel is rejected and no further processing occurs. Like the alpha test, the stencil test can make
comparisons between a reference value and, in this case, a previously written stencil mask to determine
whether or not the pixel should overwrite the existing frame buffer value or be thrown away.
If the pixel passes the depth/stencil test then it will ultimately be placed in the render target. The next
stage in the pipeline is the computation of the fog factor to blend the pixel’s current color with the
currently active fog color set by the application. As discussed in Module I, the amount of fog color that
is blended with the pixel’s current color is a function of the distance of the pixel from the camera
combined with the current fog configuration settings that have been set on the device.
Note: In shader model 3.0, the fog module is no longer available and the pixel shader is tasked with
having to compute fog on its own. This encourages the shader programmer to use whatever fog formulas
he/she wishes (which was fairly common anyway – even prior to model 3.0). If you are targeting shader
model 2.0 (the minimum for this course), the fixed-function fog module is still available for use.
Next we enter the alpha blending module where, if alpha blending is enabled, the color of the pixel is
computed by blending its current color with that of the pixel it will replace in the destination render
target. If alpha blending has not been enabled then no color blending with the render target occurs and
the pixel will simply overwrite the destination pixel currently contained in the frame buffer.
Note: Care is required when using floating point textures as render targets (e.g., for techniques like high
dynamic range lighting) when targeting shader model 2.0 hardware. Shader model 2.0 hardware does
not generally support alpha blending when the render target is a floating point format. The geForce™
6800 (sm 3.0) and later cards in the DirectX 9c line do support it, but even then, often only on 16-bit
render targets (e.g., D3DFMT_A16B16G16R16F). The Radeon™ 9500 – X800 shader model 2.0(x) cards
do not support the alpha blending of floating point render targets at all. That said, 16-bit render target
alpha blending support is available on all model 3.0 compliant hardware, which is pretty standard
nowadays, and generally offers adequate precision for most tasks (including HDR).
Technically, the pixel now enters the dithering stage, although this feature is pretty much defunct on
today’s hardware. Dithering was popular back in the days of 4 and 8-bit color systems, where additional
Page 13 of 139
colors could be simulated by placing two colors next to each other to create the appearance of a third
composited color. Since the pixels are spatially very close together, they essentially merge with respect
human vision and generate a perceived color that is a mix of the two (e.g., black and white producing a
gray color). Generally speaking, on today’s hardware dithering will not be performed when you are
rendering to render targets that have more than 8-bit color precision (which is almost always the case).
Dithering is enabled on the device by setting the D3DRS_DITHERENABLE render state.
At this point, the pixel has made its way through the entire pipeline and is ready to be placed in the
render target. Which color channels (r,g,b,a) are updated will be a function of the target itself (what
channels does it actually have?), as well as a user-defined color write mask, which by default updates all
destination channels. The default render target will be the backbuffer, but again, we can alternatively
route the output of the pipeline to textures as well. These textures can then have additional post-
rendering processes (filters, etc.) applied to them before being copied into the backbuffer prior to
presentation.
Note: Before shader model 2.0, the device could only output to a single render target at a time -- either
the back buffer or a texture created with the appropriate usage flags. In shader model 2.0, support was
added for up to four render targets that the pixel shader can output color information to simultaneously.
Not all cards supported the maximum of four, but many did. Shader model 3.0 compliance however
requires support for four targets, and availability is fairly widespread at this point.
19.1.1 Vertex & Pixel Shader Hardware
When shaders were first introduced with DirectX 8.0, shader programs had to be written using a low-
level assembly language. Those of you that have had some experience with assembly language
programming know that it can be a fairly laborious undertaking at times. There are intricate memory
management requirements at the register level and even very simple ideas can require many lines of
instructions. As a result, assembly code tends to look a bit cryptic to anyone other than the person that
wrote it (and sometimes even to him/her a couple of days after the fact!), so it is a less than ideal choice
when working with a team of programmers. Further, while most commercial game programmers have a
thorough understanding of high-level languages like C, C++, or Java, the same cannot be said of low-
level languages like assembly (or in this case, shader assembly language). As a result, shader creation
was not immediately accessible to people who would normally have no trouble writing C++ graphics
code. Because of the learning curve, many amateur programmers and likely even some professionals
held back and did not immediately make the transition to shader development in large numbers. For
some time after their introduction, shaders were more of a technical feature boasted by a few of the top
development houses than a sweeping new concept being integrated across the board.
To address this problem, DirectX 9.0 saw the introduction of High Level Shader Language (HLSL).
HLSL is a high-level language similar to C that allows us to write shaders in a more human readable
form and compile it into byte code either at runtime or development time with the HLSL compiler
(which incidentally does a good job optimizing your shader code in the process). This allows shader
developers to concentrate on the tasks the shader code should perform rather than issues such as which
registers to use to temporarily store a value or which instructions to use to fit a certain shader model
target. Because HLSL is essentially like the C language but with extended graphics functionality, any
Page 14 of 139
coder who already programs in C/C++ can immediately jump straight in and start writing shader code. It
is not surprising then that, according to nVidia®, the incorporation of shaders into commercial game
projects increased by a factor of 10 after HLSL was released.
It is now the norm to write shaders in HLSL (or another high level shader language like GLSL for
OpenGL development) and it is quite rare that you will write any assembly shaders, although the option
is still supported in DX9 (not so in DX11!) and they can still be embedded in effect files. Should you
feel like tweaking your code at the lowest level, be warned -- modern HLSL compilers are able to
compile high-level shader code into as efficient (if not moreso) byte code versus code that has been
hand-rolled using assembly language. This is not to say that you cannot squeeze out a few extra cycles if
you write assembly shaders in certain situations, you can. However, unless you really know what you
are doing, you will probably code solutions that are either slower or run at the same speed as those
compiled and optimized by the HLSL compiler, losing valuable development time in the process.
Note: While you should probably tend to avoid writing assembly language shaders in the general case,
this does not mean that you should avoid studying the shader assembly language once you have some
experience with HLSL shader development under your belt. At times it can be helpful to be able to
examine the assembly language code generated by the compiler if things are not working as expected or
if you think there might be some room for optimization.
Because virtually all shader development is done using HLSL these days, such shaders will be the focus
of this course and we will not spend any time looking at assembly language equivalents. With HLSL,
just like standard high level languages such as C or C++, we can use variable names to alias values
stored in registers and functions to perform operations on those values. In fact, we can rely on the D3DX
effect framework to upload all effect parameters into the appropriate shader registers on our behalf. We
usually won’t even need to know which registers are being used since we will reference those values via
parameter name inside the shader/effect.
Now, before we start writing any shader code, it should prove helpful to take a brief look at the
architecture of shader hardware. Since this is a DirectX 9 course and shader model 2.0 will be our
minimum target platform (shader model 3.0 will be our maximum), a good place to start will be the
vertex shader unit as given by the shader model 2.0 specifications (Figure 19.2). The model 3.0
architecture is quite similar in terms of overall form and function, so you should not have any trouble
making the leap to the more advanced archictectures once you understand the bigger picture.
Page 15 of 139
The Vertex Shader Unit
Note: The hardware unit on the card that executes vertex shader program code is called the Vertex
Shader and likewise for the chip that executes pixel shader code. This can be somewhat confusing during
early discussions because the term shader is more commonly used when speaking of the programs we
will execute, not the hardware itself. To avoid confusion throughout the rest of this discussion, we will
refer to the programs that execute on the shader hardware as being shaders (a vertex shader or a pixel
shader) and the hardware that executes the code as the shader unit.
Figure 19.2
Note: The layout can vary across different hardware but this is the basic architecture demanded of
shader model 2.0 capable hardware.
Page 16 of 139
A vertex shader can be fed two types of input data -- uniform and non-uniform. Uniform data is data that
remains constant throughout the execution of the shader for each vertex in the vertex stream and is
stored in constant registers (Figure 19.2) . For example, when using effect files, the effect framework
will take care of loading any parameter values required by the shader into constant registers on the
hardware when the effect is invoked. When we later execute the DrawPrimitive method, the constant
registers will have been filled appropriately and the vertex shader is executed for each vertex in the
vertex stream (i.e., each vertex specified by the draw call). Because the constant registers are set prior to
the draw call and remain set during the invocation of the shader for each vertex in the stream, every
invocation of the vertex shader will have access to this same set of unchanging constant variables whilst
processing each vertex in the stream.
Elements such as the current world and view matrices or light properties are typical choices for constant
register storage since we expect them to be the same for every vertex in the stream. The constant
registers are read-only and cannot be written to or updated by the shader itself. They exist simply to
facilitate communication of static data between the application about to invoke the shader and the shader
program itself. Under the shader model 2.0 specification, constant registers can store integers, floats, or
boolean values.
Note: Registers in a vertex shader are four component vectors of a given type (integer, float or bool)
and as such, each register can be used to store four values of the given type ( [x,y,z,w] or [r,g,b,a] for
example). One benefit of this approach is that it allows us to process the register on a per-component
basis by using masking (e.g., MyRegister.x or MyRegister.z) as well as process all its components as a
single entity when the need arises to perform vector/vector operations. For example, a 4D dot product
can be carried out with a single operation between two registers inside the shader. Likewise, we could
perform a multiply between only the first two components of each source register by using the (.xy)
suffix for example. The architecture has been designed very much with 3D mathematics in mind, to
facilitate both scalar and vector operations using the same registers.
The constant registers are handy for accessing static data that does not change on a per-vertex basis
during the execution of the DrawPrimitive call, but what about data that does? Position, texture
coordinates and normal for example are are usually unique to each vertex and, as such, are normally part
of the vertex stream itself. When we create a vertex structure for fixed-function use, these items will be
interwoven with other possible vertex elements in a vertex buffer and passed along to the transformation
pipeline. Undoubtedly this data is quite important to the shader as it will need access to the model/world
space position in order calculate a final clip space position at the very least. Data that is specified at the
vertex level, and thus can vary with each invocation of the shader, is referred to as non-uniform data.
As you might expect, there are dedicated registers for dealing with exactly this type of data. The input
registers are essentially a direct feed of the vertex as specified in the vertex stream. Figure 19.2 shows
the input registers (v0 - v15) and how they get information to the shader.
So how does this all work? When we design our vertex structures for use with shaders, we also create an
additional descriptive object called a vertex declaration. A vertex declaration informs DirectX about the
types, sizes, and order of the information stored in our vertex (e.g., position, normal, texture coordinate,
etc). This is similar to, but more flexible than, using fixed-function FVF flags because FVF flags enforce
certain data ordering rules (e.g., position must be defined before the normal) that declarations do not.
Page 17 of 139
Preparing a simple vertex declaration is shown below to help you visualize the role they will play.
D3DVERTEXELEMENT9 dwDecl3[] =
{
{0, 0, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT,
D3DDECLUSAGE_POSITION, 0},
{0, 12, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT,
D3DDECLUSAGE_COLOR, 0 ,
{0, 16, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT,
D3DDECLUSAGE_COLOR, 1},
D3DDECL_END()
};
This example describes a vertex that contains a position, followed by two color components that we
might use to transport per-vertex diffuse and specular values. Don’t worry about what all the flags and
parameters are for now. We will see later that they describe things like the byte offset from the
beginning of the structure, data element size, etc. More important now is the penultimate value of each
element that describes its usage. It is here that we will choose from a number of possible flags to identify
the type of data at a high level. For example, using the D3DDECLUSAGE_POSITION flag informs the
pipeline that the first member in the vertex will contain the vertex position (usually model or world
space).
Once we have defined a vertex layout using an array of D3DXVERTEXELEMENT9 structures (as per
the above example), we will call the IDirect3DDevice9::CreateVertexDeclaration method to arrange this
information into a more optimal form that Direct3D can work with at runtime. The result is an
IDirect3DVertexDeclaration9 interface that we will use later on during rendering.
Similar to setting FVF flags as we have done in the past, before we draw any geometry we will need to
provide information about the format of the vertices about to enter the pipeline. To accomplish this we
call the IDirect3DDevice9::SetVertexDeclaration method (passing our IDirect3DVertexDeclaration9),
just as we might have previously called the IDirect3DDevice9::SetFVF method. When the draw call is
issued, the pipeline can now examine the declaration to route vertex components into the appropriate
input registers. Using the current example, vertex position would be placed in register v0, the first color
in register v1, and the second in v2.
If we were using assembly language to write shaders, we would work with these input registers directly.
With HLSL, we will instead use semantics to identify parameters in our shaders as being of a specific
type (e.g., position) using keywords that Direct3D understands (e.g., POSITION, NORMAL, etc.). You
will see later that when we do this, coupled with our use of vertex declarations, we no longer need to
have any worries about which data went into which registers, and can instead access the input register
data using meaningful variable names (e.g., Position or ModelSpacePosition).
Note: Since you can use shaders without using effect files, there are of course functions that exist to
allow manual filling of registers (the same functions used by effects behind the scenes). Separate
methods also exist to create vertex and pixel shaders outside of effects and pass them to the pipeline for
use and we will look at some of these a little later. When using effect files that contain embedded
shaders however (as we will generally be doing for our lab projects), the shaders are automatically bound
to the device and the constant and input registers populated when we invoke the effect.
Page 18 of 139
The constant and vertex stream input registers are read-only, so there is clearly going to be a need for
temporary registers that can both be read from and written to. Such registers are used to store results for
the various calculations performed during shader execution. For example, temporary register AL is used
for keeping track of loops, where a counter must be read and incremented for each iteration of the loop.
Register AO can be used to perform dynamic indexed addressing into the constant register array, which
might be useful if your constant registers contained a palette of matrices for blending. Once again, you
are reminded that we are talking about shader programming at a very low level at the moment. When
writing HLSL shaders, we won’t need to be involved with managing these registers directly.
The r0 – rN registers are available for storing any intermediate data used throughout the execution of the
shader. This is very similar to the way that we use local variables in a C++ function. Remember that
when using HLSL you will not have to concern yourself with working with such registers directly. The
HLSL compiler is smart enough to know that if you declare a variable that is used temporarily in your
shader -- which is pretty much any variable you define which is not a constant register alias (uniform),
an input register alias (non-uniform), or an output register -- the variable must be a temporary one whose
lifetime will not exceed the current invocation of the shader.
So we have now seen that the vertex shader is supplied with three vital resources to perform its tasks.
The vertex input registers contain the data (position, normal, etc.) for the current vertex for which the
shader is being invoked. The constant registers provide the shader with additional data from the
application which can be used to produce the output of the shader (light source properties,
transformation matrices, etc.). The temporary registers, with their read and write access, provide the
shader with a bank of local variables that can be used during execution to store the results of one
calculation for input to another.
Once the shader has completed any tasks we coded for it, it will need to pass result data out to the
rasterization module where the vertices will be assembled into triangles. The output registers provide
the means for data transport from the shader to the rest of the pipeline. Looking at the shader unit
illustration in Figure 19.2 you can see that the output registers have an ‘o’ prefix. The most important
and commonly used is the oPos register which should be assigned the clip space position of the vertex.
Once the shader has terminated, the later stages of the pipeline will read the oPos register to fetch the
transformed vertex position and perform the divide by w to project the coordinate into 2D space, where
triangle assembly and rasterization can begin.
Every vertex shader needs to output a position via the oPos register, but there are also other output
registers such as oD1 & oD2 that can be used to output two colors (e.g., diffuse and specular). In shader
model 2.0 there are eight texture coordinate registers that you can use (oT0…oT7) to pass along texture
coordinates for sampling or just about any arbitrary data you wish. The texture coordinate registers are
commonly used to transport a variety of pieces of information that will be useful in our pixel shaders
since we often do not need all eight registers for actual texture lookup coordinates. You are reminded
that these registers are four component floating point registers and thus, eight registers means we have
32 floats that can be used to carry data out of the vertex shader and into the pixel shader. As long as the
scalar or vector values can be interpreted linearly, we can take advantage of this storage space. Light and
camera direction vectors, color scaling values, tangent space coordinate axes, etc. are just a few
examples of the virtually unlimited types of information we can pass via texture coordinate registers.
Page 19 of 139
A key point to remember is that any properties generated in the vertex shader will be linearly
interpolated over the surface of the triangle during rasterization. Thus a vector that runs between, say, a
light source and each vertex position will be interpolated such that when the vector arrives in the pixel
shader, it will point at the exact position on the surface of the triangle that we care about (i.e., the
location represented by the pixel we will be processing). Of course, since the interpolation is linear, we
have to make sure that the values we pass along are able to be interpolated linearly. For example, doing
a projective division (e.g., divide by w) on a position -- perhaps to generate a texture coordinate -- is a
non-linear operation, so we will not be able to do it in the vertex shader and expect that our interpolated
values will be correct in the pixel shader. In this particular case, we would have to output the original
un-projected position with w in the nth component since this can all be linearly interpolated. Once it
arrives in the pixel shader, we will have to do the projection divide ourselves.
The vertex shader also has a fog output register that you can use for passing along your own fog factor.
This fog factor will be used later in the pipeline by Direct3D during the per-pixel fog calculations. Once
again, this value is interpolated during the rasterization phase to generate per-pixel fog factors which
will be used to blend between the current source color of the pixel being rendered and the fog color.
Although we will not usually need to interact with the registers directly when coding in HLSL, the
instruction and register count limits still exist nonetheless. The HLSL compiler is, after all, only
compiling your HLSL into assembly language equivalent shaders, so it is good to be aware of these
concepts. For example, in shader model 2.0 there are 12 temporary registers that can be used for storing
intermediate calculations during shader execution (in 3.0 this jumps to a more manageable 32). If we
compile an HLSL shader for the 2.0 target and our code is constructed in such a way that it uses more
than 12 temporary variables simultaneously, the compile process will fail, resulting in an error indicating
that it ran out of temporary registers. Below we see a table describing the number of each register type
available in shader models 1.1 – 3.0.
Vertex Shader Input Registers
Name Register VS_1_1 VS_2_0 VS_2_x VS_3_0

Input Registers c# 16 16 16 16
Constant Float Registers v# * see note (min 96) * see note (min 256) * see note (min 256) ** see note (min 256)
Constant Integer Registers i# n/a 16 16 16
Constant Boolean Registers b# n/a 16 16 16
Temporary Registers r# 12 12 ** see note 32
Address Register a0 1 1 1 1
Loop Counter Register aL n/a 1 1 1
Sampler s# n/a n/a n/a 4
* Equal to D3DCAPS9.MaxVertexShaderConst
** Equal to D3DCAPS9.VS20Caps.NumTemps
A couple of things are worthy of note here if you intend to compile your shaders to target earlier shader
model hardware.
First, in all cases, the maximum number of constant float registers (the most commonly used by far) that
may be available is not set in stone in the specification. Instead, this value can be queried from the caps
of the device. In all cases, the number of constant float registers available is specified in the
MaxVertexShaderConst member of the D3DCAPS9 structure. While no maximum is specified, a
Page 20 of 139
minimum number is required. This is useful because if you wish to play it safe you can make sure that
you use no more than the number of constant registers specified as the minimum for the shader model
target. If targeting shader model 1.x for example, you have at least 96 constant registers available,
although a given card may support more.
Note: You are reminded that registers are four component floats unless otherwise stated. Thus, a 4x4
matrix passed to the shader by the application will consume four constant registers. As it turns out, all
registers in the above table are four component vectors, with two exceptions -- the boolean constant
registers and the loop counter register are not vectors and store only a single component.
Second, integer and boolean constant registers were only introduced with shader model 2.0, as was the
loop counter register. 1.x shaders cannot perform loops or any type of branching within the code path.
Finally, shader model 3.0 adds four additional registers for texture sampling (s0-s3). While this was
previously a register type available only to pixel shaders, shader model 3.0 opened the door to doing
texture lookups in the vertex shader. This is a very powerful feature, but it is also fairly expensive on
early model 3.0 hardware, so be aware of the performance implications if you decide to use it.
Note: If you are supporting 1.x shader models and your HLSL code contains loops, the HLSL compiler
will not necessary fail. Instead, the compiler will try to unroll the loop, eliminating it from the shader and
replacing it with separate instructions. Failure will occur if unrolling the loop ultimately generates more
instructions or register allocations than are supported by the model.
Let us now look at the output registers table which will summarize the vertex shader registers we
discussed towards the end of our Figure 19.2 examination.
Vertex Shader Output Registers 1.1 – 2.x
Name Register VS_1_1 VS_2_0 VS_2_x VS_3_0

Position Register oPos 1 1 1 n/a
Fog Register oFog 1 1 1 n/a
Point Size Register oPts 1 1 1 n/a
Diffuse Color Register oD1 1 1 1 n/a
Specular Color Register oD2 1 1 1 n/a
Texture Coordinate Registers oT# 8* see note 8 8 n/a
* Some older model 1.x cards do not support all eight texture coordinate channels. For example, the geForce 3 supported only
four oT registers whilst the Radeon 8500 supported six.
Notice that these output registers are constant across all shader models except 3.0 or higher, in which
case the output registers have been completely overhauled. When programming assembly style shaders
for vertex shader model 3.0+ there are no longer specific registers in which the position, fog, or color
must be placed. All of the above registers are gone and have been replaced with a bank of multipurpose
registers that we can use to store the data we wish to output in any order. Once again, semantics are used
to clearly label the intended use of an output register so that the correct data gets routed to the
appropriate place for later use in the pipeline. With semantics, it doesn’t matter whether our shader
stored its transformed position value in the first or tenth output register, as long as it has been labeled
with the appropriate semantic, the pipeline will know that the position is contained in that register and
can forward it to the other processes correctly.
Page 21 of 139
The 12 new output registers in shader model 3.0 are labeled o# (with # being a number between 0 – 11)
and can be used to store anything that the user wants to be interpolated by the rasterizer and fed into the
pixel shader. This array of registers, just like the constant float registers, can be indexed into using the
address register (a0). Each register is four components in size and can be used to store normals, texture
coordinates, colors, and anything else the shader may wish to output.
While we won’t be doing any significant shader assembly coding in this course, a small snippet of an
assembly style vertex shader is shown below. Don’t worry too much if you don’t understand it
completely; its purpose is to demonstrate how the registers are accessed and used when you are writing
your shaders in assembly language. In this course we will write our shaders in HLSL and things will be
much more pleasant for us.
For the purposes of this example, do not worry about how the input data gets into the registers via the
vertex stream or how the application sets the values of the constant registers -- this will all be discussed
shortly. We will just assume that the application has passed a combined 4x4 world/view/projection
matrix that it wishes the shader to use to transform the vertex position. Since a single register is a 4D
float vector, a 4x4 matrix (four 4D vectors) will need to be stored in four constant registers by the
application prior to the shader being invoked. In this example, we will assume the application has stored
the four columns of the (originally row-major) matrix in constant float registers c0 – c3. We will also
assume that the model space position of the vertex is stored in input register v0, the diffuse color is
stored in input register v5, and a single set of texture coordinates is stored in input register v2.
dp4 oPos.x , v0 , c0 ; Calculate transformed X

dp4 oPos.y , v0 , c1 ; Calculate transformed Y
dp4 oPos.z , v0 , c2 ; Calculate transformed Z
dp4 oPos.w , v0 , c3 ; Calculate transformed W
mov oD0, v5 ; Copy diffuse color unchanged into output register
mov oT0, v2 ; Copy texture coodinates unchanged into output register
We learned way back in Chapter 1 that transforming a vector4 by a 4x4 matrix is essentially just four
dot product operations between the input vector and the columns of the matrix, where each dot product
calculates the value of one of the components of the output vector. Thus, to transform our model space
vertex (stored in v0) by our matrix (stored in constants c0-c3) we simply need to perform a dot product
between v0 and each of the constant registers. The assembly shader language uses the dp4 function for
this purpose (it performs a dot product between two 4D vectors). Notice how we also use component
masking on the resulting oPos register to indicate the resulting component in the output position register
where we would like the result of each dot product stored. For example, in the first line we perform the
calculation of the transformed x position of the vertex by performing a dot product between the vertex
position and the first column of the matrix. By using the .x mask, we are instructing the shader that we
would like the single scalar result of the dot product to be stored in the x component of oPos (the output
position register). After doing this for each of the four components of the output position register, oPos
will contain our final clip space coordinate.
We can theoretically calculate the color of the vertex and/or texture coordinates using any formulas that
suit our purposes, however in this example we have simply copied over the diffuse color and texture
coordinate from the vertex unchanged into the appropriate color and texture coordinate output registers.
We can see another standard assembly style function being used here – mov – which is used to copy
Page 22 of 139
(“move”) the value from one register into another. When this shader terminates, oPos will contain the
clip space position of the vertex, oD0 will contain the diffuse color of the vertex and oT0 will contain
the texture coordinates. The position output by the shader will later undergo the perspective divide and
all output components will be interpolated during rasterization to create per-pixel inputs to a pixel
shader (or to the fixed-function color blender if no pixel shader is provided).
Note: When using vertex shaders, the Direct3D lighting pipeline is no longer available to calculate your
vertex colors. The vertex shader replaces this module, so you will now be responsible for calculating your
own lighting (either per-vertex, per-pixel, or using some combination of the two).
Again, although we will not be writing our shader code in assembly language in this course, this snippet
gives you a general idea about what they can look like and how, whether we write in HLSL or assembly,
the hardware registers are being used. With a discussion of the underlying vertex shader mechanics out
of the way, let us now examine the pixel shader unit in the same way.
The Pixel Shader Unit
For each primitive in a

DrawPrimitive call, the vertex
shader is invoked to transform
each of its vertices into clip space.
We have discussed previously
how the vertex shader is
responsible for not only
calculating the position, but all
data that is to be interpolated
during the rasterization phase.
This includes texture coordinates
(or whatever data is stored in the
texture coordinates) and vertex
colors. After the vertices have
been transformed, they undergo
clipping and of course, the
projective divide by w. At this
point, we know the area of the
projection window that will be
occupied by the primitive and the
rasterization phase begins. It is
during the rasterization phase that
the pixel shader is going to be
invoked.
Figure 19.3
During rasterization, the edges of the primitive are known and scanline conversion begins. The rasterizer
steps across the surface of the primitive one pixel at a time and calculates the colors, texture coordinates,
and position for each pixel by interpolating the values stored in the vertices (calculated by the vertex
Page 23 of 139
shader). For each pixel, once its interpolated colors and texture coordinates are known, this data is
passed to the pixel shader to calculate the final color of the pixel based on the inputs it has been
provided and any textures that may have been set.
The pixel shader completely replaces the fixed-function color blender and texture sampling operations
that used to be available via the SetTextureStageState and SetSamplerState methods. A pixel shader will
contain the code that will interface with the sampler units to sample data from any textures currently
bound to the device. Unlike the fixed-function color blender, where you have a limited number of
blending operations to choose from, in the pixel shader, you can combine the colors you sample from
textures in just about any way you wish, using a variety of intrinsic math functions available via the
shader language.
Figure 19.3 shows the pixel shader unit according to the specification for shader model 2.0. Although
most of the register management will be hidden from us when writing HLSL shaders, particularly in
conjunction with effect files, a discussion of the underlying hardware and the available registers is still
important since it makes us aware of certain limitations under which we must operate.
As with vertex shaders, we have both uniform and non-uniform input to the shader. Uniform input
comes in via the constant registers which are each 4D vectors. Under the 2.0 specification, we have 32
float registers, 16 integer registers and 16 boolean registers. The boolean registers are the exception in
that they are not four components but instead store a single value. The constant registers are used in
exactly the same way as in the vertex shader -- they allow the application to communicate information to
the shader that will help it calculate the pixel color correctly. Constant registers could contain light
source information, material coefficients for a per-pixel lighting equation, etc. As with the vertex shader
constant store, when using effect files, any effect parameters used by the shader will automatically be
uploaded into the constant registers when the effect is invoked. Inside the HLSL shader, we can simply
work with the value by its variable/parameter name. When not using effects, Direct3D has functions
which allow us to manually set the values of the constant registers (more on these later) and when
writing assembly style shaders, these values will be accessed and used within the shader using the
register name directly (e.g., c0 for constant register 0).
The non-uniform input is the data that the rasterizer has interpolated from the vertices of the primitive
prior to invoking the pixel shader. This can be the interpolated diffuse and specular color of the pixel
(passed in through input registers v0 and v1) and/or the interpolated texture coordinate sets. A maximum
of eight 4D texture coordinates can be passed to the shader and this is done through the texture
coordinate registers (named t#), of which there are eight available.
Another set of registers available to us are the sampler registers (s#) which allow us to communicate
with sampler units. There are 16 sampler units and therefore, 16 sampler registers under the 2.0 pixel
shader specification. These sampler registers provide access to textures and the appropriate sampling
settings, such as whether or not bilinear filtering should be used. We saw in the last chapter how we set
up these sampler registers in our effect files to sample from a specific texture in the appropriate way. In
an HLSL pixel shader, we have access to a nice set of functions to aid us in using these sampler units to
sample colors from our textures.
Page 24 of 139
Figure 19.3 shows that the pixel shader also has access to a number of temporary registers which are
both readable and writable. They are used for temporary data storage during the execution of the shader
and for keeping tracks of loops, etc. just as we saw in the vertex shader.
Finally, we come to the output registers which is where we see clearly the purpose of the pixel shader.
Quite simply, in most cases the pixel shader has a single task -- to output a color via the oC0 register.
Optionally, the pixel shader can also load a custom depth for the pixel into the oDepth register, but if
this is not done, the standard depth calculation has still been performed and that value will be used
instead. In shader model 2.0, support for up to four separate output render targets has been added. As
such, there are four oC# (color) registers labeled oC0 – oC3. oC0 must always be written to and is the
one you will place your final color in when using a standard single backbuffer/target rendering
approach. These three optional color registers exist to facilitate the composition of data in multiple
render targets simultaneously via a single pixel shader.
19.2 Using HLSL Shaders in Effects
When shaders were first introduced with DirectX 8, the hardware was pretty simplistic by today’s
standards and the shader language was quite small and very limited. At the time, shaders had to be
programmed in assembly language and, while it wasn’t terribly difficult to work with, it did introduce a
learning curve that initially pushed many developers away from embracing them. As the hardware
became more complex and the shader language was extended with each new shader model, shader
programs started to become more complex and a great deal larger in size. By the introduction of shader
model 2.0, it became clear to Microsoft, and many others, that a more friendly and efficient way of
writing shaders was needed. To be clear, we are not referring to efficiency in terms of shader execution
speed, but from a development timeframe standpoint. Assembly is not as user-friendly or as easily read
as high level languages like C and C++, so such shaders usually take longer to develop, are more prone
to error, and are quite a bit harder to debug.
The Microsoft Corporation addressed this issue with the introduction of High Level Shader Language in
DirectX 9.0. HLSL was based on the C language and thus was instantly familiar to almost all coders in
the field. Beyond being easier to understand, it also allowed developers to focus on the tasks that their
code should perform without worrying about what data was stored in which registers. Developers, large
and small alike, shifted to shader development in large numbers, and the industry has not looked back
since.
Shaders can be written in any text editor (e.g., notepad or your C++ IDE) in the same way effects can. In
fact, you will see shortly that an HLSL source file is really just one or more variable declarations and
one or more functions. For example, a vertex shader is ultimately just a function that will be called by
the pipeline when a vertex is about to be processed; likewise for the pixel shader. The vertex shader and
pixel shader functions can be declared in the same text file along with any variables that are used by one
or both of the shader functions. You might imagine then that an HLSL source file might at most contain
two HLSL functions -- a vertex shader and a pixel shader -- but this is not necessarily the case. Just like
standard C and C++, complex code can be broken into a series of smaller utility functions that can each
be called by the master function in order to complete the task. This is a method we use all the time as
Page 25 of 139
programmers so that we don’t have to debug single functions with thousands of lines of code. A shader
source file might contain dozens of functions which are all called by either the vertex or pixel shader
master functions. In fact, your effect file might even contain dozens of master shader functions. Perhaps
the effect contains multiple techniques of varying complexity which each have their own shader
functions aimed at a specific shader model target. Furthermore, it is quite common for a single multi-
pass technique to utilize different vertex or pixel shaders in each pass.
In many ways, you have already started down the path of shader programming, although you might not
know it yet. For example, the parameter types we used in the previous chapter to define our effect files
are actually considered part of the HLSL language. The only thing missing from our effect files in the
previous chapter for them to be considered shader effects is the code for the vertex and pixel shader
functions. You may recall that within the techniques of our prior example effects we set the
VertexShader and PixelShader states to NULL (their default value), which informed the pipeline that we
required the fixed-function pipeline to transform our vertices and color our pixel output. However, had
we defined some HLSL vertex and pixel shader functions in those effect files instead, we could have
assigned the function names to these effects like so:
// ----- Variable / Parameters -----

matrix WorldViewProj;
float3 LightDirection;
// ----------- Shaders ---------------

// Vertex shader
void MyVertexShader()
{
// HLSL Code Goes Here
}
// 1st pass pixel shader

void MyPixelShader_Pass1()
{
}
// 2nd pass pixel shader

void MyPixelShader_Pass2()
{
}
// --------- Techniques ---------------

Technique MyExampleTechnique
{
pass p0
{
...
... Set other states here
...
VertexShader = compile vs_2_0 MyVertexShader();

PixelShader = compile ps_2_0 MyPixelShader_Pass1();
}
Page 26 of 139
pass p1
{
...
... Set other states here
...

PixelShader = compile ps_2_0 MyPixelShader_Pass2();
}
}
The above effect file is incomplete, but it does provide an example of how shader functions can be
defined in an effect file and then referenced by the various techniques. As we saw in the last chapter,
this same effect file can contain multiple techniques, including those that do not reference shader
functions at all (to provide fixed-function fallbacks for non-shader capable hardware). The above code
also demonstrates that there is really nothing mystical or difficult about incorporating shaders into an
application, particularly when using the effect framework. With effects, shaders are simply functions
defined in your effect file that any pass of any technique can utilize to override the fixed-function
behaviors for the processing of vertex/pixel data. Whenever we invoke an ID3DXEffect that assigns
anything other than NULL to the VertexShader and PixelShader states, the function specified will be
uploaded to the hardware and used for the processing of all geometry and pixels that utilize that effect.
Studying the above code there are some very important observations we can make. First, notice that
when we assign a shader function name to either the VertexShader or PixelShader states, we precede the
function name with a compile directive followed by the compile target (shader model) for which we
would like the shader function to be compiled. Remember, when we load an effect file, it must first be
compiled into an ID3DXEffect before it can be used. When the effect contains no shader code, this is a
simple task of just building a list of device state assignments for each technique. However, when the
effect file contains shader functions, these functions must be compiled. It is at effect compile time that
the compile directives are executed to compile the HLSL functions into shader byte code. Behind the
scenes, this means the vertex and pixel shader compilers will be invoked, so they must be told about the
shader model you intend the compiled code to target. This is important because the shader compilers
must know which instruction sets they have at their disposal during compilation. For example, if we
issued the following state assignment in one of our techniques…
VertexShader = compile vs_1_1 MyShader();
…the vertex shader compiler would know that if the function code contains any loops it must try to
unroll them because loops are not supported in the 1.1 instruction set. Furthermore, the shader model
target allows the compiler to know how many registers in each pool it has access to and the maximum
instruction counts allowed, etc.
If the vertex or pixel shader compiler is unable to compile the shader code to the requested shader model
target, the effect will fail to compile and the compiler will return a list of the errors (such as ‘ran out of
temporary registers’, etc).
Looking once again at our partial effect file above, we can see that this code also demonstrates an
example where the same technique can use different shaders per pass. Both passes utilize the same
Page 27 of 139
vertex shader code, but each pass calls upon a different pixel shader. If the effect you are trying to create
demands it, you can write techniques with multiple passes where each pass uses completely different
vertex and pixel shaders, and of course different device states as well.
Finally, the variable declarations are worth revisiting for a moment. The variables we have defined for
the sake of demonstration are a combined world/view/projection matrix and a 3D vector which should
be filled with a light direction vector (we will confine our examples to a single light source for now). As
discussed in the previous chapter, these parameters can be set by the application via the ID3DXEffect
interface. If these parameters are used within a given technique, then it allows the application to
communicate data to the effect about how certain device states should be set based on the current state
of the simulation. However, much more important is the fact that these parameters can be referenced
inside the shader functions to provide the application a means to influence shader execution.
When the effect is compiled, the D3DX effect framework is clever enough to keep track of the effect
parameters that are being used by our shader functions. This is important because, behind the scenes, the
shader can only access such data if it is stored in its constant registers. As discussed in the last chapter,
when the application sets the values of these parameters, they are first stored in a system memory copy
inside the effect. It is only later, when the effect is applied (BeginPass), that this data is uploaded to the
hardware so that shaders can access it. In our shader code we can access these parameters using the
names we used to define them in the effect, but under the hood the shader compiler will have mapped
these variables as aliases to explicit registers in the constant register bank.
19.2.1 Compiling Effects with Shaders
In the previous chapter we looked at how to compile effects at development time using the command
line effect compiler, fxc.exe, which ships with the DirectX SDK. This is an alternative way of compiling
effects which allows you to remove the overhead of effect compilation from the runtime component and
distribute only the binary version of your effects to keep out prying eyes.
Fortunately, shaders do not change things very much and we will invoke the effect compiler in exactly
the same way as we did for non-shader effects, although we may change a few input settings. For
example, below we see our example from the previous chapter where we used fxc.exe to compile an
effect file called Terrain.fx. We specify (via the /Fo switch) that the compiled output file should be
called Terrain.fxo. Recall the error that was output when we initially attempted to do this:
C:\dx9sdk\Utilities\Bin\x86>fxc.exe Terrain.fx /Fo Terrain.fxo
Microsoft (R) D3DX9 Shader Compiler 9.29.952.3111

Copyright (C) Microsoft Corporation 2002-2009. All rights reserved.
error X3501: 'main': entrypoint not found
compilation failed; no code produced
The error occurs because fxc.exe is an HLSL compiler, not a generic effect file compiler, and we had no
shader main entrypoint. To fix this issue, we initially invoked the compiler again with the '/?' parameter.
Page 28 of 139
This triggered the compiler to output all the commands and switches it supports (the output is shown
below), which is a handy trick to remember.
C:\dx9sdk\Utilities\Bin\x86>fxc.exe /?
Microsoft (R) Direct3D Shader Compiler 9.29.952.3111

Copyright (C) Microsoft Corporation 2002-2009. All rights reserved.
Usage: fxc <options> <file>
/?, /help print this message
/T<profile> target profile

/E<name> entrypoint name
/I<include> additional include path
/Vi display details about the include process
/Od disable optimizations

/Op disable preshaders
/O{0,1,2,3} optimization level 0..3. 1 is default
/WX treat warnings as errors
/Vd disable validation
/Zi enable debugging information
/Zpr pack matrices in row-major order
/Zpc pack matrices in column-major order
/Gpp force partial precision

/Gfa avoid flow control constructs
/Gfp prefer flow control constructs
/Gdp disable effect performance mode
/Ges enable strict mode
/Gec enable backwards compatibility mode
/Gis force IEEE strictness
/Gch compile as a child effect for FX 4.x targets
/Fo<file> output object file

/Fc<file> output assembly code listing file
/Fx<file> output assembly code and hex listing file
/Fh<file> output header file containing object code
/Fe<file> output warnings and errors to a specific file
/Vn<name> use <name> as variable name in header file
/Cc output color coded assembly listings
/Ni output instruction numbers in assembly listings
/P<file> preprocess to file (must be used alone)
@<file> options response file

/dumpbin load a binary file rather than compiling
/Qstrip_reflect strip reflection data from 4_0+ shader bytecode
/Qstrip_debug strip debug information from 4_0+ shader bytecode
/compress compress DX10 shader bytecode from files

/decompress decompress bytecode from first file, output files should
be listed in the order they were in during compression
/D<id>=<text> define macro
Page 29 of 139
/LD Load d3dx9_31.dll
/nologo suppress copyright message
<profile>: cs_4_0 cs_4_1 cs_5_0 ds_5_0 fx_2_0 fx_4_0 fx_4_1 fx_5_0 gs_4_0
gs_4_1 gs_5_0 hs_5_0 ps_2_0 ps_2_a ps_2_b ps_2_sw ps_3_0 ps_3_sw ps_4_0
ps_4_0_level_9_1 ps_4_0_level_9_3 ps_4_0_level_9_0 ps_4_1 ps_5_0 tx_1_0
vs_1_1 vs_2_0 vs_2_a vs_2_sw vs_3_0 vs_3_sw vs_4_0 vs_4_0_level_9_1
vs_4_0_level_9_3 vs_4_0_level_9_0 vs_4_1 vs_5_0
Obviously many of these switches only have applicability when dealing with effects that contain shaders
since they influence such things as whether matrices should be stored in row or column-major order,
whether an assembly listing should be generated when the shader is compiled (very useful for seeing
what your shader looks like in its assembly form) as well as things such as whether or not the compiler
should try to optimize your code. More importantly however, and the key to fixing our fixed-function
compilation case, was the profile switch (/T). Recall that for our fixed-function-only effect to compile
successfully, we had to specify a target profile that specifically requested an effect compile (/T fx_2_0).
Note the leading 'fx' aspect of the indicated profile.
Examining the possible profiles we can choose from in the above listing, we can see that it contains all
of the shader models we have been discussing thus far. As such, if we did happen to have shaders in our
effect file, shouldn’t we be instructing the compiler to use some of these instead? Otherwise, how would
the compiler know which profile to compile under when compiling all the shader functions in our effect?
Well, this isn’t necessary in the case of effect files because the information about which profile to use is
described by the techniques themselves. For example, in the previous example effect file we saw the
following:
When the HLSL compiler (fxc.exe) is instructed to compile an effect using the /T fx_2_0 switch,
specifying the shader model profile is not necessary because each target is explicitly stated in the
technique itself, as the above line demonstrates.
Note: We can pass fxc.exe a file that contains multiple shader functions and ask it to extract individual
shaders for standalone compilation via the /E switch. In this case, we will need to provide shader names
as entrypoints and also be specific as to the shader model we'd like to target. The compiled results will
go to an output file that is not a binary version of a D3DX effect, but is instead a binary instance of a
shader object. You will use the effect compiler this way if you intend to bypass the D3DX effect
framework and instead make your application responsible for binding vertex and pixel shaders to the
device and uploading constant data to the shader registers manually (more on this in a moment).
However, assuming that we had an effect file called ‘MyEffect.fx’ which contained multiple techniques
and multiple shaders, the following line would compile the effect, along with all of its shader code, into
a binary effect object called ‘MyEffect.fxo’.
C:\dx9sdk\Utilities\Bin\x86>fxc.exe MyEffect.fx /Fo MyEffect.fxo
Page 30 of 139
Note: If you are still wondering what the other fxc.exe compiler profiles are and when they would be
used with the command line tool, just hang on until the next section when we have a brief discussion
about using shaders without effects.
We have no problem with compilation this time around because we do indeed have shaders in the file
and thus entry points that the compiler can find. The /T fx_2_0 switch is unnecessary here.
So as it turns out, compiling effects which do contain shaders is not all that different from compiling
effects that do not (i.e., fixed-function-only). In the same way, we can invoke the HLSL compiler from
within the Visual C++ IDE, just as discussed in the previous chapter. Figure 19.4 reminds you about
how to invoke the command line compiler from the IDE and set the command line parameters that
should be passed in.
Figure 19.4
While fxc.exe is handy, for practical reasons it is actually the final method of compilation that we
discussed in the previous chapter that we will use in this course. That is, we will supply text-based effect
files in the data directory of our lab projects and compile the effects at runtime using the D3DXEffect
framework methods. This will allow you to examine and alter the source code of our effect files for
experimentation purposes and see the results immediately reflected in the next run of the application.
Page 31 of 139
19.3 Using HLSL Shaders without Effects
Before we get into the details of writing shader code, it is important that we understand that D3DX
effects really only provide us with a convenient way to work with shaders and device states, but they are
by no means mandatory. In Direct3D other shader-oriented functions and types exist, including
standalone vertex shader and pixel shader compilers, which allow us to use shaders without the
ID3DXEffect interface. While we are generally going to be using effect files for our lab projects in this
course, primarily for convenience, examining standalone shader usage will help us better understand
what the effect framework is doing for us behind the scenes when an effect which contains a shader is
invoked.
As far as writing the source code for standalone HLSL shaders, the shader functions and parameters are
all written in the exact same way using any text editor. The only real difference is that a standalone
HLSL source file will not contain techniques or passes. It will simply contain the parameter declarations
that the shaders use and the shader functions themselves. For example, the following pseudocode shows
how a source file containing a single vertex shader and the parameters it uses might be laid out.
// ----- Variable / Parameters -----

// ----------- Shaders ---------------

void MyVertexShader( . . . )
{
}
As you can see, even though we are not using effect files, it doesn’t change the fact that we can still
work with the registers using HLSL parameters. The code is exactly the same in both cases. However,
things on the application and compilation side are very different. We will no longer use the
ID3DXEffect interface to send values into the shader parameters and we will not be using the D3DX
effect specific creation and compilation functions to load and prepare what we need. The application
will now be responsible for all of the tasks that would have been performed automatically, behind the
scenes, when using effects. As it happens, working with standalone HLSL shaders is not terribly
difficult if that is the route you wish to take. A whole host of functions exist in the API to load and
compile standalone shaders, bind them to the device at render time, and load values into the parameters
(constant registers) used by the shader.
As with effects, there are three basic ways to compile standalone shaders. You can compile the shader
using the command line compiler (fxc.exe), set up the IDE to invoke the command line compiler for
each shader in your project, or use a suite of D3DX functions that exist to load and compile these
shaders at runtime.
Let’s start with the command line compiler which, whether being run from the command line or via the
IDE, covers the first two scenarios. Just as before, we invoke the command line compiler in the same
way but with one difference -- since we will not be sending the compiler an effect, but rather a
standalone vertex or pixel shader, we will no longer use the fx_2_0 target profile. Unlike shader
Page 32 of 139
assignments contained inside effect files, a standalone shader source file does not contain any
information about which shader model the compiler should target with its compiled code. Therefore, we
have to use one of the explicit profiles available to inform it of this requirement (effect compiler profiles
relevant to this discussion shown below).
<profile>: vs_1_1 vs_2_0 vs_2_a vs_2_sw vs_3_0 vs_3_sw ps_1_1 ps_1_2 ps_1_3
ps_1_4 ps_2_0 ps_2_a ps_2_b ps_2_sw ps_3_0 ps_3_sw
For example, above we see the final bit of output from the compiler when called with no parameters.
This is a list of the profiles that are supported by the HLSL compiler. So, if our source file contained a
single vertex shader and was called MyVertexShader.vsh and we wanted to compile this shader such that
it would run on shader model 2.0 hardware, we would do the following:
fxc.exe MyVertexShader.vsh /T vs_2_0 /Fo MyVertexShader.vso
In this instance, the final standalone compiled vertex shader would be stored in the file
MyVertexShader.vso and it is this binary version of the file that would be loaded and used by the
runtime component.
Note: In order to compile shaders against profiles earlier than shader model 2.0 (such as vs_1_1), you
may be required to use an earlier version of 'fxc.exe' than that included with the most recent DirectX SDK
release. When compiling effects and shaders in code through the various D3DX effect and shader
compilation functions and interfaces, the 'D3DXSHADER_USE_LEGACY_D3DX9_31_DLL' flag can be used
to enable these earlier profiles.
The profiles supported by the HLSL compiler relevant to DirectX 9 shaders are listed below.
Target Profile Type Notes

vs_1_1 Vertex Shader Profile
vs_2_a Vertex Shader Profile nVidia Cards
vs_2_sw Vertex Shader Profile Software emulation. Must use
software vertex processing device.
vs_3_sw Vertex Shader Profile Software emulation. Must use
software vertex processing device.
ps_1_1 Pixel Shader Profile
ps_2_a Pixel Shader Profile nVidia Cards
ps_2_b Pixel Shader Profile ATI Cards
ps_2_sw Pixel Shader Profile Software emulation. Only available
via reference rasterizer.
ps_3_sw Pixel Shader Profile Software emulation. Only available
via reference rasterizer.
Page 33 of 139
As you can see from the above list, for shader model 2.0 and above, there exist software emulation
shaders which come in very handy if you are developing for a shader model which might not be
supported on the development machine. It is also useful if you are writing a shader model 2.0 shader for
example, but the compiler is giving you errors stating that you have exceeded some of the 2.0 limits. It is
quite common for this to happen and for the shader developer to use this information to optimize the
shader code to use fewer registers, instructions, etc. However, early in the process you often just want to
know if the code works before going down this road, so you can either increase the target shader model
if the card supports it, or specify the software profiles where the limits are eased. For example, the
2.0_sw target relaxes the limits to the maximum limits of the 2.x models. Once you know that your
algorithms actually work, you can then revisit the code and try to fit it into fewer instructions/registers or
if not, decide to target a higher shader model for that particular shader (ideally supplying a less complex
shader as a fallback for earlier models) or consider a multi-pass solution where possible.
Note: In order to use any of the vertex shader software models, you must have a software device or a
mixed-mode device running in software vertex processing mode. Software pixel shaders are only
supported on the reference rasterizer.
Of course, you may decide not to use the command line compiler at all and prefer to compile your
shader source code at runtime (similar to what we saw with effect files) using the many Direct3D
methods that are available for this task. Before we progress on to the actual shader language, let us take
a very brief look at some of the Direct3D functions that available to use for this task.
Compiling Standalone Shaders at Runtime
The D3DXCompileShaderFromFile function can be used to load and compile a shader source file. As
you would imagine, there are also versions of this function to compile a shader stored as a resource or in
memory. Looking at the function below, you can see that many of the parameters mirror those used
during the D3DXCreateEffectFromFile function.
HRESULT D3DXCompileShaderFromFile
(
LPCSTR pSrcFile,
CONST D3DXMACRO* pDefines,
LPD3DXINCLUDE pInclude,
LPCSTR pFunctionName,
LPCSTR pProfile,
DWORD Flags,
LPD3DXBUFFER* ppShader,
LPD3DXBUFFER * ppErrorMsgs,
LPD3DXCONSTANTTABLE * ppConstantTable
);
LPCSTR pSrcFile
This parameter is used to pass a pointer to a string containing the name of the file that contains the
shader function to be compiled into a vertex or pixel shader.
Page 34 of 139
CONST D3DXMACRO * pDefines
This is used in exactly the same way as in the call to the D3DXCreateEffectFromFile. It allows us to
pass in macros that will be defined in the shader (or you can pass NULL). This is very useful if your
shader code is using the pre-processor to perform conditional compiles based on runtime conditions. For
more information, consult the previous chapter.
LPD3DXINCLUDE pInclude
Exactly the same as in the call to the D3DXCreateEffectFromFile function. It allows us to pass in a
custom #include file handler. When loading shader source code from a file, NULL can be passed since
the default behavior is for an #include directive inside the effect file to load the file from disk. However,
if you are compiling from a shader in memory or stored as a resource, an interface of this type must be
passed if your code contains #include directives. For more information see the previous chapter.
LPCSTR pFunctionName
When not using effect files, every shader will need to be created individually by our application and set
to the device prior to rendering any geometry that uses it. However, that does not mean that we cannot
store the source code for multiple shaders in a single source file. This parameter allows us to specify the
name of the shader function that this invocation of the function should compile. If a source file
contained three shaders and you wanted to use them all in your application, you would call this function
three times to compile each shader function into a usable vertex or pixel shader. As a pixel or vertex
shader is essentially just a function defined in the source file, it is via this parameter that we inform the
function about which one we want compiled in the current invocation.
LPCSTR pProfile
In order for this function to compile your shader, you must inform it of the target profile you wish the
shader code to be compiled to. This string should contain one of the target profiles that we examined in
the above table (e.g., “vs_3_0” to compile the specified function into a shader model 3.0 vertex shader).
DWORD Flags
This is used in exactly the same way as in the call to D3DXCreateEffectFromFile. Here we can pass one
or more D3DXSHADER flags that allow us to control the compilation procedure of the shader. For
example, if were trying to do some shader debugging we might pass the following,
D3DXSHADER_SKIPOPTIMIZATION|D3DXSHADER_DEBUG, which would instruct the compiler not to
perform any optimization of the source code and to compile it as is. It would also instruct the compiler
to include debug information so that we could step into and through our code as needed.
LPD3DXBUFFER* ppShader
You will see in a moment that your ultimate goal is to end up with an IDirect3DVertexShader9 or
IDirect3DPixelShader9 interface which you can later bind to the device via a call to either
IDirect3DDevice9::SetVertexShader or IDirect3DDevice9::SetPixelShader. However, this function does
not take that final step for you. Instead, it compiles the source code into binary form and returns that to
us in an ID3DXBuffer object. This buffer contains the same results that would be output from the
command line compiler. If compilation was successful, the data can be extracted from the buffer and
passed along to either the IDirect3DDevice9::CreateVertexShader function or the
IDirect3DDevice9::CreatePixelShader function to complete the process.
Page 35 of 139
LPD3DXBUFFER * ppErrorMsgs
On function return, this buffer will store a string containing any error or warning messages that were
generated during compilation of the shader.
LPD3DXCONSTANTTABLE * ppConstantTable
This final parameter is an interesting one as it provides us with a way to pass application values into
shader parameters, much like we saw with the ID3DXEffect interface. Just because we are not using
effect files, doesn’t mean that we have to force our application to deal with the constant registers
directly. This interface has many methods that allow us to set the values of parameters by name, index,
etc. giving us the same abstraction from the underlying register usage as the D3DX effect system.
The best way to understand all of these ideas would be to see them in action. In the following code
snippet we will load a shader source file called MyShader.vsh which contains a single vertex shader. We
will also use the constant table interface to set the world matrix and a float called ‘Power’. The file is
assumed to contain the source for multiple shaders, but in this instance we are only interested in
compiling the shader whose function is called ‘MyWater’.
D3DXMATRIX mtxWorld; // App provided data

float fPower; // App provided data
LPDIRECT3DVERTEXSHADER9 g_pVertexShader = NULL;

LPD3DXCONSTANTTABLE g_pConstantTable = NULL;
LPD3DXBUFFER pCode = NULL;
DWORD dwShaderFlags = 0;
// If the debugging define is set, disable shader optimizations during compile

#ifdef DEBUG_VS
dwShaderFlags |= D3DXSHADER_SKIPOPTIMIZATION | D3DXSHADER_DEBUG;
#endif
// Load shader source and compile into bytecode buffer

D3DXCompileShaderFromFile( "MyShader.vsh",
NULL, NULL,
"MyWater",
"vs_3_0",
dwShaderFlags,
&pCode,
NULL,
&g_pConstantTable ) );
// Create the final vertex buffer interface.

pD3DDevice->CreateVertexShader( (DWORD*)pCode->GetBufferPointer(),
&g_pVertexShader );
// Set world matrix and power parameters

g_pConstantTable->SetMatrix( pD3DDevice , "WorldMatrx" , &mtxWorld );
g_pConstantTable->SetFloat ( pD3DDevice , "Power" , &fPower );
Page 36 of 139
The ID3DXConstantTable interface has methods to set/get the values of floats, integers, vectors, and
matrices (including in array form), so you can probably imagine now how it is that the effect system is
able to provide us with similar functionality (literally wrapping such ideas on our behalf).
When the above code completes, g_pVertexShader (of type IDirect3DVertexShader9*) will contain our
created vertex shader object, and g_pConstantTable (of type ID3DXConstantTable*) will grant us
access to the constants specific to that shader. During the rendering of our scene, prior to drawing any
subset of geometry, we would first assign the vertex (and/or pixel) shader to the device. Continuing the
above example, we might imagine that later on in the rendering code we have something like the
following:
pD3DDevice->Clear( 0, NULL, D3DCLEAR_TARGET | D3DCLEAR_ZBUFFER, 0, 1.0f, 0 );
if( SUCCEEDED( pD3DDevice->BeginScene() ) )

{
// Set vertex declarartion
pD3DDevice->SetVertexDeclaration( g_pVertexDeclaration );
// Set vertex shader

pD3DDevice->SetVertexShader( g_pVertexShader );
// Set Vertex Buffer

pD3DDevice->SetStreamSource( 0, pVertexBuffer, 0, sizeof(MyVertex) );
// Set Index Buffer

pD3DDevice->SetIndices( pIndexBuffer );
// Draw object
pD3DDevice->DrawIndexedPrimitive( D3DPT_TRIANGLELIST,
0,
0,
dwNumVertices,
0,
dwNumIndices/3 ) );
pD3DDevice->EndScene();
}
As you can see, setting a vertex or pixel shader on the device is no different than setting other device
resources such as the vertex buffer, index buffer, or textures. We can also imagine how our attribute
structure might now contain pointers to vertex/pixel shaders that should be used to render that attribute.
Hopefully it is plain to see that when using effect files, the same thing essentially happens behind the
scenes. When we place a line in our technique such as…
VertexShader = compile vs_3_0 MyShader();
…two ideas are at work here. First, when the effect is compiled, the 'compile vs_3_0' directive is
informing the effect compiler how to compile the shader (using all the same ideas we saw above).
Second, when the effect is invoked, the VertexShader = MyShader assignment simply instructs the
effect to call the IDirect3DDevice9::SetVertexShader method for us behind the scenes.
Page 37 of 139
There is a method used in the above code that we have not yet fully discussed in much detail:
IDirect3DDevice9::SetVertexDeclaration. Whether you are using effect files or standalone vertex and
pixel shaders, you must still create a vertex declaration and set it to the device prior to invoking the
effect or shader(s). We mentioned vertex declarations briefly earlier in the text, but they are covered in
more detail in the next section.
19.4 Vertex Declarations
Before we start to write some simple vertex and pixel shaders in the next section, as final preparation for
that topic we need to examine how the data in the vertex stream gets passed into the vertex shader, and
similarly, how outputs from a vertex shader wind up as inputs to a pixel shader.
When using the fixed-function pipeline, we never had to create a vertex declaration to describe our
vertex structure to the pipeline because we used FVF flags (although it is possible). A vertex declaration
is really just a more versatile and flexible way to describe vertex structures to the pipeline. When using
shaders, the SetFVF function is rendered obsolete and should be replaced with a call to
IDirect3DDevice9::SetVertexDeclaration. They serve roughly the same purpose, but declarations allow
us to use vertex structures with data arranged in an arbitrary order as well as to pass data to the shader
for which no FVF flags exist.
The legacy FVF flags enforced certain ordering rules on the vertex structure, such as requiring the
position to be specified before the normal, the texture coordinates after the normal, and so on. This is a
key limitation that is removed with declarations and we are free to arrange our vertex data in whatever
order we’d like. In addition, the fixed-function FVF flags provide support for only a small number of
vertex component types, which are no longer sufficient to cope with the growing range of data types that
we might wish to send into a vertex shader. The declaration exists to inform the pipeline, prior to a
vertex shader being invoked, exactly which bits of data exist where in the vertex structure. The pipeline
needs to know this so that it can extract the data from the stream and set up the non-uniform vertex
shader input registers with the correct data. So in short, a declaration is just a descriptive structure that
explains our vertex arrangement and the data it contains.
The IDirect3DDevice9::SetVertexDeclarator method accepts a single parameter -- an already created

IDirect3DVertexDeclarator9 interface. Therefore, any vertex structures that we use in our vertex buffers
must have an appropriate declaration created before they can be rendered. To create a vertex declaration,
we use the IDirect3DDevice9::CreateVertexDeclaration method:
HRESULT CreateVertexDeclaration
(
CONST D3DVERTEXELEMENT9* pVertexElements,
IDirect3DVertexDeclaration9** ppDecl
);
Page 38 of 139
This method accepts two parameters, the second of which will be used to return the address of the
created IDirect3DVertexDeclaration9 interface for the vertex layout specified. The first parameter
requires a little more analysis.
To the first parameter of the above method we pass an array of D3DVERTEXELEMENT9 structures
describing the format of our vertex. Each element in the array describes exactly one member of our
vertex structure. So, if we had created the following vertex structure in our application…
struct Vertex
{
D3DXVECTOR3 Position;
D3DXVECTOR2 TexCoords0;
D3DXVECTOR2 TexCoords1;
D3DXVECTOR3 Normal;
};
…we would provide a D3DVERTEXELEMENT9 array that contains four elements (plus a 'terminating'
element to demark the end of the array as we will see a little later on): one for our position, two for our
two sets of texture coordinates, and one for our normal. Notice that we are not forced to use the FVF
ordering rules, and as such we were free to place the normal after the texture coordinates in this case.
The D3DVERTEXELEMENT9 structure is defined as shown below, followed by an explanation of its
members. Remember, each element we place in the array describes exactly one member in the vertex
structure.
typedef struct D3DVERTEXELEMENT9

{
WORD Stream;
WORD Offset;
BYTE Type;
BYTE Method;
BYTE Usage;
BYTE UsageIndex;
} D3DVERTEXELEMENT9, *LPD3DVERTEXELEMENT9;
WORD Stream
This is a very interesting member as it leads us to a brief discussion of the use of multiple vertex
streams. Up to this point, we have only rendered geometry using a single vertex buffer. When setting
that vertex buffer on the device we have always used the SetStreamSource method with a stream number
parameter of 0 (the first stream). However, it is possible to set multiple vertex buffers at the same time
and have the pipeline extract bits of data from each one (each buffer represents input to a ‘stream’).
Using our current example vertex structure, we might decide to create one vertex buffer which contains
the position and texture coordinates of each vertex, but the normals might be contained in a completely
different vertex buffer for some purpose that makes sense in our application. In order to render the
object correctly, we could assign the first vertex buffer (containing position and texture coordinates) to
stream 0 and the second vertex buffer (containing the normals) to stream 1. A single vertex declaration
allows us to identify vertex members in different streams, and this is exactly what this member is for. In
the current example, the first, second, and third elements in the D3DVERTEXELEMENT9 array would
contain a value of 0 in the 'Stream' member, and the fourth element in the array, describing the normals
Page 39 of 139
in the second buffer, would specify a stream index value of 1. To render the object, you would then
simply set the declaration and assign the two vertex buffers to streams 0 and 1, respectively. The
DrawPrimitive method can then be invoked and, behind the scenes, the vertex data will be extracted
from the two streams accordingly, assembled, and routed to the vertex shader for processing.
Obviously, if a single vertex buffer contains all of the data for each vertex, the stream member of every
element in the array will be 0 and only a single vertex buffer would need to be bound to stream 0 of the
device (as we have always done).
WORD Offset
This member is used to supply the offset, in bytes, of the vertex component being described (i.e.
position, normal, etc.), relative to the beginning of each individual vertex in the buffer. Continuing with
our prior multiple-stream example, the first element in the D3DVERTEXELEMENT9 array would
contain a value 0 for its 'Offset' member because this element describes the position stored at the
beginning of the vertex in the first stream's buffer. The second element in our array, which describes the
first set of texture coordinates, would have an offset of 12 because the 12 byte position vector --
comprised of three four-byte floats -- comes before it in the vertex structure. The third element in our
array, the second set of texture coordinates, would have an offset of 20 bytes since it is preceded by the
12 bytes of position data plus the 8 bytes (2 floats) of the first set of texture coordinates. Our fourth
element (the normal) would, like position, have an offset of 0 in this example since it is stored at the
beginning of the vertex data in the second stream buffer (remember that in this example the vertex data
is spread over two streams).
When dealing with vertex data in a single stream, which is often the case, the offset of any member is
simply the sum of all the space taken up by the members defined before it in the vertex structure.
BYTE Type
This member informs the hardware of the data type of the vertex component this element is describing,
using one of the available values from the D3DDECLTYPE enumeration. For example, for both the
elements describing the 3D vertex position and normal vectors we would need to supply
D3DDECLTYPE_FLOAT3. Alternatively, to describe an array of two 16 bit 'short' values, we would
specify D3DDECLTYPE_SHORT2, and so on. Hopefully you get the idea.
The following list outlines all the members of this enumeration that can be used to describe the various
types of data that might be stored in our vertex type. For a full list and description of all these members,
consult the SDK documentation (search for D3DDECLTYPE). Most should require no explanation.
Page 40 of 139
typedef enum D3DDECLTYPE
{
D3DDECLTYPE_FLOAT1 = 0,
D3DDECLTYPE_D3DCOLOR = 4,
D3DDECLTYPE_UBYTE4 = 5,
D3DDECLTYPE_SHORT2 = 6,
D3DDECLTYPE_SHORT4 = 7,
D3DDECLTYPE_UBYTE4N = 8,
D3DDECLTYPE_SHORT2N = 9,
D3DDECLTYPE_SHORT4N = 10,
D3DDECLTYPE_USHORT2N = 11,
D3DDECLTYPE_USHORT4N = 12,
D3DDECLTYPE_UDEC3 = 13,
D3DDECLTYPE_DEC3N = 14,
D3DDECLTYPE_FLOAT16_2 = 15,
D3DDECLTYPE_FLOAT16_4 = 16,
D3DDECLTYPE_UNUSED = 17,
} D3DDECLTYPE, *LPD3DDECLTYPE;
BYTE Method
This member is used to instruct the vertex tessellator unit about how the vertex data should be provided
to the vertex shader. Most of the time you will use D3DDECLMETHOD_DEFAULT which basically instructs
the tessellator to copy the vertex data (spline data for patches) as is, with no additional calculations
required. When the tessellator is used, the vertex values described by this element will be interpolated
across the tessellated geometry; otherwise the vertex data is simply copied into the input registers.
BYTE Usage
This member is very important because it allows us to label the data in such a way that the pipeline will
know what the data will be used for when passed into the shader. In this member we must supply a value
from the D3DDECLUSAGE enumeration:
typedef enum D3DDECLUSAGE

{
D3DDECLUSAGE_POSITION = 0,
D3DDECLUSAGE_BLENDWEIGHT = 1,
D3DDECLUSAGE_BLENDINDICES = 2,
D3DDECLUSAGE_NORMAL = 3,
D3DDECLUSAGE_PSIZE = 4,
D3DDECLUSAGE_TEXCOORD = 5,
D3DDECLUSAGE_TANGENT = 6,
D3DDECLUSAGE_BINORMAL = 7,
D3DDECLUSAGE_TESSFACTOR = 8,
D3DDECLUSAGE_POSITIONT = 9,
D3DDECLUSAGE_COLOR = 10,
D3DDECLUSAGE_FOG = 11,
D3DDECLUSAGE_DEPTH = 12,
D3DDECLUSAGE_SAMPLE = 13,
} D3DDECLUSAGE, *LPD3DDECLUSAGE;
Page 41 of 139
As you can see, there are values to describe the vertex position, normal, blend weight, color, texture
coordinates, etc. as well as usage types for the tangent and binormal vectors generally required for
normal mapping techniques (more on these later). Even depth, fog, and point sprite size are represented
here.
Given our example vertex structure, we would supply the following usage values for each component:
struct Vertex
{
D3DXVECTOR3 Position; // D3DDECLUSAGE_POSITION
D3DXVECTOR2 TexCoords0; // D3DDECLUSAGE_TEXCOORD
D3DXVECTOR2 TexCoords1; // D3DDECLUSAGE_TEXCOORD
D3DXVECTOR3 Normal; // D3DDECLUSAGE_NORMAL
};
The usage value is very important because the pipeline needs to know exactly what data in the buffer
represents positions, normals, texture coordinates, etc. so that it can load them into the appropriate
registers. This is actually a two-step process as you will see in a moment when we start discussing the
mandatory use of semantics to describe effect/shader parameters. The declaration describes what the
member contains (e.g., position) whereas defining an input parameter in your vertex shader with the
POSITION semantic instructs the pipeline about where that data should be bound. So in effect, the
declaration says, “this is what the data is” and the semantic says, “this is where the data goes”.
Note: The texture coordinate slots can be used to represent any arbitrary data that you wish to have
interpolated over the primitive. Under typical mesh rendering circumstances, it is rare that you will want
to use all 8 texture coordinate slots for actual texture coordinates, particularly given that the registers are
four components wide. However, regardless of how we intend to use the texture coordinate slots in
practice, the D3DDECLUSAGE_TEXCOORD flag is still the correct usage type for such an element.
BYTE UsageIndex
The final member of the D3DVERTEXELEMENT9 structure provides additional information that helps
define the usage member more clearly. For example, a vertex can contain more than one color
(interpreted as diffuse and specular in the traditional pipeline) but both of these would be assigned the
D3DDECLUSAGE_COLOR usage value. How would the pipeline know which color is to be placed in
the first color register (that we might want to interpret as diffuse), and which one in the second
(interpreted as specular)? It is this information that this final member describes. In this case, specifying 0
in this member will instruct the hardware to place that color in the first color register, and a usage index
of 1 will instruct it to place it in the second color register.
Likewise, we might have a vertex that requires all eight texture coordinate slots. All of these members
would be labeled with the D3DDECLUSAGE_TEXCOORD usage flag, but again, how would the
pipeline know which texture coordinate slots should receive which data? In this example, each texture
coordinate element would provide an index between 0 and 7 in this member, describing which of the
eight texture coordinate registers the data should be set to.
This might all sound a bit complicated, but it gets a lot clearer when we look at some vertex declaration
examples. Starting with our example vertex, which has position and two sets of texture coordinates in
Page 42 of 139
the first stream and its normals in a second stream, the declaration would look something like the
following:
LPDIRECT3DVERTEXDECLARATION9 g_pVertexDeclaration = NULL;

D3DVERTEXELEMENT9 decl[] =
{
{ 0, 0, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0 },
{ 0, 12, D3DDECLTYPE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0 },
{ 0, 20, D3DDECLTYPE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 1 },
{ 1, 0, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_NORMAL , 0 },
D3DDECL_END()
};
pd3dDevice->CreateVertexDeclaration( decl, &g_pVertexDeclaration );
Notice how we first create an array of D3DVERTEXELEMENT9 structures with one element for each
of the four members of our vertex (plus the array terminator D3DDECL_END()). The normal in this
example is stored in its own vertex buffer in a second stream which is why it has an offset of 0, and a
stream index of 1 instead of 0. Once we define our array, we pass it to the device's
CreateVertexDeclaration method, which returns an IDirect3DVertexDeclarator9 interface that can be
bound to the device at render time.
Just to make sure that you have a handle on the concept of creating declarations, let’s take a look at a
few more examples before moving on. This next example shows a declaration for a vertex structure that
contains position, diffuse color, and specular color. In this example the vertex data is assumed to be
contained in a single stream (vertex buffer).
struct Vertex
{
float x; // Model space position
float y;
float z;
DWORD Diffuse; // Diffuse color
DWORD Specular; // Specular color
};
D3DVERTEXELEMENT9 dwDecl[] =
{
{0, 0, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0},
{0, 12, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR , 0},
{0, 16, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR , 1},
D3DDECL_END()
};
In another case, we might declare a vertex structure that adds two sets of texture coordinates as in the
following example:
Page 43 of 139
struct Vertex
{
float x; // Model space position
float y;
float z;
DWORD Diffuse; // Diffuse color
DWORD Specular; // Specular color
float tu1 , tv1; // Tex coords 0
float ti2 , tv2; // Tex coords 1
}
D3DVERTEXELEMENT9 dwDecl[] =
{
{0, 0, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT,D3DDECLUSAGE_POSITION, 0},
{0, 12, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT,D3DDECLUSAGE_COLOR , 0},
{0, 16, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT,D3DDECLUSAGE_COLOR , 1},
{0, 20, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT,D3DDECLUSAGE_TEXCOORD, 0},
{0, 28, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT,D3DDECLUSAGE_TEXCOORD, 1},
D3DDECL_END()
};
Remember to pay close attention to the last entry for each D3DVERTEXELEMENT9 definition, which
describes the usage index. For the color members, it describes which of the two represents our diffuse
vertex color, and which is the specular. For the texture coordinate components it describes which texture
coordinate slots / registers the data will be assigned to.
Finally, let’s take a look at a slightly more complex example that uses three vertex streams. In this case
the vertex data will be described by three separate vertex structures. In the first vertex buffer (which
contains vertices of type POSCOLORVERTEX), the model space position and diffuse and specular
colors are stored. In a second vertex buffer, the first set of texture coordinates will be stored. And last, in
a third vertex buffer a second set of texture coordinates will be stored.
// Stream 1
struct POSCOLORVERTEX
{
FLOAT x, y, z;
DWORD diffColor, specColor;
};
// Stream 2
struct TEXC0VERTEX
{
FLOAT tu1, tv1;
};
// Stream 3
struct TEXC1VERTEX
{
FLOAT tu2, tv2;
};
Page 44 of 139
The declaration that tells the pipeline how to extract the vertex data from our three separate streams will
be defined as follows:
{
{0,0, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0},
{0,12, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR, 0},
{0,16, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR, 1},
{1,0, D3DDECLTYPE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0},
D3DDECL_END()
};
Of course, before rendering the object that uses this data, and after setting the declaration, we would
need to make sure that we bind all three vertex buffers to the first three streams of the device.
pD3DDevice->SetStreamSource(0, m_pVBPosColor, 0, sizeof(POSCOLORVERTEX));

pD3DDevice->SetStreamSource(1, m_pVBTexC0, 0, sizeof(TEXC0VERTEX));
pD3DDevice->SetStreamSource(2, m_pVBTexC1, 0, sizeof(TEXC1VERTEX));
We now have a pretty good idea about where vertex declarations fit into the picture. Remember, they
must be used and set in this same way whether you are using the effect file framework or standalone
shaders. The effect framework does not take care of setting your vertex declaration -- this must be done
manually by your application either directly by calling the device's 'SetVertexDeclaration()' method, or
indirectly by your use of ID3DXMesh (which handles the creation and application of the declarator for
you).
19.5 Shader Semantics
The vertex declaration tells the pipeline exactly where your data is stored and what that data represents.
However, when we call DrawPrimitive, Direct3D will need to fetch the data for the current vertex from
the relevant streams and load it into vertex shader input registers. When using HLSL, we generally do
not want to deal with those registers directly, so instead we use parameters to alias the registers in which
data has been loaded. It will be much nicer inside our shader function to refer to the vertex position with
a variable name like Position instead of having to work with a register like v0. This is where semantics
come in.
In the previous chapter we talked briefly about how you can define your own semantics for labeling
effect parameters. This can be useful if you have several developers working on shaders at the same time
which all have different variable naming conventions and coding styles. You could define a list of
semantics which you ask all of your programmers to use and these would describe to the engine exactly
which variable in a given shader should receive a specific piece of information (e.g., a world matrix). In
this instance semantics were purely optional, but when working with HLSL shaders, there are a certain
number of standard semantics that must be used to label your non-uniform input parameters (e.g., data
from the vertex stream). These semantics tell Direct3D exactly which input parameters should receive
Page 45 of 139
which information from the pipeline's data stream(s). In this section, we will also take our first look at
how a shader function is defined.
Below is a list of the standard vertex shader input semantics that must be used to describe the flow of
information.
Vertex Shader Input

Description
Semantic
BINORMAL[n] Binormal vector
BLENDINDICES[n] Blend indices
BLENDWEIGHT[n] Blend weights
COLOR[n] Diffuse and specular color
NORMAL[n] Normal vector
POSITION[n] Vertex position in object space
Transformed vertex position. POSITIONT tells the runtime that the vertex is
POSITIONT
transformed and that the vertex shader should not be executed.
PSIZE[n] Point size
TANGENT[n] Tangent vector
TESSFACTOR[n] Tessellation factor
TEXCOORD[n] Texture coordinates
Note: n is an optional integer between 0 and the number of resources supported (PSIZE0, COLOR1, etc.)
A few of these items won’t make a lot of sense until we start developing certain shaders later in the
course, but most should be fairly obvious. There are semantics to inform the shader which parameters
should receive the vertex position, color semantics to describe which parameters should receive up to
two colors for each vertex (if defined and described by the declaration), and of course, semantics to label
normals and sets of texture coordinates.
For illustration, let us imagine that we are writing a simple vertex shader that transforms our vertex
position into clip space, computes a diffuse color for it, and passes along a set of texture coordinates for
sampling in the pixel shader (where our interpolated vertex color can also be used for modulation).
Objects that will be assigned this shader/effect will have a vertex structure that looks something like:
Page 46 of 139
struct UnLitVertex
{
D3DXVECTOR3 Position; // Model space position
D3DXVECTOR3 Normal; // Normal
float tu, tv; // Texture coords
};
Here is the matching vertex declaration:
{
{0,0, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0},
{0,12, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_NORMAL, 0},
D3DDECL_END()
};
Our vertex shader function will need to accept three parameters (non-uniform) from the vertex stream.
Since it is going to transform the model space position into clip space, it will also need access to a
combined world/view/projection matrix which the application will need to set prior to invoking the
shader. Finally, the shader will need access to a light direction vector (we are assuming a single light
source in this example, like the sun) which is another constant the application will need to provide.
Below is how our shader might look. In this example, we have called the vertex shader program
ProcessVertex, so if you were compiling this as a standalone shader, this would be the shader entry point
name. When using effects, you would assign this shader inside a given technique as VertexShader =
compile vs_3_0 ProcessVertex() (assuming you wanted to target the result at the 3.0 shader model).
// Application defined constants

// Vertex shader
void ProcessVertex( float3 ModelSpacePosition : POSITION,
float3 Normal : NORMAL,
float2 TextureCoord0 : TEXCOORD0 )
{
// ... code goes here ...
}
At the top of the source file we define two 'global' parameters. Parameters defined in this way describe
uniform (constant) data, which occupy slots in the shader’s constant memory. As we learned in the
previous chapter, these types of variables can be set by the application using the ID3DXEffect::SetX
methods. Alternatively, as we saw earlier, in the case of a standalone shader that is not defined within an
effect file, the ID3DXConstantTable::SetX methods can be used. Being constant parameters, the
application will set the light direction vector and the transformation matrix prior to invoking the shader
(via an effect or otherwise) so that when drawing takes place, this data will be loaded into the required
constant registers for access by the shader program.
Page 47 of 139
Having figured out what constants our shader will need and having defined them, our next step is to
figure out what non-uniform inputs our shader function will require. Remember, uniform inputs are
parameters that represent the constant registers and non-uniform inputs are parameters that will receive
the per-vertex information from the stream via the vN registers.
As our vertex structure contains a position, a normal, and a single set of texture coordinates, it stands to
reason that our shader function should accept at least these three parameters (although this is not a
requirement -- it can accept fewer). This can be seen in the above code sample. Notice that we label each
dynamic input with one of the standard vertex shader input semantics. The vertex declaration instructs
the pipeline how the various components of our vertex are stored in the vertex stream, and here we see
the semantics being used to tell the pipeline which shader function parameters should receive these
components. As you can see, a shader function looks quite similar to a standard C function, with the
exception of the semantics appended to each input parameter.
The above example is of course incomplete because a vertex shader that outputs nothing is invalid -- at a
minimum, a vertex shader function must output a transformed clip space position. In our particular case,
we want the vertex shader to output a transformed position, a set of texture coordinates, and a diffuse
color. So how do we tell the pipeline what information we are passing out of the shader? That is what
output parameters and shader output semantics are for.
A vertex shader can output two colors, a position, a fog factor, a point size (for use with point sprites)
and multiple sets of texture coordinate. Thus, semantics exist to label data as such.
Vertex Shader Output

Description
Semantic
Diffuse or specular color. Any vertex shader prior to vs_3_0 should clamp a
COLOR[n] parameter that uses this semantic to [0, 1]. A vs_3_0 vertex shader has no
restriction on the data range.
FOG Vertex fog.
Position of a vertex in homogenous clip space. Every vertex shader must write
POSITION[n]
out a parameter with this semantic.
PSIZE Point size.
TEXCOORD[n] Texture coordinates.
How do we return multiple values from a function? The following example provides one way to do it
and completes our shader function prototype:
Page 48 of 139
// Application supplied data
// Vertex Shader
float2 TextureCoord0 : TEXCOORD0,
out float4 ClipSpacePosition : POSITION,
out float4 DiffuseColor : COLOR0,
out float2 Tex0 : TEXCOORD0 )
{
// Code goes here
}
As you can see, the HLSL compiler understands the out modifier which informs the shader that these
parameters will be populated with the data that the shader will return. In fact, both in and out modifiers
are supported, but since in is the default, it is implicitly assumed for any parameters with no modifier
assigned. When the shader is called, any parameters that use the in modifier (either implicitly or
explicitly) are populated with data from the vertex stream for the current vertex for which the shader is
being invoked. Any parameters with the out modifer contain no initial values when the shader is
invoked. It is the job of the shader to populate these parameters with the final outputs. To use a C
analogy, we can think of shaders with output parameters as being like C functions where we pass the
address of a variable that will receive the final result. Of course, in the case of the shader, the shader
output parameters are never returned to the application, but instead are passed through to the rest of the
pipeline.
So, in the above example, the shader would receive the model space position, normal, and texture
coordinates of each vertex for which it was invoked. Using this information and a series of intrinsic
functions exposed by HLSL, we would transform the vertex position by the (often pre-combined) world,
view, and projection matrices and store the result in the ClipSpacePosition output parameter. We would
also use both the vertex normal and the light direction vector (a constant parameter set by the
application) to calculate a final color for the vertex which we would store in the DiffuseColor output
parameter. Finally, assuming we do not wish to adjust the texture coordinates, we would simply copy
the input texture coordinates into the output parameter Tex0. When the shader function reaches its end,
all of the data relating to this vertex that we wish to pass on to the rest of the pipeline will have been
calculated and because of the use of output semantics, the pipeline will know exactly which parameters
we have placed the data into.
Note: Just because an application side vertex structure declares certain elements, does not mean we
must define input parameters for all of those components. For example, if we wanted to ignore the
vertex normal because we have no wish to work with it in our shader, we would simply not declare it as
an input to the shader function.
Pixel shader functions are defined and written in exactly the same way as vertex shaders and similarly
look just like regular C functions. Of course, because the nature of the data passed into and out of a pixel
shader is quite different from that exchanged with a vertex shader, the input and output semantics are
different. Below is a list of input semantics for pixel shaders which identify data passed in by the prior
pipeline stages.
Page 49 of 139
Pixel Shader Input
Description
Semantic
Diffuse or specular color. For shaders prior to vs_3_0 and ps_3_0, this data
COLOR[n] ranges between 0 and 1, inclusive. Starting with ps_3_0, there is no
restriction on the data range.
TEXCOORD[n] Texture coordinates.
Floating-point scalar that indicates a back-facing primitive. A negative value

VFACE
faces backwards, while a positive value faces the camera. (ps_3_0 only)
VPOS Contains the current pixel (x,y) location (ps_3_0 only).
The last two entries in the above table are only available on shader model 3.0+ hardware, but can be
very useful. VFACE allows us to check if the primitive for which the pixel shader is being invoked is
back facing. If so, we can do things like kill the execution of the shader or run alternative calculations
via dynamic branching. The VPOS semantic allows us to get the actual screen pixel coordinates of the
fragment for which the shader is being invoked, which can be helpful when doing things like
dynamically generating certain types of texture coordinates. However, for shader model 2.x and below,
we have only two input semantics that the pixel shader understands -- color and texture coordinates.
Remember where in the pipeline the pixel shader is invoked. For example, if we refer back to the vertex
shader function we started writing above, we discussed how this shader would be executed for each
vertex in the primitive and would output a clip space position, a diffuse color, and a set of texture
coordinates. This data is passed to the rest of the pipeline where it eventually reaches the rasterizer. The
clip space position output by each vertex has, at this point, undergone the w divide and the 2D projection
of the primitive is now known. Once inside the rasterizer, the vertices of the primitive are now 2D
projections of the positions output from the vertex shader, but each still contains the color and texture
coordinates that we output. When rasterization begins, the color and texture coordinates at each vertex
are interpolated linearly. For each pixel in the primitive, the pixel shader is invoked and the interpolated
texture coordinates and colors are passed to the shader. Thus, the texture coordinates and colors output
from the vertex shader form the inputs for the pixel shader, but not directly -- what the pixel shader gets
are the interpolated results. This is required of course because we want our pixel shader to sample
textures using the coordinates specific to its unique position on the surface of the primitive, not to those
of the vertices.
Finally, most pixel shaders have a singular objective -- to output color(s). You can also calculate your
own depth value per-pixel and return it using the DEPTH output semantic, but this is optional and
generally will not be a common occurrence for the majority of your pixel shaders.
The pixel shader output semantics are listed below. You are reminded that you can define multiple color
and depth output parameters (replacing n with a number) since a pixel shader can write to multiple
render targets at the same time.
Page 50 of 139
Pixel Shader Output
Description
Semantic
COLOR[n] Ouput color.
DEPTH[n] Output depth.
Based on what we now know about input and output parameters, let’s update our example to provide
both a vertex and a pixel shader that are designed to be used together. As before, we are including no
code inside the shader functions at the moment (we will do that shortly). For now, we just want to make
sure that you understand the various ways that data can get into and out of a shader.
// Application defined constants

// Vertex Shader
float2 TextureCoord0 : TEXCOORD0,
out float4 DiffuseColor : COLOR0,
out float2 Tex0 : TEXCOORD0
)
{
// Shader code goes here
}
// Pixel Shader
void ProcessPixel ( float4 Diffuse : COLOR0,
float2 TexCoord: TEXCOORD0,
out float4 ColorOut: COLOR )
{
// Shader code goes here
}
// Technique
technique DiffuseTexture
{
pass P0
{
// compile shaders
VertexShader = compile vs_3_0 ProcessVertex();
PixelShader = compile ps_3_0 ProcessPixel();
}
}
At the top of the file we define the transformation matrix and the light direction vector that will be used
by the vertex shader. The application will be responsible for setting, via the ID3DXEffect::SetX
methods, the transformation matrix, the light vector, and any other information needed within the
compiled shaders not shown in the above example (such as textures, etc.). If you skip down to the
Page 51 of 139
technique definition for a moment, you can see how our vertex and pixel shader functions are assigned
to the technique’s VertexShader and PixelShader states (compiled to a 3.0 shader model target in this
example).
The vertex shader acts as before, calculating the clip space position and diffuse color of the vertex. It
also passes through the texture coordinates. During rasterization, the pixel shader will be invoked for
each pixel within the bounds of the primitive and is provided with the interpolated texture coordinates
and color. At this point the pixel shader could use one of its intrinsic functions (e.g., a function like
tex2D) to sample a color from a texture that we had bound to a sampler. We will see this aspect in more
detail in a moment, but for now just know that you pass it a reference sampler you wish to sample from
and a set of texture coordinates (in this case, the interpolated texture coordinates passed into the pixel
shader) and it returns the color of the texture at those coordinates. The pixel shader could then modulate
this sampled color with the interpolated diffuse color that was calculated per-vertex in the vertex shader
to produce the final output color of the pixel. The end result here would be similar to Direct3D’s own
vertex lighting system when used with basic texture blending (1x modulation in this case). However, the
great thing about working with pixel shaders is that we can move our lighting calculations so that they
are done per-pixel instead. For example, instead of using the light direction vector in the vertex shader to
calculate a per-vertex diffuse color that is merely interpolated over the surface for each pixel, we could
perform the diffuse calculation in the pixel shader, giving us true per-pixel lighting. More on this in a
moment.
Before we dive into writing the shader internals, we need to conclude our examination of the many ways
that you can define input and output parameters in HLSL. So far, we have seen just one method of
defining input and output parameters, but there are alternative approaches that are worthy of attention
before moving on.
First, shader functions, like functions in C/C++, can return a value. For example, in the case of a pixel
shader we are often only interested in returning a single RGBA color value, so we could for instance
modify our previous example to something more along the lines of the following:
// Pixel Shader
float4 ProcessPixel ( float4 Diffuse : COLOR0,
float2 TexCoord: TEXCOORD0 ) : COLOR
{
float4 MyColor;
// Code goes here
return MyColor;
}
Here we have removed the output parameter from the functon's parameter list and instead now use
'float4' as the function return type (returning a 4D float vector). Of course, we still need to let the
pipeline know what this output data represents, so we must supply a semantic. Notice that when
applying a semantic to describe data returned directly by a shader function, we assign the semantic after
the parameter list. Using this approach, we no longer have to store the resulting color in an output
parameter but can instead simply return the color from the function when we are ready. You may also
mix the two approaches we have seen to this point such that a vertex shader function, for example, might
Page 52 of 139
return the transformed clip position via its return value, but output the color and texture coordinates
through parameters with the out modifier.
Input / Output Using Structures
A popular I/O alternative is to take advantage of HLSL structure support. We can define structures to
contain our input and output data and instead pass those in and out of functions. A shader function can
then return a single structure that contains all of the output data. To demonstrate, let’s use our previous
vertex and pixel shader examples and adjust them to use the structure-based model.
// Vertex Shader I/O Data

struct VS_INPUT
{
float3 Pos : POSITION;
float3 Normal : NORMAL;
float2 Tex : TEXCOORD0;
};
struct VS_OUTPUT
{
float4 Pos : POSITION;
float4 Diff : COLOR0;
float2 Tex : TEXCOORD0;
};
// Vertex Shader
VS_OUTPUT ProcessVertex( VS_INPUT In )
{
// Initialize the output structure to zero
VS_OUTPUT Out = (VS_OUTPUT) 0;
// Code goes here

Out.Pos = etc;
Out.Diff = etc;
Out.Tex = In.Tex
return Out;
}
// Pixel Shader
float4 ProcessPixel ( VS_OUTPUT In ) :COLOR
{
float4 OutColor = 0;
// Code Goes Here

OutColor = In.Diff * etc + etc;
return OutColor;
}
Page 53 of 139
The vertex shader now has a single parameter: the VS_INPUT structure. Due to the fact that all of the
structure members supply input semantics, the pipeline will populate them with the appropriate data
from the vertex stream prior to passing it to the shader function. The shader function also returns its data
in a single structure whose members are similarly marked with vertex shader output semantics. It will be
the job of the vertex shader to populate the output structure prior to returning it. Hopefully you can see
that these structures just act as transport containers for input and output parameters to and from the
shaders.
Note: The vertex shader will need to provide values for every member in the output structure. At the top
of the function it casts '0' to the structure type and assigns it to the output object, a handy shortcut that
ensures all members are initialized to zero. This kind of initialization will be unnecessary if you ensure
that all members are populated at some point prior to the shader exiting.
After the shader function has declared an output structure and populated it with the necessary values, the
entire structure is then returned from the function where its values will pass on to the rest of the pipeline.
As we saw earlier, the pixel shader in our example requires two pieces of information during
rasterization -- a set of texture coordinates and an interpolated diffuse color. In our original approach,
this meant we defined two separate input parameters with the COLOR and TEXCOORD semantics.
Since our vertex output structure already declares members with these semantics, we can in fact reuse it
in the pixel shader for input too in this case. Of course, the vertex output structure also has a position
defined, which is not a valid pixel shader input type, but as long as we do not try to access the position
member inside the pixel shader, the pipeline will not try to populate it (for pixel shader use). Instead, the
rasterizer will simply ignore any such members of the input structure used by the pixel shader and only
populate (in this instance) the color and texture coordinate members with their interpolated values.
Because we only wish to return a single value from the pixel shader, we do not use an output structure,
but simply return a float4 with the COLOR semantic. An output structure (or multiple output
parameters) would be required however if we wanted to write to multiple render targets.
Note: The fact that we are using the vertex output structure as the pixel shader input structure is a
common (but completely optional) technique that saves the definition of two separate structures. Using
the same structure as the output and input structures of the vertex shader and pixel shader, respectively,
however does not imply a direct exchange of data between the two. The data passed into the pixel
shader will be calculated by the rasterizer based on the data passed out of the vertex shader, but it is
obviously not a direct copy – it was interpolated. We are simply using the parameter type (a structure in
this case) to conveniently carry that data in and out of the function.
Uniform Inputs to Shader Functions
All of the shaders that we have looked at to this point have accepted inputs that represent dynamic data -
- that is they are being read from a vertex stream or output by the vertex shader, interpolated, and passed
as input to the pixel shader. There is however, another type of data that can be passed in to a shader
function that, just like the data stored in the shader constant registers, is uniform in nature. We can
leverage an understanding of what the compiler can do when presented with this type of data to perform
some useful tricks.
Page 54 of 139
For example, let us imagine that we wish to write a technique and an accompanying shader that can
process up to eight point lights. We might define some arrays at the top of our effect file to store the
eight light positions and their diffuse colors:
// Application supplied data

texture BaseMap;
// Lights
float3 LightPosition[ 8 ];
float4 LightDiffuse[ 8 ];
int NumberOfLights;
// Sampler state definitions

sampler BaseMapSampler = sampler_state
{
Texture = <BaseMap>;
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = wrap;
AddressV = wrap;
};
Going a step further, let’s say that our application collected together lists of up to eight lights that can
influence the geometry it is about to render (or pre-batched them into lighting groups similar to those we
used in Module I). Before invoking the shader designed to work with those lights, the application will
use the ID3DXEffect interface to set the properties of the (up to) eight light sources to the associated
parameter arrays we see above.
Notice above that there is a parameter named NumberOfLights which will also be set by the application
to inform our effect of how many of the possible eight lights are influencing the collection of primitives
about to be rendered. Now, while we could write a single shader that automatically loops through the
eight lights for every invocation of the shader, one of our goals is to make our shaders as fast as
possible. This is especially important when we consider that pixel shaders, for example, are executed for
each pixel comprising the primitive currently being rendered. Conditional branching logic and loops in
shaders can incur a performance penalty, so it is often best not to use them when they can be easily
avoided. Furthermore, loops are not supported in shader models prior to 2.x and as such, a different
approach would have to be employed regardless.
Note: Remember that just because loops are not supported natively in earlier shader models does not
mean that you are unable to use an HLSL loop in your shader source code. If a 1.x or 2.0 shader model
is targeted, the compiler will attempt to unroll the loop if it can determine the required number of
iterations in advance. Of course, this may cause you to exceed the maximum number of instructions in
these earlier models if the loops are large.
An efficient way to tackle this problem is to compile an array of shaders that can be indexed using
uniform parameter inputs. In our case, we will actually create nine different vertex shaders: a shader that
deals with 0 lights, another that deals with 1 light, one that deals with 2 lights, and so on. In this way, we
Page 55 of 139
can choose to execute the most efficient shader for a given group of primitives based on the number of
lights that influence them.
Note: As with all forms of optimization, you will want to test different strategies for best performance.
While conditionals and loops in shader code can indeed be slow, switching shaders also comes with some
costs and should be done as infrequently as possible. In the current case being discussed, the shader
array method is likely to outperform branching as a general rule because the number of shader changes
is likely to be quite low when batching is employed. We just want to emphasize that you will want to be
very careful about making any assumptions about what is 'faster' based solely on what your gut tells you
should be faster, especially in a parallel system -- test (you will often be surprised)!
In the next section of the effect file we will define the input and output structures used by the vertex
shader and write some of the shader function. Notice how this version of the vertex shader takes as its
first parameter an integer that is marked with the uniform modifier, in addition to its input structure. You
will see what use this has in a moment.
struct VS_INPUT
{
float3 Position : POSITION;
float2 BaseUV : TEXCOORD0;
};
struct VS_OUTPUT
{
float4 Diffuse : COLOR0;
float2 BaseUV : TEXCOORD0;
};
VS_OUTPUT MyVShader( uniform int LightCount, VS_INPUT vertex )

{
VS_OUTPUT Output = (VS_OUTPUT)0;
// CODE GOES HERE

// for ( int i = 0; i < LightCount; i++ )
// {
// Execute some code here involving the LightPosition and
// LightDiffuse arrays to a calculate final vertex color.
// }
// Send the output

return Output;
};
Although we still have not shown a complete vertex shader code listing, the comments give you some
idea of how it will work. The supplied LightCount parameter is used in the HLSL source to define a loop
that will iterate and process each light property from the two lighting arrays. For example, if this shader
is supplied with a LightCount of 5, it means the collection of primitives currently being rendered is
influenced by five lights and as such, the application will have already placed the light properties
Page 56 of 139
(position and colors) in the first five elements of the global LightPosition and LightDiffuse arrays prior
to invoking the shader.
An interesting side-effect of using a shader input marked with the uniform keyword as the variable
controlling the number of iterations of the lighting loop, is that the compiler will automatically unroll the
loop on our behalf, thus avoiding the expense of flow control as well as allowing us to target shader
models of 2.0 or below where loops are not supported.
It is worth noting that when a high level shader function accepts an input parameter marked with the
uniform modifier, the value for this parameter can only be supplied at compile time. In our example
therefore, we would construct an array containing the nine shaders we need in the following way:
VertexShader vsArray[9] = { compile vs_2_0 MyVShader(0),

compile vs_2_0 MyVShader(1),
compile vs_2_0 MyVShader(8)
};
As you can see, we have defined an array of nine vertex shaders where, in each case, the function we
compile is the same (i.e., the vertex shader function above). However, the uniform input parameter
LightCount is assigned a value from 0 to 8. So although we are compiling the same shader function in
every case, by specifying a different value for the LightCount uniform input, the final compiled shader
code generated will be different for each element in the shader array. That is to say, for each permutation
the compiler will have unrolled the loop only as many times as is absolutely necessary based on the
value of that uniform input. In this way, although it is true to say that uniform inputs to shader functions
are constant in nature -- just like the global variables we use to transport data from the application -- it is
probably easier to view these inputs more like the literals we examined in the previous chapter since
they are supplied at compile time and thus do not consume a constant register.
Note: Notice that when compiling a shader in this way, the non-uniform parameters can be ignored since
they will be populated by the pipeline. All we have to pass at compile time are the values for our uniform
parameter(s). In this case only a single uniform parameter (LightCount) was required.
When this effect file is compiled, our vertex shader function will be compiled nine times to create nine
completely different and unique vertex shader permutations that are stored in a global array. Assuming
that the application will, prior to invoking the effect for a given collection of primitives, make sure that
it sets the appropriate value to the global NumberOfLights parameter (defined at the top of the file), the
technique can use this value to index into our vertex shader array and dynamically choose the most
efficient shader for the job:
Technique BasicRender
Page 57 of 139
{
pass p0
{
CullMode = CCW;
ZEnable = true;
ZWriteEnable = true;
AlphaBlendEnable = false;
AlphaTestEnable = false;
VertexShader = < vsArray[ NumberOfLights ] >;

PixelShader = NULL;
Sampler[0] = (BaseMapSampler);
TexCoordIndex[0] = 0;
ColorArg1[0] = Texture;
ColorArg2[0] = Diffuse;
ColorOp[0] = Modulate;
AlphaOp[0] = Disable;
}
}
The most important line in the technique shown above is where the device VertexShader state is
assigned (highlighted in bold). In this case, it is assigned one of the vertex shader permutations from the
shader array based on the number of lights required. Compiling shaders in this way is quite common so
that we can hand-pick the best shader for any given batch of polygons on the fly. However, it is also
important to note that as we increase the number of uniform inputs, we also increase the number of
permutations of our shaders. Using this approach, it is not uncommon to see the shader permutation
count climb up into the thousands. Compile times and memory usage for those shaders will obviously
increase as well, so some care needs to be taken when contemplating the design that will work best for
your application.
At this point, we know just about everything we need to know about compiling shaders and loading
them into our application. We also know how to pass constant data and the data from our vertex
stream(s) into our shaders for whatever work needs doing. What is left at this stage is learning how to
actually write an HLSL shader, and that is exactly what we will do in the remaining sections of this
chapter. However, before we jump straight into shader code, given that we have not talked about
lighting much since Module I in this series, it might be helpful to step through a quick refresher on the
lighting model we will be using in most of the shaders we will encounter in this chapter's lab projects
and in those to come.
Most of the next section will be a review of basic lighting concepts that you have seen before, but there
are a few theoretical ideas included that we have not discussed previously. If you are very comfortable
with the Blinn-Phong lighting model, you may prefer to skip the next section and move straight on to the
shaders. However, we encourage you to take the time to quickly refresh your memory about these topics
since they will come up again and again throughout the remainder of the course.
19.6 The Phong Illumination Model (A Review)

Page 58 of 139
A lighting/illumination model is a set of mathematical equations that model the interaction between light
sources in the scene and the scene’s surfaces. It will account for the geometric relationship between
lights and surfaces, the color and intensity of the light, and the reflective properties of the surfaces. The
goal is to generate a color for the location at which light is being sampled (a sample point). In a vertex
lighting system, sampling locations will be the vertices of the scene geometry. For texture-based lighting
(e.g., the lightmapping system we will study later in the course), the sample points will be the world
space positions of our lightmap texels. With per-pixel lighting, those sample points will theoretically be
the individual pixels in our backbuffer, mapped back to the world.
The lighting models used by most 3D APIs are all very similar and can produce fairly nice results, but
not surprisingly, they are not terribly realistic. Primarily, most of these systems are based on a lighting
model called the Phong illumination model, developed in 1975 and named after its creator Phong Bui-
Tuong. As we are well aware by now, when using the API’s fixed-function lighting model, the lighting
pipeline will calculate the final color values on our behalf as the vertices are transformed and rendered.
However, we do have to pass additional information to the pipeline to help it calculate those per-vertex
colors correctly. For starters, vertices will need to provide normals describing the orientation of the
surface at that location so that the direction of the light can be factored in. We must also supply
information about how a vertex (the sample point) reflects or absorbs incoming light energy. For this,
we generally use the parent surface’s material. When computing direct lighting from physical sources,
the lighting model uses the location and range of light sources in the scene to determine which points are
within range of a given light. For any point on a surface that is within range of a light, the lighting model
is run. The color emitted from that light is adjusted based on the angle of orientation between the sample
point's normal (i.e., the vertex normal in this case) and the direction of the incoming light. Further, a
distance-based attenuation is usually applied so that sample points receive less energy as they get further
from the light source origin. Ultimately, the lighting model produces an RGB color contribution from
that light source for that sample point. As different surfaces can absorb or reflect light energy in
different proportions, this contribution is scaled by these reflective properties. If a surface has a 50%
absorption rate, the light’s contribution would be scaled by half and this would be the final color added
to the sample point’s color container (vertex, texel, or pixel). Thus, the final color of a sample point will
be the sum of all light contributions that influence it, scaled by the reflective properties of the surface to
which that sample point belongs. This process is repeated for each sample point in the scene.
Note: We should be careful not to confuse the Phong Illumination Model with the Phong Shading
Model, both of which were pioneered by the same person. Phong shading is a per-pixel shading
technique that is an alternative to Gouraud shading. It produces better results, but with more
computational overhead. The DirectX fixed-function pipeline does not support Phong shading, but it
does support the Phong Illumination Model for calculating the color of each vertex when lighting is
enabled.
Page 59 of 139
19.6.1 Local Illumination
The Phong illumination model is actually a combination of three different components: ambient
illumination, Lambertian (diffuse) reflection and Phong specular reflection. The Phong model expands
Lambertian reflection, that models directional light for matte surfaces, by adding a specular reflection
component to the lighting equation. The specular component of the model attempts to simulate the
highlights that appear on shiny surfaces such as plastics, metals, or mirrors. These highlights are
intended to represent the visible reflection of the light source itself. Surface reflectance is represented by
three distinct components that can be calculated separately and then linearly combined. It is considered a
local illumination model because it deals with the sample point in isolation from the rest of the
environment. That is, it considers only light emitted from actual light sources and does not account for
reflected light from other elements in the environment. It also does not account for other surfaces in the
environment blocking incoming light from the source (i.e., shadows).
So, a local illumination model calculates the color of a sample point as if that point is in a vacuum that
contains only itself and the light source(s). Any influential light sources contribute their color and the
surface’s material is used to determine how much of that light is absorbed and how much is reflected.
The absorbed fraction is discarded and the reflected light becomes the color of the object in that
environment as we perceive it. If a nearby light source does not directly contributed to a surface, that
surface will not be illuminated in any way in a local illumination model. It is essentially a one-off
transference of energy / color. The light source transmits its energy and this in turn is scaled by the
reflectance properties to determine a final contribution.
Of course in real life, this is not exactly what transpires. For example, a surface that is not directly
affected by any light sources may still be illuminated by light that is reflecting off nearby surfaces or the
atmosphere (e.g., ‘sky light’). The radiosity lighting model (discussed later in this course) is a global
illumination model because it approximates this behavior -- the color of a surface is not only determined
by the color of the light sources that influence it directly (i.e., direct lighting), but also by the light
sources whose energy is reflected onto it from nearby surfaces and/or the atmosphere (i.e., indirect
lighting). The Phong Illumination Model, being a local illumination model, does not facilitate this latter
behavior and as such, an ambient term was added to the lighting equation to very roughly simulate this
global interaction of photons bouncing between surfaces. The ambient term is usually just a constant and
is really just an ad-hoc way of illuminating surfaces that are not directly affected by nearby light
sources. It certainly does not come close quality-wise to simulating proper global illumination effects in
its default form.
The following simplistic description of the Phong illumination equation shows how the color
contribution from a single light source is determined with respect to a single sample point. The final
color is the linear combination of three RGB color values -- ambient, diffuse, and specular. These three
colors are returned from the three separate components of the lighting model and are scaled by the
reflectance properties of the material assigned to the surface.
Page 60 of 139
Color = Ambient Contribution * Material.Ambient
+ Diffuse Contribution * Material.Diffuse
+ Specular Contribution * Material.Specular
Note: In the above equation, Ambient Contribution, Diffuse Contribution, and Specular Contribution
are not simply the three colors emitted from the light source. They are the colors emitted from the light
source after having passed through their respective parts of the lighting model. For example, Diffuse
Contribution represents the diffuse color emitted from the light source after it has been scaled by the
cosine of the angle between the incident light vector and the surface normal, and distance based
attenuation has been applied.
How the separate components of the lighting model generate their resulting colors will be reviewed
shortly. In the above equation, Ambient Contribution describes the RGB result of the ambient
component of the lighting model; it describes how much ambient color from the light source currently
being processed will be received (in DirectX, a light source can include an ambient color term, although
often we will treat this as a global concept). Diffuse Contribution is also an RGB color that is returned
from the diffuse equation of the lighting model. It describes how much of the current light’s diffuse
color reaches the surface location after factoring in the distance to the surface and its relative orientation
with respect to the path light is traveling. Finally, the same is true of Specular Contribution that
describes a third RGB color that is the result of the specular aspect of the lighting model.
Before we sum these three colors to generate the final color for the current light source being processed,
they are each scaled by the reflectance properties of the sample point’s material. We can think of these,
for the time being, as describing the material/reflective properties of the surface. Each is expressed by a
number between 0.0 and 1.0 that describes how much of the incoming color received from the light
source actually gets added to the final color (i.e., is reflected) and how much gets absorbed. For
example, if Material.Ambient = 0.75, then ¼ of the ambient light received from the light source would
be absorbed by the surface and ¾ of the ambient color received from the light source would be reflected
(ultimately into the viewer's eye) and thus added to the sample point color. In this example (as is the
case with DirectX’s lighting module) a surface has the ability to specify different reflectance properties
for ambient, diffuse and specular light for more flexibility. We could thus set Material.Ambient=1.0,
Material.Diffuse=1.0, and Material.Specular=0.0. This would describe the surface as reflecting all
incoming ambient and diffuse light but completely absorbing all incoming specular light. In such a
scenario no specular highlights would ever be visible on this surface. We will review the material values
a little later but for now just remember that they are used to scale the three colors output from the
lighting model for a given light source.
So, our lighting model is going to be divided into three separate components (ambient, diffuse, and
specular) and each component will model the interaction between light sources and the surfaces in a
different way. Let us now review each component separately to see how they will contribute.
Page 61 of 139
19.6.2 Ambient Lighting
The ambient component of the Phong illumination model is
one of the oldest and simplest illumination models attempted
in computer graphics, despite the fact that we are not
modeling a physical light type. In this model, it is assumed
that the light is not coming from any one direction; instead,
the light is assumed to be traveling equally in all directions
everywhere (spherically) throughout the world. The image on
the right (Figure 19.5) shows a cylinder and a sphere lit purely
by ambient light. Because the light is not coming from any
particular direction, but rather from all directions, the color is
applied to all surfaces equally. Since every face has now
received exactly the same color, we appear to have lost all
surface detail. This is even more evident when you examine
Figure 19.5
the figure below (Figure 19.6).
The leftmost image in Diffuse Lighting Global Ambient Lighting

Figure 19.6 depicts a
small section of terrain
that has been lit using
a very small ambient
contribution. Indeed,
diffuse color is the
primary contributor.
The ambient light in
the scene is extremely
minimal and only used
to provide a limited
base level of Figure 19.6
illumination. As the
diffuse model scales light contributions based on the direction of the light, we can see that some surfaces
are more brightly lit than others because they are more directly aligned to the incident light vector. It
should also be noted that every surface of this terrain has been rendered using the same material
reflectance properties. Thus, any differences in illumination amongst faces are due only to surface
orientation variance with respect to the incident light vector. As you can see, the terrain has considerable
illumination detail in the diffuse case on the left. The rightmost image demonstrates the same terrain
geometry being rendered with the same surface material for each face. The difference in this example is
that we have set a high ambient light value (it is the main color contributor to each vertex) with a beige
color. Since the global ambient color is added regardless of surface orientation or the orientation of any
incident light vector, every surface is now the same beige color as well.
In many basic lighting model implementations, the ambient lighting value is simply specified as a single
constant color value used to contribute to all geometry in the world. In the case where a single global
ambient color is being used to specify a scene-wide ambient color (the D3DRS_AMBIENT render state
Page 62 of 139
if you are using the fixed-function renderer), the ambient component of the lighting model calculates the
ambient contribution for each sample point using the following simple equation:
Ambient Contribution = G.Ambient
In the above equation, G.Ambient is the global ambient color that has been specified for the application.
As you can see, when only a global ambient color value is used, the ambient component of the lighting
model simply returns this color as the result so that it can be scaled by the ambient reflectance property
of the sample point’s material. This provides a minimum illumination level for the entire scene. Just
remember that since this ambient contribution will be scaled by the material’s ambient reflectance
property, the ambient color will only be collected by the sample point if the sample point reflects
ambient light.
Using this simple form of ambient lighting, we can start to flesh out our lighting model equation so that
when calculating the color contribution from a single light source, the original equation:
Sample Point Color = Ambient Contribution * Material.Ambient

becomes:
Sample Point Color = G.Ambient * Material.Ambient

While many lighting systems allow only the setting of a single global ambient light color for the entire
scene, in DirectX you can also specify an ambient color per light source. This may seem strange at first
when we consider that ambient light is not affected by the orientation of a light source with respect to
the sampling point. However, a light source can have a limited range and can even be set up with
distance-based attenuation. This means that we can set up the light such that it contributes less ambient
color to sample points as they approach the outer ranges of the light source’s range, or none, for those
outside the range. More importantly, it gives us the ability to control the color of the simulated ambient
light (generally based on the diffuse color of the light source). While orientation with respect to the light
source will not be a factor, the position of the light source with respect to the sampling point, and thus
the distance between them, can be considered. This is a very helpful upgrade to the standard global
ambient model. Artists can even create localized ambient light on surfaces through the use of hand-
placed dedicated ambient light sources throughout the scene.
Considering that the DirectX lighting model can handle both a global ambient color and per-light
ambient color with distance based attenuation, the ambient model will have to be extended a little from
that outlined previously. The overall color contributed to a sample point from the ambient component of
the pipeline can now be described as the global ambient color plus the sum of the distance-attenuated
ambient colors emitted from all light sources that are within range of the point:
Page 63 of 139
Total Ambient Contribution = M a × (Ga + ∑ La × Att )
Ga = Global ambient color (RGB)
La = Ambient color of each light source within range of the sample point (RGB)
Ma = The ambient reflectance property of the material assigned to this sample point (RGB)
Att = Attenuation factor between 0.0 and 1.0 used to scale the light contribution with distance
The more complete ambient lighting equation for our model is shown above. It calculates the total
ambient light collected from all influential light sources for a single sample point. As you can see, the
ambient contribution is the sum of the ambient color from each light source, scaled by the attenuation
factor, added to the global ambient color. This provides us with the total ambient contribution for the
sample point. This value is then scaled by the sample point’s ambient reflectance property to determine
exactly how much of this ambient light is absorbed and how much is reflected.
Even if you do not intend to hand-place specific light sources for the sole purpose of adding ambient
light to your scene, it is still often desirable to supply a global ambient color to ensure that any surfaces
not exposed to direct light will still be dimly lit and visible. However, in either case you will normally
want to keep the overall ambient contribution to the scene fairly low.
It should be noted that ambient lighting is view independent and in practical terms is essentially just an
extension of standard diffuse lighting (it is commonly referred to as indirect diffuse lighting). Thus,
unlike the specular lighting component we will review shortly, the amount of ambient light reflected by
a sample point looks the same when viewed from any angle. Consequently, this component is also a
candidate for offline pre-calculation, and this is exactly what we will do when we write our lightmap
compiler later in the course. In that case we will use a more sophisticated ambient lighting model than a
simple global color – our preferred ambient model will attempt to simulate actual indirect lighting (light
bouncing off nearby surfaces) via a process called radiosity that will look considerably more realistic
than the default DirectX fixed-function technique. Finally, it is worth noting that there are plenty of
tricks and techniques we can use in our shaders to improve the look of ambient lighting. These can range
from basic scaling and biasing to improve the look of the global ambient on normal mapped (bumped)
surfaces, to full blown real-time indirect lighting systems that attempt to simulate light bounces on the
fly. We'll discuss these techniques a bit later on.
19.6.3 Diffuse Lighting

Beyond indirect incident light, light is also received from light sources that have a fixed location in
space. Therefore the light that reaches a surface is traveling along a certain direction with respect to the
sample point, striking the surface at a specific angle.
The diffuse component of the Phong illumination model we are discussing is based on the Lambertian
model of reflection that states that for ideally diffuse (totally matte) surfaces, the amount of light
reflected from the surface is determined by the cosine of the angle between the surface normal (i.e., the
normal at the sampling point) and the incident light vector. The incident light vector is a unit length
vector describing the direction from the sampling point to the light source. Thus, if both the sample
Page 64 of 139
point normal and the incident light vector are unit length, we can use the dot product to calculate the
cosine of the angle and scale the incoming diffuse color emitted by the light source by the result.
Figure 19.7 shows four geometric primitives being lit

by diffuse lighting alone. A light source has been
positioned off to the right of these objects and is
emitting white light from right to left. Because both
the orientation of the vertex normals and the direction
the light is facing are factored into the diffuse
reflection model when the color contribution is being
calculated, we can see that the surfaces of the objects
that are facing predominantly to the right are being lit
much more intensely by the light source and are
receiving more of the full color of the light. This is
because the vertex normals and the incident light
vector (the vector from the vertex to the light source)
become more aligned for these surfaces and the result Figure 19.7
of the dot product is nearer to 1.0. Since the color
emitted from the light source is scaled by the result of this dot product, vertices whose normals are
facing directly toward the light source will receive that light’s full diffuse color.
On the left side of the meshes shown in Figure 19.7, the surfaces
receive no diffuse light and remain in shadow. This is because
the angle between the vertex normals and the light vector is
greater than 90 degrees. As we travel from right to left, the
resulting dot product approaches zero, smoothly decreasing the
light source color contribution until we eventually reach the
fully shadowed area.
Figure 19.8 Figure 19.8 illustrates that as the angle between the sample point
normal and the incident light vector increases, the result of the dot product approaches zero and the
amount of light that reaches the surface is reduced accordingly (Lambert’s Cosine Law). So we can
think of the result of the dot product as a value that can be used to scale the individual color components
of the incoming diffuse light from a given light source. If the incident light vector is perpendicular to the
surface, the light is striking the sample point directly and its full intensity is used. When a light vector
strikes the surface at some angle, its intensity is diminished based on that angle. This scaling value
should be set to zero if the angle between the incident light vector and the normal is larger than 90
π
degrees ( ) . In such a case, the surface would be facing away from the light source and should not
2
receive any of the energy emitted from it.
Just as each light source in DirectX can have an ambient color, a light source can also emit a separate
diffuse color which is used as input into the diffuse component of the lighting model. While this is not at
all how light sources work in the real world, it does give us more control over the effects we want to
create artistically. The diffuse color emitted from each light source in the scene is also scaled using a
Page 65 of 139
distance attenuation factor. As such, a sample point will only receive diffuse color from a light source if
the sample point is within its influential range. With this in mind, the total diffuse color contribution
calculated by our lighting model for a sample point is:
Total Diffuse Contribution = M d × ∑ Ld × ( L • N ) × Att
Ld = Diffuse color of each light source within range of the sample point (RGB)
Md = Diffuse reflectance property of the material assigned to this sample point (RGB)
L = Unit length direction vector from sample point to light source
N = Unit length normal of the sample point (e.g., vertex/pixel normal)
Att = Attenuation factor between 0.0 and 1.0 used to scale the light contribution with distance
This equation calculates the diffuse contribution of all light sources in the scene for a given sample point
and, as such, describes the final diffuse color of the sample point.
We now know how the per-sample point diffuse and ambient colors are generated in our lighting model.
Ignoring the specular component for the time being, recall that the resulting colors calculated from each
component of the lighting model are added together to create the final color for the sample point.
Figure 19.9 shows another example of two meshes being lit

by a single light source emitting bright white light. This time
however we have also configured the light source to include
a dark grey ambient light color. As you can see, the ambient
color of the light source is fed into the ambient component of
the lighting model and is added to each vertex uniformly.
The material used by all surfaces is assumed to reflect all
diffuse and ambient light.
We can see in this example that the surfaces that would

normally be completely shadowed when using only the
diffuse illumination model still receive ambient color,
because this is not dependant on the direction of the light Figure 19.9
source with respect to the sample points. So the ambient
color acts like a minimum level of illumination for each sample points, including those facing away
from the light source.
An important point to remember about the diffuse lighting model is that it is view independent, just like
the ambient lighting model. As mentioned, you can even think of the ambient model as an extension of
diffuse lighting in that it is intended to substitute for more costly surface-to-surface diffuse reflection
calculations. Since the view direction is not factored into the lighting equation, the diffuse illumination
will look the same regardless of the angle from which we view the sample point. Therefore, the diffuse
lighting model is making the assumption that the light that strikes a sample point is reflected equally in
all directions, which we can represent as a hemisphere centered on the sample point aligned with its
surface. This generally holds true for matte surfaces.
Page 66 of 139
When designing the diffuse reflection model, the assumption is that, unlike a perfectly smooth surface
which produces highlights whose positions are dependant on the view angle of the object and the light
source, when viewing a tiny region of a matte surface under a microscope, each region is constructed
from a large number of microfacets that are oriented randomly in different directions. This is what
makes a matte surface rough to the touch. As light hits a matte surface, it strikes these randomly oriented
micro-facets which reflect the light off in what would seem, collectively, like all possible directions
(Figure 19.10).
The diagram depicts a

section of a rough surface
under the microscope. The
circular inset shows the
microscopic view of the
surface sample being
examined. We can see that a
matte surface up close is
actually like a rough terrain
of micro-facets, each with
varying normals pointing
into the hemisphere that
surrounds the surface at that
location. A ray of light
striking a single micro-facet Figure 19.10
would be reflected about its normal in a very specific direction. However, if we consider a surface
region consisting of a large set of these micro-facets, incident light will ultimately be reflected in what
amounts to all directions represented by the hemisphere when all of their normals are taken into account.
The Lambertian model is based on this observation -- this is why the assumption is made that light that
strikes a diffuse surface is scattered equally in all directions over the hemisphere surrounding the
surface. When surfaces or sample points have their diffuse color calculated in this way, they are referred
to as Lambertian diffusers or ‘perfect’ diffusers. It is because of this effect that the diffuse reflection
looks the same irrespective of viewing angle. Interestingly, since a view direction is not needed to
calculate the diffuse color of the sample point, diffuse color becomes a potential candidate for offline
calculation. Of course, this assumes that the surface and light source are static -- if their relationship
changes, the lighting would obviously have to be recomputed.
Revisiting our simplified version of the Phong illumination equation, we can now flesh it out with what
we have already reviewed:
Sample Point Color = Ambient Contribution * Material.Ambient

The above equation expands to the following when we insert what we now know about the ambient and
diffuse terms:
Page 67 of 139
Sample Point Color = M a × (Ga + ∑ La × Att )
+
M d × ∑ Ld × ( L • N ) × Att
+
( Specular Contribution * Material.Specular )
19.6.4 Specular Lighting

The specular component of the Phong illumination model is responsible for
adding view dependant highlights to our surfaces. The specular
mathematical model we use is completely empirical. That is, it is not
actually based on any particular physical law, but rather on observation of
how these highlights seem to behave in the real world. Ultimately, it looks
sufficiently convincing when not compared too closely to the results in
nature. While there are more mathematically correct specular reflection
models available, many of which have been introduced since the Phong
specular equation debuted, few have reached the same level of popularity
Figure 19.11 for real-time use due to their more significant computational overhead.
Figure 19.11 shows a white sphere lit (at the vertex level) by a single point light source. The light emits
no ambient light, green diffuse light, and white specular light. We can see that the specular highlight on
the sphere is created by adding the white specular color emitted from the light to the vertices of the
surfaces where specular highlights should occur. The specular equation of our lighting model will be
responsible for determining which vertices/pixels receive a specular contribution and the strength of that
contribution.
A perfect specular reflector is a surface that is totally smooth, even at the microscopic level. When a thin
beam of light strikes a perfect specular reflector, that light is reflected off of the surface in the mirror
direction. This vector is referred to as the reflection vector. In mathematical terms, we say that the angle
between the surface normal of a perfect specular reflector and the reflection vector is equal to the angle
between the surface normal and the incident light vector. When viewing a surface that is a perfect
specular reflector, a highlight is only observed if the reflection vector coincides exactly with the viewing
vector. We can think of this intuitively as the beam of light striking the surface and bouncing off directly
into our eye. This causes a bright highlight as demonstrated in Figure 19.12.
Page 68 of 139
In Figure 19.12 we can imagine the
point at the center of the surface as
being the sample point that we are
calculating the specular contribution
for. As light strikes the sample point,
the incident light vector is reflected
with respect to the surface normal,
generating the reflection vector (R) in
the mirror direction. If the reflection
vector and the (negated) view vector
are the same, then a strong specular
contribution should be perceived at
that sampling point.
Figure 19.12
When viewing a 'perfect' specular reflector, if the view direction is not exactly coincident with the
reflection vector, then no highlight is observed by the viewer and there is no specular contribution. In
other words, a highlight is only observed if we are looking directly into the reflection vector's path (i.e.,
the direction in which the light is bouncing off the surface). If the viewer was to change his position
and/or orientation, the specular reflection would no longer be observed at that point on the surface. This
is also true in real life -- when we observe a shiny surface, the highlights change as the viewing angle
changes.
In nature, the reflected beam of light generally has the same wavelength composition as the incident
light, so the color of the specular highlight produced is the same as the light source color (i.e., the
highlight is literally a reflective visualizaton of the actual light source). Many mathematical models
(including the DirectX lighting model) deliberately ignore this fact for artistic reasons and allow light
sources to emit, and materials to reflect, diffuse and specular colors separately. We demonstrated this
peculiarity in Figure 19.11. The light source is emitting green diffuse light which is responsible for
illuminating most of the surface, yet emits white specular light. While not particularly realistic, this does
afford us more flexibility when lighting our objects.
Studying our vertex-lit sphere in Figure 19.11, we can see that, for vertices where the reflection vector
of the light source about the vertex normal is coincident with the view direction vector, a specular
contribution is generated by the lighting model and added to the color of those vertices. It should be
noted that in this figure the sphere is not a perfect specular reflector, which is why the specular lighting
is spread over a large number of vertices instead of being limited to a single vertex where the reflection
vector and the view vector are exactly coincident. In practice, most surfaces are not perfect specular
reflectors (even mirrors have microscopic imperfections) and the Phong illumination model allows us to
model this much more common case of imperfect specular reflection.
When viewing the Phong illumination equation, at first you may be surprised by the way the diffuse and
specular contributions are combined. While we may tend to think of surfaces in the real world as either
being matte or shiny, in reality most surfaces are a combination of the two. Thus, on areas of a shiny
surface where the specular highlight cannot be seen by the viewer, regular diffuse reflection is observed.
This can be seen clearly in the image of the green sphere in Figure 19.11 where only a small number of
Page 69 of 139
vertices are being strongly influenced by the specular component of the lighting model. Over the rest of
the sphere, the green diffuse reflection dominates.
Imperfect specular reflectors can be macroscopically very smooth, but at the microscopic level, the
surface is still composed of tiny microfacets, as in the matte surface we talked about earlier. Of course,
smoother surfaces would theoretically exhibit more uniformity in the distribution of these microfacets
(i.e., more of them will have normals that are aligned with the macroscopic surface normal). Each
microfacet is itself a perfect specular reflector, but when the microfacets are slightly misaligned from
one another, the surface is still going to be microscopically somewhat rough. Thus, when a beam of light
strikes an area of an imperfect reflector, each micro-facet in that region reflects the light in a slightly
different direction (just not necessarily as widely distributed as a matte surface).
So, a perfect specular reflector reflects a single tight beam and the resulting highlight is visible only on a
very small area of the surface (i.e., that exact area where the view direction and the reflected incident
light vector are coincident). An imperfect specular reflector scatters the beam of light since the
microfacets from which the light is reflecting are somewhat misaligned. The result is that the eye can
now see a specular highlight covering a larger area of the surface, albeit with diminished intensity.
Figure 19.13 depicts how incident light is reflected off an imperfect reflector, generating multiple
reflected light directions in an ever-widening cone of reflected light as the surface gets rougher. When
the view vector is not exactly aligned with the mirrored reflection vector, a highlight can still been seen,
but the intensity falls off as the angle between the view vector and the reflection vector increases. This
provides specular light over a wider surface area. Of course, the specular component of the Phong
illumination model has to (and does) allow for approximating this phenomenon so that we can more
accurately model what we see in the real world.
Figure 19.13
Note: Because a sample point only receives a specular contribution if the incident light is reflected about
its normal in the direction of the viewer, the view direction must also be factored into this equation. This
means that we will not be able to use an offline tool to pre-compute the final specular color contributions
for the sample points in our scene.
Here is the specular equation from the Phong illumination model, followed by its explanation:
Specular Contribution = Ms × ( Ls cos n ( β ) )
Page 70 of 139
In the above equation, Ls is the specular color emitted by the light source and Ms is the specular
reflectance property of the material assigned to the sample point being considered. As you can see, the
specular color of the light is multiplied by cos n ( β ) , which calculates the amount of incoming specular
light reaching the sample point that is reflected in the eye direction. For now, think of this as a value
between 0.0 and 1.0, where a value of 0.0 describes the surface as being incapable of specular reflection
and a value of 1.0 describes the sample point as reflecting all specular contributions that reach it. The
resulting color is scaled by the specular reflectance property of the surface.
So what is the cos n ( β ) part of the equation intended to accomplish? Well, we know that for a perfect
specular reflector the view vector must be coincident with the reflection vector of the incident light
vector in order for us to observe it. β in this equation is the angle between the reflection vector and the
(negated) view vector. If both the view vector and the reflection vector are unit length vectors, cos n ( β )
can be found by performing the dot product between the reflection vector and the negated view vector
and raising the result to the power of n. It becomes apparent then that the term is at its maximum when
the reflection vector and negated view vector are equal, returning a result of 1 raised to the power of n
(which will obviously just be 1). This means that if we observe a surface where the reflection vector is
pointing right back out at us, then the specular contribution is equal to the specular color assigned to the
light source. If the angle between the reflection vector and the negated view vector is larger than 90
degrees then the specular contribution should be zero (we will want to detect and clamp this).
Figure 19.14
Raising the result of cos( β ) to a power n provides a way to control the specular intensity as the view
angle and the reflection vector become more misaligned (Figure 19.14). The exponential nature of the
equation means that falloff can be extremely rapid with high powers since we are dealing with fractional
values for the cosine in all but the cases of 0 and 1. Effectively, n describes how ‘shiny’ the surface is
and its value can range from [1, ∞] (this is usually a property of the material). As we raise the value of n,
the surface appears shinier, and the falloff happens quickly when the view vector and the reflection
vector are not exactly aligned. In this case, a very bright, highly localized highlight can be observed on
the surface. The larger the value of n, the closer to a perfect specular reflector the surface becomes,
generating very bright highlights, but only when the view vector and the reflection vector are very close
to coincident. While this approximation may have little physical validity, the visual results are generally
quite convincing.
Page 71 of 139
We can use the dot product to calculate cos( β ) and since n is a material property that is known to us,
the specular equation shown previously can be rewritten as:
Specular Contribution = Ms × ( Ls (V • R ) n )
In the above equation, V is the negated unit length view direction vector and R is the reflection vector of
the incident light vector about the sample point’s normal. Note that we currently do not have direct
access to R, so we will have to calculate this term.
Reflecting a vector about a normal can be intuitively understood by examining the relationship of
similar triangles. Take a look at Figure 19.15. N is the unit length normal of the sample point for which
we are calculating the specular contribution and L is the unit length incident light vector. Given just
these two ingredients, we need to find the unit length light reflection vector R that is used in the specular
contribution calculation above.
Figure 19.15
In Figure 19.15 we can see that if we could find vectors P and S, then knowing what we do about vector
addition, P + S would result in vector R. We do not have vector P yet, but it is easy to calculate -- L • N
returns the length of the vector L scaled by the cosine of the angle α , which essentially gives us the
length of vector P (i.e., the length of the adjacent side of the triangle -- the projection of L onto N). In
the above diagram the length of vector P (written ||P||) is labeled n. So n describes how far we would
have to travel along vector N (the surface normal) to reach location P.
||P|| = L • N
Therefore we can find vector P by scaling normal vector N by ||P|| (n in the diagram):
P = (L • N ) N
In Figure 19.16, n is also used to denote ||P|| (the length of vector P).
Page 72 of 139
Figure 19.16
We now have vector P, so all we need to do to produce the reflection vector R is to calculate the vector
labelled S in Figure 19.16 and add it to point P. We can calculate S by subtracting vector L from P and
then add the result onto P to get R (refresh yourself on vector addition if you cannot see this). Therefore,
if
S=P–L
And
R=P+S
Then we can also say that
R=P+P–L
This can be rewritten in its more standard form as:
R = 2P - L
We have already discovered that we can calculate vector P using the incident light vector and the sample
point normal:
P = (L • N ) N
Using substitution:
R = 2( L • N ) N − L
Figure 19.17 should hopefully make clear how 2P - L allows us to arrive at vector R. In the following
diagram, we see the more intuitive version of R where R = P – L + P, but you should able to see that 2P
- L is the same thing.
Page 73 of 139
Figure 19.17
We now know how to calculate R for the incident light vector L, so we can plug it straight into our
specular reflection equation (shown again below as a reminder) to determine the specular contribution of
a light source at a sample point:
Specular Contribution = Ms × ( Ls (V • R ) n )
Note: You can reflect either the light vector or the view vector with respect to the surface normal and
the dot product result will be the same. Light vector reflection is probably more intuitive, but if for
whatever reason reflecting the view vector is more convenient for you when you are writing your shaders
(e.g., you need it for something else and want to reuse it for lighting), that it is not going to pose a
problem.
Although we now have a nice high level understanding of the Phong specular equation, it is important to
note that there is a more popular alternative approach. In 1977, James F. Blinn modified the equation
such that the need for a reflection vector was removed. The Blinn modification made the calculation
more feasible for real-time rendering and is the form of the equation adopted by the DirectX fixed-
function lighting model (as well as many others). This modified version of the equation is commonly
referred to as the Blinn-Phong model.
Blinn’s Variation
Jim Blinn deduced that the need for the reflection vector could be avoided if one could accept a minor
loss in mathematical accuracy with respect the original Phong equation. Instead of measuring the
specular contribution and falloff using the angle between the view vector (V) and the reflection vector
(R), we can instead measure the angle between the sample point normal and the so-called half vector
(H). The half vector is a vector that lies halfway between the incident light vector and the view vector.
The sample point normal is already known to us, as are the others we'll need, so calculating the half
vector is a simpler (and cheaper) process than determining the reflection vector. It can be calculated like
so:
L +V
H=
2
Page 74 of 139
Looking at Figure 19.18, we can think of the half
vector (H) as the normal for a hypothetical surface
where the incident light vector would be perfectly
reflected into the view direction. (In point of fact, H
is viewed as a sort of representative microfacet
normal in more advanced reflectance models that
we will discuss later in the course, but we'll keep
things simple right now.) Therefore, when
performing N • H we can understand it as
measuring the angle between the ideal surface,
where the viewer would get perfect specular
reflection, and the actual surface for which we are
calculating the specular contribution. If you
imagine drawing a perpendicular line in the above Figure 19.18
diagram at the base of H, this would be the ideal surface for the viewer to receive specular reflection.
While the Phong and Blinn-Phong specular equations are not equivalent to one another in terms of
visual results, they both provide a way to adjust the intensity of specular contributions with respect to
how misaligned the viewer is from the perfect reflection vector. They both therefore, provide us with a
way to simulate in a controllable way, the imperfect specular reflectors that occur frequently in the real
world.
Note: On modern hardware, the performance difference between Phong and Blinn-Phong specular
lighting is generally not very significant. The Phong approach will require a few more shader instructions
(not very costly ones though), so that does at least minimally need to be considered. While Blinn-Phong
is technically going to be the more optimal of the two, don’t necessarily rule out supporting the Phong
equation if you prefer the visual results for certain materials. Perhaps somewhat counterintuitively, it is
interesting to note that Blinn's approach is not only a bit faster, but arguably tends to produce highlights
that are considered more realistic looking than pure Phong for many common materials. An alternative
approach is to select Blinn-Phong as the default (as we do in our demos) and then adjust its power term
to approximate Phong where desired (scaling it up by a factor of around 4 gets reasonably close).
Using the Blinn variation of the specular reflection equation:
Specular Contribution = ( Ls (V • R ) n )Ms
becomes, in its entirety:
L +V n
Specular Contribution = ( Ls ( N • ( )) )Ms
2
We can use this instead of the original and slightly more costly Phong version, shown below in its fully
expanded form for comparison:
Specular Contribution = ( Ls (V • (2( L • N ) N − L)) n )Ms
Page 75 of 139
Note: Just because the view and light vectors are normalized does not mean that the half vector will be.
Technically, the half vector should be normalized to ensure proper dot product behavior. Therefore, if you
are going to perform the normalization step (highly recommended), the division by 2 is unnecessary. The
half vector equation in this case is then simply H = normalize( L + V ).
19.6.5 The Blinn-Phong Illumination Model

We now understand the ambient, diffuse, and specular equations that comprise the Blinn-Phong
illumination model. Let us conclude this part of the discussion by plugging all of the pieces we have
reviewed over the last few sections into our original formula:

+
M d × ∑ Ld × ( L • N ) × Att
+
L +V n
M s × ∑ Ls × (( N • ) ) × Att
2
Ga = Global ambient color (RGB)

La = Ambient color of each light source within range of the sample point (RGB)
Ma = Ambient reflectance property of the material assigned to this sample point (RGB)
Ld = Diffuse color of each light source within range of the sample point (RGB)
Md = Diffuse reflectance property of the material assigned to this sample point (RGB)
Ls = Specular color of each light source within range of the sample point (RGB)
Ms = Specular reflectance property of the material assigned to this sample point (RGB)
L = Unit length direction vector from sample point to light source
N = Unit length normal of the sample point (vertex normal for example)
V = Unit length view direction vector (negated)
n = Power (shininess) of the surface. Range [1, ∞]. Higher value = shinier material / smaller highlight
Att = Attenuation factor scales the light contribution based on distance. Range = [0,1].
Above we see that the final color at a sample point is the sum of the ambient, diffuse, and specular
contributions from all influential light sources scaled by the ambient, diffuse, and specular reflectance
properties of the surface and the light’s attenuation factor. We have not discussed how this attenuation
factor (Att) is calculated, so if you are rusty, you are encouraged to go back and review Chapter 5
(Module I) where we discuss light sources and materials in detail.
As a final note just to wrap up this section, we will remind you that there is technically one last potential
contributor to the final light color at the sample point -- self-emissive lighting. Emissive lighting is also
discussed in detail in Chapter 5, so we suggest reviewing that material if you've forgotten how it works.
Since emissive lighting is literally just added on top of the ambient, diffuse, and specular colors
computed above, let's quickly include it in our mathematical model for completeness:
Page 76 of 139
+
M d × ∑ Ld × ( L • N ) × Att
+
L +V n
M s × ∑ Ls × (( N • ) ) × Att
2
+
Me
Me = Emissive lighting property of the material assigned to this sample point (RGB)
We have now reviewed the foundation of the lighting model that we will use throughout the rest of the
course. As we build out some new lab projects over time we will do some things to improve the realism
of this model, but fundamentally the math will be similar to what we have seen here. Understand that
while more advanced lighting models do exist, the Phong and Blinn-Phong models remain very popular
in today’s games due to their good visuals, relative simplicity, and high performance.
At this point, we are definitely ready to start looking at some shaders.
19.7 Writing HLSL Shaders
Coming from a C/C++ programming background, you will find HLSL an easy transition to make. Like
most high level languages such as C or C++, the syntax is generally presented in Backus-NAUR form
(BNF) which is a method of composing statements that is very easy to understand. You will see that
variable/parameter assignments, conditional statements, loops, and mathematical operations are all
virtually identical to their C/C++ counterparts.
As with all languages, there are a number of keywords that are reserved by the language itself (either as
parameter modifiers, conditional statement keywords, etc.) as well as a few keywords that are not yet
used by HLSL, but are reserved for future use. You must make sure that any parameters or function
names do not clash with these reserved words which form the core of the language.
HLSL Keywords
asm asm_fragment bool column_major compile
compile_fragment const discard decl do

double else extern false float
for half if In inline
inout int matrix Out pass
Page 77 of 139
pixelfragment return register Row major sampler
sampler1D sampler2D sampler3D samplerCube sampler_state
shared stateblock stateblock_state static string
struct technique texture texture1D texture2D
texture3D textureCube true typedef uniform
vector vertexfragment void volatile while
Many of the keywords in the above table you will likely have seen used already since it also contains
effect file keywords such as pass, technique, etc. This list also includes all of the basic parameter types
that are supported in HLSL which have been highlighted in bold. Many others you will also recognize as
having matching C++ counterparts, such as true, false, if, else, void, etc. For those in the above list
which we have not yet come across, we will discuss them as we encounter them throughout our example
code discussions.
In addition to the above list of keywords, there is another list of reserved words which must not be used
as variable or function names. The keywords in the following table are not available in HLSL as of
shader model 3.0 but are reserved for implementation into HLSL in the future.
HLSL Reserved Keywords
auto break case catch default
delete dynamic_cast enum explicit end

goto long mutable namespace new
operator private protected public reinterpret_cast
short signed sizeof static_cast switch
template this throw try typename
union unsigned using virtual
As with most high level languages, just the keywords and the basic variable types alone don’t help us to
accomplish very much. In C, we have the C runtime library that contains hundreds of utility functions
such as printf, scanf, etc. to which all C programs have access. These functions, whilst not technically
keywords in the C language, are considered part of the language all the same. In HLSL, we are provided
with a set of intrinsic functions. Dozens of these intrinsic functions exist to perform all manner of
operations ranging from mathematical dot products, calculating cosines and performing vector matrix
multiplications, to sampling textures, normalizing vectors, and accessing and working with colors. All
HLSL shaders have access to them, and every shader will use one or more of these intrinsics to do their
job. There are too many intrinsic functions to attempt to explain each and every one in a laundry list
fashion (which would take more time than is warranted) so we will simply list some of the commonly
used functions below and then discuss them when we encounter them in our code. The purpose of most
intrinsics will be obvious by their name alone, while others will require some explanation when we use
them.
Page 78 of 139
Note: If you would prefer not to wait to read what some of these do, there is a complete description of
each in the DirectX 9.0 SDK documentation.
HLSL Intrinsic Functions
abs acos all any asin atan atan2 ceil

clamp clip cos cosh cross D3DCOLORtoUBYTE4 ddx ddy
degrees determinant distance dot exp exp2 faceforward floor
fmod frac frexp fwidth isfinite isinf isnan ldexp
length lerp lit log log10 log2 max min
modf mul noise normalize pow radians reflect refract

round rsqrt saturate sign sin sincos sinh smoothstep
sqrt step tan tanh tex1D tex1Dgrad tex1Dbias tex1Dlod

tex1Dproj tex2D tex2Dbias tex2Dgrad tex2Dlod tex2Dproj tex3D tex3Dbias
tex3Dgrad tex3Dlod tex3Dproj texCube texCUBEbias texCUBEgrad texCUBElod texCUBEproj
transpose
As you can see, many standard mathematical functions we would expect to use as graphics developers
are represented. There is also a variety of texture lookup functions which can be used inside the pixel
shader (and some even in the vertex shader in model 3.0) to sample a texture currently assigned to any
of the sampler units.
With these preliminaries out of the way, let’s get started writing shader code. We outlined earlier in this
chapter how shaders represent, in many ways, a return to the days of the software engine for graphics
developers. We will have the freedom to implement our own vertex and pixel processing code but have
it execute on the graphics hardware, in parallel to our application, for optimal performance. To make
certain that we are all on the same page with respect to what part the vertex and pixel shaders play in the
pipeline, let us first discuss things from a software engine standpoint. That is, let us consider how we
might implement these parts of a software engine abstracted from shader syntax and grammar. We will
start with the notion that we are developing a 3D graphics engine in software that works with triangles,
just like D3D. To keep things simple let's just assume for now that all we want to do is transform
vertices and read from a single base texture for each pixel for coloring. We'd likely wind up writing
some functions that, at a high level, might look something like the following in C++:
Page 79 of 139
void DrawPrimitive( tTriangle & Triangle,
D3DXMATRIX& WorldViewProjectionMatrix,
tTexture & BaseTexture )
{
D3DXVECTOR4 TransformedVertices[ 3 ];
D3DXVECTOR2 TransformedTextureCoordinates[ 3 ];
// For each vertex in our triangle...

for ( int i = 0; i < 3; i++ )
ProcessVertex( Triangle.Position[ i ],
Triangle.TextureCoordinates[ i ],
WorldViewProjectionMatrix,
TransformedVertices[ i ],
TransformedTextureCoordinates[ i ] );
// Clip Triangle
// Divide by W (2D projection of vertices)
// Send the triangle to our interpolator to break it up into fragments

InterpolateOverTriangle( TransformedPositions,
TransformedTextureCoordinates,
BaseTexture );
}
The DrawPrimitive function in this example conceptually imitates the Direct3D DrawPrimitive function.
In our case, we pass a triangle structure containing the vertices, a combined world/view/projection
matrix to transform vertices into clip space, and a base texture that should be used to texture the triangle
when it is rendered.
Our method loops through each of the triangle's vertices and invokes a function called ProcessVertex to
transform them (one at a time) into clip space. We can think of this function as the software equivalent
of a vertex shader program. Once all vertices in the triangle are transformed and subsequently projected,
we call the InterpolateOverTriangle method to rasterize the triangle into the frame buffer. Without
delving too deeply into the details of writing rasterizer code, the basic structure of what we might
consider the InterpolateOverTriangle function to have is shown below. We can think of it as a scaled-
down version of Direct3D’s triangle setup and rasterization module. Of course, Direct3D would offer
many more features than we will in our example, but the important point is that this function would be
supplied with the 2D versions of the vertex positions and their texture coordinates, etc. and would setup
a loop to visit every pixel on the surface of the triangle.
void InterpolateOverTriangle( D3DXVECTOR4 TransformedPositions[],

D3DXVECTOR2 TransformedTexCoords[],
tTexture & BaseTexture )
{
// Perform interpolation over the triangle, viewport clipping, etc.
while( !FinishedInterpolating )
{
// Break up our triangle into scanlines and step across each scanline
// based on the number of pixels we have to process...
...
...
Page 80 of 139
...
// Get interpolated texture coordinate for this location on the triangle

D3DXVECTOR2 TexCoords = InterpolateTextureCoordinates
( TransformedTexCoords,
CurrentBarycentricCoordinate );
// For each pixel that survives, let's just sample our base texture
FrameBufferColor[ n ] = ProcessPixel( TexCoords, BaseTexture );
}
}
Inside the loop, for each pixel we call a function called 'InterpolateTextureCoordinates' which, as its
name implies, is designed to calculate the texture coordinates through linear interpolation for the current
pixel being processed. Once the texture coordinates (and any other per-vertex attributes, like colors)
have been interpolated for the given pixel, the 'ProcessPixel' function is then called to calculate the final
color for the pixel. The resulting color is stored in the frame buffer (we are ignoring depth/stencil testing
and other similar features to keep things simple) and then the next pixel in the primitive would be
processed. The important point to recognize is that the 'ProcessPixel' function is our software equivalent
of a pixel shader. Here the function is supplied with a texture and a set of texture coordinates which it
can use to sample a color and return it for storage.
Going back to our 'ProcessVertex' function, what sort of calculations need to take place within this
function? Well, at a bare minimum, we know that we will need to transform the model space positions
of each vertex into clip space to ensure that the rest of the pipeline has workable primitives for post-
projection rasterization. For this example, let's just assume that our texture coordinates are not going to
require any particular transformation, so we can just safely pass them through the function unchanged.
Thus we might end up with a ProcessVertex function that looks like:
void ProcessVertex( const D3DXVECTOR3& ModelSpacePosition,

const D3DXVECTOR2& TextureCoordinateIn,
const D3DXMATRIX& WorldViewProjectionMatrix,
D3DXVECTOR4& ClipSpacePosition,
D3DXVECTOR2& TextureCoordinateOut )
{
// Transform model space vertex to clip space.
D3DXVec3Transform( &ClipSpacePosition,
&ModelSpacePosition,
&WorldViewProjectionMatrix );
// Just copy the texture coordinates since we don't need to do any

// transformation in this example.
TextureCoordinateOut = TextureCoordinateIn;
}
The model space position of the vertex passed in is transformed by the combined world/view/projection
matrix supplied, and the resulting clip space vertex position is stored in the output parameter. Since we
know that our texture coordinates do not need to be transformed, we just copy them over into the output
parameter.
Page 81 of 139
During rasterization the texture coordinates are interpolated, so for each pixel on the surface of the
primitive, a unique set of texture coordinates are generated. The texture coordinates were then passed
into the 'ProcessPixel' function which returned the final color to store in the frame buffer. Our
'ProcessPixel' function is going to be very simple in this example -- it will just use the input texture
coordinates to sample the input texture. In the following code, we are assuming that we have written a
function called 'SampleTexture' that can extract a color from the supplied texture based on a set of 2D
coordinates.
D3DXCOLOR ProcessPixel( const D3DXVECTOR2& TextureCoordinate,

const tTexture& BaseTexture )
{
return SampleTexture( BaseTexture, TextureCoordinates );
}
Although we grant that these code snippets oversimplify the pipeline, conceptually the ideas should
make sense to you given what you already know what DirectX does when you call a function like
DrawPrimitive. In the next section we will see how these ideas translate to shaders proper.
19.7.1 Simple Texturing Shaders
We know that when using DirectX, most of the tasks outlined in the last example are performed for us
behind the scenes. In the fixed-function pipeline we set some states, call one of the DrawPrimitive
methods, and that’s pretty much all there is to it. In such a case, functions like 'ProcessVertex' and
'ProcessPixel' contain code that is hardwired into the fixed-function pipeline to calculate the vertex/pixel
data based on the current state of the transform matrices and the texture stage cascade. However, in the
programmable pipeline, 'ProcessVertex' and 'ProcessPixel' can be replaced by our own custom code. All
the other parts of the Direct3D pipeline remain in place, so you certainly are not tasked with writing
your own rasterization or clipping modules. We are simply going to substitute the vertex and pixel color
processing routines with our own shader functions. So then let’s see how we could port our previous
'ProcessVertex' function written in C/C++ over to an HLSL shader:
// Parameters
float4x4 WorldViewProjectionMatrix;
// Vertex Shader
float2 TextureCoordinateIn : TEXCOORD0,
out float2 TextureCoordinateOut : TEXCOORD0 )
{
ClipSpacePosition = mul( float4( ModelSpacePosition, 1 ),
WorldViewProjectionMatrix );
// Just copy the texture coordinates since we don't need to

// do any transformation in this example.
}
Page 82 of 139
The similarities are quite striking.
First, notice that we define a 4x4 matrix parameter at the top of the file that will be set by the application
(e.g., via the ID3DXEffect::SetMatrix method) with the current combined world/view/projection matrix
prior to the shader being invoked. Also take note that we use the semantics we talked about earlier to
bind the data in the vertex stream to parameters coming into the function. Finally, you can see the out
modifiers being used to describe to the pipeline that the final two parameters will contain the resulting
transformed position and texture coordinates when the shader is complete.
The first line of the function calls the mul HLSL intrinsic function which, similar to the
'D3DXVec3Transform' function we used in our earlier C/C++ example, is used to transform a vector by
a matrix (or more accurately multiply two matrices together -- remember that vectors are just 1x3 or 1x4
matrices). You will see this function used many times in our shaders since it is the primary means for
transforming vectors.
There is little but important ‘gotcha’ to watch out for in the first line of this shader function. Because the
model space position in the vertex stream is a 3D vector, but the transformation matrix is a 4x4 matrix,
we know that the inner dimension rule states that the vector and matrix cannot technically be multiplied.
However, as discussed in Module I of this series, in such cases the 3D vector is just a shorthand way of
representing a 4D vector with a w component of 1.0. Thus, inside the braces of the mul function, we
construct a 4D vector with a w component of 1 to resolve the issue.
Note: In the above paragraph we claimed that technically the multiplication should not be possible, but
in HLSL multiplying a 3D vector by a 4x4 matrix is not considered an error. The mul instruction is
adaptable and if the input is a 3D vector, it will instead treat the 4x4 matrix like a 3x3 matrix (resolving
the inner dimension rule) and carry out the process. The result would be a scale and/or rotation being
applied to our vector because we know that this is what the upper 3x3 elements in our matrix are
capable of providing. Translation will not occur in this case however because those values are ignored.
As you can see, the mul intrinsic accepts two parameters -- the first in this case is the vector we wish to
transform and the second is the transformation matrix itself. The resulting 4D vector will now describe a
position in clip space and is stored directly in the output parameter. Since we wish to perform no special
transformations on the texture coordinates at the vertex level, we simply copy those over from the vertex
stream into the relevant output parameter.
And there is your first fully functional vertex shader! While potentially anti-climactic, this shader would
compile and could be used by any geometry that needs only a single base texture. Admittedly, it is a
very simple example, but you should be encouraged by how much it just looks like a regular C/C++
function.
To complete our exercise, let's take a look at our pixel processing function and see what it might look
like as a pixel shader:
Page 83 of 139
// Parameter declarations
texture BaseMap;
// Sampler State definitions

{
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = wrap;
AddressV = wrap;
};
// Pixel Shader
float4 ProcessPixel( float2 TextureCoordinates : TEXCOORD0 ) : COLOR
{
return tex2D( BaseMapSampler, TextureCoordinates );
}
In our earlier C/C++ pseudo-code the 'ProcessPixel' function was supplied directly with the texture we
wanted to sample, but we know that this isn’t necessarily how things work with effects / shaders. In this
case, we define a texture parameter inside the shader source file (as shown above) and then define the
sampler state with which we wish the texture to be sampled. So in the above example, assuming the use
of effects, our application would call the ID3DXEffect::SetTexture method to set the value of the
BaseMap parameter to a real texture. You can also see that we have defined a sampler object to which
we have bound this texture, and configured to use trilinear filtering when sampling.
Like our C/C++ example, the pixel shader code itself contains only a single line, and it is here that we
encounter a call to one of the most commonly used HLSL pixel shader intrinsic functions -- tex2D. This
method is provided with a sampler object that describes the texture to read and sampling methods to use
in its first parameter, and it simply returns the color sampled at the location expressed by the set of 2D
texture coordinates also provided to it as the second parameter. The states of the sampler object are
obeyed, so the color is fetched from the texture using whatever anti-aliasing features (bilinear/trilinear
filtering, mip-mapping, etc.) and texture coordinate addressing modes (clamping, wrapping, etc.) you
require.
In the above code we can see that the pixel shader accepts a single parameter from the rasterizer -- the
set of interpolated texture coordinates for the pixel currently being processed. Since the job of the pixel
shader is to return the color of the pixel, the function returns a single float4 and utilizes the 'COLOR'
semantic. The pipeline will scoop up this value and continue the process from there (alpha blending,
etc.).
And there is your first pixel shader! Together, in a single effect file, the final result might look like this:
Page 84 of 139
// Parameters
texture BaseMap;

{
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = Clamp;
AddressV = Clamp;
};
// Vertex Shader
out float2 TextureCoordinateOut : TEXCOORD0 )
{
// Just copy the texture coordinate since we don't need to

// do any transformation to it in this example.
}
// Pixel Shader
float4 ProcessPixel( float2 TextureCoordinates : TEXCOORD0 ) : COLOR
{
return tex2D( BaseMapSampler, TextureCoordinates );
}
// Techniques
technique MyFirstShaderTexture
{
pass P0
{
}
}
So now we have seen firsthand that shaders are really just proprietary data processing functions that we
can execute on the GPU instead of on the CPU. We use them to process our vertex and/or pixel data
according to whatever our application’s needs might be. With this basic level of exposure now out of the
way, we can start to develop some more advanced ideas.
Page 85 of 139
19.7.2 Textured Diffuse Lighting Shaders (Per-Vertex)
For our next example, let's increase the complexity of our shader just a little bit by adding diffuse
lighting calculations in the vertex shader. This will produce results quite similar to Direct3D’s vertex
lighting system, where the diffuse color is calculated at the vertex and is then interpolated over the
primitive. The pixel shader will provided with the interpolated texture coordinates and the interpolated
diffuse color calculated at each of the vertices. Inside the pixel shader, the final color of the pixel will be
calculated by modulating the diffuse color with the color sampled from the base map.
Because we want our shader to calculate the diffuse color of the vertex, our vertex structure and
declaration will need to contain a normal. To keep things simple for now we will assume that the vertex
shader will work with a single directional light source (like the Sun). As such, two non-uniform
parameters will need to be declared that will contain the world space light direction vector and the color
of the light source, as shown below.
// Parameters
float4x4 WorldMatrix;
float4 LightDiffuseColor;
texture BaseMap;

{
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = Clamp;
AddressV = Clamp;
};
Prior to rendering any primitives using this shader/effect, the application will need to make sure that the
light parameters are set with the correct world space direction vector and light color so that they will be
accessible in the shader.
In order to calculate the diffuse color of the vertex, our vertex shader will need to perform a dot product
between the light direction vector and the vertex normal. However, the normal passed into the shader is
assumed to be in model space in our example and the light vector is in world space, and we know that
they must both be in the same space for the calculation to be correct. The vertex shader will need to
transform the vertex normal into world space to perform this calculation, so you can see in the above
parameter list that, in addition to the combined world/view/projection matrix used to transform the
vertex position, the application must also pass the world matrix separately so that we can transform
vectors from model to world space. The combined matrix we use for position transformations will not
suffice because it is already concatenated with the view and projection matrices (for transformations to
clip space).
Page 86 of 139
So, prior to invoking this shader, the application will need to set the combined transformation matrix, a
standalone world matrix, a base texture, and the light’s world space direction and color.
Below is the final vertex shader which performs the calculations to generate the diffuse color of the
vertex.
// The Vertex Shader

float3 ModelSpaceNormal : NORMAL,
out float2 TextureCoordinateOut : TEXCOORD0,
out float4 DiffuseColor : COLOR0 )
{
// Transform model space vertex position to clip space.
ClipSpacePosition = mul (float4(ModelSpacePosition,1),
// Transform the model space normal to world space and renormalize.

float3 WorldSpaceNormal = normalize( mul( ModelSpaceNormal, WorldMatrix ) );
// Compute a diffuse color using a directional light source.

DiffuseColor = LightDiffuseColor *
saturate( dot( WorldSpaceNormal, LightDirection ) );

}
Notice that the shader now has an additional input from the vertex stream labeled with the NORMAL
semantic and that we also output a diffuse color alongside the texture coordinate set and clip space
position.
Inside the function we begin by transforming the model space vertex position into clip space using the
mul function and the combined world/view/projection matrix. Next, we transform the normal into world
space by calling the mul function again, but this time we specify the model space normal and the world
matrix as the two arguments. Notice that in the case of the normal that we do not perform the
'float4(v,1)' expansion on the normal because in this case we only want it to be transformed by the upper
3x3 portion of the matrix (no translation). The result is passed to the normalize intrinsic function which
ensures that our new world space normal is unit length.
Note: Technically, the normal should be multiplied by the inverse transpose of the world matrix. This
ensures that non-uniform scaling will work properly. For the moment, you can either assume that the
world matrix is presented in this form or that we are not supporting non-uniform scales.
At this point we have the vertex normal in world space along with the light direction vector, so we
calculate the cosine of the angle between the two by performing a dot product between the two vectors.
Notice that HLSL offers the dot intrinsic function for performing the dot product between two vectors.
We also use the saturate intrinsic which clamps the results to the [0, 1] range (i.e., if the result is
negative it is clamped to zero and if larger than 1.0 will be clamped to 1). The saturated dot result is then
Page 87 of 139
multiplied by the light’s diffuse color (set by the application) to generate the final color. The resulting
color is stored in the DiffuseColor output parameter so that it is passed out of the shader and on to the
rest of the pipeline. With the diffuse color and clip space position calculated, all that’s left to do is copy
over the texture coordinates from the vertex stream into the appropriate output parameter and our task is
complete.
The texture coordinates and diffuse color output by our vertex shader will be interpolated over the
surface of the primitive and eventually passed into each invocation of the pixel shader. Since our pixel
shader will now be interested in receiving the interpolated diffuse color as well as the texture
coordinates it asked for in the previous example, we assign an additional input parameter with the
COLOR0 semantic. The remainder of the effect file is shown below.
// Pixel Shader
float4 ProcessPixel( float4 DiffuseColor : COLOR0,
float2 TextureCoordinate : TEXCOORD0 )
{
return tex2D( BaseMapSampler, TextureCoordinates ) * DiffuseColor;
}
// Techniques
technique DiffuseTextureTechnique
{
pass P0
{
}
}
As you can see, the pixel shader uses the supplied texture coordinates to sample the texture and
modulates the result with the interpolated diffuse color that came in from the rasterizer.
19.7.3 Textured Diffuse Lighting Shaders (Per-Pixel)
What if we want a bit more accuracy in our calculations and decide that rather than interpolate our
vertex colors using Gouraud shading, we want to compute a diffuse color explicitly at each pixel? This
is something that we cannot do with the fixed-function pipeline, but with shaders, per-pixel lighting is a
breeze. We will just need to make a few small adjustments to move the lighting calculations out of the
vertex shader and into the pixel shader.
We will separate the coverage of this code into a few sections just as we did with the previous example
since there are some very important points to make. All of the same constant data will be used this time
around, and the application will still be responsible for supplying this data prior to invoking the effect.
However, we do have something important to consider when thinking about the vertex structure and
declaration we are going to use.
Page 88 of 139
In the previous example, we calculated the diffuse lighting using the dot product between the light
direction vector and the world space vertex normal. However, how do we do this in the pixel shader?
The pixel shader can access the application provided world space light direction vector through its
constant data registers. However, unlike a vertex shader, a pixel shader is not automatically supplied
with a normal -- a vertex normal has no context in the pixel shader. As it happens, we are now going to
see an example of something we discussed earlier -- using texture coordinates as generic containers for
data that we wish to transport from the vertex shader into the pixel pipeline.
Since texture coordinate registers are really just four component float registers that are intepolated
across the surface of the primitives being rendered, we can store anything we like in them, including
normal vectors. With this in mind, our vertex shader will now output two sets of texture coordinates,
even though we know that we only intend to sample a single texture. Our shader will transform its
normal into world space, as it did before, but store the resulting world space normal in the second
texture coordinate set. Below we see the first section of the effect file, including the vertex shader code.
Note: Although we are using a second set of texture coordinates here, this does not mean that our input
vertex stream and declaration has to define this second set. We are not asking to be provided with a
second set of texture coordinates for input into the vertex shader. Instead, we are taking advantage of
another texture coordinate output register set for passing data out of the vertex shader and on to the
pixel shader.
// Parameters
float4 LightDiffuseColor;
texture BaseMap;

{
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = wrap;
AddressV = wrap;
};
// Vertex Shader
float3 ModelSpaceNormal : NORMAL,
out float2 TextureCoordinateOut : TEXCOORD0,
out float3 WorldSpaceNormal : TEXCOORD1 )
{
// Transform our input (model space) normal to world space and output it.
WorldSpaceNormal = normalize( mul( ModelSpaceNormal, WorldMatrix ) );
Page 89 of 139
// Just copy the texture coordinates
}
Our third output parameter is now labelled as a texture coordinate instead of as a color output. In this
case, TEXCOORD1 indicates that the second texture coordinate register will be used to transport this
information. Of course, we know that this second set of texture coordinates does not contain texture
coordinates at all, but instead contains the world space normal of the vertex.
Since anything in the texture coordinate registers will be linearly interpolated over the surface of the
primitive, this means that the three world space normals of the triangle’s vertices will be involved in the
calculation to generate a specific world space normal for each pixel. Our pixel shader will need access to
the contetns of both sets of texture coordinates now, so we modify the input parameter list. The final
section of the effect file is shown below.
// Pixel Shader
float4 ProcessPixel( float2 TextureCoordinate : TEXCOORD0,
float3 WorldSpaceNormal : TEXCOORD1 )
{
// Renormalize the input normal since it will no longer
// be unit length after interpolation across the surface.
WorldSpaceNormal = normalize( WorldSpaceNormal );
// Compute a diffuse color.

float4 DiffuseColor = LightDiffuseColor *
// Modulate with base texture color and return result.

return tex2D( BaseMapSampler, TextureCoordinates ) * DiffuseColor;
}
// Techniques
technique DiffuseTextureTechnique
{
pass P0
{
}
}
When writing our pixel shader, we know that the first set of texture coordinates should be used to
sample from the base map, just as before. However, the second set will contain the normal so that we
can perform a dot product with the light direction vector, now performed at the pixel level. Since the
normal has been linearly interpolated based on the three normals computed at the triangle vertices, it is
no longer guaranteed to be unit length, so it is generally a good idea to normalize it first.
Once we have a unit length normal vector (notice how the normalize intrinsic is available to both the
vertex and pixel shader) we perform the dot product between the normal and the incident light direction
vector as before in our calculation of the diffuse color. Finally, the diffuse color is modulated with the
Page 90 of 139
color sampled from the base map (using the first set of texture coordinates) and the resulting color is
returned.
19.7.4 Skinning with Per-Pixel Diffuse and Specular Lighting Shaders
So far we have only examined some fairly simple shaders, but hopefully you are already starting to see
how flexible shaders can be. Let’s step up the complexity level just a little bit more and write a shader
that adds skinning support to our vertex shader and adds specular lighting calculations to our pixel
shader.
Recall from Module II that skinning is the process of transforming the vertices of a mesh using a
weighted blend of multiple matrices. In this context, each matrix that influences a vertex is referred to as
a bone matrix. Whether using shaders or the fixed-function pipeline, our skinned meshes are created in
exactly the same way (e.g., D3DXConvertToIndexedBlendedMesh). Once we have our skinned mesh
generated and the bone matrices stored, we render the mesh in a very specific way. Before rendering any
subset of the mesh, we assign the bone matrices that influence the vertices of that subset to the device’s
matrix palette.
Our skinned mesh would have been created with a vertex structure that contained weights. These
weights tell Direct3D how much each of the four possible matrices influence a single vertex. If a batch
of vertices were influenced by the same four bones, they would exist in the same subset and would be
rendered together (non-indexed skinning). Behind the scenes, the pipeline would transform the vertex by
each of the four matrices to generate four intermediate vertex positions. Finally, these intermediate
positions would be combined using the weights of the vertex to scale the strength of each contribution.
With indexed skinning, we didn’t have to break the mesh into groups of polygons all influenced by the
same four bones. Assuming the matrix palette was large enough, we could set all of the bone matrices on
the device in one go. The individual subsets still contained weights, as in the non-indexed case, but we
included an additional four bytes worth of indices (4 indices) per vertex to select the appropriate
matrices from the bone matrix palette.
Our skinned meshes are going to be created in exactly the same way this time around, although now we
will use a declaration to inform the pipeline about our vertex structure and the fact that it contains
weights and indices alongside the usual data. One thing to bear in mind is that when using shaders, we
no longer have access to the device’s matrix palette – it is only used for fixed-function transforms. This
is not actually a problem of course because we can just pass the shader our palette of matrices via the
constant registers. That is, we can define an array of matrices in our effect file and have the application
populate that array with the bone matrices prior to invoking the shader. Inside the shader, our code will
access those matrices in a logical manner and perform all of the skinning calculations that the fixed-
function pipeline used to do on our behalf.
Note: Module II spent a lot of time on the subject of skinning, so we will not discuss all of that
information again here. If you are feeling a little out of practice on this subject, we suggest you spend a
few moments revisiting those chapters before moving on.
Page 91 of 139
Our current task is going to be somewhat more complex than those we have previously tackled, and we
will cover the vertex and pixel shaders separately. On the application side, we will have ultimately
constructed an indexed skinned mesh with a vertex structure that looks something like the following:
struct MySkinnedVertex
{
D3DXVECTOR3 Normal;
D3DXVECTOR2 TexCoords;
float BlendWeights[3];
DWORD BlendIndices;
};
Our vertices have a model space position, space for up to three weights, and an additional DWORD used
to store four one-byte indices. This implies a maximum palette size of 256 matrices, which should be
more than enough -- you will probably not have that many constant registers to spare anyway. Since our
vertex and pixel shaders will also be responsible for lighting calculations, we also include a vertex
normal and a set of texture coordinates.
Note: You are reminded that a vertex of this type can be influenced by a maximum of four bones even
though there are only three blend weights. As the sum of all blend weights must amount to 1.0, the
fourth weight can be calculated on the fly and need not be stored in the vertex structure.
Our application would also need to create a suitable declaration for this vertex structure, which would
look something like what we see next. Notice that we are now using a few more of those usage flags to
identify the blend weights and indices.
{
{ 0,0, D3DDECLTYPE_FLOAT3,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0},
{ 0,12, D3DDECLTYPE_FLOAT3,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_NORMAL, 0},
{ 0,24, D3DDECLTYPE_FLOAT2,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0},
{ 0,32, D3DDECLTYPE_FLOAT3,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_BLENDWEIGHT, 0},
{ 0,44, D3DDECLTYPE_UBYTE4,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_BLENDINDICES,0},
D3DDECL_END()
};
With respect to the shader code, the shader functions are starting to look a little unweildy given all of the
input and output parameters in use, so let’s tidy that up a bit by using structures instead. Below, you can
see the input structure that we will use in our vertex shader, which now includes members to house the
blend weights and indices contained in the vertex stream.
struct VertexShaderInput
{
float2 TexCoords : TEXCOORD0;
float3 BlendWeights : BLENDWEIGHT;
float4 BlendIndices : BLENDINDICES;
};
Page 92 of 139
There is a little quirk when defining the indices field as an input parameter which might seem confusing
at first, but it is just one of those things you have to get used to. You will notice that while we defined
the indices in our vertex stream as a DWORD (four one-byte indices), in our shader input structure the
indices are defined as a float4. As it happens, the Direct3D pipeline will extract your indices from the
vertex stream and load them into a float4 register (recall that all of the vertex shader input registers share
this characteristic). This all happens behind the scenes, but it is one of those things you will just have to
remember when defining your shader inputs.
For consistency, we'll also define an output structure for our vertex shader that will ultimately make its
way as input to our pixel shader:
struct VertexShaderOutput
{
float2 ClipSpacePosition : POSITION;
float3 WorldSpaceNormal : TEXCOORD1;
float3 WorldSpaceViewVector : TEXCOORD2;
};
As before, the vertex shader will output the clip space position, texture coordinates, and a vertex normal
to be interpolated over the triangle during rasterization. This time around, we have an additional output
parameter that will be used to carry a world space camera-to-position ('view') vector so that we can
perform specular lighting computations in the pixel shader.
What other non-uniform parameters will our shader need in order to carry out its task? Well, as before,
we will need the application to set the light direction and light diffuse color for the lighting calculations
that will happen in the pixel shader. Theoretically, we will also need a world matrix that will be used to
transform the vertex normal into world space (although this matrix may need to be generated on the fly
when skinning as we will see). You will recall from the previous example that this world space normal
is interpolated over the triangle to provide the pixel shader with a world space normal for lighting.
// Parameters
foat4 LightDiffuseColor;
float3x3 WorldITMatrix; // Inverse transpose
float4x4 ViewProjectionMatrix;
texture BaseMap;

{
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = wrap;
AddressV = wrap;
};
Page 93 of 139
In our previous example, we had access to a combined world/view/projection matrix for transforming
the vertex position from model space to clip space. In this shader we will do things a bit differently,
using a two step process -- we will compute intermediate vertex positions in world space and then
transform them to clip space subsequently. Since we already have access to the world matrix (and in the
above example we are also adding a 3x3 version of its inverse-transpose counterpart), we now define a
separate combined view/projection matrix. This will be used to complete the transformation to clip
space after the vertex position has been calculated in world space.
What other parameters are we going to need the application to set? Well, we want to perform specular
lighting calculations in our shader now, so we will also want the application to supply us with the light
source specular color and the power to be used during the specular calculations. Also, as discussed a
moment ago, in order to compute the resulting specular color we'll need to know the relationship
between the vertices and the camera, so we will ask the application to provide us with the world space
camera position too. Then, in the vertex shader we will calculate the view vector by subtracting the
camera position from the vertex position. This view vector will be output (per vertex) and interpolated to
provide our pixel shader with a view vector it can then use in its specular lighting calculations.
float4 LightSpecularColor;
float SpecularPower;
float3 CameraPos;
Finally, we need a place for the application to store the palette of bone matrices (or more accurate
'vertex blending' matrices) that the vertex shader will index into based on the blend indices provided at
each vertex, so we need to define a matrix array:
float4x4 BlendMatrices[ MAX_BONES ];

int BlendingInfluences;
Notice that we also define a parameter named BlendingInfluences in addition to the matrix palette that
allows the application to provide additional information about the maximum number of matrices that can
possibly influence the vertices of the primitives being rendered. Having access to this value not only
allows us to optimize our shaders to perform the minimum amount of work necessary at each vertex as
we will see in a moment, but it will also allow our shader code to function in cases where we might want
to use the same shader functions for both static and skinned meshes simultaneously. For example, in the
lab projects accompanying this chapter, BlendingInfluences will contain a value of 0 if the geometry
being rendered represents a regular 'static' mesh, or a value ranging from 1 through 4 if the geometry
being rendered is part of a skinned mesh. In the former, regular mesh, case we can simply transform the
vertex using the object level world transform instead of the blending matrices. However, if we are
processing a skin, then the vertex being rendered may contain 1, 2, 3, or 4 bone influences (indexes into
the matrix palette) that we need to account for.
It would be a shame to execute a vertex shader that performs four matrix blending calculations if the
current vertex we are transforming only uses one or two. This is inefficient and we'd like to save cycles
wherever we can. To address this issue, we will adopt an approach -- similar to the case we examined
earlier when processing up to eight light sources per vertex -- that uses multiple compiled permutations
of shaders based on uniform function inputs. Although we will only write a single vertex shader
Page 94 of 139
function, we will compile it five times, each with a different value supplied to that uniform parameter.
This will generate five unique shaders that we will store in an array. The technique will then select the
appropriate shader based on the value contained in the BlendingInfluences variable. If this sounds a bit
confusing at the moment, it will make sense when we review the technique definitions at the bottom of
the effect file. For now, let’s look at the HLSL source code to our vertex shader function one section at a
time.
The shader takes two parameters. The first is the uniform input parameter, supplied when compiling the
array of vertex shaders, that tells the HLSL compiler the total number of blending matrices (if any) that
this permutation of the shader should use to transform each vertex (0 through 4). The end result is going
to be five unique vertex shaders in our array, each one tailored for a specific number of influencing
matrices. Our runtime technique will index into that shader array to execute the shader appropriate to the
current batch of polygons being rendered based on the value in the BlendingInfluences parameter. The
second function parameter is our new vertex input structure defining the data in the vertex stream.
VertexShaderOutput ProcessVertex( uniform int NumBlendingInfluences,

VertexShaderInput In )
{
// Zero out our output structure, just in case we don't intend to
// fill out every value.
VertexShaderOutput Out = (VertexShaderOutput)0;
// Since we are supporting skinning, we'll need to do a bit of matrix

// blending so that our position and normal can be properly transformed
// to world space.
Matrix WorldTransform;
float3x3 WorldITTransform;
GetWorldTransform( NumBlendingInfluences, In.BlendWeights,
In.BlendIndices, WorldTransform, WorldITTransform );
Inside the vertex shader we declare two local matrix variables. The first is a regular 4x4 matrix that will
be used to store the computed world matrix, while the second is a 3x3 matrix that will contain the
inverse-transpose version of the same matrix used for the transformation of normals should it be
necessary. How this world matrix is calculated depends on whether this shader permutation is required
to perform vertex blending (skinning) or not. If the NumBlendingInfluences parameter is 0 then it means
that we wish to compile this shader for regular, non-skinned meshes and transform our vertex data by
the standard world transformation matrices supplied by the application. If the number of influences is
non-zero, then we of course need to do some additional work to compute final blended matrices that we
can use for transformation in the same way.
In this case, the GetWorldTransform function shown in the above vertex shader example performs this
decision making and any subsequent work involved on our behalf. As such we pass the
NumBlendingInfluences value, and the blending weights and indices read from the vertex stream and
expect to simply get back two fully computed matrices. It is important to note that the
GetWorldTransform function called in the above shader is not an intrinsic function or any part of the
HLSL; it is a utility function that we will write ourselves shortly. This is a great example of how we can
break complex tasks down into multiple smaller functions in HLSL, which can then be called from the
main shader function (just as we do all the time when we write our C++ code). The GetWorldTransform
Page 95 of 139
function will be examined in a moment, but before we look at the rest of the top-level shader function,
let’s just discuss what it does at a high level.
When skinning, we have a single vertex which is going to be transformed by multiple matrices. If we
use an example in which the vertices are influenced by four matrices, we could transform the original
model space position with each of the four matrices individually to create four unique world space
positions. Using a linear combination, these four positions could then be scaled by their corresponding
weight provided along with the vertex data (all weights add up to 1) and added, where the resulting sum
is our final skinned position. This is a perfectly acceptable approach, but in this shader we are going to
go about it in a slightly different way. Instead of transforming the vertex by each of the four matrices,
and then scale and add the four results, we will scale the blending matrices directly and combine them
into a single world matrix. When we transform our model space position by this single combined matrix,
it will be the equivalent of doing the four separate vertex transformations and combining them
afterwards. Overall, this will keep the instruction count down and tend to be more efficient, particularly
when we also have to transform other components besides the position (e.g., normals, tangent space
vectors, etc.).
So the GetWorldTransform function is supplied the influence count and the blending weights and
indices stored in the vertex. It uses the weights to scale the corresponding matrices in the palette,
indexed by the vertex, and returns the combined world matrix and its inverse-transpose counterpart. In
the above code, you can see that whether we are using skinning or not, we ultimately expect to end up
with a world matrix in our local WorldTransform variable which can then be used to transform the
model space vertex position to a world space position. In addition, its inverse-transpose counterpart will
be stored in WorldITTransform that can be used to transform the normal into world space as well, as
demonstrated in the next section of the shader code below.
// Transform our position to world space using our properly

// generated world matrix.
float3 WorldPosition = mul( float4(In.Position,1), WorldTransform );
// We'll also want our normal in world space. (Note that we don't cast
// to float4 as we don't want to translate our normal, only rotate it.)
Out.WorldSpaceNormal = mul( In.Normal, WorldITTransform );
Once again, while it is not always necessary to do so, remember that for correct results, the normal
should be transformed by the inverse-transpose of the world matrix rather than the original. The inverse-
transpose matrix tends to pose more of a problem when skinning because we are dealing with multiple
matrices (up to four world matrices and thus eight total) per vertex. Ultimately, there are three
commonly employed solutions to this problem.
The first is to disallow non-uniform scales (the source of the problem) in your world matrices, which is
the simplest solution, but also a potentially significant imposition on your artists, particularly when
dealing with animation where non-uniform scales are commonplace.
The second option is to pre-compute the matrices on the CPU and upload them along with the world
matrices. Since this means consuming twice as many registers, you wind up effectively cutting your
matrix palette size in half. The matrix blending function would also need to be adjusted to apply
Page 96 of 139
weighting to both sets of matrices (effectively doubling the number of instructions). Finally, by cutting
the effective size of the bone palette in half, you are going to wind up with more subsets per mesh,
requiring more draw calls, so there will be a potential performance penalty here as well.
The third approach (the one we will use in our examples) would be to compute the inverse transpose of
the matrix-generated during matrix blending directly in the shader. While the transpose is not a problem,
there is no HLSL intrinsic that is available to compute the inverse of a matrix. This means having to
write a utility function to do the job yourself. While computing the inverse is not going to be terribly
expensive, particularly on modern hardware, there will still be some overhead associated with doing so.
Which of these methods (if any) you choose is up to you. We've used all three succesfully in various
projects, although the lighting artifacts that can come with opting to use the original, unmodified matrix
can often be quite unpleasant. For the lab project code, we will opt for either the second or the third
option. If you are targeting older hardware, option 2 may give the best performance. If your models only
have 6 or 7 bones per attribute and can fit within a single palette, even with their IT matrices, it is a very
good choice. On modern hardware with very fast GPUs, option 3 is a nice choice and usually our
preference, but it can be harder to justify on older cards. Computing a matrix inverse per vertex, which
generally amounts to thousands per model, can have an impact. This gets compounded to tens or even
hundreds of thousands of inverse computations per frame if features that require multiple renderings of
the skin are employed (very common). With the CPU pre-computation approach (option 2), if a model
uses 35 bones, we can compute the 35 IT matrices once and reuse them over all vertices. Indeed, we
could even cache those results for use over multiple render calls if desired. The cost of uploading that
many constants to the hardware can, however, be prohibitive in many cases too. As mentioned before,
benchmarking is key, so just bear these pointers in mind.
In order for our pixel shader to do specular lighting calculations, it will need a view vector calculated
between the camera and the sample point. Although the sample point is technically going to be the pixel,
we don’t have access to the world space position of the pixel in the vertex shader, so what we will do
instead is calculate the camera-to-vertex vector and store the result in a set of output texture coordinates.
During rasterization, this vector will be interpolated and generate a camera-to-pixel vector with which to
perform our specular calculations.
// Compute our view vector (for specular lighting in the pixel shader)
// between the world space camera position (supplied by the application
// read from a constant) and the world space vertex position (computed
// earlier in this shader).
Out.WorldSpaceViewVector = CameraPos – WorldPosition;
Since our vertex position is currently in world space, we must transform it by our combined
view/projection matrix to transform it into clip space for output to the rest of the pipeline. Finally, we
copy over the texture coordinates from the vertex stream into the output parameters and we are done.
Below is the remainder of the vertex shader code.
Page 97 of 139
// Transform world space vertex position to a clip space position.
Out.ClipSpacePosition = mul( float4( WorldPosition, 1 ),
ViewProjectionMatrix );

Out.TexCoords = In.TexCoords;
// Send our data to the interpolator.

return Out;
}
We still have to look at the GetWorldTransform function that manages the core of our skinning matrix
generation. The code is shown below and should require very little explanation. The function just
fetches the bone matrices referenced by the current vertex, scales them by their assigned weights, and
then adds them together into a combined final matrix.
void GetWorldTransform( int NumInfluences, float3 BlendWeights, float4 Indices,

out Matrix m, out float3x3 mIT )
{
// Convert indices to integer representation.
int BlendIndices[4] = (int[4])Indices;
// Blending required?
if ( NumInfluences == 0 )
{
// Just use the object level matrices as they were provided.
m = WorldMatrix;
mIT = WorldITMatrix;
} // End if not blended

else
{
{
m = BlendMatrices[ BlendIndices[ 0 ] ];
} // End if 1 matrix
else if ( NumInfluences == 2 )
{
BlendWeights.y = 1.0 - BlendWeights.x;
m = BlendWeights.x * BlendMatrices[ BlendIndices[ 0 ] ];
m = m + BlendWeights.y * BlendMatrices[ BlendIndices[ 1 ] ];
} // End if 2 matrices
{
BlendWeights.z = 1.0 - (BlendWeights.x + BlendWeights.y);
m = m + BlendWeights.z * BlendMatrices[ BlendIndices[ 2 ] ];
Page 98 of 139
{
float w = 1.0f - (BlendWeights.x + BlendWeights.y + BlendWeights.z);
m = m + w * BlendMatrices[ BlendIndices[ 3 ] ];
// Compute inverse transpose matrix.

float3 v0x1 = cross( m[0], m[1] );
float recipDet = 1.0f / dot( v0x1, m[2] );
mIT = float3x3( cross(m[1], m[2]), cross(m[2], m[0]), v0x1) * recipDet;
} // End if blended
}
Note: Even when calling a separate function from our top level shader as is the case here, when
supplying a uniform input integer value as we are here, conditional statements like the one above work
more like pre-processor directives during compilation. Depending on the path chosen, a completely
different shader will be compiled. So, when this function is called with a NumInfluences of 0, the skinning
code will be completely ignored by the HLSL compiler and will not be part of the final compiled shader.
That is why specifying different values for the NumBlendingInfluences uniform (in our top level shader
function) each time we compile it will generate shaders specifically suited to each situation.
As we learned in Module II, we actually compute the nth blend weight on the fly, so for the four matrix
case, we are only given three weights and we dynamically generate the fourth. That is what the first line
of each conditional case is doing. Obviously in the single matrix case there is no need to do this since
that single matrix represents the full influence on the vertex.
Our new pixel shader is very similar to the previous pixel shader but with some added code to handle the
specular lighting calculation. Notice again that we can use the vertex shader output structure as the input
for the pixel shader since it contains members that represent all of the information we need (the texture
coordinates, the interpolated view vector, and the interpolated pixel normal), just so long as we don’t try
to access any of the members of this structure that have no context within the pixel shader (e.g., clip
space position). The entire pixel shader is shown below with a small walkthrough afterwards.
float4 ProcessPixel( VertexShaderOutput In )

{
// Renormalize the input normal and view vector as they get
// unnormalized during interpolation
float3 WorldSpaceNormal = normalize( In.WorldSpaceNormal );
float3 WorldSpaceViewVector = normalize( In.WorldSpaceViewVector );

// Compute a specular color. We'll use the Phong specular

// model for illustration.
float3 ReflectionVector = reflect( -LightDirection, WorldSpaceNormal );
Page 99 of 139
float4 SpecularColor = LightSpecularColor *
pow(dot(WorldSpaceViewVector, ReflectionVector),
SpecularPower );

return tex2D( BaseMapSampler, In.TexCoords ) *
DiffuseColor + SpecularColor;
}
As discussed, whenever normalized vectors (such as our normal or view vectors) are interpolated during
rasterization, they will likely not be unit length when they reach the pixel shader. Thus, the first thing we
do in the above shader code is renormalize both the interpolated pixel normal and view vector (using the
normalize intrinsic). We then calculate the diffuse color exactly as we did before, using the standard
clamped cosine calculation. Next we calculate the reflection vector by negating the light direction vector
and perform the standard Phong calculation to generate the specular color of the pixel.
Finally, with our specular and diffuse colors calculated, we sample the base map and modulate it with
the diffuse lighting color before adding on the specular lighting results. This is then returned as the final
color for our pixel.
You might be wondering why we did not include any material reflectance values in our lighting
calculations above. While we absolutely could have done so, just by passing them along from the
application as additional shader constants and then including them in our modulation (and we do just
this in the accompanying lab projects), we decided to save two multiplications here simply by asking the
application do so once on the CPU by pre-modulating the light colors that we input to the pixel shader.
Little tricks like this can help keep performance up, which is of particular importance as we start doing
more and more work per pixel. It also means that we have a couple of extra instruction slots available if
we need them to add new features in older shader models.
As an additional example, if you wanted to do simulate modulate 2X to brighten things up, you could
tack on an extra multiply like so:
return tex2D( BaseTexture, In.TexCoords )

* DiffuseColor * 2.0f + SpecularColor;
However, since the 2.0f and the light and material colors are all constant during the lifetime of the
shader, there is technically no reason to waste time doing this work per pixel when such values can be
pre-computed on the CPU side. We could just multiply our LightDiffuseColor by 2.0 before passing it
along to the shader and we’d get the same result. One multiplication on the CPU is certainly going to be
better than the millions that will occur as we visit every pixel, but there is a downside to moving
operations like this over to the application -- namely, it makes the shader code a bit unreliable to study
because the visual results on screen do not seem to match what we expect to see based on the code.
Worse, some other developer might come along and think that the parameter values they are working
with are the proper originals, when in fact you’ve manipulated them on the CPU for optimization
purposes and they are no longer viable. Moving computations down to the application is definitely a
very good idea, but you need to be aware of the dangers of doing so before you jump right in. A
preferable approach would be to actually leave all of the math in the HLSL code so that it is clear what
is happening, but to have some application side process come along that recognizes the math that can be
Page 100 of 139
done on the CPU and do it automatically before the shaders are triggered. Thankfully, the DirectX Effect
framework assists us in this regard once more:
The Pre-Shader
When you can eliminate an instruction through the use of a pre-process, that is generally going to be a
good thing. Fortunately for us, the shader compiler actually has built-in functionality that looks for
instructions that can be computed on the CPU and then reused in the shaders over multiple executions.
Used in conjunction with the D3DX effect system (which does the CPU side math), this is referred to as
pre-shading and we will see many examples of this as we progress.
Just to illustrate the point, let’s consider our modulate 2X example above. The shader compiler will
recognize that the multiplication could be done once on the CPU in advance and will take steps to ensure
that this is so. It is worth bearing in mind that the compiler which extracts pre-shader qualified code is
very smart, but not necessarily perfect, so it is wise to verify the results by checking the assembly
output. To be safe, you can try to help it more easily recognize these cases where possible. For example,
we could add the following line (bolded) to our shader code:
. . .
// Double the diffuse light intensity

LightDiffuseColor *= 2.0f;

. . .
LightDiffuseColor is a value that will not change during the lifetime of the execution of the shader, even
if we scale it by two. The effect compiler will catch this case with ease and ultimately the effect pre-
shader system will compute the adjusted value for LightDiffuseColor on the CPU and just set it as a
constant before the shader is executed. This means we will not need to pay the price for the
multiplication when the shader is running, saving a cycle and freeing up an instruction slot. It also
means that our code clearly communicates its intentions to the reader. Note that it doesn't matter that the
light color might change; as long as it can't change during the lifetime of the shader (which is true in this
case), we can rely on the pre-shader to handle this bit of code for us.
If you are using a tool like NVIDIA PerfHUD™ (which you most definitely should be doing if you're
using NVIDIA hardware!) you can actually examine your shader code and look at the assembly that is
generated post-compile. One of the items you will see listed (and they get their own separate little
section) are any instructions like these that the pre-shader was able to take care of. If for some reason
you do not see your instructions there when you know you should, you can either try adjusting the code
to make the case more obvious as we did above, or just do the calculations yourself on the CPU (bearing
in mind the dangers mentioned earlier). Although it is rare for the pre-shader to miss such cases, it never
hurts to confirm that your shader is doing exactly what you think it should. Again, this is a benefit of
working with tools like NVIDIA PerfHUD™ and having at least a rudimentary comfort level with
assembly language shader code.
Page 101 of 139
For completeness, the entire effect file discussed in this section is shown below.
// Parameters
float3 CameraPos;
foat4 LightDiffuseColor;
float4 LightSpecularColor;
float SpecularPower;
float4x4 BoneMatrices[ MAX_BONES ];

int BlendingInfluences;
texture BaseMap;

{
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = wrap;
AddressV = wrap;
};
// Input / Ouput Structures

{
float3 BlendWeights : BLENDWEIGHT;
float4 BlendIndices : BLENDINDICES;
};
{
float2 ClipSpacePosition : POSITION;
float3 WorldSpaceNormal : TEXCOORD1;
float3 WorldSpaceViewVector : TEXCOORD2;
};
//////////////////////////////////////
// Vertex Shader
//////////////////////////////////////
VertexShaderOutput ProcessVertex( uniform int NumBlendingInfluences,
VertexShaderInput In )
{
Page 102 of 139
// Zero out our output structure, just in case we don't intend to
// fill out every value.
// Since we are supporting skinning, we'll need to do a bit of matrix

// blending so that our position and normal can be properly transformed
// to world space.
Matrix WorldTransform;
float3x3 WorldITTransform;
GetWorldTransform( NumBlendingInfluences, In.BlendWeights,
In.BlendIndices, WorldTransform, WorldITTransform );
// Transform our position to world space using our properly

// generated world matrix.
float3 WorldPosition = mul(float4(In.Position,1 ), WorldTransform );
// We'll also want our normal in world space. (Note that we don't cast
// to float4 as we don't want to translate our normal, only rotate it.)
Out.WorldSpaceNormal = mul( In.Normal, WorldITTransform );
// Compute our view vector (for specular lighting in the pixel shader).
Out.WorldSpaceViewVector = CameraPos – WorldPosition;
// Transform world space vertex position to a clip space position.

Out.ClipSpacePosition = mul( float4( WorldPosition, 1 ),
ViewProjectionMatrix );

// Send our data to the interpolator.

return Out;
}
// Utility function called by the vertex shader

void GetWorldTransform( int NumInfluences, float3 BlendWeights, float4 Indices,
out Matrix m, out float3x3 mIT )
{
// Convert indices to integer representation.
int BlendIndices[4] = (int[4])Indices;
// Blending required?
{
// Just use the object level matrices as they were provided.
m = WorldMatrix;
mIT = WorldITMatrix;
} // End if not blended

else
{
{
m = BlendMatrices[ BlendIndices[ 0 ] ];
Page 103 of 139
} // End if 1 matrix
{
BlendWeights.y = 1.0 - BlendWeights.x;
{
BlendWeights.z = 1.0 - (BlendWeights.x + BlendWeights.y);
{
float w = 1.0f - (BlendWeights.x + BlendWeights.y + BlendWeights.z);
m = m + w * BlendMatrices[ BlendIndices[ 3 ] ];
// Compute inverse transpose matrix.

float3 v0x1 = cross( m[0], m[1] );
float recipDet = 1.0f / dot( v0x1, m[2] );
mIT = float3x3( cross(m[1], m[2]), cross(m[2], m[0]), v0x1) * recipDet;
} // End if blended
}
//////////////////////////////////////
// Pixel Shader
//////////////////////////////////////
float4 ProcessPixel( VertexShaderOutput In )
{
// Renormalize the input normal and view vectors as they get
// unnormalized during interpolation
float3 WorldSpaceNormal = normalize( In.WorldSpaceNormal );
float3 WorldSpaceViewVector = normalize( In.WorldSpaceViewVector );

// Compute a specular color. We'll use the Phong specular

// model for illustration.
float3 ReflectionVector = reflect( -LightDirection, WorldSpaceNormal );
float4 SpecularColor = LightSpecularColor *
pow(dot(WorldSpaceViewVector, ReflectionVector),
SpecularPower );

return tex2D( BaseMapSampler, In.TexCoords ) *
Page 104 of 139
DiffuseColor + SpecularColor;
}
//////////////////////////////////////
// Shader Indexing Arrays
//////////////////////////////////////
VertexShader vsArray[5] = {
compile vs_2_0 ProcessVertex(0), // BlendingInfluences = 0
compile vs_2_0 ProcessVertex(4) // BlendingInfluences = 4
};
//////////////////////////////////////
// Techniques
//////////////////////////////////////
Technique BranchRender
{
pass p0
{
CullMode = CCW;
ZEnable = true;
ZWriteEnable = true;
AlphaBlendEnable = false;
AlphaTestEnable = false;
VertexShader = <vsArray[ BlendingInfluences ]>;

}
}
Notice in the above code (just above the technique definition) that we compile five different versions of
our vertex shader with different values for the single uniform input parameter, and then index into this
array inside the technique to select the best shader for the job. This indexing is based on the value the
application has stored in the BlendingInfluences parameter.
19.7.5 Introducing Normal Mapping
Although 3D technology is progressing at an astounding rate, DirectX 9 class cards are just not fast
enough to model pixel-level detail at the vertex level. Back in Module I we discussed how texture
mapping allows us to add color detail to a polygon in order to perceive it as having much more
geometric detail than it actually does. If we consider a quad with a 256x256 texture mapped to it, it
simulates what the quad would look like if it was tessellated into 65,536 separate little quads, each with
their own unique color. While texture mapping solves this problem very efficiently by painting this color
detail onto a low-polygon set, those polygons are still perceived to have an extremely flat topology
because we have not adjusted lighting to account for this new simulated surface roughness.
Page 105 of 139
In real life, the vast majority of surfaces we encounter
exhibit a degree of roughness. If we examine the side
profile of such a surface we can clearly see that the
topology (perhaps at the microscopic level) is quite
uneven (Figure 19.19). The only way we could fix such a
problem with our current objects is to model this detail
with polygons, which would be prohibitively expensive
on our target hardware.
Yet, when playing a computer game, we rarely get close

enough to objects for any great period of time to examine
them at this level. We are usually too busy firing at
monsters or driving over terrain to stop extremely close
to a polygon and examine its side profile. Thus, we might
imagine that we would not notice the lack of surface
roughness, but actually, even when some distance from Figure 19.19
our game objects, we can still tell that they are completely smooth and flat. Why is this?
Stepping back into the real world for moment, imagine

standing in front of, but several feet away from, an
interior wooden door than has four recessed decorative
panels. Even though you are looking at the door front on,
and thus cannot possibly see its side profile, you can still
immediately tell that the decorative panels are recessed
into the wood. This is because the way that light reacts
with a surface plays a big factor in how we perceive it.
Certain bumps on the surface are facing away from the
light source and thus receive less light, and this is the
case with the recessed panel edges in the door. As a light
source moves across the surface, the bumps on the
surface are evident as the light seems to shift in and out
of the grooves, scratches, and depressions. If we examine
our example bumpy surface again but this time draw a
Figure 19.20 representation of the normals of each facet (Figure
19.20), we can see that even standard diffuse calculations would result in very different lighting results
at each location were we to utilize these normals.
To be sure, we are not contemplating anything new here so far. We have been lighting our objects
dynamically since Module I; Figure 19.20 could very well be the side profile of a common terrain. In
fact, way back in the very first un-textured terrain demo of Module I, we generated normals for our
terrain faces and used them with the Direct3D pipeline to dynamically light our terrain. Recall that we
animated those light sources and watched as the lights passed over the bumpy terrain. As they did so, the
color values computed changed and shifted from light to dark due to changes in the light direction vector
with respect each vertex and its normals.
Page 106 of 139
Of course, in the case of the terrain geometry, it actually was bumpy. However, this doesn’t change the
fact that if we looked down on our terrain from a very high vantage point, we couldn’t possibly see all of
the bumps in the geometry. Yet, the way the light sources interacted with the normals of the terrain
facets still conveyed a bumpy surface to our eyes. If we had enough processing power to model every
quad of every object in our game world as a heightmap, we would be all set. However, we simply don’t
have that kind of processing power to play with when working with DirectX 9 level hardware.
The key point is that when a surface is bumpy,

it is the way the light reacts with the bumps on
the surface that allow us to perceive it as such.
Even if the example geometry shown in Figure
19.21 was a completely flat surface in our
game engine, it could still be made to look
bumpy, most of the time, if we had a way to
represent the various different normals where
each facet exist.
In Figure 19.21, our terrain-like set of facets

has been replaced by a single flat quad.
However, we have maintained the normal
information for those facets for use in lighting
calculations. As long as we were not at an
angle such that we could see the quad from its
side profile, it would look bumpy to us
because light would interact with the normals
in the exact same way as if the bumps on the
surface really existed. So, what we need is a Figure 19.21
way to store more detailed normal information for individual faces, without putting more strain on the
vertex processing pipeline.
We already know how to use texture maps to model a much higher level of surface color detail without
having to tessellate it, so what we need to do here is very similar. We need a way to specify detailed
normals for a single face so that they can be used for lighting calculations within the pixel shader. In our
prior examples, we calculated our lighting normals in the vertex shader and simply interpolated them to
provide a per-pixel normal for the pixel shader. However, vertex normals are smoothly interpolated over
the surface and provide no ability to model bumps or fluctuations on the surface of an individual
primitive. What we need is a way to supply the pixel shader with specific normals for each pixel, so that
we can model the normals of a bumpy surface accurately within the pixel shader.
As clearly implied, texture mapping will come to the rescue here and provide us with a container to store
per-pixel normals. Just as we used textures to store color (diffuse maps) and other details (detail maps),
we can also create a texture called a normal map or storage of normals. In this case, each texel in the
normal map will store a 3D normal vector which can be sampled by the pixel shader using a standard
texture read operation. This normal can then be used in our lighting calculations. Textures provide us
with the means to specify very detailed normal information for a single primitive and thus model some
incredibly realistic looking environments.
Page 107 of 139
This idea, generally called normal mapping (although sometimes loosely referred to as bump mapping),
seems a little strange at first since it is not a technique that introduces actual surface features into our
geometry. Instead, it is a lighting trick that fools the eye into seeing a bumpy surface where no physical
bumps exist (i.e., geometrically). Nevertheless, it is very effective, despite it being a reasonably simple
texturing technique as Figure 19.22 demonstrates.
Figure 19.22
Figure 19.22 shows two objects rendered with normal maps applied. We can see immediately that the
polygons look incredibly different from any we have previously rendered in this series. The objects do
not even have very high detail base maps, but because they have normal mapping applied, whose
normals are being used to model per-pixel lighting, the grout of the brick wall really does look recessed
and the monster’s skin looks quite scaly and bumpy. For comparison, Figure 19.23 shows the same
monster and a few sections of wall being diffusely lit by our previous pixel shader (without base
textures). Figure 19.24 shows the same scene with a new pixel shader that supports normal mapping.
Page 108 of 139
Figure 19.23 Figure 19.24
Although per-pixel lighting calculations are being performed in Figure 19.23, all the pixel shader has to
work with are interpolated vertex normals. These are interpolated smoothly over the surface to generate
a per-pixel normal in the vertex shader and as such, all we can hope to achieve is smooth Gouraud
shading. Although it may be hard to believe, Figure 19.24 shows the exact same geometry being
rendered, only this time with normal maps applied. The image speaks for itself (keep in mind that the
wall and floor panels are literally just single quads).
In Figure 19.24 we are supplying the pixel shader with a normal map for each primitive we render.
Inside the pixel shader, the base map’s UV coordinates are used to sample a color from a normal map.
The pixel shader then converts this color into a normal (with each component in the [-1, 1] range) which
is then dotted with the light direction vector to generate a diffuse color using the standard Lambertian
calculation. The same normal vector is also used to add a bit of specular lighting as well. Keep in mind
that the individual bricks that we can see on the walls do not actually exist geometrically; they are
simply tricks of the light that cast certain pixels into shade and others into brightness. And best of all,
although this technique adds an almost unbelievable amount of additional perceived surface detail, it is
extremely efficient. We've simply replaced the interpolated vertex normal vector with one read from a
texture -- the math is still exactly the same.
Figures 19.25 and 19.26 show the same scene as above but with the base maps applied this time. We can
see in Figure 19.25 how the geometry itself, even with the base map applied, is pretty low detail. It
actually looks like a little like commercial games used to look about a decade ago. In Figure 19.26 we
add a couple of normal maps into the mix and suddenly we see visuals that start to resemble today’s
commercial games.
Page 109 of 139
Scene with Normal and Diffuse Maps
Figure 19.25 : Without Normal Mapping Figure 19.26 : With Normal Mapping
It is worth noting that you may see the terms ‘bump mapping’ and ‘normal mapping’ used to refer to the
same thing, mostly because the purpose of these techniques is to create the appearance of bumpy (or
more detailed) surfaces through the use of custom texture maps. However, bump mapping can more
accurately be thought of as a collective term which encompasses a number of ways to produce a similar
result. The two most popular in use today are called 'dU-dV Mapping' and 'Normal Mapping'. dU-dV
mapping is a approach to adding additional percieved detail (bumps) to a surface, where the texels of the
texture map store delta values used to perturb the interpolated normal from the vertices to generate a
unique normal for each pixel. Normal mapping is a more precise technique (and the one used in this
course) in which the texels of the map contain specifically encoded normals that are sampled and used
directly in the lighting calculations.
Creating Normal Maps
There are several ways to create normal maps using standard heightmaps as a starting point. You could
draw your heightmap in an art package (in grayscale) where white pixels represent the highest points on
the surface and black represent the lowest. If you are an Adobe Photoshop™ user, then a plug-in exists
(as part of the free NVIDIA Texture Tools package) that allows you to convert a grayscale heightmap
into a normal map and save the results. Alternatively, DirectX has a function that allows you to create a
normal map from a heightmap image loaded at runtime. This function is called
D3DXComputeNormalMap and we will see it in just a moment.
Regardless of whether you choose to generate your normal map from a heightmap at runtime
(D3DXComputeNormalMap) or prefer to hand author your heightmap and convert it into a normal map
in a third-party tool, the conversion process uses the heightmap to generate a normal for each pixel by
calculating the slopes between neighboring pixels (where white = high and black = low). The result is a
texture where each pixel contains an encoded normal. Unless you are using floating point textures
(pretty rare for this purpose) or textures with a signed integer pixel format, the x, y, and z components of
each normal are remapped from the [-1.0, 1.0] range into the unsigned [0, 1] range and stored in the
texel RGB components. There is nothing special about the image itself in this case; it could be just a
standard file like a .jpg, although .dds is normally preferred. It’s what the values of those colors mean to
Page 110 of 139
us inside the pixel shader that is really important. Bear in mind that they are not colors per se, but are
instead three component <x,y,z> normal vectors encoded into the RGB components of each texel for
storage.
Quite often you will start the normal map creation process by using your diffuse map as a guide for the
source of your heightmap. For example, our brick wall has a diffuse texture as shown in Figure 19.27.
Figure 19.27 Figure 19.28
In order to create the heightmap for the above wall texture, we could simply load it into a paint package
and convert it to a gray scale image. We could then remove the detail within each brick (optional) so that
we get a situation where the grout is considered low (black) and the face of the individual bricks is
considered high. Additional detail (bumps, cracks, etc.) could be painted in as desired.
The heightmap in Figure 19.28 could then be converted into a normal map texture using the
D3DXComputeNormalMap function (as one possible approach) and the result would be a texture where
each texel contains a normal. We would then sample this texture (using the diffuse map's UV
coordinates most likely so that we get a 1:1 mapping between the features) inside the pixel shader to
fetch the normal for the current pixel being processed. After conversion back into the [-1,1] range, we
can use this normal in the diffuse and specular lighting calculations, resulting in the following rendered
results:
Figure 19.29
Page 111 of 139
The rightmost image in Figure 19.29 really shows how the light reacts with the per-pixel normals to
create the illusion of a very detailed surface. However if you look towards the end of the wall in the
leftmost image, you can see that the wall is just a single quad and the grout is not really recessed at all.
Of course, normally you will want to be a bit more precise when creating your heightmaps as a source
for normal map generation. While converting the diffuse texture to grayscale provides a good starting
point (especially for less than artistic people, like many of us), you will probably need to make some
more detailed edits to get it to look just right in practice.
The creature shown in Figure 19.22 uses a

single diffuse map in which the colors for
all of its various body parts are contained.
To create the height map for this creature,
the diffuse map was first converted to a
grayscale image. The resulting image then
underwent some adjustments to darken
areas that we wanted to appear more
recessed and lighten others that we
wanted to stand proud of the surrounding
surface. Still, this was not enough by
itself. While the various sections of the
monster’s body looked protruded or
recessed, the actual skin of the creature
did not look as bumpy as we wanted. As a
final step therefore, the heightmap was
then merged with a grayscale detail map
containing patterns that looked a bit like
dinosaur skin. The result was a nice noisy
heightmap (see Figure 19.30).
Figure 19.30
The finished height map was then
converted into a normal map using the D3DXComputeNormalMap function which utilized the height
information given for each texel, in conjunction with the height of neighboring texels, to generate the
final per-texel normals much as we did with our terrain heightmap and vertex normals back in Module I.
These normals are encoded into the unsigned [0,1] range and stored in the RGB components of the
texture just like standard color values. A final call to the D3DXSaveTextureToFile function saved the
created normal map out to disk.
Note: We opted to load in the height map and convert it into a normal map then save the resulting file
at development time. Alternatively, you could perform this conversion into a normal map for all of your
height maps at load time, but this will cause some delay at the start of your application. By converting it
into a normal map at development time and then saving the resulting image file, your runtime engine can
simply load the texture and immediately use it without further calculation. When you consider that these
days, for each diffuse map used by your scene you will probably want a matching normal map,
generating them offline could reduce loading times significantly.
Page 112 of 139
So what does a normal map look like? Well, considering the fact that the color values in the image are
actually packed normal vectors, it looks a little strange. Normal maps will tend to take on a purple-blue
color because the z component (which more often than not gets packed into the blue channel) is usually
dominant. Below we see what the converted normal maps for our creature and brick wall heightmaps
look like when loaded into an image viewer.
Figure 19.31
For each texel in the map a unit length normal vector will have been computed and encoded for storage
as a color value. The encoding process is generally relatively simple. In its most common form, each of
the three components of the vector <x, y, z> will be converted to an 8-bit representation (-1.0 to +1.0
mapped into the 0-255 range) and placed directly into the R, G, and B components of the texel. The
following code might be used to achieve this:
unsigned char Red = (unsigned char)((Normal.x + 1) / 2) * 255);

unsigned char Green = (unsigned char)((Normal.y + 1) / 2) * 255);
unsigned char Blue = (unsigned char)((Normal.z + 1) / 2) * 255);
More comprehensive encoding techniques do exist, some of which can be used more reliably with
compressed textures, but in the above example the normal vector has simply been converted and stored
in the red, green and blue components of each texel, respectively.
So, we have learned that in order to create a normal map we can use any art package we wish. We
simply first create a height map (usually grayscale) and then, using D3DX or a third party tool, convert
it into a normal map ready for use by our shader.
Page 113 of 139
Note: A popular alternative method for normal map generation involves initially creating a very high
polygon count (i.e., highly detailed) model to start, alongside the traditional lower polygon count model
used at runtime. Then, an empty normal map texture is mapped to the low polygon model to create a set
of sampling points (i.e., the texels of the normal map) on its surface. Each of these sampling points can
then be projected, so to speak, onto the higher polygon model in order to find the more detailed surface
information. The surface normal at that location on the high polygon mesh is then captured and stored in
the texel. This approach has the obvious benefit of being able to capture the surface features that are
deliberately hand modelled geometrically versus relying on the much less accurate diffuse texture whose
job is certainly not to do this. Using the diffuse texture is also hit and miss when it comes to textures
based on photographs, since very often lighting and shadows are already part of the color and are not
necessarily easy to remove. To be sure, a good diffuse texture would usually require that any lighting and
shadow be removed, but you need to be a reasonably skilled texture artist to do this effectively.
Let’s have a closer look now at the D3DX function responsible for loading in a height map and
converting it to a normal map. The D3DXComputeNormalMap function assumes that you have already
loaded your height map image, or otherwise created it as a Direct3D texture. This texture is provided to
the D3DXComputeNormalMap function, which will in turn populate a second texture of matching
dimensions that must also have been created by the application advance. This second texture will
contain the final normal map, where each pixel describes an encoded normal.
HRESULT D3DXComputeNormalMap
(
LPDIRECT3DTEXTURE9 pTexture,
LPDIRECT3DTEXTURE9 pSrcTexture,
CONST PALETTEENTRY * pSrcPalette,
DWORD Flags,
DWORD Channel,
FLOAT Amplitude
);
LPDIRECT3DTEXTURE9 pTexture
The address of a pre-created Direct3D texture object into which this function will encode the generated
normals. The dimensions of this texture should match those of the source texture specified in the next
parameter.
LPDIRECT3DTEXTURE9 pSrcTexture
The address of the source texture containing the per-texel height values (your height map image), loaded
by your application in the usual way (e.g., using D3DXCreateTextureFromFile or other mechanism).
This function will examine the heights at each texel and will, based on neighboring texel heights,
generate a normal that suitably describes the direction that any facet, should it exist, would be oriented.
CONST PALETTEENTRY * pSrcPalette

Contains the source texture's color palette if your heightmap has been loaded with a palettized format
(usually 256 colors or less), or NULL otherwise.
DWORD Flags
Zero or more flags that describe how the pixels in the height map should be addressed. The possible
constants are shown below.
Page 114 of 139
#define Description
Indicates that pixels off the edge of the texture on the

D3DX_NORMALMAP_MIRROR_U
u-axis should be mirrored, not wrapped.
Indicates that pixels off the edge of the texture on the

D3DX_NORMALMAP_MIRROR_V
v-axis should be mirrored, not wrapped.
Same as specifying D3DX_NORMALMAP_MIRROR_U |

D3DX_NORMALMAP_MIRROR
D3DX_NORMALMAP_MIRROR_V.
D3DX_NORMALMAP_INVERTSIGN Inverts the direction of each normal.
Computes a per-texel occlusion term and encodes it into

the alpha. An alpha of 1 means that the texel is not
D3DX_NORMALMAP_COMPUTE_OCCLUSION
obscured in any way, and an alpha of 0 means that the
texel is completely obscured.
DWORD Channel
It is sometimes the case that you will have your height map data stored in a 24 or 32-bit image format.
Thus, the function needs to know which color channel should be used to access the heights in the normal
generation equation. The possible values are shown below.
#define Description
D3DX_CHANNEL_RED Indicates the red channel should be used.
D3DX_CHANNEL_BLUE Indicates the blue channel should be used.
D3DX_CHANNEL_GREEN Indicates the green channel should be used.
D3DX_CHANNEL_ALPHA Indicates the alpha channel should be used.
This flags instructs the normal map compilation to use the

D3DX_CHANNEL_LUMINANCE
luminance value of the RGB color of the pixel.
FLOAT Amplitude
This parameter allows us to scale the intensity of the heights used in normal map generation. A larger
amplitude value causes the slopes to become more exaggerated and the normals to become more angled,
and vice versa. In short, it makes our 'bumps' more pronounced when we use a larger value.
At this point, we know what normal maps are and how to create and load them. Let us now discuss how
we can use normal maps inside our shaders.
Page 115 of 139
Tangent Space
While sampling the normal from the normal map inside the shader and using it within a lighting
computation is a trivial process, we do have a problem that must be addressed first relating to the space
in which the normals in the map are expressed.
Although it is entirely possible to store normals within the normal map in any space of our choosing, let
us imagine for a moment that the normals we have generated are expressed in either world or object
space as would generally be the case when working with vertex normals. When we generate a normal
map texture it is very common for us to want to apply that same texture map to many different surfaces
within the scene, each with their own orientation and potential object transform, just as we do with our
diffuse color maps for example. Since the normals that we ultimately need access to within the lighting
calculation are designed to represent the orientation of the surface at each texel, how would this work if
the normals were encoded in world or object space and applied to two separate surfaces (of the same
object in the latter case), each with different real world orientations? In such a case we would potentially
need to create two different normal map textures that could be applied to each surface independantly so
that the world or object space normals would accurately take into account the orientation of their
respective surfaces. As you might imagine therefore, encoding the normals in either of these two spaces
could lead to a significant increase in texture memory requirements because it is likely that we will need
unique normal maps in many cases.
In an attempt to solve this problem, it will often be the case that the normals stored in a normal map will
instead be expressed relative to a space that is local to the texture itself. This space is most commonly
referred to as tangent space. In this space, the normals in the normal map are almost always created such
that a normal perfectly perpendicular to the surface is facing fully along the surface’s local Z axis
<0,0,1> (i.e., a non-angled normal matches the direction of the polygon (or vertex) normal).
To help clarify let’s examine the data that might be stored in a normal map where the normals have been
encoded in tangent space.
Figure 19.32 – Normal Map Figure 19.33 – Normal Encoding
Page 116 of 139
Figure 19.32 depicts a typical two dimensional normal map texture similar to the one we looked at
earlier. As we know, the difference between a normal map and most other types of texture is that instead
of a color they are intended to represent a unique direction vector for each texel within the map.
Were we able to visualize the texture / tangent space normals, Figure 19.33 depicts a small 6x6 section
of the normal map so that we can see what the normal vectors might look like. While the encoding of the
vector itself is somewhat outside the scope of this discussion, what is important to understand is just
what these unit length normal vectors actually represent when expressed in tangent space.
As we have established above, when representing these normals in tangent space, no assumption can or
should be made about the surfaces to which the normal map will actually be assigned at the point when
the normals are being computed. These normal maps might be assigned to thousands of completely
arbitrary polygons throughout the scene. Given this fact, the only reliable space in which these direction
vectors / normals can exist is that of the texture itself.
Take the following three example tangent space normal vectors:
Normal A = <1, 0, 0>

Normal B = <0, 1, 0>
Normal C = <0, 0, 1>
If we were to encounter these vectors when sampling from a tangent space normal map, what would
they actually represent?
The first vector, Normal A, contains values that describes a direction pointing precisely along the +X
axis <1,0,0> in tangent space. Irrespective of the space in which the surface being rendered exists, or
how the texture was mapped to that surface, a normal pointing along the +X axis in this way will always
correspond to the U axis of the two dimensional normal map texture as shown in Figure 19.34 (labelled
“Normal +X”).
Figure 19.34 – Normal Map Axes
Page 117 of 139
The second vector, Normal B, similarly contains values that point precisely along a major axis. In this
case it is +Y <0,1,0> which corresponds directly the texture’s V axis.
Normal C contains a value that describes a direction pointing along the +Z axis <0,0,1> in tangent
space. With respect to a two dimensional texture, the existence of a third axis (Z) may be a little difficult
to understand. Remember however that the normal map is intended to represent fully three dimensional
vectors that describe the topology of the surface. If that surface was completely flat, the normals will
always point directly away from the surface, along its computed surface normal. In the case of a normal
map, the +Z axis (labelled “Normal +Z” in Figure 19.34) points directly away from the “texture”. With
respect to tangent space, this is effectively the same as saying that the normal will be pointing directly
away from the surface being rendered and to which the texture is mapped.
At this stage then, we know that each of these normals will be expressed within the space of the texture
itself, whereby the +X axis corresponds to the texture U axis, the +Y axis corresponds to the texture V
axis, and the +Z axis points “out” of the texture.
With normals expressed relative to this fixed tangent space, it should be possible for us to assign the
same normal map to any surface or object in the scene without the need to create unique instances of the
texture in each case. The problem now however is that the incident light direction vector used during the
lighting calculations and the normals read from the normal map exist in different spaces. In our previous
shaders, the light vector was supplied to the shader by the application as a world space direction vector.
Performing a dot product between a world space light direction vector and a tangent space per-pixel
normal would produce wildly inaccurate results since, as we know, the Lambertian and Phong
calculations only work if the vectors share the same coordinate system.
To solve this problem we either need to transform world space light vectors into this fixed texture or
tangent space, transform our tangent space normals into world space, or transform both vectors into
some other shared space (e.g., view space), prior to performing the lighting calculations.
Whichever approach we choose we will need to know, or be able to compute, the three vectors that
describe the axes of tangent space relative to our target space (i.e., world or object space) at the point the
lighting computation is run. Together these three vectors will form an orthonormal basis at each vertex
which is all we need to construct the columns of a matrix that will perform a transform from world or
object space into tangent space inside the vertex (or pixel) shader or vice versa.
We already have access to one of these axis direction vectors -- the vertex normal corresponding to the
positive Z axis of the tangent space coordinate system in object space. What we don't yet have access to
in our shaders are the two additional axis direction vectors corresponding to the positive X axis
representing the U direction of the normal map texture (also called the tangent vector), and the positive
Y axis vector representing the V direction of the normal map texture (also called the bitangent or
binormal vector).
Page 118 of 139
Note: It has become commonplace in most cases to refer to these two additional vectors as the tangent
and binormal vector. This is in fact somewhat of a misnomer and only applies with respect to curves,
which can have two normals and one tangent. When talking about surfaces, in which there can only ever
be one normal, these vectors are more accurately described as the tangent and bitangent (i.e., the two
additional vectors computed are both considered tangential to the surface).
To avoid confusion however, for the purpose of this discussion we’ll be using the more common terms,
binormal and tangent, also adopted by Direct3D.
As you may have gathered, getting access to this information isn't necessarily trivial. Due to the fact that
Z axis direction vector of the tangent space transformation matrix is based on the vertex normal of the
triangle being rendered, this matrix may in fact be different for every triangle that we draw. Indeed,
given that the other two axis vectors must describe a direction equivalent to the normal map's U and V
axes as they have been mapped to each surface, even the texture coordinate components of each triangle
must be considered. How then might we generate the additional information we need, and how do we
get that information to the shader on a per-surface basis?
Generating Tangent Vectors
The first step in preparing your geometry for normal mapping is to add space to the vertex structure
(application side) to store the two additional vectors that form the tangent space transformation matrix
(tangent and binormal) in addition to the normal. Theoretically you could opt to store only the tangent
vector and calculate the binormal inside the vertex shader by performing a cross product between the
vertex normal and its tangent vector, but for the sake of keeping this discussion simple, we will assume
that your mesh vertices define both a tangent and binormal.
struct NormalMapVertex
{
D3DXVECTOR3 Normal;
D3DXVECTOR3 Tangent;
D3DXVECTOR3 Binormal;
D3DXVECTOR2 UV;
};
We will compute and store these two additional vectors (tangent and binormal) per vertex at mesh
creation time since generating them dynamically, while possible, is usually too expensive to consider as
a general solution. Forgetting about how we calculate these additional tangent space vectors for the
moment, it is clear that we will have initial access to this information in the vertex shader. It is here that
we will pack these three vectors (normal, tangent, and binormal) into a 3x3 transformation matrix to take
us from world space to tangent space (or vice versa if desired). This means that we can transform the
world space light direction vector by this matrix to create a tangent space light direction vector per
vertex. This value can then be output in a set of texture coordinates so that the final tangent space light
vector passed into the pixel shader is an interpolation of the tangent space light vectors output by each
vertex. If we need additional tangent space vectors inside the pixel shader (e.g., the view vector for
specular lighting), the same strategy can be applied at the vertex level.
Page 119 of 139
Note: We are also free to pass the entire 3x3 tangent space matrix along to the pixel shader via output
parameters. This would be something that we would do if we wanted to compute all of our lighting in
world space. In this case, we would use the matrix in our pixel shader to transform our tangent space
normal (pulled from the normal map) into world space before doing any lighting. Additionally, we would
simply leave our light and camera direction vectors in world space and not bother with them at all in the
vertex shader.
Of course, now that we have made additional space in our vertex structure for both the tangent and
binormal vectors we will also need to define a suitable declaration so that our shader knows that these
additional vectors exist. As it happens, there are usage types and input shader semantics specifically for
identifying tangent and binormal vectors:
{
{ 0,0, D3DDECLTYPE_FLOAT3,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0},
{ 0,12, D3DDECLTYPE_FLOAT3,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_NORMAL, 0},
{ 0,14, D3DDECLTYPE_FLOAT3,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TANGENT, 0},
{ 0,36, D3DDECLTYPE_FLOAT3,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_BINORMAL, 0},
{ 0,48, D3DDECLTYPE_FLOAT2,D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0},
D3DDECL_END()
};
So how do we populate these vertex structure members with valid tangent space axis vectors?
Just as D3DX provides the D3DXComputeNormals function to automatically calculate normals for your
vertices, it also exposes a D3DXComputeTangentFrame function (and an extended version) to
automatically calculate the normals, tangents, and binormals for all of your mesh’s vertices. The simpler
non-extended version is shown next.
HRESULT D3DXComputeTangentFrame( ID3DXMesh * pMesh, DWORD dwOptions );
To this function you simply pass your mesh and some optional flags that govern how the calculations
are performed (you may like to refer to the SDK documentation for a full list of these). On successful
return, the binormal and tangent components of your mesh vertices should then contain the computed
tangent space axis vectors.
While this approach is relatively straight forward, in our sample code we opted not to use the provided
D3DX function, and instead adopted a function that was originally written by respected game
mathematician and author Eric Lengyel. Not only did we feel it provided better control over the process
(since we could see exactly what it was doing) but it also gives us a chance to look at the internals so
that we can understand what is happening under the hood. Eric's function does not require that you pass
a slew of parameters like some of the more complex D3DX versions, nor does it require any particular
mesh layout. It simply accepts a list of triangle indices in addition to an array of vertices which it will
then populate with the tangent and binormal vectors we need. The D3DX functions are arguably more
flexible and configurable, but we have found that Eric’s code does an excellent job and is far more
educational.
While the concept of tangent space is easy enough to understand and using it is even easier (as we will
see momentarily), the math behind tangent space generation is not necessarily obvious and immediately
Page 120 of 139
intuitive to everyone. For those of you who are happy to make use of the provided D3DX tangent frame
generation function, you may prefer to skip ahead to the next section which deals with constructing the
final HLSL shaders we need to integrate tangent space normal maps. Otherwise, let's take a brief look at
how the tangent frames can be computed manually.
Note: Much of the following discussion can also be found in the excellent book “Mathematics for 3D
Game Programming & Computer Graphics” by Eric Lengyel. Indeed the general approach outlined here is
based on Eric’s work on the subject, so a special thank you goes out to him. Eric’s code implementation
for the computation of surface tangent vectors can also be found at the following URL:
http://www.terathon.com/code/tangent.php
The ultimate purpose of the surface tangent computation is to determine the two vectors, in addition to
the surface normal, that can be used to describe the transformation between world (or object) space and
that of a texture applied to each surface. These three vectors -- including the surface normal --
essentially describe the three axes of the surface (X, Y, Z) as they exist in world or object space. Figure
19.35 below demonstrates this principle:
Figure 19.35 – Tangent Vectors
We know that these tangent space axes are intended to allow us to transform information to and from
texture / tangent space, and as a result it follows that the manner in which the texture map has been
applied to each surface is of critical importance.
Figure 19.36 demonstrates how the orientation of the tangent space axis vectors (corresponding to the U
& V axes of the texture) are not guaranteed to match up with those of the space in which all other
information is expressed. In this case, the +X and +Y axes on which the normals in the normal map are
based in no way relate to the +X and +Y axes of the world in which, for instance, our lights might exist.
Page 121 of 139
Figure 19.36 – Tangent Space vs. World Space
In order to compute the binormal and tangent vectors that correspond to the U and V axes of the texture
therefore, we must base this information on both the orientation of the surface to which the normal map
is assigned and at least one set of texture coordinates that will be used to determine how the normal map
has been mapped to that surface.
Note: The artist need not always supply a unique set of texture coordinates solely for the purposes of
applying normal maps, and indeed in this example we will not do so. Due to the fact that the normal map
will often match the diffuse/base texture also applied to that surface, it is generally the case that we can
use the base texture coordinates for this purpose. In either case however, the texture coordinates used
in the tangent vector computation must be identical to those used when sampling from the normal map.
In order to approach the construction of an appropriate formula capable of computing the tangent and
binormal vectors, we first need to establish some known property that will help us understand the
relationship of the unknown values (in this case, the texture's U and V axis vectors) to the values that we
do know (surface information such as vertex positions and texture coordinates).
In the case of computing surface tangent vectors we can make the following fundamental assumption:
Given any point ‘P’ that falls somewhere within the triangle formed by vertices E, F & G, that point can
be calculated based on known offset or texture coordinate values (‘s’ and ‘t’) assuming tangent and
binormal vectors (‘T’ and ‘B’) are known.
It may not be immediately apparent as to the exact meaning of this statement, so let’s take a look at an
example of this in action (Figure 19.37).
Page 122 of 139
Figure 19.37 – Point from Texture Coordinates
It’s worth noting that at this stage in the process we are absolutely not making the assumption that the
tangent and binormal vectors are unit length. They will ultimately be normalized at the end of the
process for use during rendering. At this point it is important that these vectors describe the magnitude
of an individual tiling of the texture, and are aligned to the U & V axes of the texture as dictated by the
triangle’s texture coordinates and object space scale.
Assuming that we already had access to these arbitrary length tangent and binormal vectors, we could
compute any point P known to sit on that surface. To do so, given the two known offset/texture
coordinate values s and t, we simply offset our starting position taken from vertex E along the tangent
vector by a distance of s, and then offset that result along the binormal vector by the distance described
by t.
This all sounds pretty logical, but what does this have to do with the surface tangent vector calculation?
The reason this is important is because it gives us a potential starting point that will allow us to derive a
formula to compute the tangent and binormal vectors we’re after. The formula used to describe the
operation depicted in Figure 19.37 contains only the unknown values we are attempting to compute (the
tangent T and the binormal B) in addition to the values we do know (vertex positions and texture
coordinates). In this case, the texture coordinates and vectors are used to compute a position, whereas
we need to convert known positions and texture coordinates into the appropriate tangent axis vectors.
However, by way of some linear algebra operations, we should be able to re-arrange this into a useful
equation.
Let’s start with the formulae used to describe the operation outlined in Figure 19.37. In this case we will
look at the two formulae that will arrive at a position equivalent to those of the other two vertices F &
G. We’ll discuss why this is important shortly.
In the following formulae E, F and G refer to the positions of the three vertices in the triangle, and T &
B are the tangent and binormal vectors, respectively. In addition, we also use
Page 123 of 139
to describe the texture coordinate offset values stored
at the vertices E, F & G respectively.
Where the resulting values P & Q also equal:
Based on our earlier discussion relating to the diagram shown in Figure 19.37, the two formulae given
above should be relatively easy to follow. Starting at vertex E, we offset that position along the vectors
T and B respectively, based on the difference between the texture coordinates of the point we’re moving
to (the coordinates of each of the other vertices in this case) and those of the reference vertex E.
Put another way, the difference (or deltas) of the texture coordinates along each edge of the triangle will
cause us to arrive at the absolute extremes of that triangle when offsetting along the vectors T & B.
This is an important point to understand. In the first formula, the resulting position would match that of
vertex F precisely, and in the second the result would match the position of vertex G. As a result, we can
consider P & Q to be known values that correspond to the vertex information already available to us.
The fact that we are taking into account more than one edge in our calculation is also important. When
the artist is applying a texture map to a surface, it is not necessarily required that they be applied using a
planar mapping approach (where the mapping is essentially square). The fact of the matter is, we need to
understand how the texture has been applied to the entire triangle and as a result we must take into
account the mapping of every possible point on the surface of the triangle.
Returning to our formulae; we can make things a little easier here by substituting the individual texture
coordinates used directly in the formulae outlined earlier, for variables describing the delta values
directly:
In addition, because the position of the starting / reference point (E in this example) is largely
unimportant with respect the final computation of the tangent and binormal vectors -- since they
Page 124 of 139
describe only a direction and magnitude -- we can simplify these formulae somewhat further by
removing the reference to the position of vertex E and simply assume a standard origin of <0,0,0>.
With the delta computations substituted, our new simplified formulae should look something along the
lines of the following:
Where the resulting values P & Q also now equal:
We know that before we removed E as the initial starting point, the result of each computation matched
the position of the vertices F & G. Because we are no longer using the position of vertex E as a starting
point, the resulting values P & Q will instead now describe a position that is relative to the origin. This
relationship to the origin will however be identical to the relationship that each vertex originally had
relative to E. As a result of the fact that the relationship has been maintained, both P & Q can still be
considered to be known values (i.e. F-E and G-E).
We now have two formulae which describe the mapping for the entire extents of the triangle along each
edge. Unfortunately, we are also left with 6 unknown values on the right hand side of the equation (the
x, y, and z components of both the tangent and binormal vectors). Solving for multiple unknowns is
tricky at the best of times, and in this case we need to write an equation in which our 6 unknown values
can be computed, leaving only our known values (texture coordinates and positions) on the right hand
side.
In order to solve the two separate equations for these multiple unknowns, rewriting the formula in
matrix form is the ideal solution. We can treat each matrix as an individual entity within the equation
and in theory we should be left with a single matrix that contains the 6 values we are interested in.
The following equation shows how we might rewrite the above formulae in matrix form:
Hopefully it is clear that the above equation is essentially the same in practice as the original formulae
we started with. Remember that during matrix multiplication, we compute the product of each row of the
first matrix (containing the texture coordinate deltas) and the columns of the second matrix. Now that
this has been rewritten in matrix form, the task of solving for vectors T & B becomes a somewhat
simplified problem.
Page 125 of 139
What we need to do now is to re-arrange this such that the matrix containing the values we are
attempting to find stands alone on the left-hand side of the equation. To do so we need to apply some
fundamental algebraic principles:
Take a look at the following equation:
This is about as simple an equation as we can get. Notice however that this follows the same form as the
matrix-based equation we are working with. A (the result) equals B (the texture coordinate delta matrix)
multiplied by C (the matrix containing the tangent and binormal vectors).
In this simplified version of our equation we need to isolate C such that it stands alone on one side of the
equation. To do so, we must move the variable B to the opposite side of the equal sign. Because of the
fact that C is simply multiplied with B, we can achieve this by dividing both sides of the equation by B.
Alternatively, this can be achieved by multiplying each side by the reciprocal (i.e., the inverse) of B as
shown below:
We’ll use this reciprocal / inverse approach for the moment because it is much more applicable when we
know that the variables we’re working with are in matrix form.
We can simplify greatly at this point. Whenever we end up with a result based on the multiplication of
two variables (B & C), and we subsequently divide that result by one of those same variables (B), what
remains will be a value equal to the other variable (C) i.e. BC/B = C. The above formula can therefore
be written as follows:
So, following the above logic, it makes sense that we can isolate our unknown vectors by adopting a
similar approach. In the above formula we arrived at C (the matrix containing the tangent and binormal
vectors) simply by multiplying A (the matrix containing the position deltas P & Q) by the reciprocal or
the inverse of B (the matrix containing the texture coordinate deltas).
Rather than step through every stage of the inverse and simplification of the matrix values, the following
formula shows the resulting simplified equation following the rearrangement and inverse operations.
Page 126 of 139
This is in fact our final formula. The result of this formula would return to us the six unknowns (two
three dimensional vectors) which constitute the tangent and binormal vectors.
At this stage however, remember that they will not necessarily be unit length. Although their direction
will be aligned to the U & V axes of the texture applied to the surface, their length will in fact be equal
to the length of a single tiling of the texture in object space along their respective axes. You can think of
these two vectors as forming the edges of an object space rectangle as it would fit precisely around a
single tile of the texture, irrespective of the underlying geometry.
With these formulae to hand, we can now start writing code.
Under normal circumstances we would be computing a distinctly separate unit length tangent and
binormal vector for each vertex in the triangle based on the vertex normal rather than that of the triangle.
In this first block of code however, we will concentrate on computing a tangent and binormal vector for
an individual triangle as a whole in order to demonstrate the principal.
The first thing we need to do is to compute the known values used as input into the formula outlined
above. These consist of the position delta values P and Q, in addition to the texture coordinate deltas s1,
s2, t1 and t2. We will assume that the vertices are passed in array form through a parameter named
simply v.
Recall that P=F-E and Q=G-E based on our original discussion of the tangent vector calculation:
// Position deltas P & Q
P = v[1].Pos - v[0].Pos;
Q = v[2].Pos - v[0].Pos;
// Texture coordinate deltas

s1 = v[1].u - v[0].u;
t1 = v[1].v - v[0].v;
s2 = v[2].u - v[0].u;
t2 = v[2].v - v[0].v;
With the known variables computed, we can now move on to computing the tangent and binormal
matrix.
If you refer back to the original formula, notice that the result of the matrix multiplication will be scaled
by a value equal to 1/(s1*t2-s2*t1). Since each of the six values resulting from the matrix multiply
will need to be scaled by this value, we will pre-compute this part in advance to save us a little time
during execution. In addition, this will allow us to test for cases in which the texture coordinates are
effectively degenerate, potentially causing a divide by zero.
Page 127 of 139
r = (s1 * t2 - s2 * t1);
if ( fabs( r ) < 1e-5 ) return false;
r = 1.0f / r;
All we need to do now is to perform the actual matrix multiplication itself. As mentioned, each of the
resulting values will be scaled by our pre-computed scalar value contained in the variable r such that it
takes the form “(1/(s1*t2-s2*t1))* Matrix Result”.
T.x = r * (t2 * P.x - t1 * Q.x);
T.y = r * (t2 * P.y - t1 * Q.y);
T.z = r * (t2 * P.z - t1 * Q.z);
B.x = r * (s1 * Q.x - s2 * P.x);
B.y = r * (s1 * Q.y - s2 * P.y);
B.z = r * (s1 * Q.z - s2 * P.z);
There is one outstanding issue that needs to be addressed here, and that is the case in which the artist has
mirrored or inverted the texture coordinates that have been mapped to the surface. In the case of normal
maps, this can be quite a problem. The matrix that is generated will not have the correct “handedness”
and will effectively cause the normals to be inverted later on. As a result, we must check for this case
and flip the sign of the binormal if necessary.
if ( D3DXVec3Dot( D3DXVec3Cross( &vCross, &vNormal, &T ), &B ) < 0.0f )

{
// Flip the binormal
B = -B;
} // End if coordinates inverted
Finally, we can normalize the resulting vectors and return them to the calling function.
D3DXVec3Normalize( &outTangent, &T );

D3DXVec3Normalize( &outBinormal, &B );
return true;
For the sake of completeness, the entire code listing for this example function follows below:
Page 128 of 139
bool CalcTangentSpace( CVertex v[], const D3DXVECTOR3 & vNormal,
D3DXVECTOR3 & outTangent, D3DXVECTOR3 & outBinormal)
{
D3DXVECTOR3 P, Q, T, B, vCross;
float s1, t1, s2, t2, r;
// Compute the known variables P & Q, where "P = F-E" and "Q = G-E"
// based on our original discussion of the tangent vector calculation.
P = v[1].Pos - v[0].Pos;
Q = v[2].Pos - v[0].Pos;
// Also compute the know variables <s1,t1> and <s2,t2>. Recall that
// these are the texture coordinate deltas similarly for "F-E" and
// "G-E".
s1 = v[1].u - v[0].u;
t1 = v[1].v - v[0].v;
s2 = v[2].u - v[0].u;
t2 = v[2].v - v[0].v;
// Next we can pre-compute part of the equation we developed

// earlier: "1/(s1 * t2 - s2 * t1)". We do this in two separate
// stages here in order to ensure that the texture coordinates are
// not invalid (can happen with degenerate triangles in fans).
r = (s1 * t2 - s2 * t1);
if ( fabs( r ) < 1e-5 ) return false;
r = 1.0f / r;
// All that's left for us to do now is to run the matrix multiplication

// and multiply the result by the scalar portion we precomputed earlier.
// (Such that it takes the form (1/(s1 * t2 - s2 * t1)) * Matrix)
T.x = r * (t2 * P.x - t1 * Q.x);
T.y = r * (t2 * P.y - t1 * Q.y);
T.z = r * (t2 * P.z - t1 * Q.z);
B.x = r * (s1 * Q.x - s2 * P.x);
B.y = r * (s1 * Q.y - s2 * P.y);
B.z = r * (s1 * Q.z - s2 * P.z);
// Compute the "handedness" of the tangent and binormal. This ensures

// the inverted / mirrored texture coordinates still have an accurate
// matrix.
if ( D3DXVec3Dot( D3DXVec3Cross( &vCross, &vNormal, &T ), &B ) < 0.0f )
{
B = -B;
// Normalize the vectors and return them

D3DXVec3Normalize( &outTangent, &T );
D3DXVec3Normalize( &outBinormal, &B );
return true;
}
As mentioned however, this is not quite the end of the story. The above code will only compute a single
tangent and binormal vector for the triangle as a whole. We need to compute a unique set of vectors for
each vertex in a polygon or mesh that potentially contains many thousands of triangles. Therefore, some
slight changes will need to be made to the above function in order to support this.
Page 129 of 139
Let’s quickly run through the final code that can be used to compute per-vertex tangent space
information for every triangle in our candidate mesh, stopping to discuss any significant changes that
have been introduced.
In this example code we are passed the vertex data via the parameter v, and the triangle indices through
the parameter Indices.
bool CalcTangentSpace( CVertex v[], int nNumVerts,

ULONG Indices[], int nNumTris )
{
D3DXVECTOR3 P, Q, T, B, vCross, vNormal;
int i, i1, i2, i3;
// Allocate storage space for the tangent and binormal vectors

// that we will effectively need to average for shared vertices.
D3DXVECTOR3 * pTangents = new D3DXVECTOR3[nNumVerts];
D3DXVECTOR3 * pBinormals = new D3DXVECTOR3[nNumVerts];
memset( pTangents, 0, sizeof(D3DXVECTOR3) * nNumVerts );
memset( pBinormals, 0, sizeof(D3DXVECTOR3) * nNumVerts );
// Iterate through each triangle in the mesh

for ( i = 0; i < nNumTris; ++i )
{
// Compute the indices that reference the correct vertices
// of the triangle we're currently processing.
i1 = Indices[i*3];
i2 = Indices[(i*3)+1];
Although the very start of the function is slightly different in this case, it should still be relatively self-
explanatory. Here we must allocate some temporary storage that will be used to sum together each of the
individual per-triangle tangent and binormal vectors at each vertex. These will eventually be normalized
such that we end up with an average at each vertex.
Next we are simply looping through each triangle in the mesh, retrieving the indices for each of the three
vertices in the triangle.
What follows is largely identical to that of the previous function, with the exception that we are now
using the indices to select the vertices from the vertex array.
// based on our original discussion of the tangent vector
// calculation.
P = v[i2].Pos - v[i1].Pos;
Q = v[i3].Pos - v[i1].Pos;
// these are the texture coordinate deltas similarly for "F-E"
// and "G-E".
s1 = v[i2].u - v[i1].u;
t1 = v[i2].v - v[i1].v;
s2 = v[i3].u - v[i1].u;
t2 = v[i3].v - v[i1].v;
Page 130 of 139
// stages here in order to ensure that the texture coordinates
// are not invalid.
r = (s1 * t2 - s2 * t1);
if ( fabs( r ) < 1e-5 ) continue;
r = 1.0f / r;
// All that's left for us to do now is to run the matrix

// multiplication and multiply the result by the scalar portion
// we precomputed earlier.
T.x = r * (t2 * P.x - t1 * Q.x);
T.y = r * (t2 * P.y - t1 * Q.y);
T.z = r * (t2 * P.z - t1 * Q.z);
B.x = r * (s1 * Q.x - s2 * P.x);
B.y = r * (s1 * Q.y - s2 * P.y);
B.z = r * (s1 * Q.z - s2 * P.z);
The following piece of the function is again slightly different. Here we store the computed tangent and
binormal vectors by adding them to any previous values computed at the referenced vertices. Once
completed, we move on to the next triangle in the list.
// Add the tangent and binormal vectors (summed average) to

// any previous values computed for each vertex.
pTangents[i1] += T;
pTangents[i2] += T;
pTangents[i3] += T;
pBinormals[i1] += B;
} // Next Triangle
At this stage, we have computed all of the individual per-triangle tangent and binormal vectors and
stored them. In cases where vertices are shared between different triangles, the tangent and binormal
vectors have been summed.
We must now take this information and compute the final vectors that will represent the correct tangent
space matrix at each vertex in the mesh. We begin by iterating through each vertex for which the tangent
vectors have been computed, retrieving both the computed tangent vector and the original vertex normal.
// Generate final tangent vectors for each vertex

for ( i = 0; i < nNumVerts; i++ )
{
// Retrieve the normal vector from the vertex and the computed
// tangent vector.
vNormal = v[ i ].Normal;
T = pTangents[ i ];
// Gram-Schmidt orthogonalize
T = T - vNormal * D3DXVec3Dot( &vNormal, &T );
D3DXVec3Normalize( &T, &T );
The last section of the above code is new to us; the “Gram-Schmidt Orthogonalize”. Put simply; given
two direction vectors, this code will adjust the first vector such that it is guaranteed to be orthogonal to
the second.
Page 131 of 139
In this case, we need to ensure that the tangent vector has a direction which is orthogonal to the
potentially arbitrary vertex normal computed elsewhere by our application. Remember that the normal
of the vertex may well be different to the normal of the triangle / surface.
The same “orthogonalization” process technically needs to be undertaken for the binormal too, but in
this case it must be guaranteed to be orthogonal to both the vertex normal and our new tangent vector.
Therefore, we can more easily determine this simply by re-computing the vector using the cross product.
// Calculate the new orthogonal binormal

D3DXVec3Cross( &B, &vNormal, &T );
D3DXVec3Normalize( &B, &B );
The remainder of the function deals with the “handedness” question once again, and finally stores the
computed tangent and binormal before cleaning up and returning.
// Compute the "handedness" of the tangent and binormal. This

// ensures the inverted / mirrored texture coordinates still have
// an accurate matrix.
if ( D3DXVec3Dot( D3DXVec3Cross(&vCross,&vNormal,&T),
&pBinormal[i] ) < 0.0f )
{
B = -B;
// Store these values

v[i].Tangent = T;
v[i].Binormal = B;
} // Next vertex
// Clean up and return

delete []pTangents;
delete []pBinormals;
return true;
}
That concludes our discussion of the computation of surface tangent vectors. Once again, for the sake of
completeness, the full uninterrupted code listing is included below.
bool CalcTangentSpace( CVertex v[], int nNumVerts,

ULONG Indices[], int nNumTris )
{
D3DXVECTOR3 P, Q, T, B, vCross, vNormal;
int i, i1, i2, i3;
// Allocate storage space for the tangent and binormal vectors

// that we will effectively need to average for shared vertices.
D3DXVECTOR3 * pTangents = new D3DXVECTOR3[nNumVerts];
D3DXVECTOR3 * pBinormals = new D3DXVECTOR3[nNumVerts];
memset( pTangents, 0, sizeof(D3DXVECTOR3) * nNumVerts );
memset( pBinormals, 0, sizeof(D3DXVECTOR3) * nNumVerts );
Page 132 of 139
// Iterate through each triangle in the mesh
for ( i = 0; i < nNumTris; ++i )
{
// Compute the indices that reference the correct vertices
// of the triangle we're currently processing.
i1 = Indices[i*3];
// based on our original discussion of the tangent vector
// calculation.
P = v[i2].Pos - v[i1].Pos;
Q = v[i3].Pos - v[i1].Pos;
// these are the texture coordinate deltas similarly for "F-E"
// and "G-E".
s1 = v[i2].u - v[i1].u;
t1 = v[i2].v - v[i1].v;
s2 = v[i3].u - v[i1].u;
t2 = v[i3].v - v[i1].v;

// stages here in order to ensure that the texture coordinates
// are not invalid.
r = (s1 * t2 - s2 * t1);
if ( fabs( r ) < 1e-5 ) continue;
r = 1.0f / r;
// All that's left for us to do now is to run the matrix

// multiplication and multiply the result by the scalar portion
// we precomputed earlier.
T.x = r * (t2 * P.x - t1 * Q.x);
T.y = r * (t2 * P.y - t1 * Q.y);
T.z = r * (t2 * P.z - t1 * Q.z);
B.x = r * (s1 * Q.x - s2 * P.x);
B.y = r * (s1 * Q.y - s2 * P.y);
B.z = r * (s1 * Q.z - s2 * P.z);
// Add the tangent and binormal vectors (summed average) to

// any previous values computed for each vertex.
pTangents[i1] += T;
pTangents[i2] += T;
pTangents[i3] += T;
} // Next Triangle
// Generate final tangent vectors for each vertex

for ( i = 0; i < nNumVerts; i++ )
{
// Retrieve the normal vector from the vertex and the computed
// tangent vector.
vNormal = v[ i ].Normal;
T = pTangents[ i ];
// Gram-Schmidt orthogonalize
T = T - vNormal * D3DXVec3Dot( &vNormal, &T );
D3DXVec3Normalize( &T, &T );
Page 133 of 139
// Calculate the new orthogonal binormal
D3DXVec3Cross( &B, &vNormal, &T );
D3DXVec3Normalize( &B, &B );
// Compute the "handedness" of the tangent and binormal. This

// ensures the inverted / mirrored texture coordinates still have
// an accurate matrix.
if ( D3DXVec3Dot( D3DXVec3Cross(&vCross,&vNormal,&T),
&pBinormals[i] ) < 0.0f )
{
B = -B;
// Store these values

v[i].Tangent = T;
v[i].Binormal = B;
} // Next vertex
// Clean up and return

delete []pTangents;
delete []pBinormals;
return true;
}
The Normal Mapping Shaders
In our next example effect file we will look at how a normal map texture can be used to begin to
approach the kinds of stunning visuals we see in today’s per-pixel lit games. This time around we will
shake things up again and instead of using another fixed directional light source, we will allow the
application to provide attributes (position, color, and range) for a single point light.
Let us first define the parameters we are going to need. We will need the application to pass the world
space position, diffuse color, and range of the point light. We will also need the world matrix of the
object in order to transform the model space vertex position into world space so that we can calculate the
world space light vector (from the light position to the vertex). We will also need a combined
view/projection matrix to finally transform the world space vertex position into clip space.
// Parameters filled out by our application

// Matrix Parameters
// Light Parameters
float3 LightPosition;
float4 LightDiffuse;
float LightRange;
Page 134 of 139
We will also define two textures and their matching samplers because each primitive we render will now
have two textures applied: a diffuse map and a normal map. In this example we are assuming that the
diffuse map and the normal map share the same mapping with the surface and as such, the diffuse map
texture coordinates can be used to sample both textures inside the pixel shader.
// Textures and Sampler Definitions

texture BaseMap;
texture NormalMap;
// Base Map
{
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = Wrap;
AddressV = Wrap;
};
// Normal Map
sampler NormalMapSampler = sampler_state
{
Texture = <NormalMap>;
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = Wrap;
AddressV = Wrap;
};
With our constant parameters defined, let’s see what the vertex shader input and output structures will
look like now.
{
float3 Tangent : TANGENT;
float3 Binormal : BINORMAL;
};
{
float3 LightDir : TEXCOORD1;
};
The input structure looks much as we would imagine and mirrors the layout of our vertex in the vertex
stream. With respect to the output structure, the first member will obviously be used to return the clip
Page 135 of 139
space vertex position, as always. The first set of output texture coordinates will be used to pass along the
actual texture coordinates defined for the vertex. As before, we will simply copy these over from the
vertex stream with no changes. The second set of texture coordinates will be used to output the tangent
space light-to-vertex direction vector.
The first thing the shader does is transform the vertex position and normal into world space using the
application provided world matrix. We make sure to re-normalize the world space vertex normal.
VertexShaderOutput NormalMapVertexShader( VertexShaderInput In )

{
// Transform the position and normal into world space

float3 WorldPosition = mul( float4(In.Position, 1), WorldMatrix);
float3 Normal = normalize( mul( In.Normal, WorldITMatrix) );
Now its time to build a transformation matrix that describes the tangent space coordinate system stored
at this vertex.
// Compute our world to tangent space transform matrix

float3x3 TangentToWorldSpace;
TangentToWorldSpace[0] = normalize( mul( In.Tangent, WorldITMatrix ) );
TangentToWorldSpace[1] = normalize( mul( In.Binormal, WorldITMatrix ) );
TangentToWorldSpace[2] = Normal;
As you can see, we do this by transforming the tangent and binormal vectors into world space (the
normal is already in world space at this point) and store our three world axis vectors in the three rows of
our local tangent space matrix. At this point, we can think of this matrix as a means for transforming
objects from tangent space into world space, but what we want is the opposite -- a matrix that will
transform our world space vectors into tangent space. Fortunately, we know that we can do so by simply
multiplying our vectors with the transpose of this matrix, which you will see happen in a moment.
(Notice in the above code another way in which we can access and set the components of a matrix. In
this example we are using array access to set each row of the matrix as a single 3D vector.)
Note: The transformation of the binormal and tangent vectors to world space should be done using the
inverse transpose of the world matrix, just as we discussed earlier for normals.
Our next step is to calculate the incident light vector that will be used in the pixel to calculate our
lighting. Since a point light emits light equally in all directions, we can calculate this vector by
subtracting the world space vertex position from the world space light position (now you know why we
took the intermediate step of transforming the vertex position into world space rather than straight into
clip space):
// Compute light direction

float3 LightDir = LightPosition - WorldPosition;
Page 136 of 139
Now we can go ahead and transform the world space vertex position into a final clip space position by
multiplying it with the combined view/projection matrix, making sure we store the result in the
appropriate output parameter.
// Transform to clip space

Out.Position = mul(float4(WorldPosition, 1), ViewProjectionMatrix);
Our next step is to transform the world space incident light vector into tangent space. We currently have
a matrix that transforms from tangent space into world space, but we need the opposite. As it happens,
rather than manually transposing our matrix, we can simply reverse the order in which we supply the
vector and the matrix to the mul intrinsic function. This will automatically result in the vector/matrix
multiplication to be performed as if the matrix were transposed. Very handy indeed.
// Transform the light direction into tangent space using the transpose
// mul( Matrix, Vector ) == Transposed Transform
Out.LightDir.xyz = mul( TangentToWorldSpace, LightDir );
The resulting tangent space light vector is stored in the appropriate output parameter. Notice that we do
not normalize the vector. This is because we want to maintain its length so that we can perform the point
light attenuation computation with respect to the light’s range in the pixel shader. It is important to bear
in mind that the tangent space transform is just a reorientation of the vector (i.e., just a rotation – notice
the 3x3 nature of the matrix), which means that its length is preserved, post-transformation.
Finally, we copy the texture coordinates over from the input stream that we will use for sampling our
diffuse and normal maps in the pixel shader.
// Copy the texture coordinates through

// Send the output

return Out;
}
Our vertex shader is now complete. It outputs a clip space position, a tangent space lighting vector, and
a set of texture coordinates. We can use the same output structure as the input for our pixel shader,
shown below in its entirety.
float4 NormalMapPixelShader( VertexShaderOutput In ) : COLOR

{
// Sample our diffuse texture
float4 DiffuseColor = tex2D( BaseMapSampler, In.TexCoords );
// Sample the normal for this pixel from the normal map
float3 Normal = tex2D( NormalMapSampler, In.TexCoords ).rgb;
Normal = normalize( 2.0f * Normal - 1.0f );
// Compute the length of the light direction vector (distance to pixel)

// and then use it for linear attenuation
float fLength = length( In.LightDir );
float fAtten = saturate( 1.0f - ( fLength / LightRange ) );
Page 137 of 139
// Normalize the light direction
float3 L = In.LightDir / Length;
// Calculate the lambertian cosine (N dot L)

float fNDotL = saturate( dot( Normal, L ) );
// Calculate final color

return DiffuseColor * LightDiffuse * fAtten * fNDotL;
}
In the first line we sample the diffuse map using the interpolated base map texture coordinates and store
the resulting color in the DiffuseColor temporary variable. We then call the tex2D intrinsic function
again using the same texture coordinates, but this time to sample a texel from the normal map, storing
the sampled RGB components into a temporary variable of type float3 called Normal. We follow with:
Normal = normalize( 2.0f * Normal - 1.0f );
This line of code is responsible for converting the unsigned color values sampled from the normal map
into the [-1, 1] range so that it represents our original normal once again. Why does this work? Well,
when the normal map was generated, the normal vectors (originally in the range [-1, 1]) were mapped
into the [0, 1] range so that they could be stored in a texture using an unsigned texel format (technically,
the [0, 1] range value will be converted to the [0, 255] range for an unsigned 8-bit per channel integer
texture format, but pixel shaders always get back the results in the original [0, 1] range). This would
have been fundamentally achieved during normal encoding using the formula '(Normal.xyz + 1) / 2' or
some variant of it. As you can see therefore, the code used in the shader is intended simply to reverse the
effects of this conversion.
After we have sampled, unpacked, and renormalized the per-texel normal, we then compute the distance
between the light source and the pixel by using another new HLSL intrinsic called length. This function,
like its D3DX counterparts D3DXVec*Length, returns the magnitude of the floating point vector
supplied to its only parameter (which can be of any vector type, including float, float2, float3 or float4).
In this case, the length returned represents the magnitude of the input incident light direction vector
computed in the vertex shader, and interpolated per pixel. Although this is a tangent space vector with
respect to direction, its length in tangent space will be no different than its length in world space because
the tangent space transform simply performed a reorientation of the vector (again, rotation only -- no
scaling or translation takes place). We will use this distance, along with the light’s range, to generate a
linear attenuation value that we use later on to adjust the intensity of our final lighting result. This
distance (length) also provides us with what we need to normalize the light direction for the purposes of
computing our Lambertian term. Since we have already paid the price for the square root computation
when we asked for the length of the vector, we can optionally avoid using the normalize intrinsic (which
would internally compute the vector magnitude a second time) by simply dividing it by the length we
just computed.
The rest is simple enough. Now that we have a normalized light direction and a pixel normal in tangent
space, we can perform a dot product between them to produce the Lambertian cosine scalar (stored in
the variable fNDotL in this example). We then merge this with our distance attenuation factor and
Page 138 of 139
modulate by the light's diffuse color and the color we sampled earlier from the diffuse map as our
reflectance and we are done.
This is actually the first of many normal mapped shader implementations we will encounter in this
course, and we really just wanted to whet your appetite here. While the code we use will be slightly
different as we move forward in later lessons, the core ideas will all remain exactly the same. Given how
important a technique this is, it would probably be a good idea for you to go back and quickly give this
whole section one more read. The concepts we saw here are going to come up time and time again, so
please be sure that you are comfortable with normal mapping before you move on to the later lighting
chapters (where it will be assumed you are familiar with all of this material).
Conclusion
At this point we have just scratched the surface of what we can do with shaders, but you now have a
solid foundation with respect to the basics of writing HLSL shaders and integrating them with effects.
Fortunately for us, they pretty much look and behave like standard C/C++ functions with extra graphics
functionality so they are pretty easy to sink our teeth into. Whereas in this chapter, we mostly just took
steps to replace the old fixed-function transformation and lighting pipeline with a more hand-rolled
approach using shaders, as the course progresses you will see shaders being used for all sorts of very
interesting new jobs that we were never really able to do very easily before (e.g., image post-processing,
high dynamic range lighting, real-time shadows, and much more), if at all. So, this is only a first step,
but it is a fundamentally important one, so be sure you are comfortable with the material presented here
before moving on. Experimenting with the provided lab projects, or even writing your own based on the
information outlined in this chapter, will prove to be a great way of getting a handle on shaders.
Page 139 of 139

GP Series Chapter 19

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

GP Series Chapter 19

Загружено:

Авторское право:

Доступные форматы

Chapter Nineteen

The Programmable Pipeline

Graphics Card VS Version PS Version

Vertex Shader Models Differences

Instruction Slots Static Dynamic Vertex

Pixel Shader Model Differences

Instruction Slots Static Dynamic Other

Note: We will not be discussing higher order surfaces in

At this point, when a vertex shader has not been set on

19.1.1 Vertex & Pixel Shader Hardware

Vertex Shader Input Registers

Name Register VS_1_1 VS_2_0 VS_2_x VS_3_0

Vertex Shader Output Registers 1.1 – 2.x

Name Register VS_1_1 VS_2_0 VS_2_x VS_3_0

dp4 oPos.x , v0 , c0 ; Calculate transformed X

The Pixel Shader Unit

For each primitive in a

19.2 Using HLSL Shaders in Effects

// ----- Variable / Parameters -----

// ----------- Shaders ---------------

// 1st pass pixel shader

// 2nd pass pixel shader

// --------- Techniques ---------------

VertexShader = compile vs_2_0 MyVertexShader();

VertexShader = compile vs_2_0 MyVertexShader();

VertexShader = compile vs_1_1 MyShader();

19.2.1 Compiling Effects with Shaders

C:\dx9sdk\Utilities\Bin\x86>fxc.exe Terrain.fx /Fo Terrain.fxo

Microsoft (R) D3DX9 Shader Compiler 9.29.952.3111

error X3501: 'main': entrypoint not found

compilation failed; no code produced

Microsoft (R) Direct3D Shader Compiler 9.29.952.3111

Usage: fxc <options> <file>

/?, /help print this message

/T<profile> target profile

/Od disable optimizations

/Gpp force partial precision

/Fo<file> output object file

/P<file> preprocess to file (must be used alone)

@<file> options response file

/compress compress DX10 shader bytecode from files

/D<id>=<text> define macro

VertexShader = compile vs_2_0 MyVertexShader();

C:\dx9sdk\Utilities\Bin\x86>fxc.exe MyEffect.fx /Fo MyEffect.fxo

// ----- Variable / Parameters -----

// ----------- Shaders ---------------

fxc.exe MyVertexShader.vsh /T vs_2_0 /Fo MyVertexShader.vso

Target Profile Type Notes

Compiling Standalone Shaders at Runtime

D3DXMATRIX mtxWorld; // App provided data

LPDIRECT3DVERTEXSHADER9 g_pVertexShader = NULL;

// If the debugging define is set, disable shader optimizations during compile

// Load shader source and compile into bytecode buffer

// Create the final vertex buffer interface.

// Set world matrix and power parameters

pD3DDevice->Clear( 0, NULL, D3DCLEAR_TARGET | D3DCLEAR_ZBUFFER, 0, 1.0f, 0 );

if( SUCCEEDED( pD3DDevice->BeginScene() ) )

// Set vertex shader

// Set Vertex Buffer

// Set Index Buffer

VertexShader = compile vs_3_0 MyShader();

19.4 Vertex Declarations

The IDirect3DDevice9::SetVertexDeclarator method accepts a single parameter -- an already created

typedef struct D3DVERTEXELEMENT9