Вы находитесь на странице: 1из 60

M ASTER S T HESIS IN C OMPUTER S CIENCE

Data Oriented Interactive Water


An Interactive Water Simulation For PlayStation 3

J OEL L ENNARTSSON
June 25, 2012

Examiner: Ingemar Ragnemalm

Supervisor: Jens Ogniewski

Supervisor at DICE: Torbjrn Sderman

Abstract In this report, a method for simulating interactive height-eld based water on a parallel architecture is presented. This simulation is designed for faster than real time applications and is highly suitable for video games on current generation home computers. Specically, the implementation proposed in this report is targeted at the Sony PlayStation 3. This platform requires code to be both highly parallelized and data oriented in order to take advantage of the available hardware which makes it an ideal platform for evaluating parallel code. The simulation captures the dispersive property of water and is scalable from small collections of water to large lakes. It also uses dynamic Level Of Detail to achieve constant performance while at the same time presenting high delity animated water to the player. This report describes the simulation method and implementation in detail along with a performance analysis and discussion.

Contents
1 Introduction 2 Background And Related Works 2.1 Navier-Stokes Equations . . . . . 2.2 Numerical Methods . . . . . . . . 2.2.1 Lagrangian Particles . . . 2.2.2 Eulerian Grids . . . . . . 2.3 Height Field Methods . . . . . . . 2.3.1 Shallow Water Equations 2.3.2 Linear Wave Theory . . . 2.3.3 Wave Equation . . . . . . 2.4 Related Works . . . . . . . . . . . 2.4.1 Gerstner Waves . . . . . . 2.4.2 Fast Fourier Transforms . 2.4.3 Semi-Lagrangian Method 2.4.4 Wave Particles . . . . . . 2.4.5 Detailed Flow . . . . . . . 2.4.6 Choppy Waves . . . . . . 3 Architectures 3.1 PC . . . . . . . . . . 3.1.1 Hardware . . 3.1.2 Performance 3.2 Xbox 360 . . . . . . 3.2.1 Hardware . . 3.2.2 Performance 3.3 PlayStation 3 . . . . 3.3.1 Hardware . . 3.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6 7 8 8 8 9 9 9 10 11 11 12 12 12 13 13 14 14 14 15 15 16 16 17 17 19

4 Parallelization And Optimization 20 4.1 Parallel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.1 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Data Oriented Design . . . . . . . . . 4.2.1 Object Oriented Programming 4.2.2 Cache Inefciency of OOP . . 4.2.3 Structure Of Arrays . . . . . . 4.3 Optimization Details . . . . . . . . . 4.3.1 SIMD Vectorization . . . . . . 4.3.2 Software Pipelining . . . . . 4.3.3 Careful Branching . . . . . . 4.3.4 Load-Hit-Stores . . . . . . . . 5 Algorithm 5.1 Earlier work . . . . . . . . . . . . 5.1.1 Surface Cells . . . . . . . 5.1.2 Dispersion By Convolution 5.1.3 Kernel Approximation . . 5.1.4 Laplacian Pyramids . . . . 5.1.5 Grid Summation . . . . . 5.1.6 Level Of Detail . . . . . . 5.1.7 Interaction . . . . . . . . 5.1.8 Stitching . . . . . . . . . . 5.2 Implementation Issues . . . . . . 5.2.1 Parallelism . . . . . . . . 5.2.2 Scaling . . . . . . . . . . 5.2.3 Border Copying . . . . . . 5.2.4 Data Locality . . . . . . . 5.3 Improved algorithm . . . . . . . 5.3.1 Homogeneous Grids . . . 5.3.2 Quad-trees . . . . . . . . 5.3.3 Large cells . . . . . . . . . 5.3.4 Memory Layout . . . . . . 6 Implementation 6.1 Engine . . . . . . . . . . . . . 6.1.1 Frame Overview . . . 6.2 Simulation . . . . . . . . . . 6.2.1 Data Layout . . . . . . 6.2.2 Grid Dimensions . . . 6.3 Interaction Setup . . . . . . . 6.4 Frame Setup . . . . . . . . . 6.4.1 Level Of Detail . . . . 6.4.2 Fading . . . . . . . . . 6.5 Update Passes . . . . . . . . . 6.5.1 SPE Jobs . . . . . . . . 6.5.2 Applying Disturbances 6.5.3 Wave Propagation . . 6.5.4 Border Copying . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22 22 22 22 23 23 24 25 25 26 26 26 27 27 28 28 29 29 29 30 30 30 31 31 32 32 33 33 33 34 34 35 35 36 36 37 37 38 39 39 39 40 41 42

6.5.5 Grid Summation 6.6 Rendering . . . . . . . . 6.6.1 Drawing . . . . . 6.6.2 Dispatch . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

42 43 43 44

7 Results and analysis 45 7.1 Previous Implementation . . . . . . . . . . . . . . . . . . . . . . . 45 7.2 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . 46 7.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 8 Discussion 8.1 Design . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Homogeneous Grids . . . . . . . . . . 8.1.2 Improved Level Of Detail . . . . . . . 8.1.3 Fading . . . . . . . . . . . . . . . . . . 8.2 Future Implementations . . . . . . . . . . . . 8.2.1 Enhanced Effects . . . . . . . . . . . . 8.2.2 Ambient Waves . . . . . . . . . . . . . 8.2.3 Customized Interaction . . . . . . . . 8.2.4 Situational Level Of Detail . . . . . . . 8.2.5 Better Boundary Conditions . . . . . . 8.2.6 Non-Linear Texture Mapping . . . . . 8.2.7 Performance Efcient Flow Simulation 8.2.8 Minimized Vertices . . . . . . . . . . . 8.2.9 Mesh Reduction . . . . . . . . . . . . . 8.2.10 Unied Quad-tree . . . . . . . . . . . 8.2.11 Multiple Convolutions . . . . . . . . . 8.3 Conclusion . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 50 50 51 51 52 52 52 52 53 53 53 54 54 54 54 55 55 55

Chapter 1

Introduction
This is a thesis report for a bachelors degree in computer science at Linkping University, Sweden, written in 2012. The work for this thesis was done at EA DICE (DICE) and implemented in the Frostbite 2 game engine. This thesis is the continued work of a master thesis by Bjrn Ottosson [Ott11], also done at DICE, which presented a height-eld based method for simulating real-time interactive water with the dispersion property. Ottossons work was prototyped inside the Frostbite 2 engine developed at DICE and managed to simulate and render high quality water waves at under 3 milliseconds per frame on a single Intel Xeon core. On a system where multiple processing cores are available however, which has long since been the norm, the simulation will not utilize more than one core. On the PC platform, only running on a single core is not a huge issue since the water simulation can be included as a graphics option for high-end PCs. However, this is not possible on consoles where the hardware is identical for all users and resources are sparse. Since DICE produces multi-platform games the simulation needs to work, with similar visual quality, on at least PC, Microsoft Xbox 360 and Sony PlayStation 3, preferably still under 3 milliseconds on each platform. Because of the linear nature of programs and code, it is often difcult to properly utilize all available cores on a system. It is easy to just let most of the program run on the main processor. The assumption is that all three platforms would benet from a simulation which is able to use multiple processing cores simultaneously. Mainly because secondary processors might go unused otherwise but also to let the main thread perform other important tasks. This is especially true on the PlayStation 3 since it, along with six specialized vector processors, only has one general purpose core available. The goal for this thesis work was therefore to produce an adaptation of the existing simulation better suitable for parallel architectures. This includes redesigning large portions of the previous algorithm, mostly changing implementation details but also how the algorithm worked in general. For this assignment, the PlayStation 3 was chosen as the target platform for implementation since its architecture places such high demands on how well the code is par4

allelized. Another reason for this decision is that a simulation working on this platform is reasonably simple to port to other parallel architectures afterwards. The result is a high quality interactive water simulation, running at good performance, utilizing all of the PlayStation 3s vector processors simultaneously. This report contains some general background information on real time water simulation along with an overview of both the previous and the parallelized algorithm. Techniques and theory used in the simulation will be described in detail along with a description of the implementation. There is also a chapter dedicated to data oriented programming since this is of high importance when optimizing for performance in games. Results are presented in detail along with a performance analysis and comparisons with the previous water simulation. Finally there is a discussion around possible ways to improve the simulation in the future and on what needs to be done to get it ready for production.

Chapter 2

Background And Related Works


Computer generated water in games and movies has always been a popular eld of research since water occurs so naturally in many scenes and situations. Complex behavior and appearance in everything from small puddles to large windy oceans helps to build a immersive experience for the viewer. Waves, splashes, refraction and reections are all examples of important attributes that contribute to the illusion of water. These are all complex properties that, if not emulated in an convincing way, can quickly make the water look unrealistic. Because water surrounds us daily, we are very good at determining how a body water should behave and look in different situations. Realistically simulating every aspect of water, however, is very computationally demanding. Water simulation have been used to great extent in the movie industry. Movies like Titanic and Waterworld were groundbreaking with the rendition of ocean water at a high degree of realism [Tes04b] and the industry have since produced increasingly realistic computer generated water. The movie industry has the luxury of being able to simulate and render water off-line, performing the simulation and rendering of frames on clusters of computers over a period of time, much, much slower than the speed at which the movie is played back to the audience. This is why water in movies can look incredibly realistic and gives an enormous advantage in visual quality over games which have to be able to render water in real-time. In fact, games usually incorporate other elements which share the same available performance, so the water animation needs to be much faster than real-time. Another advantage of movie water is that the behavior of water is completely deterministic and can be customized to each event in a scene. While this is possible for games as well, the water animation cannot be precomputed if the player is able to interact with it in any way. To allow such interactivity, the water must be able to realistically simulate waves depending on the actions of the player. With the additional computational costs of an interactive simulation, it

becomes increasingly difcult to produce a high quality water simulation within the time constraints that a game enforces. Even if the water simulation is kept within the allotted time frame it needs to be able adjust the performance cost according to how much and in what detail water is visible in a scene. The process of determining at which delity to present elements to the player is referred to as Level Of Detail. Presumably, the player is also free to move around and look at other things than water, so the simulation should dynamically scale accordingly. Level Of Detail allows a highly detailed presentation of the water while also freeing up resources when possible.

2.1

Navier-Stokes Equations

For most uid simulations, realistically calculating the ow of a uid is a process of numerically solving the Navier-Stokes equations (2.1) and the volume conservation equation (2.2). These equations describe the motion of uid substances and the physical relations between such attributes as viscosity, density and uid pressure. While generally considered a good model of uid dynamics, the equations are performance costly to solve numerically. Therefore simplications to the model are necessary for them to be usable in real-time applications. In a game, the purpose of a uid simulation is often to only render visually convincing uid animation. In order to reduce complexity, the physical correctness of the simulation can therefore be compromised without the player noticing [Sch07]. For many uid simulations, the Navier-Stokes equations can be reduced to the Incompressible Navier-Stokes equations (2.3) and volume conservation equation (2.4). The following formulae describe the Navier-Stokes equations and the volume conservation equation: v +v t v = p+ T+f (2.1)

+ (v) = 0 (2.2) t Where v is the ow velocity, is the uid density, p is the pressure, T is the stress tensor and f represents body forces. is the vector of all partial derivatives. The following formulae describe the Incompressible Navier-Stokes equations and the volume conservation equation: v +v t v = p+ v =0
2

v+f

(2.3) (2.4)

Where is the constant viscosity term and 2 the vector Laplacian. These equations assume a constant, homogeneous density across the whole uid body and replaces the stress tensor term with the viscosity term. For the majority of water waves, viscosity is assumed to be zero and as such, the viscosity term can be eliminated

2.2

Numerical Methods

To be able to numerically integrate the Navier-Stokes equations the simulation must be able to model the movement of uid. The two main categories of methods for modeling water in simulations are Eulerian Grids and Lagrangian particles [DYQKEH10]. While Lagrangian methods are able to model water very realistically, Eulerian methods are popular in games because of the high performance benets. However, the latter can easily become unstable if the duration of time between two simulated frames, the time step, is too great.

2.2.1

Lagrangian Particles

With Lagrangian Particles, the water is simulated in a system of discrete particles that interact with each other. Each particle represents a small body of water that collide and attract to other particles depending on mass and speed. Because each particle of water is simulated explicitly, it is trivial to preserve mass and energy which makes it a very stable model. Performance-wise for games, it is difcult to implement a real-time simulation solely based on Lagrangian Particles since a very large amount of particles is required to render water of decent visual quality. The simulation can also easily become wasteful for sections of water that are not visible or not in motion. This model is used in the software application RealFlow [rea] to perform incredibly realistic off-line rendering of water.

2.2.2

Eulerian Grids

In this method, a uniform grid of cells is used as a xed frame of reference. Water movement is simulated by keeping track of uid properties such as velocity, density and pressure in each cell. Simulation is done by integrating values for each cell based on the time step and is able to simulate large bodies of water very efciently depending on the resolution of the grid. When using Eulerian grids, the conservation of mass and energy must be taken into special account since this is not handled implicitly as is the case with particle systems [Kal08]. The length of the time step is also an important factor in this model. The frame rate in games might vary and a large time step might lead to an overestimation of the values in a cell when integrating. In such a situation, multiple sequential integration errors can cause additional energy to be generated which might result in the simulation exploding.

The algorithm in thesis uses an Eulerian Grid approach to simulate water surfaces. This numerical method was chosen for simplicity and performance.

2.3

Height Field Methods

Height eld methods are a specialization of Eulerian grids in which a uniform, two-dimensional grid are used in conjunction with height-values to produce a topographic surface. It is assumed that simulating the motion of whole volumes of resting water is unnecessary since such motion remains largely invisible to the viewer [Lov02]. Instead, height-eld methods simplies water simulation by only modeling the surface of the water as a height-function of spatial surface coordinates. By using 2D-grids to simulate 3D water volumes, the complexity of the simulation is effectively reduced by one dimension. Using this technique in games, high quality water animation can be achieved at low performance costs. The height-eld constraint does, however, limit the simulation by not allowing breaking waves and spray since the water cannot have more than one height-level for a given surface coordinate.

2.3.1

Shallow Water Equations

A height-eld model commonly used in oceanic modeling is the Shallow Water Equations which assumes that the length of the water waves is signicantly larger than the mean water depth. The Shallow Water Equations are: dv + g h + (v dt )v = 0 (2.5)

dh + (h + b) v = 0 (2.6) dt Where v is the horizontal ow velocity, g is the acceleration of gravity, h the water height from the mean surface level and b the water depth from that level. These equations are derived from the incompressible Navier-Stokes equations (2.3), assuming zero viscosity, but ignores the ow perpendicular to the water surface [DYQKEH10]. Only horizontal ow is taken into account and as such, rivers and water currents can be simulated. Because of the wave length assumption however, it is only suitable for simulating the movement of water on a macro-scale, such as tidal waves. Smaller waves will break that assumption and will be simulated incorrectly.

2.3.2

Linear Wave Theory

Linear Wave theory is a different height-eld model that assumes that the surface displacement is relatively insignicant compared to the mean water depth. It also assumes incompressibility and zero viscosity, but contrary to the Shallow Water equations, horizontal ow is not modeled. It is therefore unable to 9

simulate moving water across a surface but makes no assumption regarding the wave length in relation to water depth. This makes the model suitable for simulating large ranges of resting water volumes, from oceans to swimming pools. Linear Wave Theory gives the following formula for water speed, c: c= g tanh (kh) k (2.7)

Where g is the acceleration by gravity, k the angular wave number andh the water depth. For shallow and deep water respectively, the formula can be split into equations 2.8 and 2.9 depending on the wave length . c= gh, g , k h (2.8) (2.9)

c= Dispersion And Kelvin Wakes

An important property captured by Linear Wave Theory is the relation between wave speed, wave length and water depth. Unlike sound waves, water waves propagates at different speeds depending on the wave length and becomes slower when wave frequency increases [Lov02]. This property is known as dispersion and becomes signicant when wave length is signicantly smaller than the mean water depth, see equation 2.9. For simulating realistic ocean waves, this property is essential and is responsible for the characteristic V-shapes of waves in the wakes of moving ships. These are referred to as Kelvin Wakes.

2.3.3

Wave Equation

The Wave Equation is similar to the equations derived from Linear Wave Theory but assumes the wave speed is constant for all wave lengths: 2u = c2 2 u (2.10) t2 Where u is the water height and c is the constant water speed. Without the dispersive property, a group of waves composed of multiple frequencies propagate in a medium with no deformation. This is suitable for modeling light and sound waves but does not handle water waves realistically [Lov02]. Despite being a technically incorrect simulation for water, the Wave Equation has been widely used in games for interactive water because of its low demand on resources. A visual comparison between Linear Wave Theory and the Wave Equation can be found in gure 2.1.

10

Figure 2.1: A comparison between simulations using Linear Wave Theory (left image) and the Wave Equation (right image). Waves are generated by a point-shaped object moving from left to right at constant speed. Darker values represent negative wave amplitudes while brighter values represent positive. Image courtesy of Ottosson [Ott11].

2.4

Related Works

Much research has been focused on simulating water realistically, both in engineering for physics simulation purposes and for visual presentation in movies and games. While a full review of the eld of water simulation is out of scope for this report, this section brings up a few novel methods presented in other articles. For a good historical perspective on the research done within uid simulation, see Computer Graphics For Water Modeling And Rendering: A Survey [Igl04].

2.4.1

Gerstner Waves

1986, Fournier and Reeves presented, what is possibly the rst application of Gerstner Waves in the computer graphics eld [FR86]. Gerstner Waves approximates the movement of points along a water surface as sinusoidal functions of the amplitude, direction and wavelength of traveling waves [Tes04b]. This results in surface points moving in elliptical motions when affected by waves and gives the surface a choppy look with sharp tops and attened valleys. Since the motions are dependent on wave length, the dispersion property is also modeled along with easy detection of breaking waves or spray. Gerstner Waves also models wave refraction along shores which means that the elliptical motions become smaller over shallow depths. This gives ocean waves the characteristic behavior where incoming wavefronts aligns with the shoreline.

11

2.4.2

Fast Fourier Transforms

A common technique for height-eld methods that employ uniform grids is to propagate waves using the Fourier domain. Ordinarily, a surface is represented by discrete spatial height-values, one for each cell. Assuming the surface is the result of super-positioned sinusoidal frequencies with different amplitudes and phase, the height-eld can be transformed into the Fourier domain, or frequency space. In this domain the surface is instead represented by complex values holding the amplitude and phase for all present wave frequencies, sampled at the same resolution as the original height-eld. Performing wave propagation in the Fourier domain is done at low performance cost simply by modifying the phase of each frequency. Insomniac Games uses this method to simulate dispersive waves [Day09] which was implemented in the game Resistance 2. With the invention of the discrete Fast Fourier Transform, which is a highly optimized algorithm for transforming a sampled signal into the Fourier domain, Fourier based methods became viable for real-time applications. Jensen and Golias [JG01] present an implementation of this method for use in games. Since the sinus waves described in the Fourier domain are periodic, this method is suitable for rendering large oceans, where the waves for a surface patch can be calculated and then tiled over the whole surface. Using this method for interactive water can prove difcult however since dynamic local distortions are not periodical in nature.

2.4.3

Semi-Lagrangian Method

A problem with Eulerian grid methods is that in each simulated step, data can only be transferred between immediate neighbors on the grid. When simulating water ow of high velocity, where mass might be transferred over more than one grid cell per step, the ordinary grid model is insufcient. A SemiLagrangian method also uses a uniform grid of cells, but calculates displaced mass, or advection, via tracer particles. For every advection step, a temporary tracer particle is simulated for each cell, which travels backwards along the ow vector eld of the surface, subtracting mass from the calculated origin, adding that mass to the current cell. Kallin [Kal08] uses a modied version of the Shallow Water Equations together with a Semi-Lagrangian grid to simulate rivers owing over arbitrary terrain. All grid methods, including the Semi-Lagrangian, suffer from dissipation when performing advection, that is, loss of data when adding mass to non-cellcentered locations. Kim et al. [KLLR07] present an error correcting algorithm for compensating the data loss but such an algorithm might signicantly lower performance.

2.4.4

Wave Particles

A novel approach to the Lagrangian method is proposed by Yuksel [YHK07] called Wave Particles. This can be thought of as a sparse system of Lagrangian 12

Particles where particles are simulated over a static height-eld. It differs from other Lagrangian Particle systems by only simulating particles where there are surface waves. Since particles are only created when disturbing the water, a calm body of water means that no particles need to be simulated. Yuksel uses the Wave Equation for propagating Wave Particles that are splatted onto the height-eld before rendering. This can be performed on top of an existing static height-eld water simulation to add interactivity.

2.4.5

Detailed Flow

Even if there is not enough performance available to include a physics based water simulation in a game, water animation can still be achieved by other means. Early games, for example, used scrolling textures on top of static meshes to give the impression of owing water. Vlachos[Vla10] describes a similar technique used in Portal 2 and Left 4 Dead 2 for visualizing currents in water by using a precomputed ow vector map. This method is based on work by Max and Becker[MB96] and uses image advection to distort the normal map of the water surface with great results. With this technique, static objects that intersects the water surface can be mirrored in the ow map, emulating ow around such objects.

2.4.6

Choppy Waves

Waves simulated using sinusoidal functions have a tendency to look very round and smooth. Large, or steep, ocean waves created during windy conditions will not look like this, but appear more sharp and choppy. Together with Tessendorf [Tes04b], Jensen and Golias[JG01] suggest that the sinusoidal shapes of the waves can be altered to achieve such a choppy look. This is done, after the simulation step, by displacing the height-eld vertices near steep waves before rendering the surface. For very steep waves, displaced vertices will start to overlap, causing inverted waves. This can be used for creating breaking waves and foam effects where overlaps occur.

13

Chapter 3

Architectures
With several competing gaming systems on the market, high quality games are often produced for multiple platforms to reach as big an audience as possible. For this reason the Frostbite 2 engine is designed to run on at least PC, Microsoft Xbox 360 and Sony PlayStation 3. Any water simulation in Frostbite 2 should therefore be capable of running on all those platforms at full speed, with comparable visual results. This chapter gives an overview of the architectures targeted by Frostbite 2 in general and by this thesis specically. The primary positive and negative aspects of each platform will be discussed, along with a more detailed description of the PlayStation 3 as a reference for upcoming chapters.

3.1

PC

The PC is the most versatile of all platforms and at the same time the one that puts the largest demands on the hardware compatibility of a published game. The game has to be compatible with a wide range of different hardware solutions and software. Frostbite 2 has chosen to solely support the Microsoft Windows operating system since most PC players use this OS for gaming. While the PCs today are available with both 32-bit and 64-bit architectures, consoles still only allow 32-bit applications. Because of this, the 32-bit architecture is often preferred on PCs for multi-platform games.

3.1.1

Hardware

Since the PC is not a closed platform like consoles there are many companies that manufacture different systems and consumers are often free to arrange and replace parts of their PCs hardware in any way they desire. This means that a PC game must be compatible with as many hardware congurations as possible. Usually a minimum requirement on processing power and memory is posed to reduce the range of supported hardware. However, even with a fairly narrow range, there is still room for vendor-specic differences regarding graphics 14

cards, sound cards, input devices and so on. Frostbite 2 uses the Microsoft DirectX library as a platform interface which relieves the developer of many of these problems by abstracting the hardware. Memory Memory on a PC is often available in abundance compared to current generation consoles, and even if there is not enough free physical memory, a hard drive can be used, at the cost of latency, as a swap space. It is not uncommon for computer games today to use 1-2GB of memory and with 64-bit operating systems, the physical memory available on a system will most likely not be a limiting factor. Depending on the CPU, a PC often has a variety of techniques and hardware for automatically reducing the issues inherent with memory management. For example, when fetching memory that need to be loaded into the cache, a CPU with Out-Of-Order Execution can execute independent instructions ahead of time. This reduces stalls that would otherwise have been the result of waiting for cache operations. The downside of such techniques is that the developer might be unaware of the performance impact certain code might have on other platforms.

3.1.2

Performance

With the variety of hardware available, performance on the PC platform differs greatly between consumers. For consumers who do not own the latest in PC gaming technology, it is important to have a minimum hardware requirement as low as possible. At the same time, the PC market is highly focused on the audiovisual quality of a game, so for consumers with more processing power and memory available, a game should try to take advantage of that performance. This puts high demands on customizability that does not exist on consoles. The user should be able to control the quality of the experience by adjusting settings for graphics and audio. Perhaps most importantly, there should be options available for conguring keyboard and mouse.

3.2

Xbox 360

With the Microsoft Xbox 360 controlling a substantial share of the console market, its naturally one of the main platforms for the Frostbite 2 engine. The Xbox 360 hardware layout is largely similar to a PC which generally makes code developed for Windows and DirectX easy to port. The biggest difference compared to a PC is that, being a console, all systems for all consumers are alike performance-wise. The immediate benet of this is of course that a game that works on one system will work on all, which removes a lot of compatibility issues. Working with the same hardware over a long period of time also tends to bring out more efcient ways of utilizing the hardware during the lifespan of the console. 15

3.2.1

Hardware

The Xbox 360 is based on a 64-bit Power-PC architecture with 3 general purpose in-order CPU cores (PPUs), 512MB system memory and a powerful GPU. Each core is running at 3.2GHz and has support for two simultaneous hardware threads (hyper threading) with dual pipelines and duplicated register sets. The result of this is that the CPU appears as if it has 6 individual cores from the point of view of the operating system. Along with 64-bit integer and oat arithmetic units, all cores are tted with a 128-bit SIMD vector unit which allows the CPU to perform multiple arithmetic operations per cycle. Together with a competent GPU, with unied shader processors, running at 500MHz the Xbox 360 has a lot of performance to offer. Since it is so similar to a PC, experienced programmers can quickly start developing which makes it a popular platform. Memory One of the key differences compared to a standard PC is that the 512MB system memory is shared by both the CPU and the GPU at the same time. On a PC, the GPU usually has a separate memory which means that data needs to be copied from main memory to GPU memory for rendering. Since the main memory is equally accessible for both processing units on the Xbox 360, copying such data can be avoided to increase performance. Another difference is the amount of memory available, only having a system memory of 512MB means that a lot of data has to reside on the optical storage. While some systems have access to a hard drive, the game still needs to be designed for a console without one. This means that data constantly needs to be streamed from the DVD and replaced if currently not needed. Reading data from the DVD is very slow compared to a hard drive which puts large demands on streaming techniques.

3.2.2

Performance

While the 6 virtual cores can theoretically yield double the performance compared to a single thread per physical core, this is often not the case. They will help the developer to utilize more of the dual pipeline structure but can also lead to unpredictable results. Since 2 threads executing on the same core cannot use the same core component, for example the vector unit, getting good performance from hyper threading relies on the threads using different parts of the processor. If the developer does not have full control of which threads execute on each core, two computationally intensive threads might run on the same core which would generate unnecessary stalls from waiting on processor components. Two threads running on one core will also share the same L1 cache, which can lead to unpredictable cache misses. While it is easy to get code running on this machine, the developers needs to be procient at optimizing to get the most out of the performance.

16

3.3

PlayStation 3

The third architecture that Frostbite 2 supports is Sonys PlayStation 3. It is the main console competitor to the Xbox 360 in terms of graphics and since it uses a heterogeneous CPU architecture it is a unique system on the console market today. The hardware is highly shaped around parallel computing and therefore also by far the most difcult of the platforms mentioned in this thesis to develop for. Since single-threaded code code alone would leave most of the processing resources completely unused, games on the PlayStation 3 have to do as many computations as possible in parallel in order to compete with other gaming platforms. This makes the system a good benchmark for testing parallel code designs and is the reason for being the target architecture for this thesis. If a simulation runs well on the PlayStation 3 it is highly likely to run well on any architecture that can use multiple processing units. A more thorough description of the PlayStation 3 can be found in A Rough Guide To Scientic Computing On The PlayStation 3 [BLK+ 07].

Figure 3.1: Simplied schematic over the architecture of the PlayStation 3. Shown in detail is the layout of the CPU (CELL) and its connections to memory, input/output devices and GPU (RSX).

3.3.1

Hardware

Not counting the GPU (RSX), the PlayStation 3 has two different types of processing cores, 9 in total, and separate memory access controllers. The architecture is 64-bit but only allows 32-bit applications due to operating system constraints. It has 512MB physical memory available, but unlike the Xbox 360, the main memory is physically split between the CPU and GPU with 256MB each. Because of all its processing units the PlayStation 3 is the console with the highest theoretical computing power. However, since most of the processors require customized code, it is often difcult to utilize it to the full extent in

17

games. In addition, the GPU is in most cases inferior to the one in Xbox 360. For example, the RSX does not have a unied shader architecture. Cell Broadband Engine The CPU of the PlayStation 3 is an Intel Cell Broadband Engine (CELL), see gure 3.1, which is a system-on-chip that is designed for parallel computationheavy tasks. The CELL is a 64-bit architecture which consists of one general purpose processor (PPE) and 8 secondary processors specialized for vector math operations (SPE) connected via the Element Interconnect Bus (EIB). The secondary processors have access to the main memory through memory controllers but do not have the ability to read and write directly to it. Instead they need to copy the data to a local storage before computation and then copy the resulting data back to the main memory. Because of these limitations, a common technique is to have the PPE run a main thread that controls and schedules batches of tasks executed on the SPEs. PPE The PPE is a 64-bit processor that supports the Power-PC instruction set along with VMX, which is the Power-PC SIMD extension for performing 128-bit vector operations. It operates at 3.2Ghz with a 32KB L1 cache and supports hyper threading with separate registers for each hardware thread. Using hyper threading, the performance per thread might be lower than if only a single thread is used, their combined performance is often higher. The PPE is very similar to the processing cores of the Xbox 360, with the same SIMD capabilities and supporting the same instruction set. It is however the only processor on the Cell chip capable of executing Power-PC instructions. This means that for ordinary code, compiled for Power-PC, the PlayStation 3 has only about a third of the processing power of the Xbox 360. SPE What the Cell processor lacks in general processing power it delivers with the so called Synergistic Stream Processing Elements (SPE). While the CELL contains 8 SPEs, only 6 of them are available for use since one is exclusively assigned to the operating system and one is a failsafe backup. The processing unit of each SPE is called an SPU. An SPU is a single-core in-order processor optimized for running computationally intensive code at high speeds. Similar to the PPE, it has a clock speed of 3.2GHz but with a limited instruction set customized for SIMD operations only. It is equipped with a dual pipeline to be able to issue two instructions each cycle. Each execution unit in the SPU is assigned to one of those pipelines which means that instructions of the same type are always scheduled to the same pipeline. To fully utilize both pipelines and thus maximize performance, the programmer needs to organize instructions after which pipeline they use. 18

Local Store The SPEs do not use a conventional cache. In order to still use cache-like functionality, each SPE contains a single high speed memory called the Local Store (LS). The LS is 256KB in size and is similar to a L1 cache with the exception that the programmer must load it manually from main memory via the Memory Flow Controller (MFC). The LS must be able to hold both the instructions for the currently executing program and the working data which puts high demands on memory management from the developer. A positive aspect of the Local Store is that all unpredictability of a normal cache is eliminated in favor of loading memory explicitly. MFC The MFC of each SPE handles all the data transfers between the Local Store and the systems main memory via DMA instructions. The latency of a DMA operation is comparable to that of transferring memory to the cache in an ordinary CPU and is optimized for data transfers in multiples of 16B or 128B. To avoid stalls produced by reading from and writing to main memory, DMA instructions are handled asynchronously and let the SPU query the state of ongoing memory transfers via different channels. Since the DMA operations are asynchronous, it is a common technique to double-buffer data fetches by loading a segment of memory while working on another. By shufing memory this way, the SPE can work on a larger total set of data than can t into the Local Store.

3.3.2

Performance

As mentioned before, when developing games on the PlayStation 3, programmers should use the SPE-processors as much as possible. Because the SPEs only have vector registers, ordinary single-variable operands need to be converted to vector instructions before processing. Those results are then written back to memory using several shufe instructions. To avoid unnecessary instructions, programmers should rewrite their code to only use vector operations. Code that efciently use vector instructions can be more than 4 times as fast as ordinary code. Since the SPEs are so good at processing large amount of vector data, Frostbite 2 also use them to do vertex operations that would otherwise be done on the GPU. The algorithm in this thesis is able to make use of an arbitrary number of processors. This enables the PlayStation 3 to easily out-perform the Xbox 360 if only considering the water simulation.

19

Chapter 4

Parallelization And Optimization


These days parallelism is an important aspect of any real time system. Processors are getting more and more cores added to them in order to increase performance and to fully utilize modern computers it is increasingly vital to be able to run simulations on multiple cores simultaneously. One of the largest obstacles when designing code for parallel computation is the manner in which data is accessed in memory. A parallel simulation is split up into several jobs, or threads, each working on either separate tasks or partial data for the same task. Often, performance does not scale linearly with the number of processing cores compared to the single-threaded simulation. Stalls arise from different tasks that requires access to the same data, or dependencies on previously executed tasks. By structuring the memory layout and the data ow of the simulation, these stalls can be greatly reduced. This chapter introduces some of the parallelization techniques referred to in this thesis along with a few general methods for optimizing jobs on vector processors.

4.1

Parallel Methods

According to Flynns Taxonomy [Fly72], there exists at least two major parallel architectures for distributing work load over multiple processing cores: Multiple Instruction Multiple Data (MIMD) and Singe Instruction Multiple Data (SIMD). While these classications refer to hardware congurations, they can also be used to describe parallel design patterns. MIMD executes different instructions on multiple streams of data and will be referred to as Task Parallelism. SIMD, on the other hand, executes the same instruction, or function, on different parts of the same data stream. This is referred to as Data Parallelism.

20

4.1.1

Task Parallelism

Task Parallelism is often found in multi-threaded operating systems where parallelism is achieved by having multiple separate processes executing simultaneously. For this reason, it is also called Process Parallelism. For a simulation that consists of different tasks, or steps, code designed with this method lets each processing core take care of a whole step of the simulation for every frame. This is preferable if a task can operate on a section of data, independently of other tasks. If a task is dependent on one or several other tasks during the same simulated frame, a pipelining structure needs to be implemented to achieve parallelism over all cores. A good example of this is a multi-threaded game engine where rendering for a frame is done at the same time as the simulation for the next frame. While being an intuitive way of parallelizing code, this method does not scale well with additional processing cores since the level of parallelism is limited to the number of tasks that can be run in parallel. When adding additional computational cores, it might be difcult to create more tasks for these processors. Furthermore, if the tasks for one frame is dependent on both tasks from the same frame and tasks from a previous frame, it might not be possible to use pipelining.

4.1.2

Data Parallelism

Data Parallelism is a good alternative to Task Parallelism for simulations that need to take advantage of an arbitrary amount of processing cores. Instead of splitting the work load over the different tasks of a simulation, code can be designed in a way that lets multiple processors work on the same task. If a task consists of performing the same operation on a large set of data, each processed part of that data can be viewed as a partial result of the complete task. Full parallelism can be achieved by letting all threads each calculate a partial result. This means that the number of processors that can work on a single task is only limited by the amount of data processed. By parallelizing with focus on data rather than individual tasks, dependencies between other tasks are trivial to resolve since all jobs in a task can be synchronized to the completion of the previous task. To ensure that the simulation time scales well with both the number of processors and size of the data, data should be distributed over jobs in a way that creates good load balancing. When dealing with simulations that have a lot of dependencies within the same task, parallelization will be harder to achieve. A solution to this might be to redesign the simulation with smaller tasks to only have task-to-task dependencies. [HS86] is a good resource on how to redesign algorithms with focus on data parallelization.

21

4.2

Data Oriented Design

The available performance of a system is dominated by the speed at which processors are able to execute instructions and the rate at which data can be read from memory. The rate of performance improvements in processors has, however, by far exceeded those in memory over the years [Car02]. Because of this, the process-memory gap has grown larger and memory access speed has become the biggest bottleneck in performance today. To be able to fully utilize the speed of the CPU, developers should design code with memory efciency in mind, so called Data Oriented Design (DOD).

4.2.1

Object Oriented Programming

Object Oriented Programming (OOP), which is used widely in software development, groups data after the objects to which it logically belongs. This often makes heavy use of classes and inheritance which might lead to class explosions and large executables [Fre11]. The purpose of the programming constructs associated with OOP is to make the code structured and easily manageable by abstracting and isolating data. They do not focus on organizing memory efciently.

4.2.2

Cache Inefciency of OOP

The main focus of DOD is to minimize the number of data transfers needed between the main memory and the cache. Using OOP features like encapsulation and polymorphism often means extended use of virtual function calls. Each virtual function call needs to do a virtual table look-up before knowing which function actually implements the called function [DH96]. Performing these look-ups means that the tables need to be fetched into the cache before the instructions of the implementing function can be loaded. If a large collection of encapsulated objects is being iterated over, like calling the update method on objects in a game, many cache misses are generated from the virtual table look-ups. Since the cache does all memory fetches in blocks the size of a cache line, even for single values, much of the memory loaded into the cache becomes wasted.

4.2.3

Structure Of Arrays

To maximize the use of every memory block loaded into the cache when iterating over objects, the data associated with each object needs to be organized in groups according to how it is used rather than which object it belongs to [Col10]. Consider an iteration over a collection of objects, each with many different attributes, where each iteration only reads a specic attribute. If each logical object is represented by a continuous block of memory, the cache needs to be updated every time the specic attribute of each object is accessed. If, however, the data for all objects is stored as one continuous memory block per

22

attribute, the cache can contain the specic attribute for several objects at the same time. This is the difference between an array of game objects, see gure 4.1, compared to a single structure with arrays for all object attributes, see gure 4.2. By designing the code around collections of objects instead of the individual objects, the cache can be utilized optimally.

Figure 4.1: Example of OOP organization of memory. The layout of 7 objects with 4 attributes each are shown. Each object is stored sequentially, left to right.

Figure 4.2: Example of DOD organization of memory. The layout of 7 objects with 4 attributes each are shown. Each attribute of all objects is stored sequentially, left to right.

4.3

Optimization Details

Often, most of the execution time in a simulation is spent on iterating over small sections of code, such as inner loops. It is therefore important to optimize those sections as much as possible for real-time applications. Presented here are some of the most important aspects of code optimization.

4.3.1

SIMD Vectorization

The SIMD architecture can also refer to that of a vector processor, which can perform the same operation on multiple scalar data. To fully utilize the performance that a SIMD processor core can deliver, data should be organized in a

23

way that allows operations to be done in vector format. By also organizing data as Structures Of Arrays, great speedups can be achieved. For example, a 32-bit dot product operation, which, with two 3d vectors operands and scalar multiplication, results in 3 multiplications and 2 additions. By storing the 3d components of 4 vectors as 3 128-bit SIMD operands, 4 dot products can be performed with 3 vector multiply-and-add instructions, which is more than 6 times faster [Col11]. The value of vectorizing code becomes even greater on SPEs, since the SPU has no scalar arithmetic unit. This means that ordinary scalar operations are performed using the vector unit instead. Such an operation has a lot of costly overheads since the scalar operands needs to be shifted inside the vector register before the operation and then shifted back before writing. See Preferred Slot in A Rough Guide To Scientic Computing On The PlayStation 3 [BLK+ 07].

4.3.2

Software Pipelining

Out-Of-Order processors have the ability to execute instructions in an order other than specied by the programmer. To reduce stalls from waiting instructions, instructions that have no active dependencies can be executed in the meantime. Consoles today use In-Order processors, which, in order to save chip space production costs, must execute instructions in the order in which they are dened. To avoid stalling the processor pipeline, a method called Software Pipelining can be used. By rearranging instructions in a way that minimizes stalls, many processing cycles can be saved [Eng10, Cof11]. For example, given a loop that performs one read and one multiplication, the multiplication might need to wait 4 cycles for the read to nish, see gure 4.3. The programmer can unroll the loop by doing 4 loop iterations at a time, grouping the read instructions in front of all multiplications. This way, each read operation would be completed just in time for the corresponding multiplication as shown in gure 4.4.

Figure 4.3: Simplied cycle diagram of a simple two-instruction loop on an in-order processor. The rst 4 iterations are shown. 20 cycles are needed to execute 8 instructions. Note that branching instructions are not shown.

24

Figure 4.4: Simplied cycle diagram of the same loop unrolled 4 times. The rst 8 iterations are shown. 16 cycles are needed to execute 16 instructions. Note that branching instructions are not shown.

4.3.3

Careful Branching

Another drawback of In-Order processors is the inability to mitigate branching stalls. Since instructions cannot execute in advance, misprediction of a branch results in a ush of the whole execution pipeline [Col11]. The programmer should avoid branches in sections of code that run frequently, like small computation loops. If branching cannot be avoided, special hint instructions can be used to control prediction. If a branch is more likely to result in a specic path, the hint instruction can be executed a few cycles ahead to force the pipeline to start issuing instructions in the specied path before the outcome of the branch is calculated. In special cases, branching can be completely avoided at the cost of executing both branches by using result masking. With this method, the results of both branches are calcutated and then masked to yield the correct result. This is very useful on architectures with very high branching costs.

4.3.4

Load-Hit-Stores

One of the largest causes of performance loss is the Load-Hit-Store [Hei08], which is the process of loading a variable from cache or main memory and storing it in a processor register. If a variable is written to, just before it is loaded into another register, any instruction that use that register must wait for the writing and loading operations to nish. Since a write often needs to update the cache, it could take many cycles before the read instruction can be executed and on an In-Order processor, this results in a pipeline stall. On Power-PC processors, which have separate registers for integers, oats and vectors, LoadHit-Store situations can easily arise from simple type conversions. Since there are no instructions on the Power-PC processor that can move values from one type of register to another, type-converting a variable means that the data needs to be stored in the cache between registers. While being too time consuming to fully implement in this thesis work, the design choices in the water simulation algorithm have been made with these optimizations in mind. 25

Chapter 5

Algorithm
The aim of this chapter is to give a detailed description of the water simulation presented in this report along with a summary of the earlier work this thesis is based on. The issues with the previous algorithm are brought up together with an explanation of the improved algorithm and how it aims to solve these problems. This chapter is focused on the theory of the algorithm and some of the methods involved rather than the actual implementation, which will be presented in the next chapter.

5.1

Earlier work

Ottossons method for simulating interactive water [Ott11] uses a height-eld model to effectively represent waves based on Linear Wave Theory. The simulation is able to render bodies of resting water which can be interacted with by physical objects moving through, or under, the surface. The method is capable of simulating both very small and very large waves simultaneously over water surfaces ranging in sizes from small puddles to big oceans.

5.1.1

Surface Cells

Each water surface is uniformly divided into a grid of smaller square sections, or cells, which are simulated individually, see gure 5.1. One such cell represents a xed portion of the total height eld for the whole surface, for example a 32x32 grid. To allow propagation of waves over the whole surface, the data along the borders of these cells, produced during a simulation step, is copied to neighboring cells in preparation for the next frame. To adjust the delity of the water simulation, the size of these cells can be set as desired, with the default size being 6x6 world units (measured in meters).

26

6m

{
6m

32Pt

Figure 5.1: An illustration of a water surface divided into cells where each cell may or may not contain a grid. In this illustration, 6x6m cells are showed together with a 32x32 grid.

5.1.2

Dispersion By Convolution

The method is heavily based on the iWave article [Tes04a] which presents a convolution based method for simulating water as an alternative to Fourier based methods. The article introduces and derives a way of expressing a dispersive propagation, normally done in the Fourier domain, as a convolution kernel applied directly to the height-eld (An alternative derivation of the same convolution kernel can be found in Ottossons report [Ott11]). This method produces excellent results and is presented by Tessendorf as a viable alternative to realtime water simulation in games. However, simulating a high range of water wave lengths in this way requires a very large convolution kernel.

5.1.3

Kernel Approximation

Convolution of a 2D-eld is a costly operation that is is highly dependent on the size of the kernel for performance. To reduce the cost of a convolution, it is desirable to have a kernel that is separable. A separable kernel reduces the number of multiply-and-add operations needed from nm to 2nm, where n is the size of the kernel and m the dimensions of the data eld. Since the convolution kernel used in the iWave implementation is not separable, an approximation is used instead. The kernel is approximated by convoluting the height-eld data with a gaussian kernel and subtracting the original height-eld with the convoluted data [Ott11]. This results in a convolution operation that reasonably approximates the iWave kernel while still being separable.

27

{
32Pt

5.1.4

Laplacian Pyramids

Propagating water waves of different wavelengths requires a kernel that is proportional to the width of the largest wave simulated. If a single convolution kernel is used for waves of both centimeters and meters in length, the size of the kernel needs to be very large. Even if a separable kernel is used, this will severely degrade performance. To be able to simulate a high range of wavelengths, a multi-resolution height-eld approach is applied. Burt and Adelson [BA83] introduced Laplacian Pyramids, which is a method of dividing an image or a height-eld into several grids, each containing data within a specic frequency range. Each grid is also of proportional dimensions for the frequencies it holds, in this case, a grid is half the resolution of the next grid in the pyramid. Each surface cell contains a pyramid that holds decomposed waves in different grids down to a resolution of 4x4 as shown in gure 5.2 (It is important to note that, while of different resolutions, all grids in a pyramid take up the same area in world space). These grids can then be simulated separately with a convolution kernel much smaller than if all waves would have been propagated on the same grid.

Figure 5.2: An illustration of one surface cell with a 4-level Laplacian Pyramid containing grids of sizes 32x32, 16x16, 8x8 and 4x4.

5.1.5

Grid Summation

Waves are stored separately in the pyramid levels between frames, as described above, but in order to render the simulation, a pyramid needs to be merged into a single height-eld for each surface cell. To generate the total surface elevation for a section of water, the height data of all grids in the same pyramid are accumulated into a height-eld of the same resolution as the most detailed

28

grid of that pyramid. This merging process is done cumulatively going from the lowest grid resolution to the highest. To merge two grids, where one grid is half the resolution of the other, an up-scaling algorithm is used to interpolate data between two pixels on the lower grid. Since each grid is assumed to represent a frequency range of sinusoidal waves, summing a linearly up-scaled version of the lower grid with the higher is insufcient. This method uses instead a bicubic up-scaling algorithm which provides an interpolation that is close enough to sinusoidal for good visual results.

5.1.6

Level Of Detail

To gain additional performance over the original iWave algorithm [Tes04a], Level Of Detail is applied, taking advantage of the aforementioned pyramid grid structure. It is assumed that in video games, the detail of a water simulation can be adjusted depending on the distance from the viewer without losing great amounts of visual quality. A section of a water surface viewed from afar or from a narrow angle would not need the same attention to detail as a section close or perpendicular to the players point of view. In other words, the frequency range of waves which can be visibly determined is decreased with distance from the viewer. Since the water is already divided into grids containing a discrete range of wavelengths it is trivial to disregard the high frequency pyramid grids when simulating and rendering surface sections at a distance. Adjusting the level of detail by dynamically changing the number of grids used in a surface cell can create a popping effect if the visual changes are too great. In a game where the point of view is highly mobile, fading the amplitude of newly added or removed grids over time is important to reduce this effect.

5.1.7

Interaction

Another key feature of the iWave method is the ease at which interaction can be performed. Interaction is done by distorting the surface before propagating waves. Since waves are not propagated or stored in the Fourier domain, distortions can be applied directly to the height-eld around interacting objects. Taking care to preserve total mass, the height-eld is, in this implementation, raised in front of moving objects and lowered behind them. This, together with the dispersion property, creates realistic looking movement through the water with the characteristic Kelvin Wakes. Storing different lengths of waves in different grids, however, requires distortions to be decomposed into the corresponding frequency ranges of the grids they are added to.

5.1.8

Stitching

Each simulated section of the water surface is carefully tted to its neighboring sections to produce the full surface mesh for the water. When two neighboring

29

grids, sharing the same edge, are of different resolution, the in between vertices of the grid with higher resolution need to be handled. This is done to avoid gaps in the combined rendered surface mesh. Fitting two such grids is done by fading the height data of the high resolution grid close to the seam to match the lower. To achieve this, the high resolution features of one of the grids is faded out and linear interpolation is applied to the offending vertices to place them on the seam. This removes much of the need for an explicit stitching scheme which can impact performance further. However, two neighboring grids of different resolution will still leave the mesh with T-junctions. Rendering meshes with T-junctions can result in small artifacts, holes, between polygons. These artifacts have proven not to be a problem in the Frostbite 2 engine. If these potential glitches do present issues in the future, they can be completely removed with, for example, a frame of small vertical skirts around each surface section as Zhao and Ma shows [ZM09].

5.2

Implementation Issues

While a competent method for simulating interactive water, Ottossons implementation does have a few issues preventing it from being useful in the Frostbite 2 engine, most of which the new implementation in this thesis seeks to eliminate.

5.2.1

Parallelism

Sonys PlayStation 3, which is the target architecture for this thesis, employs six special stream processors (SPEs) along with a normal CPU (PPE) that together make up the CELL chip. To be able to run any water simulation effectively on this platform, the simulation needs to take advantage of those secondary cores. Without parallelizing code specically for these processors, they would go completely unused. Since the previous implementation is running on a single core only, this leaves a large area open for improvement.

5.2.2

Scaling

The method of uniformly dividing a water surface into small sections, or cells, is not optimal for scaling with larger water surfaces. If an ocean is divided into cells of the same size as a water surface the size of a puddle, memory is wasted for areas of the ocean that are not presently simulated. Ideally, a simulation should dynamically adjust the size and the amount of cells needed for a given water surface.

30

5.2.3

Border Copying

Borders consist of data outside of the effective simulation data on a grid. Copying border data is done from one grid in a pyramid to surrounding pyramids containing a grid of the same resolution. Since the wave propagation is done with a convolution kernel of the same size regardless of grid resolution, the width of the borders is constant, in this case 4. To be able to simulate larger water surfaces, an increasing amount of cells with pyramids are needed. Each pyramid has at least a lowest resolution grid of size 4x4. This means that a grid of that size, is really 12x12 in size including borders, thus having a worst case memory footprint that is 9 times larger than the actual simulated data, see comparison in gure 5.3.

Figure 5.3: Comparison of the memory used for borders between a 4x4 grid and a 56x56 grid. Bright tiles represent simulated data and darkened tiles represent borders.

5.2.4

Data Locality

Since simulated data of the smallest resolution are copied 4 times per grid, a lot of memory accesses from different addresses are potentially performed. This usually generates many cache misses. To gain as much performance from an algorithm as possible, special attention must be given to the memory layout of the data used. It is important to localize data according to how and when it is accessed to avoid stalls caused by updating the cache. This is especially true for the SPEs since the programmer has to do all the memory accesses by hand, and if memory can be accessed in larger chunks at a time it will benet performance. Grouping data in this manner is also very useful when parallelizing code in general since it might remove the need for critical sections.

31

5.3

Improved algorithm

To combat the issues with the previous method, a new improved algorithm was designed. The goal of this algorithm is front and foremost to address the performance issues to make it more scalable with parallel architectures. This means that the method presented in this report differs mostly on implementational details and the description of the techniques presented earlier in this chapter is largely applicable to the new adaptation as well. However, there are some notable differences in design brought up here.

5.3.1

Homogeneous Grids

The previous algorithm used many pyramids of predetermined resolutions to construct the surface mesh. Since a limited amount of memory is available, these pyramids were stored in separate pools, one for each resolution. These were distributed over cells according to the Level Of Detail. You could therefore be in a situation where you might need one or more low resolution pyramids while only having high resolution pyramids available. A solution to this is to only use one pool, populated with separate grids of the same resolution for constructing pyramids dynamically, as shown in gure 5.4. This way grids become all-purpose and can be used for both low and high resolution pyramids which enables the Level Of Detail scheme to be more exible.

Figure 5.4: An illustration of one surface cell with a complete 4-level quad-tree containing 85 grids. For simplicity reasons, the grid resolution is shown as 4x4.

32

5.3.2

Quad-trees

By only using grids of the same resolution, keeping in mind that resolution still scales by 2 between each level, it is no longer possible to construct straightforward pyramids for each cell. Instead a quad-tree structure is used, which means that instead of having a one-dimensional stack of grids with decreasing resolution, each grid has up to 4 other grids attached as children. In contrast to the previous pyramids, where all grids take up the same world space area, a grid child only takes up a quarter of the area of the underlying grid. This effectively doubles the world-space resolution of the children relative to their parent. By arranging grids in this manner, one can treat each quadrant of a grid as the basis of a virtual pyramid with the corresponding child as its next level. This means that there is no longer an upper resolution limit since grid children can always be added on top of existing grids. The quad-tree method is a common solution for adaptive grids and are used extensively in rendering terrain. Some good resources on how to implement such trees together with continuous Level Of Detail can be found in the works by Thatcher [Ulr00], Livny et al.[LKES09], and Pajarola[Paj98].

5.3.3

Large cells

Previously, the size of each surface cell was the size of the area that the highestresolution pyramid occupied. For a high resolution water simulation, these cells had to be quite small. In the improved algorithm, a cell contains a complete quad-tree which means that instead of cells being the size of the highestfrequency grid used, they are now the size of the lowest. In the simplest scenario, the whole water surface contains only one large cell. In order to limit the maximum length of simulated wavelengths, a surface can be divided into any number of cells.

5.3.4

Memory Layout

When using uniform grid sizes, the memory usage goes down drastically. Looking at a grid, 64x64 in size, the memory wasted for borders is always just below 25%, which is a large improvement from the worst case scenario from the previous implementation. Same-resolution grids also provides greater data locality, which is another important advantage over the older algorithm. Before, the propagation algorithm was performed on a large number of small grids, along with the grids of higher resolutions. Performing propagation on many small grids results in much overhead and many cache misses. With all grids being of equal size, there are more continuous segments of memory that can be worked on for each propagation, which helps avoid cache misses. Large grids also benets parallelization on the SPEs since more memory can be fetched with a single read instruction.

33

Chapter 6

Implementation
The goal of the thesis presented in this report is to adapt an existing water simulation for parallel architectures. The implementation of the improved method relies heavily on the theory and techniques already employed by the previous work. Therefore, most of the new work done in the thesis is aimed towards implementational details and how to design the code to run in parallel. This chapter aims to describe in detail how the improved algorithm presented in the previous chapter is implemented and what happens during one frame of simulation. The low-level implementation details will be specic to the Frostbite 2 engine, but should be easy to adapt for any parallel game software. Prior to implementing the simulation in Frostbite 2, the algorithm was prototyped in a standalone application with limited mouse interaction using GLFW/OpenGL as a framework.

6.1

Engine

The Frostbite 2 engine is a highly scalable engine designed to run on an arbitrary number of processors in parallel. To manage running the numerous thread instances, called jobs, it uses a central job manager. The developer can use the job manager to set up computation jobs in a deferred manner with dependencies on other jobs if so desired. These jobs are then scheduled by the job manager to run when there are available time slots on one of the processors. In an environment with such high focus on parallelism, writing data oriented code is very important. With good data orientated code, each job should be able to work on an isolated block of data, minimizing the memory that is shared with other jobs.

34

Figure 6.1: Timing diagram showing an example of a job-schedule for the rst couple of frames of a simulation. Indices indicate which frame data is being processed. Horizontal length is approximately proportional to wall-clock time.

6.1.1

Frame Overview

Since a game needs to present discrete frames to the viewer, the engine uses a main thread to synchronize jobs each graphical frame. The data containing the height eld of the water simulation is updated once per such frame and then rendered to the screen during the subsequent frames. This yields a three-step process for the water simulation, see gure 6.1. At the beginning of a frame, the grids of a water surface are simulated and the height eld is updated. After a complete simulation step, the render data for that frame is submitted to a dispatch job where GPU buffers are lled with the mesh and texture data. Finally the GPU renders meshes to the screen with the desired shaders applied, using the data in the buffers. To avoid having to wait for the GPU to fully render a frame before continuing the next frame, which would leave idle processors, the simulation for the next frame is able to start right after the dispatch job has been set up. This means that a simulation update for frame Fi is run at the same time as the dispatch job for frame Fi1 and the screen rendering for frame Fi2 .

6.2

Simulation

Each frame, the height eld for the whole water surface is updated with new distortion data made by interacting objects and then simulated before rendering. Since the simulation update for a frame is run at the same time as the rendering for the previous frame, the height eld data needs to be duplicated over two buffers. This is achieved by a double-buffering mechanism where the buffer holding the height data for the current frame is being written to while the other buffer, containing the previous height data, is being read from by the dispatch job. To synchronize these buffers, a pointer is swapped on the main thread at the start of the update. After the buffers are swapped, the grids are rearranged according to the Level Of Detail and then simulated through a number of passes, as shown in gure 6.2.

35

6.2.1

Data Layout

The water surface is built up by an array of square cells, each containing a quadtree of square grids. These grids are stored in a global pool for all surfaces in the game world and are distributed according to the Level Of Detail scheme. Each of these grids represent either a section of the whole surface height-eld or a partial height-eld used in the summation with other grids. By using a quad-tree structure for distributing grids in each cell instead of a list, all grids can be of constant size. This greatly simplies the data layout since the data for all grids can be stored as a simple indexed array. Each grid also has attributes describing, for example, the location and scale of that grid. The attributes for all grids are each stored separately using the Structures Of Arrays (4.2.3) design pattern to reduce cache misses when reading the same attribute for many grids. The data block of a grid is split into data elds of the same dimensions containing partial height data Hpartial (x, y), total height data Htotal (x, y) and velocity data V (x, y). A simulated grid requires Hpartial and V for both the current and the previous frame along with a doublebuffered Htotal . This means that each grid contains two copies of each data eld (Hpartial , Htotal , V ), for a total of 6 elds. Each eld is stored as one f loat32 array in order to enable SIMD operations to process 4 values at once. The whole simulation is performed on the partial height data, using the velocity data of the previous frame, which is then accumulated into the total height data together with up-scaled data from lower-level grids.

6.2.2

Grid Dimensions

The choice of dimensions the grids is an important factor when optimizing the simulation for the right hardware. When using grids of small dimensions, more grids can be stored in the global pool for the same memory cost, which leads to greater freedom when distributing grids. However, each grid comes with an overhead, both in attributes and the communication of nearby grids. This needs to be taken into account if memory operations have a high latency, which is the case of the SPEs. Having grid dimensions of powers of 2 speeds up DMA transfers which work best in blocks of 16 or 128 bytes. Also, keeping in mind the limited size of the SPE Local Store (3.3.1) and that vectorized instructions works best on multiples of 4, the grid size 64x64 was chosen for the Frostbite 2 implementation.

36

Figure 6.2: Job dependency graph for one step of simulation. For simplicity, only 3 worker threads are shown for the update passes.

6.3

Interaction Setup

Interaction with the water is done by displacing the height eld with the desired shape before performing the wave propagation. Interaction data is gathered during a frame from objects that are intersecting the water surface and written to a double-buffered array, synchronized with the rest of the simulation. To avoid having to read properties directly from the intersecting objects, all data necessary for applying a disturbance is written into the array. All disturbances are abstracted with ellipses coupled with a vertical force. An object in the water would generate ellipses best matching the intersection with the resting water surface plane. In order to conserve mass when moving along the surface, two overlapping ellipses with opposing vertical forces are generated which separate when the object is in motion. The length of the separation is determined by the frame delta-time and results in water being pushed up in front of a moving object depending on speed.

6.4

Frame Setup

Before any simulation can be done the attributes of all grids must be determined and updated. These attributes include the position and scale of each grid and its

37

position in the quad-tree along with information about the neighboring samedimension grids.

6.4.1

Level Of Detail

Updating the grids positions in the quad-trees is done by a Level Of Detail manager (LOD manager). The LOD manager distributes and culls grids depending on the position of the camera, as shown in gure 6.3. Since removing grids from the simulation effectively destroys waves that cannot be restored, removing grids must be done with care. By not taking the direction of the camera into account, the LOD manager sees to that a player cant destroy surrounding waves simply by turning the camera on the spot. In order to determine which grids should be culled and where grids need to be added, a sorted priority list of candidates is constructed every frame, which are then truncated to the same size as the global grid pool. Candidates can be either existing grids, grids that should be added as a child to another grid or root-level grids that should be added to an empty cell. The priority of a candidate is determined by the position of the camera and the distance to the surface section of the candidate relative to its scale. The formula is shown here: pcand = v s (n ) |v| |v| (6.1)

Where s is the scale of the candidate, v the distance vector from the nearest point, inside the section, to the camera and n the normal of the surface. By using this formula, where the surface-relative angle to the candidate section is taken into account, the LOD manager will down-prioritize candidates that are seen narrowly. Its important not to distribute grids unnecessarily in a situation where the water surface is too far away from the camera too be seen, therefore a lowest-threshold exists that will prune candidates that are not interesting regardless of other candidates.

Figure 6.3: An example of grid distribution over a surface with 4 cells at a given camera position. For simplicity reasons, the grid resolution is shown as 4x4.

38

6.4.2

Fading

When the LOD manager culls a grid, it is not removed immediately. This is done to prevent popping, where the camera sees large quick changes in the detail of the water simulation. Instead, when a grid is culled or added, the height data of that grid is faded towards either zero or one over a short period of time. Each frame, the fade levels of all grids are updated and the grids that have faded away completely are removed. This results in smooth transitions between levels of detail when the camera is moving around. In order to not get sharp transitions between a fully visible grid and a grid that has a lesser fade value, each grid stores 8 additional fade values, one for each potential neighbor. These values are linearly interpolated to produces a seamless topography of fade levels across the whole water surface.

6.5

Update Passes

Updating the total height eld, Htotal , of all grids is done in four passes where each pass is fully parallelized and able to run on all 6 SPEs simultaneously on the PlayStation 3, see gure 6.2. The parallelization is done on a grid level, which means that each job in a pass is responsible for updating a subset of all currently simulated grids. The simulation adheres to the Data Parallel type of parallelization since computation in jobs is split over data rather than tasks. Therefore the simulation can easily be scaled by adjusting the number of simulated grids. The data needed for each pass is generated by the main thread and stored in separate arrays. Each job in each pass parse its designated array, performs the computation necessary on the grids described in that array, writes data to those grids and then returns. After each pass, all jobs are synchronized, before spawning the jobs for the next pass since each pass is fully dependent on the previous one.

6.5.1

SPE Jobs

Reading from and writing to fragmented memory in a SPE program requires a lot of DMA operations which would lead to slow code. By isolating the data that is updated by a job, grids can be read from and updated with few DMA operations and without dependencies on other same-pass jobs. The synchronizations between each pass also completely removes the need for thread locks for, so called, critical sections. By structuring memory this way, the transition to SPE code is made very simple by only adding an entry point and some additional DMA operations to the standard code. If an SPE job has to process more than one grid it uses a double-buffered scheme that fetches the data for the next grid while processing the current one.

39

6.5.2

Applying Disturbances

Generating waves on a surface is done by displacing the height eld before propagating waves. The rst pass handles the disturbances produced by interacting objects. A simple rectangle-overlap test is made for all the disturbance data that was generated during the last frame and then applied to the partial height data of the intersecting grids. The disturbance data consists of elliptical shapes that approximate the surface intersections made by the intersections objects. The elliptical shapes are also dependent the vertical and horizontal velocity of the object that generated the distortion, as described by Ottosson [Ott11]. For each ellipse, a bounding rectangle is calculated and all the intersecting coordinates in the grid heightelds are modied along the edge of the ellipse. Since each grid has a specic range of wave frequencies that it can simulate, the distortion should not add frequencies outside that range. To accomplish this, the edge of the ellipse is band-pass ltered before it is applied, see gures 6.4 and 6.5. For each intersecting pixel in the grid, the distance to the edge of the ellipse is calculated and then adjusted according to the force of the disturbance and the lter kernel. By using a lter kernel that is dened over a xed width of grid pixels instead of world units, the same kernel can be used on grids of all dimensions. After all disturbances have been applied to the partial height elds of all intersecting grids, those waves are propagated during the subsequent pass.

40

Partial Height Field: 1m/pixel 1 0 -1 -8 1 0 -1 -16 1 0 -1 -32 1 0 -1 -64 1 0 -1 -128 -112

-7

-6

-5

-4

-3

-2 -1 0 1 2 Partial Height Field: 0.5m/pixel

-14

-12

-10

-8

-6

-4 -2 0 2 4 Partial Height Field: 0.25m/pixel

10

12

14

16

-28

-24

-20

-16

-12 -8 -4 0 4 8 12 Partial Height Field: 0.125m/pixel

16

20

24

28

32

-56

-48

-40

-32

-24 -16 -8 0 8 16 24 Partial Height Field: 0.0625m/pixel

32

40

48

56

64

-96

-80

-64

-48

-32

-16

16

32

48

64

80

96

112

128

Figure 6.4: Decomposition of an ellipse with a downward force. The gure shows crosssections of partial height elds of different scale when disturbed at origin by an ellipse with radius 2m.

Total Height Field: 0.0625m/pixel 2 0 -2 -128 -112

-96

-80

-64

-48

-32

-16

16

32

48

64

80

96

112

128

Figure 6.5: The total distortion of the water surface accumulated from all partial height elds in gure 6.4.

6.5.3

Wave Propagation

Propagating waves is done by applying the approximated convolution kernel to the existing height eld and then using the resulting data to calculate the

41

vertical velocities for all vertices in the height eld. The convolution operation is the most performance-costly operation of the whole simulation and scales proportional to the squared size of the convolution kernel. To minimize the cost of this operation, a kernel width of 5 is used. Taking advantage of the separability of the approximated kernel, the operation is done in two passes, one vertical and one horizontal. A property of this simulation is that waves are automatically reected against empty borders. Since a lot of grids will have empty borders, without actually being on the edge of the total water surface, these reections needs to be eliminated. To reduce this effect, the amplitudes of the velocity eld are slowly faded towards sides that have no neighboring grid, using the fade levels of each grid.

6.5.4

Border Copying

The next pass handles copying data between all the grids of the water surface. Since convolution is used for wave propagation, the grid needs data outside the simulated area. This data is a part of the total 64x64 grid and forms a frame, or border, around the simulated data. The width of the border was chosen to be 4 (which leaves a 56x56 simulated area), to easily allow larger convolution kernels (a maximum of size 9) and make the code more suitable for SIMD operations. To enable a wave to travel from one grid to another, sections of simulated data from the source grid is written to the borders of the neighboring grids in all directions. This results in a maximum of 8 memory blocks that need to be copied for each grid without doing any calculations on that data. Doing this on the SPEs will basically only use DMA instructions, so it is vitally important to make as few of those as possible. Writing horizontal borders to the main memory can be done with a single DMA instruction since that memory is sequential. However, vertical borders need one DMA operation per line since that memory is non-sequential. To avoid writing non-sequential data, borders are copied from neighboring grids into the working grid. The border data for the working grid is assembled after which the whole working grid is written back to memory. By only reading from neighboring grids, the result can be written back to main memory using only a single DMA operation without synchronization with other border copy jobs.

6.5.5

Grid Summation

The nal pass of the simulation update accumulates overlapping grids and updates the total height data for each grid. The total height data for a grid contains the combined wave information of all underlying grid sections and the local partial height data. In order to be able to add the partial data to grid data of a lower resolution, the lower-resolution data needs to be scaled up rst. This upscale operation uses bi-cubic interpolation to approximate the shapes of waves in a higher resolution than the source data. This pass is not trivial to parallelize since the summation of a grid is directly dependent on the summations of all underlying grids and thus recursive in its 42

nature. To achieve parallelism in a scalable manner, it is observed that grid 1 Gi roughly only uses of the up-scaled data from grid Gi1 , 16 of the data 1 from grid Gi2 , 256 of the data from grid Gi3 and so forth. By reading only the appropriate sections from all the underlying grids, full parallelization can be achieved per grid. This leads to a lot of small memory accesses for every grid and many potential overlapping calculations due to the xed-size borders needed for every section. However, these penalties are overshadowed by the scalability that comes from every grid being able to perform the full summation without dependencies on the summation of other grids. To optimize further, the total height data for each grid is only calculated for visible quadrants.

6.6

Rendering

After the simulation update for a frame is completed, all grid attributes necessary for rendering are copied to allow for the next update to be run while rendering for the current frame is done. After copying these attributes, drawing lists are populated with draw-function calls for all surface meshes that will be sent to the GPU during the next frame. To render the complete surface of the water, each visible quadrant of all grids are drawn as a separate mesh. If a surface cell does not contain any grids, a simple at quad will be drawn in that surface section. To produce the grid meshes, the world coordinates of each element in the total height eld of all grids are written to a vertex buffer. That data is also written to a texture atlas, which is used to calculate surface normals in a fragment shader during rendering. After all data is sent to the GPU the drawing lists produced earlier are processed to render the water surface. Writing data and rendering are done asynchronously with the simulation update on different threads, which is why the vertex buffer and texture buffer have to be doublebuffered. The index buffer, which is also needed for rendering, is constant for all congurations of grids and does not need to be double-buffered.

6.6.1

Drawing

Drawing is done in a separate job that is dependent on the simulation jobs being nished. The Frostbite 2 engine uses a deferred renderer which means that no actual rendering is done in this stage, only drawing lists are generated. During the drawing job, the quad-trees in each surface cell are parsed and data from the grids are collected. For each grid in a tree, it is determined for all quadrants whether the quadrant is visible or hidden underneath other grids. For all visible quadrants, the position, scale and the local height data associated with the current grid are stored in a list. If a cell does not contain a quad-tree, data representing an empty cell is stored in another list in a similar manner. After all relevant grid data has been collected, drawing lists are created with an entry for each draw call. A draw call entry consists of pointers to the index, vertex and texture data blocks along with shader parameters. It is important 43

to note that, at this stage, only the index buffer contains the correct data for rendering. The vertex and texture buffers are lled with the data for this frame later during the dispatch.

6.6.2

Dispatch

Before anything is rendered to the screen, a dispatch job is run where the vertex and texture data are written to GPU buffers and the drawing lists created earlier are dispatched. Filling the vertex and texture buffers with data is done by parsing the lists with visible grid quadrants and empty cells. The vertex buffer is an array with vertex objects for all vertices that are rendered during a frame where each vertex object contains world and texture coordinates. Since the vertex buffer can easily grow very large when rendering high resolution water it is important to keep the size of the vertices as small as possible to avoid memory bandwidth bottlenecks. The current implementation uses vertices of 16B in size, including 12B for three f loat32 world coordinates and 2B for two uint8 texture coordinates. The vertices are padded to 16B with 2B dummy data since both the Xbox 360 and the PlayStation 3 needs 16B-aligned vertices. The height data used for calculating surface normals are stored in a monochromatic 16-bit texture for all grids. After the vertex and texture buffers have been lled with data, the drawing lists are sent to the GPU which then renders the water surface to the screen. The results of a completed rendered frame can be viewed in gure 6.6.

Figure 6.6: A screenshot of the Frostbite 2 engine, using a test level with a simulated lake. The frame is taken shortly after a point sized object has interacted with the water surface.

44

Chapter 7

Results and analysis


In this chapter, the performance of the parallelized implementation is compared to the implementation of the previous work and then discussed. Benchmarks are done with a varying number of grids/pyramids for both implementations on PC. Performance results on Xbox 360 and PlayStation 3 are only available for the parallel implementation since the old implementation was not designed for those platforms.

7.1

Previous Implementation

These are the results of the previous implementation running on a PC with an Intel Xeon x5550 2.67GHz processor with 8 virtual cores, 12GB RAM and an Nvidia GeForce GTX 470 graphics card. The results are taken from Ottossons report [Ott11] and presented with some additional data, see table 7.1. Here, it is shown how many vertices are simulated in total and roughly how many uniform 64x64 grids in the parallel implementation this corresponds to, not including border data. To further simplify such comparisons and to show how well the algorithm scales with simulated area, the simulation time spent per grid (using the estimated number of grids) is also shown. The resolutions specied represent the number of complete pyramids, with top-level grids of the given resolution, available. Each pyramid includes grids from the specied resolution down to a size of 4x4. The time for simulation is the total time spent on propagating waves and copying borders while rendering is the time it takes to generate vertex buffers and textures. Times are measured in wall-clock time and are presented in milliseconds per frame.

45

P64x64 4 4 4 6 6 6

P32x32 12 12 16 24 18 24

P16x16 0 36 64 0 54 96

PAll 16 52 84 30 78 126

tsim 2.0 3.1 4.3 4.0 4.5 6.5

trender 0.5 0.7 0.9 0.7 1.0 1.3

v 38144 50240 65088 65376 75360 97632

G64x64 12 16 21 21 24 31

tsim G

0.16 0.19 0.21 0.19 0.19 0.21

Table 7.1: Showing the results of the previous implementation, where Pn is the number of pyramids used with the specied top-level resolution, duration per frame t is shown in milliseconds for the simulation, rendering and time spent per grid. v is the number of simulated vertices and G64x64 is roughly the corresponding number of uniform grids.

As can be seen in table 7.1, the only conguration that performed well enough (< 3ms) was the one that roughly corresponds to 12 64x64 grids. This is certainly enough for a moderately detailed simulation. This conguration however, does not contain any 16x16 pyramids which makes it difcult to simulate larger sections of low resolution waves. Considering this, the second conguration might be more suitable even if it performs slightly worse.

7.2

Parallel Implementation

The results for the parallelized implementation are presented for each platform in in tables 7.2, 7.3 and 7.4. Each table shows the execution time for different grid pool sizes and for single-threaded and parallel execution with 6 threads. The grid pool sizes were chosen to best match the range of total simulated data that was presented in Ottossons report using a resolution of 64x64 for each grid. All data is shown in milliseconds per frame, wall clock time, where the pass-specic times are execution time for a single thread. For each platform, S T the average speedup, Sp = T 1 , and efciency, pp , of the parallellization is p calculated, where p is the number of threads and Tn is the total wall clock time for n threads. These measurements were made in-engine on a testing level with a large approximately 200x200m water surface. The camera was positioned close to the surface to maximize the surface detail for each benchmark. No disturbances were made, so that pass was not included in these measurements. All times was measured by observing in-engine timers displayed on the screen. As those timers uctuated somewhat, the times shown here are an estimated average. The PC used in this benchmark is equipped with an Intel Xeon x5650 2.67GHz processor with 12 virtual cores, 12GB RAM and an ATI Radeon HD 5700 graphics card.

46

G64x64 12 16 24 32

theight1 2.0 2.7 4.1 5.4

tborders1 <0.1 <0.1 <0.1 0.1

tsummation1 1.9 2.6 3.9 5.3 Sp 3.34


Sp p

tsim1 4.1 5.4 8.2 10.9

tsim6 1.2 1.5 2.8 3.2

trender1 0.4 0.5 0.8 1.1

tsim1 G64x64

0.10 0.09 0.12 0.10

0.56

Table 7.2: Shown are the results of the parallel implementation on a PC with 12 virtual cores. Frame times for both 1 thread (tsim1 , trender1 ) and 6 threads (tsim6 ) are displayed along with the average speedup and average efciency.

G64x64 12 16 24 32

theight1 8.3 10.5 16.5 21.7

tborders1 0.1 0.2 0.3 0.4

tsummation1 7.6 11.2 16.8 23.3 Sp 3.34


Sp p

tsim1 16.2 22.3 34.5 46.3

tsim6 4.9 6.2 9.8 15.7

trender1 5.8 6.7 10.6 13.7

tsim1 G64x64

0.41 0.42 0.44 0.43

0.56

Table 7.3: Shown are the results of the parallel implementation on Xbox 360. Frame times for both 1 thread (tsim1 , trender1 ) and 6 threads (tsim6 ) are displayed along with the average speedup and average efciency.

G64x64 12 16 24 32

theight1 5.5 7.4 11.1 14.8

tborders1 <0.1 0.1 0.2 0.3

tsummation1 4.2 5.8 8.9 12.2 Sp 4.63


Sp p

tsim1 10.3 13.6 20.5 27.6

tsim6 2.2 3.2 4.2 5.8

trender1 3.0 4.1 5.5 7.3

tsim1 G64x64

0.18 0.20 0.18 0.18

0.77

Table 7.4: Shown are the results of the parallel implementation on PlayStation 3. Frame times for both 1 thread (tsim1 , trender1 ) and 6 threads (tsim6 ) are displayed along with the average speedup and average efciency.

7.3

Analysis

Comparing the parallel implementation to Ottossons result, it is clear that the old algorithm was parallelisable with good results. The new parallel algorithm

47

outperforms the previous one when looking at wall clock time for both rendering and simulation. It is important to note however, that the single-thread performance of the previous algorithm is about 200% better than the new one. This is mainly because the parallel implementation is not yet adapted for SIMD instructions. With SIMD optimizations, the parallel implementation is expected to exceed the previous code in single-threaded mode as well. While not as heavily optimized, the new algorithm, running in parallel, scales almost perfectly linearly with the amount of grid space that is simulated. This is an improvement over Ottossons implementation which spends more time per grid area the more pyramids are used. This is because of the overhead that comes with using many smaller grids to make up the base levels of the pyramids. The performance on the PlayStation 3 is as expected. With a parallel efciency at 77%, it is the platform which gains the most performance from the parallel simulation compared to the single thread. This is because the simulation can utilize 6 physical cores at the same time which suffer from no shared cache issues. The signicant single-thread advantage for the PlayStation 3 over the Xbox 360 is because that thread is run on an dedicated SPE instead of the PPE and, as such, does not have to share resources with other running threads. Since the code is also optimized for SPEs, it is assumed that the PPE suffers from cache issues which explains some of the poor performance. Since the Xbox 360 only has 3 physical cores, it cannot take advantage of the parallelization as well as PlayStation 3. The speedup however, being super linear in relation to the 3 cores, shows the performance effects that hyper threading can provide. Looking at the performance of the PC however, it is interesting to see that it has the same speedup as the Xbox 360. This might also be explained by hyper threading issues since the PCs 12 cores are all logical, distributed over 6 physical cores. That many physical cores should be able to provide a speedup closer to that of the PlayStation 3, but suboptimal job distribution may lead to all 6 threads being run on 3 physical cores only. In the previous implementation, it was assumed that the wave propagation step outweighed the up-scaling step in terms of performance cost. Since the up-scaling step contains memory access overheads that are dependent on the grid pool size, these operations could become the dominating cost factor when the rest of the code is vectorized. The effects of frequent memory accesses is apparent on the Xbox 360 where the up-scaling cost surpasses the cost of propagation when the grid pool size is increased, dominating the total simulation time. The render step, where GPU-buffers for vertices and textures are lled with the height-eld data, is a big bottleneck, especially on the consoles, which is mainly due to memory bandwidth issues. The implementation on PlayStation 3 uses the Cell memory for storage since writing to that is about 4 times as fast as writing to RSX memory. This is not a viable solution however since Cell memory is needed for other parts of the engine. Ottosson used a method for reducing the resolution of the drawn mesh in relation to the simulated height-eld, which reduces the stress on bandwidth greatly without compromising visual quality 48

too much. This feature was not implemented in the parallel simulation due to time constraints, but once implemented, this should reduce the time it takes to construct the GPU-buffers by at least 75%.

49

Chapter 8

Discussion
This thesis presents an algorithm which is the continued work on a dispersive water wave simulation implemented in the game engine Frostbite 2, developed by DICE. It improves on the previous implementation by being able to run in parallel, making it possible to utilize modern multi-core architectures like the PlayStation 3. By simulating water using a homogeneous pool of resources, level designers are allowed to easily adjust the delity of the simulation according to available performance. This also enables a continuous Level Of Detail which results in high quality simulation of water that can be viewed both near and far away simultaneously. In this chapter, the strengths and weaknesses of the new design elements are discussed along with suggestions of how to evolve the simulation further

8.1

Design

The design of the new algorithm was done with parallelism as the main mode of dealing with performance. Since Data Oriented Design is very well suitable for multi-threaded applications, the implementation was done with the location and access pattern of memory in mind. By focusing on paralellizing on a tasklevel, some of the instruction-level parallelism of the previous implementation was lost. However, the new design leaves room for such optimizations to be done easily which is promising for future console iterations of the simulation.

8.1.1

Homogeneous Grids

It was assumed that the wave propagation code was a bottleneck of the previous simulation which is why new code was designed around making propagation as fast as possible. By using only a single resource pool populated with grids of the same resolution, larger sections of memory can be processed in the same loop, increasing data locality and performance. Not having small grids also means that the memory footprint of the old implementation was reduced signicantly. 50

However, using only same-size grids means that the up-scaling process needs to access unaligned and non-sequential blocks of memory when reading a section on an underlying grid. This is not a big problem on the SPEs since memory can be managed asynchronously by the DMA unit but needs to be cache optimized on PC and Xbox 360 to avoid stalls.

8.1.2

Improved Level Of Detail

The biggest functional gain from using one single pool of grids is that the Level Of Detail can work on any scale. A camera can go from looking at water waves of centimeters in length at close range to surveying a whole ocean in one continuous transition without visual popping or decreased performance. A large problem with the previous LOD system was the limited number of pyramids that were available for each resolution. Since the cells that divided the surface were xed in size, pyramids of high resolution would be rendered at a unnecessary high detail if looking at the water from a distance. From longer distances, there would not be enough pyramids available to cover all visible cells, severely limiting the dynamic range of water waves that could be simulated. By sorting all grids and empty grid slots according to their visible area on screen, the LOD system can balance the simulation delity, dynamically adding and removing grids of smaller scales as the camera moves through the world, to present the player with the optimal surface resolution. Special care must however be taken not to remove grids that might only be momentarily out of sight since removing grids results in the loss of wave information. Such disappearing waves might break the illusion of a persistent world.

8.1.3

Fading

To make sure that waves are not reected against cell edges that are empty, the simulation velocities need to be slowly faded against such edges. This was previously done during the rendering build process but is now done at the same time as up-scaling grid data. Since the up-scale process is done cumulatively going from lower-resolution grids to higher, the fading can be done over multiple grid levels using the fade levels of the underlying grid. This also take some load of the single-threaded rendering step, which can be observed in the tables in the results chapter. To accommodate the xed sized grids, quad-trees are used to distribute them over the cells of the water surface. Since the base grids on the water surface cover a much larger area than before, the fade levels of those grids affect the simulation to a much larger extent. This means that, with accumulated fade levels over multiple levels of grids, the edges towards non-simulated cells can look dull and waste much simulation data.

51

8.2

Future Implementations

While the water simulation implemented in this thesis is competent and able to render high quality interactive water, there are still many features missing from the algorithm in this report that would enhance the user experience or otherwise reduce the remaining issues.

8.2.1

Enhanced Effects

A realistic water simulation can look impressive, but without effects, like for example, spray, foam and breaking waves the water will look at and since the simulation in this thesis is based on a height-eld none of these effects can be simulated properly. Even without the use of realistic simulation, games have managed to produce high quality illusions of water by the use of particle effects and decals. Adding such details will greatly enhance the experience for the player and is vital for the simulation to be able to be used in a high-production game.

8.2.2

Ambient Waves

The simulation presented in thesis is able to realistically simulate ocean waves over large areas of water. However, the majority of waves on an ocean are often created by large, external forces such as hard weather. These waves, called Ambient Waves, would be very performance-costly to simulate using interactive simulation. In order to get detailed waves for a whole ocean at a low performance cost, a deterministic simulation can be used for ambient waves that the player cannot affect. This could be used in combination with the interactive wave simulation to enable artists to precisely direct ambient water animation at the same time as allowing interaction from the player. This would mean that when objects that intersect the surface are hit by statically simulated waves, they could easily generate dynamic interactive waves as a result.

8.2.3

Customized Interaction

Water in games can be affected by many factors like, for example, gusts of wind, reecting shorelines and rain. For the simulation to handle such a wide range of factors, an algorithm of a high level of complexity would be required. In order to save performance, predictable interaction can be modeled directly by artists instead, allowing the water behavior to be customized for each scene. Ideally, there should exist an interface for generating custom disturbances at specic locations on water surfaces. This way, turbulence generated by hovering helicopters, ripples created by rocks and splashes from bullets being red into the water can be created articially, without physically correct interactions. The API should let disturbances be made with varied shapes, sizes and forces to enable as great artistic freedom as possible.

52

8.2.4

Situational Level Of Detail

The Level Of Detail used in the current implementation is competent but only focuses on the position of the viewer. Having the camera close to the water surface does not necessarily mean that the surrounding area contains the most detailed information. If the player is looking at an interacting object some meters away without generating any local disturbances, the LOD system should prioritize the area around the interacting object, thus increasing the simulation delity at that location. This should be trivial to implement by introducing more factors to the priority calculation for each surface area. Another improvement of the current LOD system would be to take permanently occluded water sections into account when constructing the priority list. As has already been mentioned, in section 6.4.1, the LOD system should not remove temporarily occluded surface grids because of rapid player movement. However, when designing a level, the surface plane of the water is often bigger than the actual visible water. This creates surface sections that are permanently occluded by static terrain and should be ignored when distributing grids.

8.2.5

Better Boundary Conditions

The velocity fading that is done towards empty borders to prevent reections can be removed completely with better boundary conditions. If the open edges can be approximated to give the effect of waves continuing and disappearing instead of reecting, wave information would not be lost along empty borders. The need for fading would then only exist for reducing the visible difference between two levels of detail, which could be done without affecting the simulated waves. There are water volumes, like swimming pools, where total reection along the borders is desired, so the boundary conditions should be congurable for each surface.

8.2.6

Non-Linear Texture Mapping

In order to calculate the normals for the surface mesh, the height-eld is rendered to a grey-scale texture map. This texture map is of limited bit-depth which makes small waves that oscillate close to zero-values generate large differences in the normal map. This might make a near-calm water surface icker and look bad. It is assumed that large waves do not need the same resolution in height as smaller waves. Instead of linearly mapping the height-values to the texture, a logarithmic mapping can be performed which would let the resolution be relative to the water height.

53

8.2.7

Performance Efcient Flow Simulation

Linear Wave Theory does not handle water ow but with image advection techniques, ow can still be simulated by transposing wave information along dened ow vectors. Doing so would require good error correction algorithms which would impair performance to a large degree. Image advection over eulerian grids, without error correction, normally leads to some dissipation where high-frequency information is lost. This effect might be overlooked in other simulations to still get advection on a larger scale. However, since this implementation stores wavelength in different grids according to their frequency, every grid would suffer from dissipation relative to the wavelengths stored in that grid. This would effectively kill most waves with just a few frames of advection, which is why an error correction algorithm is essential for ow in this implementation.

8.2.8

Minimized Vertices

As shown in the results, a large portion of the time for animating water is consumed by writing data to GPU-buffers. It is therefore important to minimize the amount of data needed for each vertex in the surface mesh. By identifying data that is constant over many vertices, that data can be sent to the shaders per draw-function call instead. If, for example, a draw-function call is made for each grid and the position of that grid is sent to the shader, the vertex only needs to contain information relative to the grid origin. Sending only relative information for each vertex would remove the need for high-precision data types that are otherwise required to specify precise world coordinates. It should be noted however, that on certain systems, draw-function calls are very expensive and should in such cases not be used extensively.

8.2.9

Mesh Reduction

Another feature that was lost from the previous implementation was the resolution reduction of surface meshes. It is not necessary to render the water surface mesh in the same resolution as the simulation since the smaller waves can be visualized through the normal mapping only. By using a normal map of full resolution on a mesh with a resolution of only , large amounts of GPU memory can be saved without compromising visual quality to any larger extent. This reduction requires some special attention when fading between detail levels, which is why there was not enough time to include it in the new fading procedure.

8.2.10

Unied Quad-tree

One of the bigger problems with the previous implementation was the number of xed-sized cells needed to render larger water surfaces. While the inclusion of per-cell quad-trees reduced this problem somewhat, memory still scales 54

poorly with very large water surfaces, leaving many cells empty. Instead, a single quad-tree should be used for a whole surface. By specifying maximum and minimum grid sizes for each surface and allowing empty nodes in the quadtree, the range of simulated wave lengths could be controlled. This would also simplify code and waste less memory for each surface.

8.2.11

Multiple Convolutions

With the parallel implementation designed for optimal performance when propagating waves, improvements might be made to the convolution step without too much loss of performance. A better approximation of the iWave kernel could be achieved with multiple gaussian kernels of varying width. This would result in more physically correct modeled dispersion and Kelvin Wakes much closer to the iWave algorithm.

8.3

Conclusion

The implementation presented in this report has successfully managed to parallelize an existing method for simulating interactive water. While the main target of this work has been the PlayStation 3, the simulation is well suited for any parallel architecture. Being parallelized on a data level, thus able to make use of an arbitrary amount of processing cores, the method is also well suited for future parallel architectures. The next step in Frostbite 2 for this simulation is to optimize code on an instruction level. With these optimizations, the simulation should be ready for production on PCs as well as Xbox 360 and PlayStation 3.

55

Bibliography
[BA83] P. Burt and E. Adelson. The laplacian pyramid as a compact image code. Communications, IEEE Transactions on, 31(4):532 540, apr 1983. A Buttari, Piotr Luszczek, J Kurzak, J Dongarra, and G Bosilca. A rough guide to scientic computing on the playstation 3. Cell, 5(11):675675, 2007. Carlos Alberto G. Carvalho. The gap between processor and memory speeds. In ICCA02: 3rd Internal Conference on Computer Architecture, 2002. Christina Cofn. Spu based deferred shading in battleeld 3 for playstation 3. In Game Developers Conference 2011, 2011. Daniel Collin. Introduction to data oriented design. In DICE Coders Day 2010 November, 2010. Daniel Collin. Culling the battleeld data oriented design in practice. In Game Developers Conference 2011, 2011. Mike Day. Insomniacs water rendering system. http://www.insomniacgames.com/tech/articles/0409/les/ water.pdf, mar 2009. Retrieved February 22th 2012. Karel Driesen and Urs Hlzle. The direct cost of virtual function calls in c++. SIGPLAN Not., 31:306323, October 1996.

[BLK+ 07]

[Car02]

[Cof11] [Col10] [Col11] [Day09]

[DH96]

[DYQKEH10] Li Dong, Liu You-Quan, Bao Kai, and Wu En-Hua. Real-time shallow water simulation on terrain. In Proceedings of the 9th ACM SIGGRAPH Conference on Virtual-Reality Continuum and its Applications in Industry, VRCAI 10, pages 331338, New York, NY, USA, 2010. ACM. [Eng10] Pl-Kristian Engstad. Introduction to spu optimizations. http://www.naughtydog.com/docs/gdc2010/intro-spuoptimizations-part-1.pdf, 2010.

56

[Fly72]

Michael J. Flynn. Some computer organizations and their effectiveness. Computers, IEEE Transactions on, C-21(9):948 960, sept. 1972. Alain Fournier and William T. Reeves. A simple model of ocean waves. SIGGRAPH Comput. Graph., 20:7584, August 1986. Andreas Fredriksson. Executable bloat - how it happens and how we can ght it. http://publications.dice.se/publications.asp, 2011. Becky Heineman. Common performance issues in game programming. http://www.gamasutra.com/view/feature/3687/sponsored_ feature_common_.php, jun 2008. Retrieved February 22th 2012. W. Daniel Hillis and Guy L. Steele, Jr. Data parallel algorithms. Commun. ACM, 29:11701183, December 1986. A. Iglesias. Computer graphics for water modeling and rendering: a survey. Future Generation Computer Systems, 20:1355 1374, 2004. Lasse Staff Jensen and Robert Golias. Deep-water animation and rendering. http://www.gamasutra.com/gdce/2001/jensen/jensen_01.html, 2001. Retrieved February 22th 2012. Daniel Kallin. Real time large scale uids for games. Linkping Electronic Conference Proceedings (Proceedings of SIGRAD 2008), 2008. ByungMoon Kim, Yingjie Liu, Ignacio Llamas, and Jarek Rossignac. Advections with signicantly reduced dissipation and diffusion. IEEE Transactions on Visualization and Computer Graphics, 13:135144, January 2007. Yotam Livny, Zvi Kogan, and Jihad El-Sana. Seamless patches for gpu-based terrain rendering. Vis. Comput., 25:197208, February 2009. J. Loviscach. A convolution-based algorithm for animated water waves. In Eurographics, volume 2, pages 381389, 2002. Nelson Max and Barry Becker. Flow visualization using moving textures, 1996. Bjrn Ottosson. Rapid, stable uid dynamics for computer graphics. Masters thesis, Kungliga Tekniska hgskolan, 2011. Retrieved from 57

[FR86] [Fre11]

[Hei08]

[HS86] [Igl04]

[JG01]

[Kal08]

[KLLR07]

[LKES09]

[Lov02] [MB96] [Ott11]

http://www.nada.kth.se/utbildning/grukth/exjobb/ rapportlistor/2011/index.html February 22th 2012. [Paj98] Renato Pajarola. Large scale terrain visualization using the restricted quadtree triangulation. In Proceedings of the conference on Visualization 98, VIS 98, pages 1926, Los Alamitos, CA, USA, 1998. IEEE Computer Society Press. Realow. http://www.realow.com/. R. Schuster. Algorithms and data structures of uids in computer graphics. Retrieved from http://www.cg.tuwien.ac.at/courses/Seminar/WS2007/arbeit_ schuster.pdf February 22th 2012, 2007. Jerry Tessendorf. Interactive water surfaces. In Game Programming Gems 4. Charles River Media, 2004. Jerry Tessendorf. Simulating ocean surface. SIGGRAPH 2004 Course Notes, 2004. retrieved from http://tessendorf.org/reports.html September 10th 2010. Thatcher Ulrich. Continuous lod terrain meshing using adaptive quadtrees. http://www.gamasutra.com/view/feature/3434/continuous_lod _terrain_meshing_.php, feb 2000. Retrieved February 22th 2012. Alex Vlachos. Water ow in portal 2. Advances in Real-Time Rendering in 3D Graphics and Games, SIGGRAPH 2010 Course Slides, 2010. Retrieved from http://advances.realtimerendering.com/s2010/index.html February 22th 2012. Cem Yuksel, Donald H. House, and John Keyser. Wave particles. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2007), 26(3):99, 2007. Yuxin Zhao and Yan Ma. A modied lod terrain model based on quadtree algorithm. In Proceedings of the 2009 International Joint Conference on Computational Sciences and Optimization - Volume 02, CSO 09, pages 259263, Washington, DC, USA, 2009. IEEE Computer Society.

[rea] [Sch07]

[Tes04a] [Tes04b]

[Ulr00]

[Vla10]

[YHK07]

[ZM09]

58

Вам также может понравиться