Вы находитесь на странице: 1из 15

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE

Concurrency Computat.: Pract. Exper. 2012; 24:880894


Published online 7 July 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.1778

SPECIAL ISSUE PAPER

Rapid computation of value and risk for derivatives portfolios


Stephen Weston 1,2, * , , James Spooner 3 , Sbastien Racanire 3 and Oskar Mencer 3
1 Credit

Quantitative Research, J.P. Morgan, London EC2Y 5AJ, U.K.


University of Warwick, Coventry CV4 7AL, U.K.
3 Maxeler Technologies, Imperial College, London W6 9JH, U.K.
2 The

SUMMARY
We report new results from an on-going project to accelerate derivatives computations. Our earlier work was
focused on accelerating the valuation of credit derivatives. In this paper, we extend our work in two ways: by
applying the same techniques, first, to accelerate the computation of portfolio level risk for credit derivatives
and, second, to different asset classes using a different type of mathematical model, which together present
challenges that are quite different to those dealt with in our earlier work. Specifically, we report acceleration
over 270 times faster than a single Intel Core for a multi-asset Monte Carlo model. We also explore the
implications for risk. Copyright 2011 John Wiley & Sons, Ltd.
Received 27 March 2011; Accepted 2 April 2011
KEY WORDS:

FPGA; J.P. Morgan; Maxeler; acceleration; credit derivatives; Monte Carlo

1. INTRODUCTION
One of the key enablers of the growth and innovation in the global derivatives markets has been
the development and intensive use of complex mathematical models. As the use of derivative instruments based on such models has expanded across asset classes, the process of valuing and managing
the risk of such complex portfolios has grown to a point where thousands of CPU cores are used
for the daily calculation of value and risk. Unfortunately, CPU cores consume vast amounts of
electricity both for powering the CPUs themselves as well as for cooling.
In 2005, the worlds estimated 27 million servers consumed around 0.5% of all electricity produced on the planet, a figure that is closer to 1% when the energy for associated cooling and auxiliary
equipment (e.g., backup power, power conditioning, power distribution, air handling, lighting, and
chillers) is included [1]. Although it is true that the purchase costs of hardware are falling, such
savings are being increasingly offset by rapidly rising power-related indirect costs [2]. These costs
have led many large financial institutions to search for ways to continue to add greater computational
power, with reduced capital and operating costs.
In this paper, we follow on from our earlier work [3] and report new results from the collaborative project between J.P. Morgan in London and the acceleration solutions provider Maxeler
Technologies, based on work that took place during the period from late 2009 through to the end
of 2010. All of the computations reported in this paper were carried out on the MaxRack hybrid
(field-programmable gate array (FPGA) and Intel CPU) cluster solution delivered by Maxeler. The
contributions of this new paper therefore are as follows:

*Correspondence to: Stephen Weston, Credit Quantitative Research, J.P. Morgan, London EC2Y 5AJ, U.K.
E-mail: stephen.p.weston@jpmorgan.com
Copyright 2011 John Wiley & Sons, Ltd.

COMPUTATION OF DERIVATIVES VALUATION AND RISK USING FPGAS

881

 Extension to risk computations of a > 30 performance acceleration by building a customized

high-efficiency high-performance computing system.


 Extension of the acceleration approach from valuation to risk measurement for credit deriva-

tives.
 Extension of the acceleration approach to a multi-variate Monte Carlo derivative pricing model.
 Enabling the computation of an order of magnitude more scenarios, with direct impact on the

credit derivatives business at J.P. Morgan.

2. RELATED WORK
There is a substantial and continually expanding body of work in the area of acceleration for applications in finance [4]. The underlying theme of existing work has been to adapt technology to
accelerate the performance of computationally demanding valuation models. This report of our work
continues the theme but is distinguished from our earlier paper in three ways. The first distinguishing
feature is the computation of risk for credit derivatives. The second is the application to asset classes
beyond credit. The third is the extension to a multi-asset modeling framework to calculate the value
and the risk of derivative contracts. The model on which the computations are based is a multi-asset
Monte Carlo simulation framework.
The combination of the new method and increased scope presents a range of distinct computational, algorithmic and performance challenges [5]. A further distinguishing feature of our acceleration results is that they have always been driven by the needs of a real-world trading portfolio that
includes live credit derivatives and interest rate trades from a major investment bank, rather than
relying on theoretical portfolios and artificial data sets.
3. DERIVATIVES
3.1. Background
According to the June 2010 survey by the Bank for International Settlements [6], positions in overthe-counter derivatives stood at $583 trillion at the end of June 2010 a level 15% higher than
3 years previously. However, the slower overall growth in outstanding notional amounts conceals
significant variations across risk categories. The highest growth was recorded in the interest rate
segment of the derivatives markets at 25%, bringing the share of this risk category in the market
total to 82%. In comparison, the credit markets declined by some 3040% over the same period.
Products with a single underlying credit (such as a bond) account for approximately 57% of the
notional outstanding, with the remainder accounted for by products with multiple underlying credits.
In our previous paper, we reported results based exclusively on our work on a single credit derivatives model. In this paper, we continue that theme but expand the scope to discuss how acceleration
impacts upon risk calculations. In addition, given the clear importance of other classes of derivatives
(particularly those dependent on interest rates), we also report results of accelerating a cross-asset,
multi-variate Monte Carlo model for valuation of a broader class of underlying assets.
3.2. Credit derivatives concepts
In the finance world, credit contracts allow an entity (e.g., a company or government) to issue bonds
in exchange for an amount of cash (the notional). These products are over-the-counter traded products, rather than being exchange traded such as equities. The entity issuing the bond (issuer) pays
the purchaser a regular payment (or coupon) for the use of the money. When the bond matures, the
issuer repays the bond, or if the issuers are unable to pay, they are said to have defaulted, and the
purchaser stands to lose some or all of the money.
Credit derivatives are contracts that let an entity transfer the risk of some number of credits (such
as a bond) defaulting, to a third party, in exchange for regular payments. The simplest example
of such a contract is a credit default swap (CDS), where the risk of a single underlying credit is
Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

882

S. WESTON ET AL.

exchanged for regular payments. CDS contracts resemble an insurance contract on a bond defaulting. The key difference between a bond and a CDS is that the party buying the risk does not pay
anything up front. Building on CDS contracts, CDS indexes (CDSIs) allow the trading of risk using
a portfolio of credits. A collateralized default obligation (CDO) is a specialization of a CDSI in
which the total loss pool is divided into tranches. A tranche allows an investor to buy or sell protection for losses in a certain part of the pool. Figure 1 shows a tranche between 5% to 6% of the
losses in a pool. If an investor sells protection on this tranche, they will not be exposed to any losses
until 5% of the total pool has been wiped out. The lower (less senior) tranches have higher risk,
and those selling protection on these tranches will receive higher levels of compensation than those
investing in upper (more senior) tranches. CDOs can also be defined on non-index portfolio; such
a CDO is dubbed bespoke and presents additional challenges as the behavior of a given bespoke
portfolio is not easy to observe in the market.
The modeling of behavior for a pool of credits becomes complex as it is important to model the
correlation between companies or governments defaulting. Corporate defaults, for example, tend to
be significantly correlated because all firms are exposed to a common or correlated set of economic
risk factors, so that the failure of a single firm tends to weaken the remaining firms [7].
3.2.1. Credit models. The market standard approach to pricing CDO tranches where the underlying
is a bespoke portfolio of reference credits is to use the base correlation methodology and a convolution algorithm to sum conditionally independent loss random variables, with the addition of
a coupon model to price exotic coupon features [8]. Since mid-2007, the standard approach in
the credit derivatives markets has moved away from the original Gaussian copula model [9]. Two
approaches to valuation now dominate the modeling of tranches: the random factor loading model
and the stochastic recovery model [7, 10]. In this paper, we have adopted the latter approach and use
the following algorithm as a basis for accelerating pricing:
1. In the first step, the loss distribution is discretized and computed using numerical convolution,
given the conditional survival probabilities and losses resulting from the copula model.
2. We then use the standard method of discretizing over the two closest bins with a weighting
such that the expected loss is conserved.
3. The final loss distribution is computed using a weighted sum over all of the market factors
evaluated using the copula model.

Figure 1. Tranched credit exposure.


Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

COMPUTATION OF DERIVATIVES VALUATION AND RISK USING FPGAS

883

Figure 2 shows pseudocode for the algorithm. For brevity, the edge cases of the convolution and
the detail of the copula and recovery model have been removed.
3.2.2. Computing risk for credit products. Computing risk sensitivities for CDO tranches presents a
further computational challenge beyond those already presented by simple valuation. The problem
arises from the fact that calculating the sensitivity to changes in the term structure of default for
each credit in the CDO basket involves re-computing after perturbing the credit spread at each point
on the default curve for every credit in the CDO basket. For example, if each credit curve is constructed using eight maturity points and there are, say, 100 credits in the basket, then computing
spread sensitivity for that basket involves using a brute force approach of valuing a tranche a total
of 801 times ((8  100) C1 base valuation) a huge computational requirement for a portfolio that
contains many thousands of tranches.
In some cases, it is possible to employ a number of measures that help to reduce the need for
repetitively calculating the sensitivities to the same names, as well as using partial derivatives to
reduce the number and complexity of the computations. However, these measures do not provide
total relief from the computational burden, with the result that it is frequently preferable to use the
brute force approach for the sake of consistency and reliability.
3.2.3. Accelerated collateralized default obligation pricing. The current computational challenge
faced by the credit hybrids business within J.P. Morgan is that each day it needs to calculate risk and
fair value for hundreds of thousands of credit derivative instruments. A large proportion of these are
standard single name CDSs, which require very little time to calculate fair value and risk. However,
a substantial minority of the instruments are tranched credit derivatives that require the use of complex models such as the model discussed in Section 3.2.1. The computational cost of these daily
runs is such that without the application of the acceleration techniques reported in this paper, they
could only be meaningfully carried out overnight. To complete the task, approximately 800 standard
Intel Cores are used. Even with such resources available, the calculation time is around 3 1/2 h, and
the total end-to-end run time is close to 7 h when batch preparation and results write-back are taken
into consideration.
It is important to be precise about the gains made from the acceleration process. Consequently,
being clear about the performance and efficiency of the original C++ code is vital if the acceleration
achieved is to be isolated and measured accurately. Several points can be made about the original
source C++ code. First, the entire C++ library from which the bespoke tranche pricing code was
extracted has been written from scratch within the credit quantitative research team at J.P. Morgan by
highly experienced programmers since the beginning of 2008. Although no library is ever fully optimal, considerable attention has been given to performance optimization. Optimization techniques

for i in 0 ... markets-1


for j in 0 ... names-1
prob=cum_norm((inv_norm(Q[j])-sqrt(p)M)/sqrt(1-p);
loss=calc_loss(prob,Q2[j],RR[j],RM[j])*notional[j];
n = integer(loss);
lower = fractional(loss);
for k in 0 ... bins-1
if j == 0
dist[k] = k == 0 ? 1.0 : 0.0;
dist[k] = dist[k]*(1-prob) +
dist[k-n]*prob*(1-L) +
dist[k-n-1]*prob*L;
if j == credits -1
final_dist[k] += weight[i] * dist[k];
end # for k
end # for j
end # for i

Figure 2. Pseudocode for the bespoke collateralized default obligation tranche pricing algorithm.
Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

884

S. WESTON ET AL.

such as dead code elimination, expression simplification, function in-lining, loop collapsing, and
fusion have all been employed to improve the efficiency of the source C++ code. Pointer optimization, try/catch block optimization, as well as virtual function optimization have also all been used
wherever possible and practical to improve the execution speed of the original source code. Therefore, as far as was practically possible, the acceleration results reported in this paper are attributable
purely to the development tools, technology, and the process.
With this information, the acceleration stage of the project focused on designing a solution capable of dealing with only the complex bespoke tranche products, with a specific goal of exploring the
High Performance Computing architecture design space to maximize the acceleration.
3.3. A wider range of underlying assets
In this section, we outline the general form of a Monte Carlo simulation model designed to provide
value and risk calculations in a multi-asset setting. The model provides an integrated framework for
pricing cross-asset derivative instruments and is scalable with respect to the number of assets. The
model owes its intellectual origins to the work of Sanakarasubramanian and Ritchken [11] and is
used across a range of valuation and risk situations within J.P. Morgan.
To price a derivative that involves several assets, the user defines which assets to diffuse and the
list of key dates. The model then diffuses the assets along a path in the risk neutral measure utilizing
user specific correlations. The final step on each path is to calculate the discounted payoff. These
two steps are repeated for the required number of paths and averaged to arrive at the final Monte
Carlo price. The model can be used to value derivatives where the payoff depends on some combination of foreign exchange, credit, equities, and interest rates. Figure 3 provides an overview of
the algorithmic structure of the model and loosely identifies the key blocks that were the target for
acceleration. Example payoffs that can be dealt with by the model include, but are not limited to,
products such as foreign exchange chooser turbos where the payoff is the maximum or the minimum of two vanilla turbo coupons (also known as power reverse dual currency notes) with different
foreign coupons and credit knock-out cross currency swaps where the payout is a cross currency
swap with coupons payable contingent upon survival of the specified underlying credit.

Loop over time steps


Generate pseudorandom
numbers & transform
into uncorrelated
normals
Transform into
correlated normals
& loop over paths

Calc IR diffusion
parameters; diffuse
IR asset; calc
forwards

Calc FX diffusion
parameters; diffuse
FX asset; calc
forwards

Calc equity diffusion


parameters; diffuse
equity asset; calc
forwards

Calibration
enhancement via
moment matching
Compute indices
to be referenced
in payoff calc

Compute
payoff

Figure 3. Algorithmic structure of the multi-asset Monte Carlo model.


Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

COMPUTATION OF DERIVATIVES VALUATION AND RISK USING FPGAS

885

Such products are notoriously difficult to value because in the case of (for example) simulating
interest rates using a Monte Carlo model, there is a requirement to diffuse an entire interest rate
curve that can involve quarterly points out as far as 50 years, not just a spot quantity such as an
equity price. This presents a substantial computational challenge, particularly because as the payoffs
for such derivatives increase in complexity as well as the number of assets/underlyings, solution by
Monte Carlo simulation becomes an increasingly necessary tool. However, Monte Carlo methods are
highly computationally intensive with the result that the number of simulations required to achieve
stable and accurate results rapidly escalates beyond what is feasible given current technology when
multiple stochastic factors are involved.
4. MAXELER ACCELERATION PROCESS
The key to maximizing the speed of the final design is a systematic and disciplined end-to-end
acceleration methodology. Maxeler follows the four stages in the acceleration process shown in
Figure 4 from the initial C++ design to the final implementation, which ensure we arrive at the
optimal solution.
4.1. Analysis stage collateralized default obligation pricing
In the analysis stage, we conducted a detailed examination of the algorithms contained in the original C++ model code. Through extensive code and data profiling with Maxeler Parton (profiling
tool suite), we were able to clearly understand and map the relationships between the computation and the input and output data. This analysis covered acquiring a full understanding of how
the core algorithm performs in practice, which allowed us to identify the major computational and
data movements and storage costs. Dynamic analysis using call graphs of the running software,
combined with detailed analysis of data values and runtime performance, were necessary steps to
identifying bottlenecks in execution as well as memory utilization patterns.
Profiling led to focusing on accelerating two main areas of the computation:

Figure 4. Iterative Maxeler process for accelerating software.


Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

886

S. WESTON ET AL.

 Calculation of the conditional survival probabilities (copula evaluation).


 Calculation of the probability distribution (convolution).

4.2. Analysis stage Monte Carlo


A further computational challenge was presented by the more than 100 unique payoff profiles within
the overall 400C trade population that the Monte Carlo model was used to value. Analysis showed
that the payoff scripts accounted for 20% of CPU runtime, and so it was critical to include them
within the acceleration to maximize code coverage and hence overall acceleration. Moreover, payoff computations required index data that were computed and stored on the FPGA. Streaming this
data over Peripheral Component Interconnect Express (PCIe) would have been too slow. Detailed
profiling of the payoffs revealed that they could be split into two broad categories, namely, simple
and complex. In the first category, which accounted for around 80% of the population, were simple
payoffs such as notional multiplied by the spot foreign exchange rate. In the second category were
complex payoffs that included features such as capped and floored payoffs (mainly the maximum
or the minimum of some underlying over one or more periods), as well as look-back features that
require data from an entire path not just at a series of discrete points.
Profiling of the payoffs revealed that using only the 17 simple operators shown in Table I, the
functionality of approximately 99% of the trade population could be captured using only a single
accelerated FPGA kernel. Mindful of the need to guard against the use of FPGAs within the financial
arena becoming an overly academic exercise, on-going profiling showed that it was worth trading
off a small (typically 12%) loss of overall system performance resulting from leaving a very small
number of payoffs in software, in favor of significantly reducing the time to market of the completed
FPGA solution.
4.3. Transformation stage
The transformation stage examined the heart of how loops and control are structured within the
existing C++ code. This stage also identified key data layout transformations necessary to accelerate the performance of the core algorithm in the model. Data storage abstraction using object
orientation needed to be re-evaluated, as data needed to be passed efficiently to the accelerator with
low CPU overhead.
The results of code profiling and analysis for CDO pricing were based on the 898 lines of
the original C++ model code. See Section 5.6 for an analysis of the lines of code required for
implementation.
4.4. Partitioning stage
The aim for the partitioning stage of our acceleration process is to create a contiguous block of operations that is both tractable to accelerate as well as achieve the maximum possible runtime coverage,
when balanced with the overall CPU performance and data input and output considerations.
The profiling identified in Section 4.1 gave us the necessary insight to make partitioning decisions.
4.5. Implementation stage collateralized default obligation pricing
A general bespoke tranche pricer using standard probability domain convolution was implemented.
The implementation featured multiple kernels (independent units within the design) and allowed
for arbitrary replication of compute pipelines within these kernels. This allowed for the copula and
Table I. Simple operators used to model payoffs.
ADD
CMPL
EQ
LT
MIN
OR
Copyright 2011 John Wiley & Sons, Ltd.

AND
COND
GT
LTE
MUL
SUB

ABS
DIV
GTE
MAX
NEG

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

COMPUTATION OF DERIVATIVES VALUATION AND RISK USING FPGAS

887

// Shared library and call overhead 1.2%


for d in 0 ... dates-1
// Curve Interpolation 19.7%
for j in 0 ... names-1
Q[j] = Interpolate(d, curve)
end # for j
for i in 0 ... markets-1
for j in 0 ... names-1
// Copula Section 22.0%
prob=cum_norm((inv_norm(Q[j])-sqrt(p)M)/sqrt(1-p);
loss=calc_loss(prob,Q2[j],RR[j],RM[j])*notional[j];
n = integer(loss);
lower = fractional(loss);
for k in 0 ... bins-1
if j == 0
dist[k] = k == 0 ? 1.0 : 0.0;
// Convolution 51.4%

dist[k] = dist[k]*(1-prob) +
dist[k-n]*prob*(1-L) +
dist[k-n-1]*prob*L;
// Market factor integration, 5.1%

if j == credits-1
final_dist[k] += weight[i] * dist[k];
end # for k
end # for j
end # for i
end # for d
// Code outside main loop 0.5%

Figure 5. Profiled version of original pricing algorithm in pseudocode form.

convoluter to be in separate kernels, such that they could run independently of each other within the
chip and could be sized according to the bounds of the loops described in Figure 2.

4.6. Implementation stage Monte Carlo pricing


The algorithm in Figure 3 was implemented using three kernels running asynchronously. The clock
frequency of each kernel was tailored to maximize usage of each kernel while minimizing the
constraint put on the chip design. The generated pseudorandom numbers were stored in the high
capacity on card memory and made available for all risk runs.

5. IMPLEMENTATION DETAILS
The FPGA design comprised arithmetic data paths for the computations (the kernels) and modules orchestrating the data inputoutput for these kernels (the manager). Separating computation
and communication into kernels and manager allows for highly parallel pipelined kernels. This
parallelism is the key to achieving the performance reported in Section 6. Speedups were further
increased by replicating between several independent pipelines within kernels and by using multiple kernels with different levels of parallelism. The number of pipelines and kernels that could be
mapped to the accelerator is limited by the size of the FPGAs used in the MaxNode and available
parallelization in the application.
Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

888

S. WESTON ET AL.

5.1. Acceleration hardware collateralized default obligation pricing


For the CDO result measurements in this paper, we use the J.P. Morgan MaxRack configured with
MaxNode-1821 compute nodes.
Figure 6 sketches the system architecture of a MaxNode-1821. Each node has eight Intel Xeon
cores and two Xilinx FPGAs on a MAX2 card connected to the CPU via PCIe. A MaxRing highspeed interconnect is also available, providing dedicated high bandwidth communication channel
directly between the FPGAs.
5.2. Acceleration hardware Monte Carlo pricing
For the Monte Carlo result measurements in this paper, we use the J.P. Morgan MaxRack configured
with MaxNode-1834 compute nodes with Maxeler MAX3 accelerator cards.
Each node has eight Intel Xeon cores and four Xilinx Virtex6 FPGAs connected to the CPU via
PCIe. A MaxRing high-speed interconnect is also available, providing dedicated high bandwidth
communication channel directly between the FPGAs.
5.3. Implementation tools
In the past, a key issue with FPGA based acceleration has been the complexity of the programming
task. Maxeler provides a programming environment called MaxCompiler, which raises the level of
abstraction of FPGA design to enable rapid development and modification of streaming applications,
even when faced with frequent design updates and fixes.
MaxCompiler allows the FPGA design with managers and kernels discussed in Section 5 to be
implemented efficiently in Java, without resorting to lower-level languages such as VHDL.
5.4. Implementation for collateralized default obligation pricing
One of the main focus points during this stage was finding a design that balanced arithmetic optimizations, desired accuracy, power consumption, and reliability. Two design points were selected
for implementation, a full precision design for fair value calculations and a reduced precision design
for scenario analysis. The full precision has been designed for an average relative accuracy of 108
and the reduced precision for an average relative accuracy of 104 . These design points share identical implementations, with only the compile-time parameters of precision, parallelism, and clock
frequency varying.
The copula and convolution kernel were built to implement one of more pipelines that effectively
parallelize loops within the computation. Because the convolution uses each value generated by the
copula model many times, the kernel and manager components were scalable to enable the exact
ratio of copula or convolution resources to be tailored to the workload for the design. Figure 7 shows
the structure of the convolution kernel in the design.
MaxRing

PCIe

PCIe

FPGA

FPGA

Xeon
Cores
Figure 6. MaxNode-1821 architecture diagram containing eight Intel Xeon cores and two Xilinx fieldprogrammable gate arrays (FPGAs) with MaxRing interconnect. PCIe, Peripheral Component Interconnect
Express.
Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

COMPUTATION OF DERIVATIVES VALUATION AND RISK USING FPGAS

889

Figure 7. Convoluter architecture diagram.

When implemented in MaxCompiler, the code for the convolution in the inner loop of Figure 2
resembles the original structure. The first difference is that there is an implied loop, as data streams
through the design, rather than the design iterating over the data. Another key difference is that the
code now operates in terms of a data stream (rather than on an array), such that the code is now
describing a streaming graph, with offsets forward and back in the stream as necessary to perform
the convolution. Figure 8 shows the core of the code for the convoluter kernel.
5.5. Implementation for Monte Carlo pricing
There is a growing body of published work on the implementation of a Monte Carlo model for pricing derivatives (see [12] as a recent example). In contrast, the work reported in this paper is novel in
three respects:
 Multi-asset classes.
 Flexible design to accommodate idiosyncratic payoff functions.
 Incorporation of accurate and stable finite difference risk calculations.

A single bitstream was developed for the Monte Carlo model. When running the bitstream with
a single process, there is no scope for overlapping, with the consequence that FPGA is idle for over
50% of the overall execution time, as can be seen in Figure 9
Random numbers are loaded in main application and can then be read in more than one process.
As a result of adopting this approach, it was then found to be possible to overlap the software and
run three concurrent processes so that the FPGA was kept permanently busy. This was achieved by
loading the random numbers in a separate process that is run in parallel with the software setup. The
advantages of this enhancement can be seen clearly in Figure 10. Thorough and consistent application of this approach to calculating the risk measures enabled optimization of the wall-clock time by
HWVar
HWVar
HWVar
HWVar
HWVar
HWVar
HWVar

d = io.input("inputDist", _distType);
p = io.input("probNonzeroLoss", _probType );
L = io.input("lowerProportion", _propType );
n = io.input("discretisedLoss", _operType );
lower = stream.offset(-n-1,-maxBins,0,d);
upper = stream.offset(-n,-maxBins,0,d);
o = ((1-p)*d + L*p*lower + (1-L)*p*upper);

io.output("outputDist", _distType, o);

Figure 8. MaxCompiler code for adding a name to a basket.


Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

890

S. WESTON ET AL.

Time in Seconds

Figure 9. Monte Carlo run time with single process and one field-programmable gate array (FPGA).

Random
numbers
loaded while
input file
parsed

...

...

One process is always using the FPGA.

...

Time in Seconds

Figure 10. Monte Carlo run time (zoomed in) with three processes and one field-programmable gate array
(FPGA). RNG, random number generator.

using pre-computation of the random numbers and by re-using many data structures that in software
are created and destroyed inside the risk loop.
5.6. Implementation effort
In general, software programming in a high-level language such as C++ is much easier compared
with interacting directly with FPGA hardware using a low-level hardware description language.
Therefore, in addition to performance, development and support time are increasingly recognized
as significant components of overall effort when finding a solution from the software engineering
perspective. For the purposes of this paper, we measure programming effort as one aspect in the
examination of programming models. Because it is difficult to get accurate development-time statistics for coding applications and also to measure the quality of code, we use lines of code (LOC) as
our metric to estimate programming effort.
Applying the LOC approach, we find that the original C++ code for the fair value and risk calculations was 898 lines, of which 81 lines of code were shared between the fair value and risk
Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

COMPUTATION OF DERIVATIVES VALUATION AND RISK USING FPGAS

891

calculations. Because of the low-level nature of coding for the FPGA architecture when compared
with standard C++, the original 898 lines of code generated 3696 lines of code for the FPGA (or
a growth factor of over three) to replicate the fair value and risk computations (of which 285 lines
were shared).
On the multi-asset Monte Carlo model, the core algorithms reside 26,560 lines of C++ and C
code, to which 5188 lines of acceleration interface and model were added for integration and testing. The implementation of the accelerated application itself required 4537 lines of MaxCompiler
code.
5.7. Implementation summary
Starting with the program that runs on a CPU and iterating in time, we transformed the program
into the spatial domain running on the MaxNode, creating a structure on the FPGA that matched the
dataflow structure of the program (at least the computationally intensive parts). Thus, we optimized
the computer based on the program, rather than optimizing the program based on the computer.
Obtaining the results of the computation then became a simple matter of streaming the data through
the Maxeler MaxRack system.
In particular, the fact that we can use fixed-point representation for many of the variables in our
application is a big advantage, because FPGAs offer the opportunity to optimize programs on the
bit level, allowing us to pick the optimal representation for the internal variables of an algorithm,
choosing precision and range for different encodings such as floating and fixed point.
6. RESULTS
6.1. Credit derivatives
As a benchmark for demonstration and performance evaluation, a fixed population of 29,250 CDO
tranches was used. This population comprised real bespoke tranche trades of varying maturity,
coupon, portfolio composition, and attachment/detachment points.
We performed a standard fair value calculation for bespoke tranches. The results of this
calculation are reported for the full population of 29,250 bespoke tranches.
Table II shows the speedup achieved by a MaxNode-1821 over an eight core (Intel Xeon E5430
2.66 GHz) server, both nodes using multiple threads to price up to eight tranches in parallel.
Figure 11 shows the CPU profile of the code running with FPGA acceleration. Using the numbers from Table III, we can also see that the power usage per node is decreased by 3%, even with a
31 increase in computational performance. It follows that the speedup per Watt is actually greater
than the speedup per cubic foot.
Reporting speedup statistics for risk calculations reveals two interesting points. First, one of the
main challenges to accelerating repetitive risk calculations involved keeping the FPGAs busy for
the maximum period of time. This turned out to be a non-trivial task that required significant effort
and was accomplished in our project by developing a bespoke task distribution system that enables
data to be streamed on and off the FPGA cluster in such a way as to minimize time spent waiting
for input and output. The result of the performance increase was to enable the CDO trading desk
to evaluate thousands of plausible risk scenarios during the trading day, both in response to and
in anticipation of developments in the financial markets. A typical example of this has been the
development of the capability to evaluate the potential risk impact of one or more credits defaulting
and/or spreads widening in advance of the event occurring such that trading decisions can be made
Table II. MaxNode-1821 versus eight Core
Xeon server speedup.
Precision
Full precision
Reduced precision
Copyright 2011 John Wiley & Sons, Ltd.

Speedup
31
37
Concurrency Computat.: Pract. Exper. 2012; 24:880894
DOI: 10.1002/cpe

892

S. WESTON ET AL.
// Shared library and call overhead ~5%
for d in 0 ... dates-1
// Curve Interpolation 54.5%
for j in 0 ... names-1
Q[j] = Interpolate(d, curve)
end # for j
for i in 0 ... markets-1
for j in 0 ... names-1
// Copula Section 9%
prob=cum_norm((inv_norm(Q[j])-sqrt(p)M)/sqrt(1-p);
loss=calc_loss(prob,Q2[j],RR[j],RM[j])*notional[j];
// (FPGA Data preparation and post processing) 11.2%
n = integer(loss);
lower = fractional(loss);
for k in 0 ... bins-1
if j == 0
dist[k] = k == 0 ? 1.0 : 0.0;
dist[k] = dist[k]*(1-prob) +
dist[k-n]*prob*(1-L) +
dist[k-n-1]*prob*L;
if j == credits-1
final_dist[k] += weight[i] * dist[k];
end # for k
end # for j
end # for i
end # for d
// Other overhead (object construction, etc) 19.9%

Figure 11. Profiled version of field-programmable gate array (FPGA) version of pricing algorithm in
pseudocode form.
Table III. Power usage for 1U compute nodes when idle and while processing.
Platform
Dual Xeon L5430 2.66 GHz Quad Core 48GB DDR DRAM
(as above) with MAX2-4412C Dual Xilinx SX240T, 24GB DDR DRAM

Idle (W)

Processing (W)

168
191

246
238

DDR DRAM, double-data-rate synchronous dynamic random access memory.

in close to real time, which was previously impossible. Such scenario evaluation capability provides
real commercial value in highly volatile market conditions.
6.2. A wider range of underlying assets
Table IV shows the same aggressive pipelining of results in the Monte Carlo model as we employed
with the CDO model, accelerating the computation time by 278.76. We calculate this at the portfolio level by recording the calculation time of the model running in software on one core and dividing
by the calculation time on four FPGAs contained in one node running 13 risk measures from the
portfolio. This includes three kernels that were used to implement the following:
 Random number correlation, index generation, and diffusion (currently foreign exchange,

interest rates and equity).


 Internal Calibration Engine search for 2nd order moment matching.
 Payoff evaluation.
Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

COMPUTATION OF DERIVATIVES VALUATION AND RISK USING FPGAS

893

Table IV. MaxNode-1834 versus single Xeon Core for Monte Carlo.
Metric

Single Xeon Core (s)

MAX node (s)

Speedup

13,213
13,240

47.4
65.8

279
201

Kernel only
Wall clock

The software spends an extra 37 s in the outer loops, and we preliminarily accelerated this by
partitioning the risk run into eight parallel jobs. When running on four FPGAs, the wall-clock time
reduces to 65.8 s, yielding a wall-clock or end-to-end time speedup of 201 when compared with a
single core.
7. CONCLUSIONS
As already noted, the results reported in our previous work and extended in this paper were achieved
using a hybrid CPU-FPGA accelerated machine. A question of interest is how do these results compare with what could be achieved using a graphics processing unit (GPU)-accelerated approach. It
is worth noting that at the beginning of the project, the same C++ code for bespoke tranche pricing
was migrated to work on the C1060 Tesla card from NVidia (Santa Clara, CA, USA). After several
weeks of work, the best acceleration that was achieved was limited at approximately 12. This was
found to be broadly in line with expectations based on Che et al.[13] who provided a useful table
that characterizes the types of problems that tend to perform well on GPUs and FPGAs.
The credit hybrids trading group within J.P. Morgan is reaping substantial benefits from applying
acceleration technology. The order-of-magnitude increase in available compute has led to three key
benefits:
 Current computations run in much less time.
 Additional applications and algorithms, or those that were once impossible to resource, are now

possible.
 Operational costs resulting from given computations are dramatically reduced.

The design features of the Maxeler hybrid computer mean that its low consumption of electricity,
physically small footprint, and low heat output make it an extremely attractive alternative to using
the traditional cluster of standard cores.
One of the key insights of the project has been the substantial benefits to be gained from changing
the computer to fit the algorithm, not changing the algorithm to fit the computer (which would be the
usual approach on the standard CPU cores approach). We have found that executing complex calculations in customizable hardware with Maxeler infrastructure is much faster than executing them in
software. However, the major benefit comes not from the welcome cost savings arising in running
computation more rapidly but from the potential revenue benefits from increased risk management
that is possible with accelerated and hence more frequent computation and analysis of value and
risk in complex derivative portfolios.
8. FUTURE WORK
The project is now looking to delivering the next generation of a multi-node Maxeler hybrid compute designed to provide portfolio level risk analysis across a wider range of derivatives models for
more trading desks in near real time.
The promise conveyed in the results reported in this paper has subsequently led to the project
being expanded. In credit derivatives, a second, more general, CDO model based on the random
factor loading approach of Sidenius et al.[10] is currently undergoing migration to run on the
architecture reported in this paper.
In addition, the Monte Carlo approach is now also being applied to trades whose payoff are
principally dependent on equity underlyings. Further and more wide-reaching applications of the
Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

894

S. WESTON ET AL.

acceleration approach are currently in the evaluation stage but are expected to lead to similar gains
across a wider range of computational challenges.
REFERENCES
1. Koomey JG. Worldwide electricity used in data centers. Environmental Research Letters 2008; 3(3):034008.
2. Koomey JG, Belady C, Patterson M, Santos A, Lange KD. Assessing trends over time in performance, costs
and energy use for servers. Technical Report, Microsoft, Intel and Hewlett-Packard Corporation, 17 August
2009. Available from: http://download.intel.com/pressroom/pdf/computertrendsrelease.pdf [accessed on September
2, 2010]
3. Weston S, Marin JT, Spooner J, Pell O, Mencer O. Accelerating the computation of portfolios of tranched
credit derivatives. IEEE Workshop on High Performance Computational Finance (WHPCF), New Orleans, LA, 14
November 2010; 18.
4. Kaganov A, Chow P, Lakhany A. FPGA acceleration of monte-carlo based credit derivative pricing. International
Conference on Field Programmable Logic and Applications, Heidelberg, 810 September 2008; 329334.
5. Andersen LBG, Sidenius J, Basu S. All your hedges in one basket. Risk Magazine, November 2003; 6772.
6. BIS. Semiannual OTC derivatives statistics at end-June 2010. Bank for International Settlements, December 2010.
Available from: http://www.bis.org/statistics/derstats.htm [accessed on September 2, 2010]
7. Amraoui S, Hitier S. Optimal stochastic recovery for base correlations. Technical Report, BNP Paribas, 2008.
8. McGinty L, Beinstein E, Ahluwalia R, Watts M. Credit correlation: a guide. Technical Report, J.P. Morgan, March
2004.
9. Li DX. On default correlation: a copula function approach. Journal of Fixed Income 2000; 9(4):4354.
10. Andersen LBG, Sidenius J. Extensions to the gaussian copula: random recovery and random factor loadings. Journal
of Credit Risk 2004; 1(1):2270.
11. Ritchken P, Sanakarasubramanian L. Volatility structures of forward rates and the dynamics of the term structure.
Mathematical Finance 1995; 5(1):5572. Available from: http://ideas.repec.org/a/bla/mathfi/v5y1995i1p55-72.html
[accessed on September 2, 2010]
12. Tian X, Benkrid K, Gu X. High Performance Monte-Carlo Based Option Pricing on FPGAs. IAENG Journal
Engineering Letters, Special Issue on High Performance Reconfigurable Systems 2008; 16(3):434442.
13. Che S, Li J, Sheaffer JW, Skadron K, Lach J. Accelerating compute-intensive applications with gpus
and FPGAs. Symposium on Application Specific Processors, Anaheim, CA, 2008; 101107. Available from:
ttp://doi.ieeecomputersociety.org/10.1109/SASP.2008.4570793 [accessed on September 2, 2010]

Copyright 2011 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2012; 24:880894


DOI: 10.1002/cpe

Вам также может понравиться