Parallel Collections: A Free Lunch?

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 17, ISSUE 2, FEBRUARY 2013 1
Parallel Collections: A Free Lunch?

Ron Coleman and Udaya Ghattamaneni
AbstractWith the advent of inexpensive multicore systems, there has come a renaissance of interest in parallel functional programming and languages. This paper examines one of these languages, Scala, in particular parallel collections, as representative of the new trend and asks for the first time whether their mathematical elegance translates into runtime performance compared to MapReduce to analyze the fair value of riskless bond portfolios using nave, fine-grain, coarsegrain, and memory-bound parallel algorithms of compute- and IO-intensive operations. The experimental data suggest that while parallel collections underperform MapReduce for larger workloads, the effect size is still small; the cost is little; and the return on investment is possibly large. Index Terms bond portfolio analysis, parallel functional languages, Scala
1 INTRODUCTION
unctional programming proponents have long maintained that elaboration of the lambda calculus lends itself to mathematical expressiveness and avoids concurrency hazards (e.g., side-effects, race conditions, deadlocks, thread management, etc.) that are the bane of shared-state parallel computing. [1] [2] Yet, parallel functional programming has remained largely an academic interest mostly ignored by the mainstream programming community. [3] One could conceivably argue that parallel functional programming was ahead of its time and the era of commodity multicore processors in which some observers have suggested that the free lunch is over since clock speeds have been decreasing or at least not increasing significantly, necessitating a turn toward parallel programming. [4] This may explain the recent popular interest in functional programming languages [5] that includes Scala [6] which blends object and functional styles and runs on the Java Virtual Machine [7] which is to say, practically everywhere (e.g., desktops, servers, browsers, cell phones, tablets, TV set tops, etc.). Scala supports tasklevel parallelism based on actors [8] and lately, a form of data-level parallelism [9] called parallel collections [10] [11] that are modeled on serial collections (e.g., lists, arrays, vectors, etc.) except backed transparently by threads. This simple extension of the Scala standard library makes parallel collections virtually free since a programmer might apply them naively without change to a serial implementation, which is a windfall for productivity, for design, verification, debugging and testing. However, little is known about how parallel collections or any of the other parallel functional programming languages gaining new or renewed attention perform in relation to alternatives on modern multicore processors. Investigators have studied functional language performance on older, specialpurpose hardware and for compute-intensive kernels, which at best solve only half the problem since practical
applications typically must also fetch data and update a repository involving high latency IO operations that often are the bottleneck. [12] [13] [14] Thus, we sought to determine whether and to what degree the productivity potential of parallel collections translated into performance on a non-trivial application involving compute-intensive and IO-intensive operations. We tested parallel collections compared to a pure Scala version of MapReduce from Haller and Sommers [15] to analyze riskless bond portfolios using nave, fine-grain, coarse-grain, and memory-bound algorithms on commodity hardware. MapReduce is probably one of the most, if not the most, versatile and successful parallel processing paradigms to emerge in some years. [16] As for bond portfolio pricing, under the rubric of computational finance, it typically requires a combination of both numerical and IO programming. [17] [18] Bond pricing furthermore is related to or informs the valuation of a number of other financial instruments including annuities, mortgage securities, bond derivatives, and interest rate swaps, which are among the most heavily traded financial contracts in the world. [19] The experimental results indicate that parallel collections achieve superlinear speedup with efficiency ~200% and the nave algorithm performs nearly as well as, if not better than, the other algorithms. Still, parallel collections underperform MapReduce by as much as 35% on larger workloads. While this difference is statistically significant (p<0.01), we believe it is premature to conclude parallel collections are not the free lunch they appear to be since (a) in this case the effect size is small, a matter of a few seconds or Z=1 [20] and (b) the use of parallel collections didnt cost much in time or labor, to be precise, just four keystrokes. In other words, we speculate that productivity savings afforded by parallel collections when applied naively, evidently the whole point of them, may offset possibly many times a marginal loss in machine performance yielding a large return on investment. For these reasons, we propose that further study is necessary.
Ron Coleman and Udaya Ghattamaneni are with Marist College, Poughkeepsie, NY 12601.
2 METHODS
2.1 Basic Ideas In this section, we expose the basic ideas through a trivial problem, namely, to multiply a range of numbers, 1-5, using serial collections, parallel collections, and MapReduce. We will extend and enrich this idea later to value portfolios of bonds. The serial solution to multiply {1,2,3,4,5} by 2 is below. (1 to 5).map(x => x * 2) // Output: Vector(2, 4, 6, 8, 10) Snippet 1 The core of this snippet is x => x*2, an anonymous function literal object, which operates over the range collection in O(N) time. For parallel collections, we have: (1 to 5).par.map(x => x * 2) // Output: ParVector(10, 6, 2, 4, 8) Snippet 2 This snippet is virtually identical to the serial version with the exception of .par. It converts the serial range collection into a parallel collection. This simple use of .par is the nave application of parallel collections in which the map method invokes the function object on the collection in parallel in O(1) time. Note that because of the threaded execution, the output ordering is unspecified. Using the Scala mapreduceBasic method from Haller and Sommers [15] to solve the same problem, we have: mapreduceBasic ( // input collection (1 to 5).foldLeft(List[(Int,Int)]()) {(list, n)=> list ::: List((n,n))},
// map function object (i:Int,x:Int) => { List((i, x*2))}, // reduce function object (i:Int,mappings:List[Int])=> {List(mappings(0))} ) // Output: Map(5->List(10),1->List(2),2>List(4),3-> List(6),4->List(8))
suggestive of what might be expected from a more complex problem. In all three cases (serial, parallel collections, and MapReduce) we replace the simple multiply-by-two function objects with portfolio valuation function objects. They implement database queries to retrieve portfolios and bonds; value them according to the pricing theory; and update the database with the portfolio price. The programming details can be found in the source online at the Scaly project site. [21]
2.2 Pricing theory The fair-value of a simple bond, b, is functionally defined as follows [22]: nT P( b , r ) = " C / (1+rt) t/n + M / (1+rT)T (1) t=1 where r is the time-dependent market rate or yield curve; C is the coupon payment; n is the coupon payment frequency per annum; T is the time to maturity in years; and M is the face value of the bond due at maturity time. Equation 1 gives the value of a bond as the sum of its discounted cash flows. The fair value of portfolio, #j,, with collection of Q bonds, is functionally defined as follows: Q P( #j ) = " P(bq, r) (2) q=1
Equation 2 gives that the value of a portfolio of bonds as the sum of the prices its constituent bonds.
Snippet 3 This solution takes three parameters: the input collection, the map function object, and the reduce function object. The input collection is a regular serial list of 2-tuples of the index of the element and the element, which in this case happen to be the same. The map function object takes the index and the element pair and returns a serial list of two-tuples, the element index and the mapping result. Finally, the reduce function object takes the index of the mapping result and the mapping result and returns the reduced result. However, here reduce trivially returns the first (and only) value of the mapping result since this problem calls for merely multiplying values, not reducing them, too. Thus, the MapReduce version has O(1) time complexity, too. We note, however, that the code varies significantly from the serial and parallel collections versions, which is
2.3 Bond portfolio generation We generate bonds that model a range of computational scenarios. Specifically, we have the collections, n ={1, 4, 12, 52, 365}, T ={1, 2, 3, 4, 5, 7, 10, 30}, and ={0.005, 0.01, 0.02, 0.03, 0.04, 0.05} where the elements of n are payment frequencies per annum; T are maturities in years; and are coefficients. We derive the parameters of a bond object from the bond generator equations below: M=1000 (3a) n = n [] (3b) T = T [] (3c) C = M / T [] (3d)
where is a uniform random deviate in the range of the size of the respective collection. We invoke Equations 3a3d a total of 5,000 times to produce a bond inventory, B, which we store in an indexed database that we describe below. We generate a portfolio by first selecting its size, Q, per the equation below: Q=k+$% (4) where % is a Gaussian random deviate with mean equal to 0 and standard deviation equal to 1. The parameters k and $ are set to 60 and 20 respectively. In other words, each portfolio has an average of 60 20 bonds. Finally we generate a universe, W, of these portfolios where |W|=100,000.
2.4 Database schema We store the portfolios and bonds in MongoDB, a document-oriented persistent database. [23] The schema is organized in third-object normal form (3ONF). [24] Thus, the portfolios and bond inventory are stored in two distinct collections referenced by primary key. Consequently, a portfolio is a unique id and a list of bond ids. A bond is a 5-tuple of its unique id and four parameters, C, n, T, and M. To evaluate Equation 2, a total of 2+Q accesses are needed: one to fetch portfolio #j; Q accesses to retrieve each bond, bq of #j, and finally one access to update #j with its price after computing Equation 1 for each bond. This design could be improved for performance purposes, e.g., by denormalizing the database. However, we decided to use this schema in the interest of best practices to establish a baseline for further exploration possibly in future efforts. 2.5 Algorithms We developed two serial algorithms, nave and memory-bound kinds. We developed four parallel algorithms, nave, memory-bound, fine-grain, and coarsegrain kinds for both parallel collections and MapReduce. Thus, there are a total of ten algorithms. The serial algorithms serve as the baseline against which to measure speedup and efficiency of the parallel algorithms. The serial nave kind is modeled on Snippet 1 to evaluate Equation 1 and Equation 2 by naively accessing the bonds and portfolios only when needed. The serial memory-bound kind first fetches the portfolios and bonds into a memory cache and then evaluates Equation 1 and Equation 2 from this cache. The nave parallel collections version is identical to the serial one except we added .par as suggested by Snippet 2 to price the portfolios in parallel. We made no other changes. The naive MapReduce version is modeled on Snippet 3 to price the portfolios (i.e., evaluate Equation 2) in parallel. The parallel memory-bound algorithms do the same as the serial version except they pre-fetch the portfolios and bonds in parallel and then price the portfolios in parallel. With parallel collections we use a parallel vector to invoke the parallel queries and then parallel fold the vector to merge the results into a parallel list (i.e., the memory cache) in O(log N) time. With MapReduce, we spawn one actor per query in parallel and then merge the results into a serial list as the results arrive in O(N) time. We must merge the MapReduce list as such to avoid having to synchronize the memory cache and the hazards that that may entail. The fine-grain parallel algorithms exploit parallelism within the portfolio. That is, since the value of each bond is independent, it is possible to analyze them in parallel. For parallel collections, after fetching the bond ids for a given portfolio into a parallel vector in the portfolio valuation function object, we invoke the map method with the bond valuation function object (i.e., Equation 1) on the parallel vector of bond ids; this returns a parallel vector of bond prices in O(1) time. We then parallel
reduce these prices in O(log N) time per Equation 2. For MapReduce, the map function object evaluates the bonds of a given portfolio in O(1) time in parallel per Equation 1 and then the reduce function object merges the bond prices in O(log N) time per Equation 2. Finally, the coarse-grain parallel algorithms are similar to the nave kind except they price the portfolios in chunks. The chunk size is the workload size divided by the number of hyperthreads. For parallel collections, the chunk size is the size of the parallel collection, which is a collection of collections. The outer parallel collection has as many elements as there are hyperthreads. Each element of this outer collection is thus a collection of portfolio ids. For MapReduce, we use the Haller and Sommers method, coarseMapReduce, which has a parameter to specify the number of mappers and reducers, respectively which we set to the number of hyperthreads. For both parallel collections and MapReduce, we determined the number of hyperthreads programmatically a runtime using the Java Runtime class. It has a class method, availableProcessors, which returns the number of hyperthreads.
3 EXPERIMENTAL DESIGN
3.1 Environment setup The test environment consisted of four hosts networked on an Ethernet LAN in a public computing center of commodity machines. Each host had an Intel W3540 2.93 GHz processor with four cores (i.e., two hyper-threads per core or eight hyper-threads total) [25], and 4 GB RAM running Microsoft Windows 7 [26] and MongoDB 1.8.3. [27] We used Eclipse 3.7.1 (Indigo) [28] with the Scala IDE plugin version 2.0 [29] to compile the code and Scala 2.92 [30] to execute the trials on the 64-bit JVM. [7] As the figure below suggests, host C was the compute server; host S was the MongoDB shard server; and hosts A and B contained 100,000 portfolios and 5,000 bonds striped evenly on secondary storage (i.e., A and B had 50,000 portfolios and 2,500 bonds, respectively).
A S B Figure 1. Experimental configuration C
3.2 Trials A single trial consisted of a workload that we processed in five separate runs to obtain stable runtime statistics. [31] The workloads consisted of w=2x randomly sampled portfolios from the database where x=[0..10]. In other words, the workloads included 1 to 1024 portfolios each with a mean of 60 bonds. The last workload, w=1024, we call the terminal trial which we analyze in detail.
3.3 Statistical calculations For calculating the speed-up and efficiency we compute the serial time, T1, from the serial algorithm and the parallel time, TN, from the parallel algorithm. In both cases, we use the median runtime of the trial to compute the speedup, R, and efficiency, e, namely: R = T1 / TN (5) e=R/N (6) We observed the median differences in the terminal trial of parallel collections and MapReduce. We then applied the Mann-Whitney test [32] to the results of all five runs to robustly assess whether the differences were statistically significant.
4 RESULTS
The charts below show respectively results comparing the nave, fine-grain, coarse-grain and memory-bound parallel collections (PC) and MapReduce (MR) algorithms. The X-axis is workload size, that is w=2x where w is the number of portfolios. The left Y-axis is the speedup, R. The right Y-axis is the efficiency, e. Each data point on a curve is the median run-time of the trial. Figure 4 Comparison of coarse-grain algorithms Note that coarse-grain results are undefined for w={20, 21, 22} portfolios since they do not divide evenly across eight hyperthreads.
Figure 5 Comparison of memory-bound algorithms Figure 2 Comparison of naive parallel algorithms. The table below gives the terminal trial results with the median runtime and difference in seconds.
Table 1 Terminal trial, w=1024, median runtimes in s. Algorithm MR PC Diff % loss Naive 7.5 11.6 4.1 35.3 Fine-grain 10.1 11.1 1.0 9.0 Coarse-grain 7.6 11.0 3.4 30.9 Memory-bound 10.3 13.0 2.7 20.7
The rank sums for MapReduce in each algorithm case are 15 (i.e., parallel collections underperform MapReduce in every trial). This gives U, the Whitney-Mann statistic, a value of 25, p=0.006, and an effect size, Z=1 or one standard deviation.
Figure 3 Comparison of fine-grain algorithms
4 DISCUSSION
From these data we make the following observations:
Nave applications of parallel collections perform often as good as or better than the other applications of parallel collections. In other words, second-order efforts do not appear to payoff much. While the performance curves for both parallel collections and MapReduce are trending generally upward for larger workloads, parallel collections underperform MapReduce but only clearly for w>32 portfolios. This is an interesting threshold, probably related to the thread pool size. For w=1024 portfolios, distinction between parallel collections and MapReduce is fairly well established, although the effect size is only a few seconds.
5 CONCLUSIONS
Is it better to add a .par to an existing algorithm and get superlinear speedup at the opportunity cost of a few seconds of machine time but no effort on the programmers part? Or is it better to re-write the algorithm, possibly capture those few seconds of opportunity at the expense of having to redesign and rewrite the algorithm with all the costs and risks associated therein? The data we report cannot directly address these questions for the answers depend on many factors we have not analyzed. Nevertheless, we argue on the side of conserving programmer time rather than machine time where they diverge, which suggests a need for further study.
ACKNOWLEDGMENT
This research has been funded in part by grants from the National Science Foundation, Academic Research Infrastructure award number 0963365 and Major Research Instrumentation award number 1125520.
(Lecture Notes in Computer Science), Springer, 1996 [10] B. Lester, Data Parallel Programming in Scala, Scala Days 2010, EPFL, Lausanne, Switzerland, 15-16 April 2010 [11] A. Prokopec, A Generic Parallel Collection Framework, EPFL, InforScience 2011, 2010-7-31, 2011 [12] H.W. Loidl, et al, Comparing Parallel Functional Languages; Programming and Performance, Higher-Order and Symbolic Computation, 16, Kluwer Academic Publishers, 2003, p. 203-251 [13] D. Spoonhower, G.E. Blelloch, P.B. Gibbons, Space Profiling for Parallel Functional Programs, ICFP 08, 2008 [14] R. Loogen, Y. Ortega-Mallen, and R. Pena-Mari, Parallel Functional Programming in Eden, Journal of Functional Programming, vol 15, issue 3, May 2005 [15] P. Haller and F. Sommers, Actors in Scala, Artima, 2012 [16] D. Miner and A. Shook, MapReduce Design Patterns, OReilly, 2012 [17] O. Ugur, Introduction to Computational Finance, Imperial College Press, 2008 [18] M. Constantino (ed.) and C.A. Brebbia (ed.), Computational Finance and Its Applications, WIT Press, 2006 [19] J. Hull, Options, Futures, and Other Derivatives, 8th ed., Prentice Hall, 2011 [20] R. Coe, Its the effect size, stupid, British Educational Research Association Annual Conference, 2002 Sept 1214; Exeter, England [21] Scaly project, http://code.google.com/p/scaly/ [22] F. Fabozzi and S. Mann, Introduction to Fixed Income Analytics, 2nd ed., 2010 [23] K. Chodorow and M. Dirolf, MongoDB: The Definitive Guide, ORielly, 2010 [24] Merunka, V., et al., Normalization Rules of the Object-Oriented Data Model, EOMAS '09, Proceedings of the International Workshop on Enterprises & Organizational Modeling and Simulation, 2009 [25] Intel Corp., Intel Xenon Processor W3540, 2009, http://ark.intel.com/products/39719/Intel-Xeon-Processor-W3540(8M-Cache-2_93-GHz-4_80-GTs-Intel-QPI) , 2012 [26] Microsoft Corp., www.microsoft.com, 2012 [27] 10gen, Inc., www.mongodb.org, 2012 [28] The Eclipse Foundation, www.eclipse.org, 2012 [29] Scala IDE for Eclipse, www.scala-ide.org, 2012 [30] cole Polytechnique Fdrale de Lausanne (EPFL), www.scalalang.org, 2012 [31] A. Georges, D. Buytaert, L. Eeckhout, Statistically Rigorous Java Performance Evaluation, OOPSLA 2007 [32] J. Conover, Practical Non-Parametric Statistics, Wiley, 1999 Ron Coleman Ph.D., Computer Science, Polytechnic Institute of New York University (1990), NYC, NY, United States; B.S. Computer Science, City College of New York, NYC, NY, United States; IBM Supercomputer Systems Lab (1989-1992), Citibank Pre-Settlement Risk Technology (1992-2000). Member of ACM and IEEE. Research interests include parallel computing, simulation, and games. Udaya Ghattamaneni M.S., Computer Science, Marist College (2012); B.T., Computer Science, Aurora Technological and Research Institute (ATRI), Uppal, Hyderabad, AP, India.
REFERENCES
[1] [2] [3] G. Michaelson, Introduction to Functional Programming through Lamda Calculus, Dover, 2011 Z. Horvath, V. Zsok, P. Achten, and P. Koopman, Trends in Functional Programming, vol 10, Gutenberg, 2011 P. McKenney, Is Parallel Programming Hard, And, If So, What Can You Do About It?, http://kernel.org/pub/linux/kernel/people/paulmck/perfbo ok/perfbook.html, 2011 H. Sutter, The Free Lunch is Over: The Fundamental Turn Toward Concurrency in Software, Dr. Dobbs Journal, vol. 30, no. 3, 2005 B.A. Tate, Seven Languages in Seven Weeks, The Pragmatic Programmer, 2010 M. Odersky, L. Spoon, and B.Venners, Programming in Scala, 2nd ed. , Artima, 2011 Oracle Corp., www.oracle.com, 2012 V. Subramaniam, Programming Concurrency on the JVM: Mastering Synchronization, STM, and Actors, 2011 G. Perrin (ed.), A. Darte (ed.), The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications
[4]
[5] [6] [7] [8] [9]

Parallel Collections: A Free Lunch?

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Parallel Collections: A Free Lunch?

Загружено:

Авторское право:

Доступные форматы

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 17, ISSUE 2, FEBRUARY 2013 1

Parallel Collections: A Free Lunch?

Figure 3 Comparison of fine-grain algorithms

[5] [6] [7] [8] [9]

Вам также может понравиться