Bo WANG 1 , Yi SHAN 1 , Jing YAN 2 , Yu WANG 1 , Ningyi XU 2 , Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China 2 Hardware Computing Group Microsoft Research Asia 1 Outline Motivation Proposed solution: FPGA+MapReduce Case study: RankBoost acceleration Summary 2 The Power Barrier Source : Shekhar Borkar, Intel parallel 3 Cost and Energy are still a Big Issue 4 Challenges General purpose CPU architecture Memory wall CPUs are too fast; memory bandwidth is too slow Cache Real Estate Power Wall Most power: non-arithmetic operations (out-of-order, prediction) Higher freq: higher leakage power Large cache Traditional parallel programming Need to manage the concurrency explicitly 5 Customized Domain Specific Computing for Machine Learning Primary goal of this project Automatically utilize the parallelism in machine learning algorithms with 100x performance/power efficiency A few facts We have sufficient computing power for most applications * Each user/enterprise need high computation power for only selected tasks in its domain * (machine learning) Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture * MapReduce is a successful programming framework for ML/DM Approach Supercomputer in a box with reconfigurable hardware Field Programmable Gate Array (FPGA) and CPUs Parallel hardware programming with MapReduce framework *Jason Cong, FPL09 Keynote, Customizable Domain-Specific Computing 6 Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box MapReduce description of the Algorithm in C/C++ The Big Picture Machine Learning Applications Mapper Reducer CPUs FPGAs Interconnection Network Scheduler Mapper Mapper Mapper Mapper Mapper Reducer Data Manage MEM MEM MEM Programming Architecture User Constraints High Level Synthesis Tool 7 Field-Programmable Gate Array Defined Field-programmable semiconductor device Change functionality after deployment Create arbitrary logic with gate arrays Gate arrays: islands of reconfigurable logic in a sea of reconfigurable interconnects. 8 Y = i 0 + i 1 + i 2 * i 3 Islands of reconfigurable logic in a sea of reconfigurable interconnects (Altera Stratix) 9 Field-Programmable Gate Array Defined Field-programmable semiconductor device Change functionality after deployment Create arbitrary logic with gate arrays Gate arrays: islands of reconfigurable logic in a sea of reconfigurable interconnects. Implement desired functionality in hardware Example: X = 3*Y + 5*Z Hardware Description Languages (HDLs) C/C++ to HDL compilation tools: AutoPilot http://www.deepchip.com/items/0482-06.html CPU runs the application, FPGA is the application. 10 Why use FPGA? High flexibility Customized logic for application Match the application in bit level Best utilize parallelism and locality in application High computation density Several Pentium cores High I/O bandwidth Up to 100s Gbps High internal memory bandwidth Up to 10s Tbps Customized memory hierarchy with no cache miss Track Moores Law Compared to ASIC Much lower design cost Compared to GPU Bit level flexibility Lower power 11 FPGA-based High Performance Computing 10X ~ 10,000X speedup reported Conferences: FCCM, FPGA, FPT, FPL, SC, ICS Domains: scientific computing, machine learning, data mining, graphics, financial computing, Challenges Ad-hoc solutions Design productivity 12 Framework: MapReduce Web Request Logs MapReduce programmer Parallelization Functionality Data Distribution Fault Tolerance Load Balance MapReduce Runtime Two Primitive: Map (input) for each word in input emit (word, 1) Reduce (key, values) int sum = 0; for each value in values sum += value; emit (word, sum) Word Count 13 Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box MapReduce description of the Algorithm in C/C++ The Big Picture Machine Learning Applications Mapper Reducer CPUs FPGAs Interconnection Network Scheduler Mapper Mapper Mapper Mapper Mapper Reducer Data Manage MEM MEM MEM Programming Architecture User Constraints High Level Synthesis Tool 14 FPGA MapReduce (FPMR) Framework REDUCER REDUCER Reducer Processor Scheduler Data Controller Global Memory Intermediate <key,value>
Local Memory CPU <key,value> Generator REDUCER REDUCER Mapper FPGA enable parameters P C I e
/
H y p e r - T r a n s p o r t Merger 1 2 3 4 5 6 15 Major Building Blocks Processors (workers) with pre-defined interfaces Mapper and reducer On-chip scheduler Dynamically scheduling Monitor status Queues to record Data access infrastructure Interconnection network Message passing and shared memory Storage hierarchy Global memory, local memory, and register file Data controller CPU, memories, and workers 16 Parallelism Task level/data level parallelism Among mappers/reducers Instruction level parallelism Within each worker 17 Case study: RankBoost An extension of AdaBoost to ranking problems [Yoav Freund, 2003] Learn a ranking function by combining weak learners Weak learner are usually represented by decision stumps of features Slow with large number of features and training samples E.g. Web search engine Weeks to get optimal result 18 Case study: RankBoost 19 RankBoost: mapper and reducer map (int key, pair value): // key : feature index fi // value : document bin fi , document for each document d in value : hist(bin fi (d)) = hist(bin fi (d)) + (d) EmitIntermediate (fi, hist fi ); reduce (int key, array value) : // key : feature index fi // value : histograms hist fi , fi = 1N f for each histogram hist fi for i = N bin 1 to 0 integral fi (i) = hist fi (i) + integral fi (i+1) EmitIntermediate (fi, integral fi ) 20 RankBoost on FPMR Map RankBoost on FPMR Decide <key, value> #mapper/#reducer Reducer Processor Scheduler Data Controller Global Memory bin (d ) Intermediate <fi,hist fi (bin)>
Local Memory CPU <bin(d),(d)> Generator REDUCER REDUCER Mapper P C I - E FPGA enable parameters Global Memory (d ) Merger 21 Mapper & Reducer Structure hist f RAM Dual Port Shift Registers M U X Bin FIFO Read Address M U X Write Address 8'b0 DataOut DataIn M U X Floating Point Adder 32'b0 Pi FIFO M U X 32'b0 Local Memory Address Generator Write Address DataIn Read Address DataOut Mapper Floating Point Adder MUX 32'b0 Floating Point Comparator ageb M U X Maximum Register Reducer Local Memory Write Address DataIn Read Address DataOut Address Generator 22 Target Accelerator PCI Express x8 interface (Xilinx V5 LXT FPGA) Altera StratixII FPGA DDR2 modules x2, 16GB, 6.25GBps, SRAMs Designed in HCG, MSRA 23 Experimental results #mapper #reducer WL / s Total / s Speedup WL Total 1 1 320.9 321.96 0.33 0.33 2 1 160.5 161.52 0.65 0.65 4 1 80.22 81.293 1.30 1.30 8 1 40.11 41.181 2.60 2.56 16 1 20.06 21.125 5.20 4.99 32 1 10.09 11.159 10.33 9.44 52 1 6.228 7.297 16.74 14.44 64 1 5.107 6.176 20.42 17.06 128 1 2.616 3.685 39.87 28.59 146 1 2.242 3.311 46.52 31.82 Optimized software 104.3 105.37 1 1 31.82X speedup with 146 parallel mappers Manual design: 33.5x 24 Scalability Mapper 1 2 4 8 16 ALUT 1% 2% 3% 5% 10% Register 1% 2% 4% 6% 11% Mapper 32 52 64 128 146 ALUT 19% 31% 38% 75% 86% Register 17% 32% 39% 81% 89% 0 50 100 150 200 250 300 0 20 40 60 80 WL with CDP Total with CDP WL w/o CDP Total w/o CDP N mappers S p e e d u p 25 Design Productivity Manual design More than 3 months after the hardware circuit board was ready FPGA-based MapReduce Weeks Data layout and performance tuning took time 26 Summary Designed building blocks for MapReduce on FPGA Achieved comparable result with manual design Future work Use C2HDL compilers to further increase the design productivity Build Runtime for multiple machines Try more cases to build a tunable library 27 Thanks! ningyixu@microsoft.com 28