Вы находитесь на странице: 1из 28

FPGA-based MapReduce

Framework for Machine Learning


Bo WANG
1
, Yi SHAN
1
, Jing YAN
2
, Yu WANG
1
,
Ningyi XU
2
, Huangzhong YANG
1
1
Department of Electronic Engineering
Tsinghua University, Beijing, China
2
Hardware Computing Group
Microsoft Research Asia
1
Outline
Motivation
Proposed solution: FPGA+MapReduce
Case study: RankBoost acceleration
Summary
2
The Power Barrier
Source : Shekhar Borkar, Intel
parallel
3
Cost and Energy are still a Big Issue
4
Challenges
General purpose CPU architecture
Memory wall
CPUs are too fast; memory bandwidth is too slow
Cache Real Estate
Power Wall
Most power: non-arithmetic operations (out-of-order,
prediction)
Higher freq: higher leakage power
Large cache
Traditional parallel programming
Need to manage the concurrency explicitly
5
Customized Domain Specific
Computing for Machine Learning
Primary goal of this project
Automatically utilize the parallelism in machine learning
algorithms with 100x performance/power efficiency
A few facts
We have sufficient computing power for most applications
*
Each user/enterprise need high computation power for only
selected tasks in its domain
*
(machine learning)
Application-specific integrated circuits (ASIC) can lead to
10,000X+ better power performance efficiency, but are too
expensive to design and manufacture
*
MapReduce is a successful programming framework for ML/DM
Approach
Supercomputer in a box with reconfigurable hardware Field
Programmable Gate Array (FPGA) and CPUs
Parallel hardware programming with MapReduce framework
*Jason Cong, FPL09 Keynote, Customizable Domain-Specific Computing
6
Supercomputer in a box
Supercomputer in a box
Supercomputer in a box
Supercomputer in a box
MapReduce description of
the Algorithm in C/C++
The Big Picture
Machine Learning
Applications
Mapper Reducer
CPUs
FPGAs
Interconnection
Network
Scheduler
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Manage
MEM
MEM
MEM
Programming Architecture
User
Constraints
High Level Synthesis Tool
7
Field-Programmable Gate Array Defined
Field-programmable semiconductor device
Change functionality after deployment
Create arbitrary logic with gate arrays
Gate arrays: islands of reconfigurable logic in a sea of
reconfigurable interconnects.
8
Y = i
0
+ i
1
+ i
2
* i
3
Islands of reconfigurable logic in a sea of
reconfigurable interconnects (Altera Stratix)
9
Field-Programmable Gate Array Defined
Field-programmable semiconductor device
Change functionality after deployment
Create arbitrary logic with gate arrays
Gate arrays: islands of reconfigurable logic in a sea of
reconfigurable interconnects.
Implement desired functionality in hardware
Example: X = 3*Y + 5*Z
Hardware Description Languages (HDLs)
C/C++ to HDL compilation tools: AutoPilot
http://www.deepchip.com/items/0482-06.html
CPU runs the application, FPGA is the
application.
10
Why use FPGA?
High flexibility
Customized logic for
application
Match the application in bit
level
Best utilize parallelism and
locality in application
High computation density
Several Pentium cores
High I/O bandwidth
Up to 100s Gbps
High internal memory
bandwidth
Up to 10s Tbps
Customized memory
hierarchy with no cache
miss
Track Moores Law
Compared to ASIC
Much lower design cost
Compared to GPU
Bit level flexibility
Lower power
11
FPGA-based High Performance
Computing
10X ~ 10,000X speedup reported
Conferences: FCCM, FPGA, FPT, FPL, SC, ICS
Domains: scientific computing, machine learning,
data mining, graphics, financial computing,
Challenges
Ad-hoc solutions
Design productivity
12
Framework: MapReduce
Web Request Logs
MapReduce
programmer
Parallelization
Functionality
Data Distribution
Fault Tolerance
Load Balance
MapReduce
Runtime
Two Primitive:
Map (input)
for each word in input
emit (word, 1)
Reduce (key, values)
int sum = 0;
for each value in values
sum += value;
emit (word, sum)
Word Count
13
Supercomputer in a box
Supercomputer in a box
Supercomputer in a box
Supercomputer in a box
MapReduce description of
the Algorithm in C/C++
The Big Picture
Machine Learning
Applications
Mapper Reducer
CPUs
FPGAs
Interconnection
Network
Scheduler
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Manage
MEM
MEM
MEM
Programming Architecture
User
Constraints
High Level Synthesis Tool
14
FPGA MapReduce (FPMR) Framework
REDUCER
REDUCER
Reducer
Processor
Scheduler
Data Controller
Global Memory
Intermediate
<key,value>

Local
Memory
CPU
<key,value>
Generator
REDUCER
REDUCER
Mapper
FPGA
enable
parameters
P
C
I
e

/

H
y
p
e
r
-
T
r
a
n
s
p
o
r
t
Merger
1
2
3
4
5
6
15
Major Building Blocks
Processors (workers) with pre-defined interfaces
Mapper and reducer
On-chip scheduler
Dynamically scheduling
Monitor status
Queues to record
Data access infrastructure
Interconnection network
Message passing and shared memory
Storage hierarchy
Global memory, local memory, and register file
Data controller
CPU, memories, and workers
16
Parallelism
Task level/data level parallelism
Among mappers/reducers
Instruction level parallelism
Within each worker
17
Case study: RankBoost
An extension of AdaBoost to ranking problems
[Yoav Freund, 2003]
Learn a ranking function by combining weak
learners
Weak learner are usually represented by decision
stumps of features
Slow with large number of features and training
samples
E.g. Web search engine
Weeks to get optimal result
18
Case study: RankBoost
19
RankBoost: mapper and reducer
map (int key, pair value):
// key : feature index fi
// value : document bin
fi
, document
for each document d in value :
hist(bin
fi
(d)) = hist(bin
fi
(d)) + (d)
EmitIntermediate (fi, hist
fi
);
reduce (int key, array value) :
// key : feature index fi
// value : histograms hist
fi
, fi = 1N
f
for each histogram hist
fi
for i = N
bin
1 to 0
integral
fi
(i) = hist
fi
(i) + integral
fi
(i+1)
EmitIntermediate (fi, integral
fi
)
20
RankBoost on FPMR
Map RankBoost on FPMR
Decide <key, value>
#mapper/#reducer
Reducer
Processor
Scheduler
Data Controller
Global
Memory
bin (d )
Intermediate
<fi,hist
fi
(bin)>

Local
Memory
CPU
<bin(d),(d)>
Generator
REDUCER
REDUCER
Mapper
P
C
I
-
E
FPGA
enable
parameters
Global
Memory
(d )
Merger
21
Mapper & Reducer Structure
hist
f
RAM
Dual Port
Shift Registers
M
U
X
Bin FIFO
Read
Address
M
U
X
Write
Address
8'b0
DataOut DataIn
M
U
X
Floating
Point Adder
32'b0
Pi FIFO
M
U
X
32'b0
Local Memory
Address
Generator
Write
Address
DataIn
Read
Address
DataOut
Mapper
Floating
Point Adder
MUX
32'b0
Floating
Point
Comparator
ageb
M
U
X
Maximum
Register
Reducer
Local Memory
Write
Address
DataIn
Read
Address
DataOut
Address
Generator
22
Target Accelerator
PCI Express x8 interface
(Xilinx V5 LXT FPGA)
Altera StratixII FPGA
DDR2 modules x2, 16GB,
6.25GBps, SRAMs
Designed in HCG, MSRA
23
Experimental results
#mapper #reducer
WL /
s
Total /
s
Speedup
WL Total
1 1 320.9 321.96 0.33 0.33
2 1 160.5 161.52 0.65 0.65
4 1 80.22 81.293 1.30 1.30
8 1 40.11 41.181 2.60 2.56
16 1 20.06 21.125 5.20 4.99
32 1 10.09 11.159 10.33 9.44
52 1 6.228 7.297 16.74 14.44
64 1 5.107 6.176 20.42 17.06
128 1 2.616 3.685 39.87 28.59
146 1 2.242 3.311 46.52 31.82
Optimized software 104.3 105.37 1 1
31.82X speedup with 146 parallel mappers
Manual design: 33.5x
24
Scalability
Mapper 1 2 4 8 16
ALUT 1% 2% 3% 5% 10%
Register 1% 2% 4% 6% 11%
Mapper 32 52 64 128 146
ALUT 19% 31% 38% 75% 86%
Register 17% 32% 39% 81% 89%
0 50 100 150 200 250 300
0
20
40
60
80
WL with CDP
Total with CDP
WL w/o CDP
Total w/o CDP
N
mappers
S
p
e
e
d
u
p
25
Design Productivity
Manual design
More than 3 months after the hardware circuit
board was ready
FPGA-based MapReduce
Weeks
Data layout and performance tuning took time
26
Summary
Designed building blocks for MapReduce on FPGA
Achieved comparable result with manual design
Future work
Use C2HDL compilers to further increase the design
productivity
Build Runtime for multiple machines
Try more cases to build a tunable library
27
Thanks!
ningyixu@microsoft.com
28

Вам также может понравиться