Вы находитесь на странице: 1из 90

Loop fusion in Haskell

Roman Leshchinskiy

Programming Languages and Systems


University of New South Wales
What is this about?

What I do
Data Parallel Haskell
compiles nested data-parallel programs to flat data-parallel ones
lots of arrays and collective operations involved
What is this about?

What I do
Data Parallel Haskell
compiles nested data-parallel programs to flat data-parallel ones
lots of arrays and collective operations involved

zipWith (-)
(zipWith (*)
(zipWith (-) (replicate_s segd as1) xs)
(zipWith (-) (replicate_s segd bs1) ys))
(zipWith (*)
(zipWith (-) (replicate_s segd bs2) ys)
(zipWith (-) (replicate_s segd as2) xs))
What is this about?

What I do
Data Parallel Haskell
compiles nested data-parallel programs to flat data-parallel ones
lots of arrays and collective operations involved
What other people do
array programs with lots of collective operations

return . foldl’ hash 5381


. map toLower
. filter isAlpha =<< readFile f
What is this about?

What I do
Data Parallel Haskell
compiles nested data-parallel programs to flat data-parallel ones
lots of arrays and collective operations involved
What other people do
array programs with lots of collective operations
What everybody wants
no temporary arrays
fused loops
C-like speed
Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)


Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)

RULES
"map/map" map f (map g xs) = map (f . g) xs
Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)


bar ys = filter even (filter (<42) ys)

RULES
"map/map" map f (map g xs) = map (f . g) xs
Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)


bar ys = filter even (filter (<42) ys)

RULES
"map/map" map f (map g xs) = map (f . g) xs
"filter/filter" filter f (filter g xs)
= filter (λ x → f x && g x) xs
Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)


bar ys = filter even (filter (<42) ys)
baz zs = map (+1) (filter even zs)

RULES
"map/map" map f (map g xs) = map (f . g) xs
"filter/filter" filter f (filter g xs)
= filter (λ x → f x && g x) xs
Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)


bar ys = filter even (filter (<42) ys)
baz zs = map (+1) (filter even zs)

RULES
"map/map" map f (map g xs) = map (f . g) xs
"filter/filter" filter f (filter g xs)
= filter (λ x → f x && g x) xs
"map/filter" map f (filter g xs) = mapFilter f g xs
Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)


bar ys = filter even (filter (<42) ys)
baz zs = map (+1) (filter even zs)

RULES
"map/map" map f (map g xs) = map (f . g) xs
"filter/filter" filter f (filter g xs)
= filter (λ x → f x && g x) xs
"map/filter" map f (filter g xs) = mapFilter f g xs
"map/mapFilter" map f (mapFilter g h xs)
= mapFilter (f . g) h xs
"mapFilter/filter" mapFilter f g (filter h xs)
= mapFilter (f λ x → g x && h x) xs
...
Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)


bar ys = filter even (filter (<42) ys)
baz zs = map (+1) (filter even zs)

RULES
"map/map" map f (map g xs) = map (f . g) xs
"filter/filter" filter f (filter g xs)

"map/filter"
E A
= filter (λ x → f x && g x) xs
map f (filter g xs) = mapFilter f g xs

ID
"map/mapFilter" map f (mapFilter g h xs)
= mapFilter (f . g) h xs

A D
"mapFilter/filter" mapFilter f g (filter h xs)

... B
= mapFilter (f λ x → g x && h x) xs
The challenge

use a constant number of rewrite rules


don’t require new rules for new combinators
make adding new combinators easy
fuse everything!
don’t require specialised compiler support
handle both sequential and parallel loops
Sequential loops
Streams

data Step s a = Yield a s


| Done
data Stream a = ∃s. Stream (s → Step s a) s
Streams

data Step s a = Yield a s stepper


| Done
data Stream a = ∃s. Stream (s → Step s a) s state

stepper produces next element and state from current state


similar to an iterator
actually encodes an anamorphism (unfold)
Streams

data Step s a = Yield a s


| Done
data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state


similar to an iterator
actually encodes an anamorphism (unfold)

sumS :: Num a ⇒ Stream a → a


sumS (Stream step s) = go 0 s
where go z s = case step s of
Yield x s’ → go (z+x) s’
Done → z
Streams

data Step s a = Yield a s


| Done
data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state


similar to an iterator
actually encodes an anamorphism (unfold)

stream :: Array a → Stream a


stream arr = Stream step 0
where step i | i < length arr = Yield (arr ! i) (i+1)
| otherwise = Done
Streams

data Step s a = Yield a s


| Done
data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state


similar to an iterator
actually encodes an anamorphism (unfold)

mapS :: (a → b) → Stream a → Stream b


mapS f (Stream step s) = Stream step’ s
where step’ s = case step s of
Yield x s’ → Yield (f x) s’
Done → Done
Streams

data Step s a = Yield a s


| Done
data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state


similar to an iterator
actually encodes an anamorphism (unfold)

unstream :: Stream a → Array a


unstream (Stream step s) = <allocate, fill and freeze>
Stream fusion in three easy steps
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → a


sum = sumS . stream

map :: (a → b) → Array a → Array b


map f = unstream . mapS f . stream
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → a


sum = sumS . stream

map :: (a → b) → Array a → Array b


map f = unstream . mapS f . stream

Step 2: inline them


sumsq :: Num a ⇒ Array a → a
sumsq = sum . map (λx -> x*x)
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → a


sum = sumS . stream

map :: (a → b) → Array a → Array b


map f = unstream . mapS f . stream

Step 2: inline them


sumsq :: Num a ⇒ Array a → a
sumsq = sum . map (λx -> x*x)
= sumS . stream . unstream . mapS f . stream
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → a


sum = sumS . stream

map :: (a → b) → Array a → Array b


map f = unstream . mapS f . stream

Step 2: inline them


sumsq :: Num a ⇒ Array a → a
sumsq = sum . map (λx -> x*x)
= sumS . stream . unstream . mapS f . stream
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → a


sum = sumS . stream

map :: (a → b) → Array a → Array b


map f = unstream . mapS f . stream

Step 2: inline them


sumsq :: Num a ⇒ Array a → a
sumsq = sum . map (λx -> x*x)
= sumS . stream . unstream . mapS f . stream
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → a


sum = sumS . stream

map :: (a → b) → Array a → Array b


map f = unstream . mapS f . stream

Step 2: inline them


sumsq :: Num a ⇒ Array a → a
sumsq = sum . map (λx -> x*x)
= sumS . mapS f . stream
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → a


sum = sumS . stream

st
re
map :: (a → b) → Array a → Array b

e
th
map f = unstream . mapS f . stream

do
C
Step 2: inline them H
tG
sumsq :: Num a ⇒ Array a → a
sumsq = sum . map (λx -> x*x)
Le

= sumS . mapS f . stream


Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Optimising stream operations

sumsq xs = sumS (mapS square ( stream xs))


Optimising stream operations

sumsq xs = sumS (mapS square ( stream xs))

inline

stream :: Array a → Stream a


stream arr = Stream step 0
where step i | i < length arr = Yield (arr ! i) (i+1)
| otherwise = Done
Optimising stream operations

sumsq xs = sumS ( mapS square (Stream step1 0))


where
step1 i = case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done
Optimising stream operations

sumsq xs = sumS ( mapS square (Stream step1 0))


where
step1 i = case i < inline
length xs of
True → Yield (xs ! i) (i+1)
False → Done

mapS :: (a → b) → Stream a → b
mapS f (Stream step s) = Stream step’ s
where step’ s = case step s of
Yield x s’ → Yield (f x) s’
Done → Done
Optimising stream operations

sumsq xs = sumS (Stream step2 0)


where
step1 i = case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done

step2 i = case step1 i of


Yield x i’ → Yield (square x) i’
Done → Done
Optimising stream operations

sumsq xs = sumS (Stream step2 0)


where
step1 i = case
inlinei < length xs of
True → Yield (xs ! i) (i+1)
False → Done

step2 i = case step1 i of


Yield x i’ → Yield (square x) i’
Done → Done
sumS :: Num a ⇒ Stream a → a
sumS (Stream step s) = go 0 s
where go z s = case step s of
Yield x s’ → go (z+x) s’
Done → z
Optimising stream operations

sumsq xs = go 0 0
where
step1 i = case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done

step2 i = case step1 i of


Yield x i’ → Yield (square x) i’
Done → Done

go z i = case step2 i of
Yield x i’ → go (z+x) i’
Done → z
Optimising stream operations

sumsq xs = go 0 0
where
step1 i = case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done

step2 i = case step1 i of


Yield x i’ → Yield (square x) i’
Done → Done

go z i = case step2 i of
Yield x i’ → go (z+x) i’
Done → z
inline
Optimising stream operations

sumsq xs = go 0 0
where
step1 i = case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done

go z i = case (case step1 i of


Yield x i’ → Yield (square x) i’
Done → Done) of
Yield x i’ → go (z+x) i’
Done → z
Optimising stream operations

sumsq xs = go 0 0
where
step1 i = case i < length xs of
True → Yield (xs ! i) (i+1)
case of case
False → Done

go z i = case (case step1 i of


Yield x i’ → Yield (square x) i’
Done → Done) of
Yield x i’ → go (z+x) i’
Done → z
Optimising stream operations

sumsq xs = go 0 0
where
step1 i = case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done

go z i = case step1 i of
Yield x i’ → go (z + square x) i’
Done → z
Optimising stream operations

sumsq xs = go 0 0
where
step1 i = case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done

go z i = case step1 i of
Yield x i’ → go (z + square x) i’
Done → z
inline
Optimising stream operations

sumsq xs = go 0 0
where
go z i = case (case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done) of
Yield x i’ → go (z + square x) i’
Done → z
Optimising stream operations
case of case
sumsq xs = go 0 0
where
go z i = case (case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done) of
Yield x i’ → go (z + square x) i’
Done → z
Optimising stream operations

sumsq xs = go 0 0
where
go z i = case i < length xs of
True → go (z + square (xs ! i)) (i+1)
False → z
Optimising stream operations

sumsq xs = go 0 0
where
go z i = case i < length xs of
True → go (z + square (xs ! i)) (i+1)
False → z

optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Why does it work?

sumsq xs = go 0 0
where
step1 i = case i < length xs of
True → Yield (xs ! i) (i+1)
False → Done

step2 i = case step1 i of


Yield x i’ → Yield (square x) i’
Done → Done

go z i = case step2 i of
Yield x i’ → go (z+x) i’
Done → z
Why does it work?

sumsq xs = go 0 0
where
step1 i = case i < length xs of
u r sive
True -r→ c Yield (xs ! i) (i+1)
n o n e
False → Done

step2 i = case step1 i of


Yield x i’ → Yield (square x) i’
Done → Done

go z i = case step2 i of
Yield x i’ → go (z+x) i’
Done → z
Why does it work?

sumsq xs = go 0 0
where
step1 i = case i < length xs of
u r sive
True -r→ c Yield (xs ! i) (i+1)
n o n e
False → Done

step2 i = case step1 i of


ve
ursi→ Yield (square x) i’
Yield -xreci’
non
Done → Done

go z i = case step2 i of
Yield x i’ → go (z+x) i’
Done → z
Why does it work?

sumsq xs = go 0 0
where
step1 i = case i < length xs of
u r sive
True -r→ c Yield (xs ! i) (i+1)
n o n e
False → Done

step2 i = case step1 i of


ve
ursi→ Yield (square x) i’
Yield -xreci’
non
Done → Done

go z i = case step2 i of
sive→ go (z+x) i’
Yield xcuri’
re
Done → z
A slight problem

filterS :: (a → Bool) → Stream a → Stream a


filterS f (Stream step s) = Stream step’ s
where
step’ s = case step s of
Yield x s’
| f x → Yield x s’
| otherwise → step s’
Done → Done
A slight problem

filterS :: (a → Bool) → Stream a → Stream a


filterS f (Stream step s) = Stream step’ s
where
step’ s = case step s of
Yield x s’
| f x cursive → Yield x s’
re
| otherwise → step s’
Done → Done
Extending streams

Idea: allow a loop iteration not to produce an element


Extending streams

Idea: allow a loop iteration not to produce an element

data Step s a = Yield a s


| Skip s
| Done
Extending streams

Idea: allow a loop iteration not to produce an element

data Step s a = Yield a s


| Skip s
| Done

filterS :: (a → Bool) → Stream a → Stream a


filterS f (Stream step s) = Stream step’ s
where
step’ s = case step s of
Yield x s’
| f x → Yield x s’
| otherwise → Skip s’
Skip s’ → Skip s’
Done → Done
Extending streams

Idea: allow a loop iteration not to produce an element

data Step s a = Yield a s


| Skip s
| Done

filterS :: (a → Bool) → Stream a → Stream a


filterS f (Stream step s) = Stream step’ s
where
step’ s = case step s of
Yield x s’
| f x → Yield x s’
u r sive
| otherwise c → Skip s’
n-re
Skip s’ no → Skip s’
Done → Done
Stream fusion - summary

encode loops by streams


implement array operations in terms of streams
eliminate stream/unstream pairs (temporaries)
stream producers are non-recursive
standard optimisations remove overhead (loop fusion)
Stream fusion - summary

encode loops by streams


implement array operations in terms of streams
eliminate stream/unstream pairs (temporaries)
stream producers are non-recursive
standard optimisations remove overhead (loop fusion)

Standard optimisations: inlining, case-of-case, worker/wrapper


transformation, SpecConstr, LiberateCase, specialisation ...
Parallel loops
DPH on multicores

Evaluation strategy after vectorisation


operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
DPH on multicores

Evaluation strategy after vectorisation


operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism

mapP :: (a → b) → Array a → Array b


mapP f xs = <split xs across workers>
<map f over each chunk>
<collect local results>
DPH on multicores

Evaluation strategy after vectorisation


operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
f is sequential

mapP :: (a → b) → Array a → Array b


mapP f xs = <split xs across workers>
<map f over each chunk>
<collect local results>
DPH on multicores

Evaluation strategy after vectorisation


operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism

mapP :: (a → b) → Array a → Array b


mapP f xs = <split xs across workers>
<map f over each chunk>
<collect local results>

sumP :: Num a ⇒ Array a → a


sumP xs = <split xs across workers>
<sum each chunk>
<reduce local sums>
DPH on multicores

Evaluation strategy after vectorisation


operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism

sumsqP = sumP . mapP square


DPH on multicores

Evaluation strategy after vectorisation


operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism

sumsqP xs = <split xs across workers>


<map square over each chunk>
<collect local results>
<split results across workers>
<sum each chunk>
<reduce local sums>
DPH on multicores

Evaluation strategy after vectorisation


operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism

sumsqP xs = <split xs across workers>


<map square over each chunk>
<collect local results>
<split results across workers>
<sum each chunk>
<reduce local sums>
DPH on multicores

Evaluation strategy after vectorisation


operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism

sumsqP xs = <split xs across workers>


<map square over each chunk>
<collect local results>
<split results across workers>
<sum each chunk>
<reduce local sums>
Distributed types

Idea: let’s make the evaluation strategy explicit! (Keller 1999)


Distributed types

Idea: let’s make the evaluation strategy explicit! (Keller 1999)


data Dist a a is distributed across threads
Dist (Array a) each thread has a local array (chunk)
Dist Double each thread has a local Double
Distributed types

Idea: let’s make the evaluation strategy explicit! (Keller 1999)


data Dist a a is distributed across threads
Dist (Array a) each thread has a local array (chunk)
Dist Double each thread has a local Double

splitD distribute an array across threads


joinD collect thread-local chunks

splitD :: Array a → Dist (Array a)


joinD :: Dist (Array a) → Array a
Distributed types

Idea: let’s make the evaluation strategy explicit! (Keller 1999)


data Dist a a is distributed across threads
Dist (Array a) each thread has a local array (chunk)
Dist Double each thread has a local Double

splitD distribute an array across threads


joinD collect thread-local chunks

mapD execute a sequential operation in each thread


sumD compute sum of local values

splitD :: Array a → Dist (Array a)


joinD :: Dist (Array a) → Array a
mapD :: (a → b) → Dist a → Dist b
sumD :: Num a ⇒ Dist a → a
Programming with distributed types

mapP f xs = <split xs across workers>


<map f over each chunk>
<collect local results>
Programming with distributed types

mapP f = joinD -- collect


. mapD (map f) -- map f over chunks
. splitD -- split
Programming with distributed types

mapP f = joinD -- collect


. mapD (map f) -- map f over chunks
. splitD -- split

sumP xs = <split xs across workers>


<sum each chunk>
<reduce local sums>
Programming with distributed types

mapP f = joinD -- collect


. mapD (map f) -- map f over chunks
. splitD -- split

sumP = sumD -- reduce


. mapD sum -- sum each chunk
. splitD -- split
Fusing distributed types

sumsqP = sumP . mapP square


Fusing distributed types

sumsqP = sumD -- reduce


. mapD sum -- sum each chunk
. splitD -- split
. joinD -- collect
. mapD (map square) -- map square over chunks
. splitD -- split
Fusing distributed types

sumsqP = sumD -- reduce


. mapD sum -- sum each chunk
. splitD -- split
. joinD -- collect
. mapD (map square) -- map square over chunks
. splitD -- split

RULES
splitD (joinD xs) = xs
Fusing distributed types

sumsqP = sumD -- reduce


. mapD sum -- sum each chunk
. mapD (map square) -- map square over chunks
. splitD -- split

RULES
splitD (joinD xs) = xs
Fusing distributed types

sumsqP = sumD -- reduce


. mapD sum -- sum each chunk
. mapD (map square) -- map square over chunks
. splitD -- split

RULES
splitD (joinD xs) = xs
mapD f (mapD g xs) = mapD (f . g) xs
Fusing distributed types

sumsqP = sumD -- reduce


. mapD (sum . map square) -- work
. splitD -- split

RULES
splitD (joinD xs) = xs
mapD f (mapD g xs) = mapD (f . g) xs
Fusing distributed types

sumsqP = sumD -- reduce


. mapD (sum . map square) -- work
. splitD -- split
stream fusion

RULES
splitD (joinD xs) = xs
mapD f (mapD g xs) = mapD (f . g) xs
Distributed types on multicores

data Dist a a is distributed across threads

splitD distribute xs across threads


joinD collect thread-local chunks
mapD execute a sequential operation in each thread

splitD/joinD eliminate communication


mapD/mapD eliminate synchronisation
Distributed types on clusters

data Dist a a is distributed across nodes

splitD scatter
joinD gather
mapD execute operation on each node

splitD/joinD eliminate communication


mapD/mapD eliminate synchronisation
Distributed types on GPUs

data Dist a a is in GPU memory

splitD CPU −→ GPU transfer


joinD GPU −→ CPU transfer
mapD execute kernel on the GPU

splitD/joinD eliminate memory transfers (communication)


mapD/mapD fuse kernels (synchronisation)
Distribured types – summary

encode parallel loops as split/work/join


eliminate unnecessary split/join pairs
fuse sequential work (stream fusion)
very general mechanism for fusing parallel computations
applicable to a wide range of architectures
again, no specialised compiler support
1
1 2 4 8
Obligatory benchmark
sumsq, Haskell sumsq, C dotp, Haskell
dotp, C smvm, Haskell smvm, C

Runtime @ greyarea
10000

1000

100

10

1
1 2 4 8 16 32 64

sumsq, Haskell sumsq, C dotp, Haskell


dotp, C smvm, Haskell smvm, C
Parting thoughts

it’s nice, it’s easy to use, it works


high-level functional programs compiled to highly efficient code
even parallel ones!
rewrite rules + great optimiser = win
DPH doesn’t require any special-purpose optimisations
try this in an imperative language...
Parting thoughts

it’s nice, it’s easy to use, it works


high-level functional programs compiled to highly efficient code
even parallel ones!
rewrite rules + great optimiser = win
DPH doesn’t require any special-purpose optimisations
try this in n’t imperative language...
doan
Parting thoughts

it’s nice, it’s easy to use, it works


high-level functional programs compiled to highly efficient code
even parallel ones!
rewrite rules + great optimiser = win
DPH doesn’t require any special-purpose optimisations
try this in n’t imperative language...
doan

Stream fusion: dph, bytestring, vector, uvector


Distributed types: dph

Вам также может понравиться