Вы находитесь на странице: 1из 34

Diskless Checkpointing

15 Nov 2001
Motivation
Checkpointing on Stable Storage
Disk access is a major bottleneck!

Incremental Checkpointing
Copy-on-write
Compression
Memory Exclusion
Diskless Checkpointing
Diskless?
Extra memory is available (e.g. NOW)
Use memory instead of disk

Good:
Network Bandwidth > Disk Bandwidth
Bad:
Memory is not stable
Bottom-line
NOW with (n+m) processors
The application runs on exactly n procs,
and should proceed as long as
The number of processors in the system is at least n
The failures occur within certain constraint
Available
Processors (n+m)
Application
Processors (n)
Chkpnt
Processors (m)
Overview
Coordinated Chkpnt (Sync-and-Stop)

To checkpoint,
Application Proc: Chkpnt the state in memory
Chkpnt Proc: Encoding the application chkpnts and
storing the encodings in memory

To recover,
Non-failed Procs: Roll-back
Replacement processors are chosen.
Replacement Proc: Calculate the chkpnts of the failed
procs using other chkpnts & encodings
Outline

Application Processor Chkpnt
Disk-based
Diskless
Incremental
Forked (or copy-on-write)
Optimization

Encoding the chkpnts
Parity (RAID level 5)
Mirroring
1-Dimensional Parity
2-Dimensional Parity
Reed-Solomon Coding
Optimization

Result
Application Processor Chkpnt
Goal

The processor should be able to roll back to its
most recent chkpnt.

Need to tolerate failures when chkpnt
Make sure that each coordinated chkpnt
remains valid until the next coordinated chkpnt
has been completed.
Disk-based Chkpnt
To chkpnt
Save all values in the stack,
heap, and registers to disk
To recover
Overwrites the address space
with the stored checkpoint
Space Demands
2M in disk
(M: the size of an application processors
address space)
Simple Diskless Chkpnt
To chkpnt
Wait until encoding calculated
Overwrite diskless chkpnts in
memory
To recover
Roll-backed from in-memory
chkpnts
Space Demands
Extra M in memory
(M: the size of an application processors
address space)
Incremental Diskless Chkpnt
To chkpnt
Initially set all pages R_ONLY
On page fault, copy & set RW
To recover
Restore all RW pages
Space Demands
Extra I in memory
(I: the incremental chkpnt size)
Forked Diskless Chkpnt
To chkpnt
Application clones itself
To recover
Overwrites state with clones
Or clone assumes the role of
the application
Space Demands
Extra 2I in memory
(I: the incremental chkpnt size)
Optimizations
Breaking the chkpnt into chunks
Efficient use of memory

Sending Diffs (Incremental)
Bitwise xor of the current copy and chkpnt copy
Unmodified pages need not be sent

Compressing Diffs
Unmodified regions of memory
Application Processor Chkpnt (review)

Simple Diskless Chkpnt:
Extra M in memory

Incremental Diskless Chkpnt:
Extra I in memory

Forked Diskless Chkpnt:
Extra 2I in memory, less CPU activity

Optimizations:
Chkpnt into chunks, diffs, and compressed diffs
Encoding the chkpnts
Goal

Extra chkpnt processors should store enough
information that the chkpnts of failed processors
may be reconstructed.

Notation:
Number of chkpnt processors (m)
Number of application processors (n)
To chkpnt,


On failure of i
th
proc,


Can tolerate:
Only one processor failure
Remarks:
Chkpnt processor is a bottleneck of
communication and computation
Parity (RAID level 5, m=1)
Application
Processor
Chkpnt
Processor
j
i
b
j-th byte of
Application processor i
j
b
1
j
b
2
j
b
3
j
b
4
j
ckp
b
j
n
j j j
ckp
b b b b ...
2 1
Example
n=4, m=1
j
ckp
j
n
j
i
j
i
j j
i
b b b b b b

... ...
1 1 1
Mirroring (m=n)
Application
Processor
Chkpnt
Processor
j
i
b
j-th byte of
Application processor i
j
b
1
j
b
2
j
b
3
j
b
4
j
ckp
b
1
Example
n=m=4
j
ckp
b
2
j
ckp
b
3
j
ckp
b
4
To chkpnt,


On failure of i
th
proc,


Can tolerate:
Up to n processor failures
Except the failure of both an application
processor and its checkpoint processor
Remarks:
Fast, no calculation needed
j
i
j
ckpi
b b
j
ckpi
j
i
b b
1-Dimensional Parity (1<m<n)
Application
Processor
Chkpnt
Processor
j
i
b
j-th byte of
Application processor i
j
b
1
j
b
2
j
b
3
j
b
4
j
ckp
b
1
Example
n=4, m=2
j
ckp
b
2
To chkpnt,
Application processors are partitioned
into m groups.
i
th
chkpnt processor calculates the parity
of the chkpnts in group i
On failure of i
th
proc,
Same as in Parity encoding

Can tolerate:
One processor failure per group
Remarks:
More efficient in communication and
computation
2-Dimensional Parity
Application
Processor
Chkpnt
Processor
j
i
b
j-th byte of
Application processor i
Example
n=4, m=4
To chkpnt,
Application processors are arranged
logically in a two-dimensional grid
Each chkpnt processor calculates the
parity of the row or the column
On failure of i
th
proc,
Same as in Parity encoding

Can tolerate:
Any two-processor failures
Remarks:
Multicast
Reed-Solomon Coding (m)
To chkpnt,
Vandermonde matrix F, s.t. f(i,j)=j^(i-1)
Use matrix-vector multiplication to calculate chkpnt
To recover,
Use Gaussian Elimination

Can tolerate:
Any m failures
Remarks:
Use Galois Fields to perform arithmetic
Computation overhead
Optimizations
Sending and calculating the encoding
in RAID level 5-based encodings (e.g. Parity)
(a) DIRECT: C1 bottleneck (b) FAN-IN: log(n) step
Encoding the Chkpnts (review)

Parity (RAID level 5, m=1)
Only one failure, bottleneck

Mirroring (m=n)
Up to n failures (unless both app and chkpnt fail), fast

1-Dimensional Parity
One failure per group, more efficient than Parity

2-Dimensional Parity
Any two failures, comm overhead w/o multicast

Reed-Solomon Coding
Any m failures, computation overhead

DIRECT vs. FAN-IN
Testing Applications (1)

CPU-Intensive parallel programs
Instances that took 1.5~2 hrs on 16 processors

NBODY : N-body interactions among particles in a system
Particles are partitioned among processors
Location field of each particle is updated
Expectation:
Poor with incremental chkpnt
Good with diff-based compression

MAT : FP matrix product of two square matrices (Cannons alg.)
All three matrices are partitioned in square blocks among processors
In each step, adds the product and passing the input submatrices
Expectation:
Incremental chkpnt
Very poor with diff-based compression
Testing Applications (2)
PSTSWM : Nonlinear shallow water equations on a rotating sphere
Majority pages, but only few bytes per page are modified
Expectation:
Poor with incremental chkpnt
Good with diff-based compression

CELL : Parallel cellular automaton simulation program
Two (sparse) grids of cellular automata (current/next)
Expectation:
Poor with incremental chkpnt
Good with compression

PCG : Ax=b for a large, sparse matrix
First, converted to a small, dense format
Expectation:
Incremental chkpnt
Very poor with diff-based compression
Diskless Checkpointing
20 Nov 2001
Disk-based vs. Diskless Chkpnt

Disk-based Diskless
Where to chkpnt?
In stable storage In local memory
How to recover?
Restore from stable storage Re-calculate
Remarks
Can tolerate whole failure Cannot tolerate whole failure
Low BW to stable storage Memory is much faster

Encoding (+communication)
overhead
Recalculate the lost chkpnt?
Error Detection & Correction
in Digital Communication
Chkpnt Recovery
in Diskless Chkpnt
1-bit Parity (m=1)
Mirroring (m=n)
Remarks
-Difference: we can easily know that which node is wrong in chkpnt system.
-Some codings can be used to recover from errors in Digital Comm, too. (e.g. Reed-Solomon)
11001011[1] (right)
11000011[1] (detectable)
11001011[0] (detectable)
11000011[0] (oops)
11001011[1] (chkpnt)
1100X011[1] (tolerable)
11001011[X] (tolerable)
1100X011[X] (intolerable)
11001011[11001011] (right)
11001011[11001010] (detectable)
11001011[00111100] (detectable)
11001010[11001010] (oops)
11001011[11001011] (right)
11001011[1100101X] (tolerable)
11001011[XXXXXXXX] (tolerable)
1100101X[1100101X] (intolerable)
Performance
Criteria
Latency: time between chkpnt initiated and ready for recovery
Overhead: increase in execution time with chkpnt

Applications
NBODY N-body interactions
PSTSWM Simulation of the states on 3-D system
CELL Parallel cellular automaton
MAT FP Matrix multiplication (Canons)
PCG PCG for sparse matrix
Majority pages, but only
few bytes per page are
modified
Only small parts are
updated, but updated in
their entirety
App Description Pattern
Implementation

BASE : No chkpnt
DISK-FORK : Disk-based chkpnt w/ fork()

SIMP : Simple diskless
INC : Incremental diskless
FORK : Forked diskless
INC-FORK : Incremental, forked diskless

C-SIMP : w/ diff-based compression
C-INC
C-FORK
C-INC-FORK

Experiment Framework

Network of 24 Sun Sparc5 w/s connected to each other by a
fast, switched Ethernet: ~ 5MB/s

Each w/s has
96MB of physical memory
38MB of local disk storage

Disks with bandwidth of 1.7MB/s are connected via Ethernet,
and NFS on Ethernet achieved a bandwidth of 0.13 MB/s

Latency: time between chkpnt initiated and ready for recovery
Overhead: increase in execution time with chkpnt
Discussion
Latency: diskless has much lower latency than disk-based.
Lowers the expected running time of the application in the
presence of failures (has small recovery time)
Overhead: comparable
Recommendations
DISK-FORK:
If chkpnt are small
If the likelihood of wholesale system failures are high

C-FORK:
If many pages, but a few bytes per page are modified

INC-FORK:
If not a significant number of pages are modified
Reference

J. S. Plank, K. Li, and M.A. Puening. "Diskless
checkpointing." IEEE Transactions on Parallel &
Distributed Systems, 9(10):972986, Oct. 1998

Вам также может понравиться