Вы находитесь на странице: 1из 26

UNIVERSITY OF MASSACHUSETTS Dept.

of Electrical & Computer Engineering Fault Tolerant Computing


ECE 655 Checkpointing II

Fall 2006

ECE655/Ckpt Part.11 .1

Copyright 2004 Koren & Krishna

Checkpoint Placement - Notations


jCheckpoint placement - tradeoff between cost and
benefits - aimed at minimizing the expected execution time of a long job jCost - time to store a checkpoint - can be large jt x - execution time without checkpoints jt c - average time of taking checkpoint jN - decision variable - number of checkpoints placed uniformly in job - minimizing total execution time T (N) tot

jX x =

t x / N - time between consecutive checkpoints jFailures occur with rate P jFailures are transient - go away after a mean lifetime tf jSystem then rolls back to the latest checkpoint jCheckpoints in secure memory - uncorrupted by failure
ECE655/Ckpt Part.11 .2 Copyright 2004 Koren & Krishna

Checkpoint Placement Analytical Model

jt l - total time lost for every transient failure jt f - time system is down jIf failure occurs during checkpointing
 probability p = t c /(t c + X ) x c  lost time X x + t c /2 jIf failure occurs during execution  probability p x= X x /(t c + X x )  lost time X x /2

jt l =t f +

This result is intuitive - (t c + X x )/2 is half the interval


ECE655/Ckpt Part.11 .3

(X x +t c /2) +p x X x /2 =t f + (t c + X x )/2
p
c

tc + X x
Copyright 2004 Koren & Krishna

Optimal Checkpoint Placement


jAssume P is sufficiently small so that probability of
failure during rollback is negligible jExpected number of failures during the total execution P (t x + N t c ) time of t x + N t c is jTotal time taken jTtot (N) =t x +N t c + P [t x + N t c ][t f +(t c +t x /N )/2] jSelect N so as to minimize T tot (N) 2 2 j x Ttot (N) / x N = t c + P t c (t c /2+t f )-(P t x )/(2N ) jSetting derivative to zero, we obtain

jN

opt = t x P

t c (2 + P t c +2 P t f )

_ __ _ _ _ _ _ __

ECE655/Ckpt Part.11 .4

Copyright 2004 Koren & Krishna

Optimal Value of N
jN
opt must be a whole number - the one out of floor or ceiling that minimizes T tot (N)

jOptimal inter-checkpoint interval - X opt=t x /N opt jExercise - Relax the assumption that the probability
of additional failures during the recovery process is negligible jUniform placement - optimal if checkpointing cost is constant throughout the execution jIf checkpoint size - and hence checkpointing time varies greatly from one part of the execution to the other - optimal time between checkpoints is not constant

ECE655/Ckpt Part.11 .5

Copyright 2004 Koren & Krishna

Optimal Checkpoint Placement An Instruction Level Model


jProbability of a fault during an instruction
execution depends on the functional units used and its execution time jDecision variable M - number of instructions between consecutive checkpoints jMinimizing W - time spent per instruction jInstruction set partitioned into N subsets of similar instructions jFor a type i instruction - execution time T i , failure rate P i , frequency fi (f i=1) js (1-s) - fraction of permanent (transient) faults jQ i - repair rate of a transient failure in a type i instruction
ECE655/Ckpt Part.11 .6 Copyright 2004 Koren & Krishna

Instruction Level Model - Notations


jPossible events during an instruction execution: c jH - Instruction is completed successfully when first c

executed - probability P jHRB - Instruction fails, failure identified, program rolled-back to last checkpoint, instruction completed - probability P RB jH PF - Program rollback fails, program fails, program reloaded and restarted - probability P PF c RB jP i , Pi , P iPF - conditional probabilities for a type i instruction jThese conditional probabilities will be calculated and then averaged:

ECE655/Ckpt Part.11 .7

Copyright 2004 Koren & Krishna

Instruction Level Model - Further Notations


jFor a system with failure rate P and repair rate Q jProbability of no faults in the time interval (0,t) jProbability of transition from the fault-free state
at time 0 to the fault-free state at time t

jFor 0 e t1 e t, probability of transition from the

fault-free state at time 0 to the fault-free state at time t with at least one fault during (0,t 1 )

ECE655/Ckpt Part.11 .8

Copyright 2004 Koren & Krishna

Instruction Level Model - Probabilities


jM - number of instructions between checkpoints jm - number of instructions between failing instruction

and last checkpoint =1,,M with probability 1/M each RB jP i,m - conditional probability of successful rollback given type i and m instructions to the last checkpoint jH 1 - setup time needed to initiate a program rollback including the time needed to load the information saved in the last checkpoint

ECE655/Ckpt Part.11 .9

Copyright 2004 Koren & Krishna

Instruction Level Model - Calculating W

jX

- mean time to successfully execute an instruction jTs - time taken for checkpointing

jTime spent per instruction - W = X + T s / M jIncreasing M - first term increases, second decreases jT = f i T i - mean execution time of a fault-free
instruction jH 2 - average time required for diagnose and repair jL - average number of instructions per program

jX jX

includes W as a term is substituted in W and solved for W


Copyright 2004 Koren & Krishna

ECE655/Ckpt Part.11 .10

Optimal value of M
jSolving for W -

jFinding the optimal value for M which minimizes W


-done iteratively jInitial value is obtained by substituting 1 for the denominator and 0 for H 2 , taking the derivative with respect to M and letting it equal 0 jInitial value of M -

ECE655/Ckpt Part.11 .11

Copyright 2004 Koren & Krishna

CARER: Cache-Aided Rollback Error Recovery


jReducing checkpointing time allows more frequent
checkpoints - reducing penalty of rollback upon failure jThe CARER scheme reduces the time required to take a checkpoint by marking the process footprint in main memory and cache as parts of the checkpointed state jAssuming that the memory and cache are far less prone to failure than the processor jCheckpointing consists of storing the processor's registers in main memory, and includes the processes' footprint in main memory plus any lines of the cache marked as being part of the checkpoint

ECE655/Ckpt Part.11 .12

Copyright 2004 Koren & Krishna

Checkpoint Bit For Each Cache Line


jThis requires hardware modification: an extra
checkpoint bit associated with each cache line jWhen this bit is 1: the corresponding line is unmodifiable, i.e., the line is part of the latest checkpoint - may not update without being forced to take a checkpoint immediately jIf the bit = 0: processor is free to modify the word jThe process' footprint in main memory, and marked lines in the cache do double duty as both memory and part of checkpoint - less freedom when deciding when checkpoints have to be taken

ECE655/Ckpt Part.11 .13

Copyright 2004 Koren & Krishna

Forced Checkpointing
jCheckpointing is forced when  A line marked unmodifiable is to be updated  Anything in the main memory is to be
updated  An I/O instruction is executed or an external interrupt occurs jTaking a checkpoint involves:  (a) saving the processor registers in memory  (b) setting to 1 the checkpoint bit associated with each valid cache line

ECE655/Ckpt Part.11 .14

Copyright 2004 Koren & Krishna

Roll Back - Cost


jRolling back to the previous checkpoint is very
simple: restore the registers, and mark invalid all the lines in cache with checkpoint bit = 0 jThe cost:  A checkpoint bit for every cache line  Every write-back of a cache line into main memory involves taking a checkpoint

ECE655/Ckpt Part.11 .15

Copyright 2004 Koren & Krishna

Checkpointing in Distributed Systems


jDistributed system: processors and associated
memories connected by a network  Each processor may have local disks  Can be a network file system accessible by all processors jProcesses connected by directional channels point-to-point connections from one process to another  Assume channels are error-free and deliver messages in the order received

ECE655/Ckpt Part.11 .16

Copyright 2004 Koren & Krishna

Deterministic and Non-deterministic Events


jA non-deterministic event: its occurrence cannot be
predicted based on prior state(s) of system jA deterministic event can be predicted jProcess execution is a sequence of deterministic events, interrupted now and then by some nondeterministic events jExample: a program controlling a pressure valve of a chemical reactor - an endless loop with frequent inputs from pressure sensors - then making control decisions jThe value of an input is a non-deterministic event: cannot predict it based on the results of prior processing
ECE655/Ckpt Part.11 .17 Copyright 2004 Koren & Krishna

Piecewise Deterministic Process


jHowever, once the input is available, the rest is
predictable (assuming no failures) jA process execution can be regarded as piecewise deterministic: jIt consists of time-slices, each of which begins with some non-deterministic event jGiven information about the non-deterministic event and the state of the process at the beginning of that time-slice, we can predict every event that happens during the time-slice

ECE655/Ckpt Part.11 .18

Copyright 2004 Koren & Krishna

Process/Channel/System State
jThe state of a process has the obvious meaning; jThe state of the channel at t is the set of messages
carried by it up to time t (and the order of receipt) jThe state of the distributed system is the aggregate states of individual processes and channels jThe state is said to be consistent if, for every message delivery there is a corresponding messagesending event jA state violating this - a message delivered that had not yet been sent - violating causality; such a message is called an orphan jThe converse is consistent - a system state reflect the sending of a message but not its receipt
ECE655/Ckpt Part.11 .19 Copyright 2004 Koren & Krishna

Consistent and Inconsistent States


jTwo processes, P and Q,
each has two checkpoints taken; Message m is sent by P to Q jSets of checkpoints representing consistent system states: j{P_1, Q_1}: Neither checkpoint has any information about m j{P_2, Q_2}: P_2 indicates that m was sent; Q_2 indicates that it was received j{P_2, Q_1}: P_2 indicates that m was sent; Q_1 has no record of receiving m

ECE655/Ckpt Part.11 .20

Copyright 2004 Koren & Krishna

Inconsistent States
jIn contrast, the set {P_1, Q_2}
is an inconsistent state; P_1 has no record of m being sent,while Q_2 records that m was received, i.e., m is an orphan message jThe sets of checkpoints that represent a consistent system state are said to form a recovery line we can roll the system back to them and restart from there: j{P_1, Q_1}: Rolling back P to P_1 undoes the sending of m and rolling back Q to Q_1 means that Q does not have any record of m jRestarting from these checkpoints, P will again send out m, which Q will receive
ECE655/Ckpt Part.11 .21 Copyright 2004 Koren & Krishna

Inconsistent States - Cont.


j{P_2, Q_1}: Rolling back P
to P_2 means that it will not retransmit m; however, rolling back Q to Q_1 means that Q has no record of ever having received m jThe recovery process has to be able to play back m to Q - can be done by adding it to the checkpoint of P or having a separate message log, recording everything received by Q jSometimes, checkpoints can be useless - they will never form part of a recovery line, so that taking them is a waste of time

ECE655/Ckpt Part.11 .22

Copyright 2004 Koren & Krishna

Useless Checkpoints

jQ_2 is a useless checkpoint jQ_2 records the receipt of m1, but not the
sending of m2 j{P1,Q_2} cannot be consistent (otherwise m1 would become an orphan); similarly {P_2,Q_2} cannot be consistent (since otherwise m2 would become an orphan)

ECE655/Ckpt Part.11 .23

Copyright 2004 Koren & Krishna

The Domino Effect


jIf checkpoints are not coordinated (directly -

failure

message passing or indirectly - synchronized clocks): a single failure can cause a domino effect jWhen P suffers a transient failure, it rolls back to checkpoint P_3 jSince message f was sent after P_3, Q has to roll back (otherwise Q would have a message that was never sent: an orphan message) jP will rollback to P_2 since Q sent a message e to P jThis continues until all processes have rolled back to their starting positions
ECE655/Ckpt Part.11 .24 Copyright 2004 Koren & Krishna

Lost Message
jMessages lost due to rollback: jSuppose Q rolls back to Q_1 after receiving
message x from P  Record of having received x is lost jIf P does not roll back to P_2 - as if P had sent a message which was never received by Q jLost messages do not violate causality - similarly to messages lost due to network problems  Retransmission jHowever, if Q sent an ACK of x to P before rolling back, then that ACK will be an orphaned message unless P rolls back to P_2
ECE655/Ckpt Part.11 .25 Copyright 2004 Koren & Krishna

Example of Livelock jLivelock - another problem

that can arise in distributed checkpointed systems jQ sends P a message q1 P sends Q a message p1  Then, P fails at the point shown, before receiving q1. To

prevent p1 from being orphaned, Q must roll back to Q_1  In the meantime, P recovers, rolls back to P_2, sends another copy of p1, and then receives the copy of q1 that was sent before all the rollbacks began  However, since Q has rolled back, this copy of q1 is now orphaned, and so P has to repeat its rollback  This in turn, orphans the second copy p1 as well, forcing Q to also repeat its rollback  This dance of the rollbacks can continue indefinitely

ECE655/Ckpt Part.11 .26

Copyright 2004 Koren & Krishna

Вам также может понравиться