CHKPNT 2

UNIVERSITY OF MASSACHUSETTS Dept.
of Electrical & Computer Engineering Fault Tolerant Computing

ECE 655 Checkpointing II
Fall 2006
ECE655/Ckpt Part.11 .1
Copyright 2004 Koren & Krishna
Checkpoint Placement - Notations

jCheckpoint placement - tradeoff between cost and
benefits - aimed at minimizing the expected execution time of a long job jCost - time to store a checkpoint - can be large jt x - execution time without checkpoints jt c - average time of taking checkpoint jN - decision variable - number of checkpoints placed uniformly in job - minimizing total execution time T (N) tot
jX x =
t x / N - time between consecutive checkpoints jFailures occur with rate P jFailures are transient - go away after a mean lifetime tf jSystem then rolls back to the latest checkpoint jCheckpoints in secure memory - uncorrupted by failure
ECE655/Ckpt Part.11 .2 Copyright 2004 Koren & Krishna
Checkpoint Placement Analytical Model
jt l - total time lost for every transient failure jt f - time system is down jIf failure occurs during checkpointing
probability p = t c /(t c + X ) x c lost time X x + t c /2 jIf failure occurs during execution probability p x= X x /(t c + X x ) lost time X x /2
jt l =t f +
This result is intuitive - (t c + X x )/2 is half the interval

(X x +t c /2) +p x X x /2 =t f + (t c + X x )/2
p
c
tc + X x
Optimal Checkpoint Placement

jAssume P is sufficiently small so that probability of
failure during rollback is negligible jExpected number of failures during the total execution P (t x + N t c ) time of t x + N t c is jTotal time taken jTtot (N) =t x +N t c + P [t x + N t c ][t f +(t c +t x /N )/2] jSelect N so as to minimize T tot (N) 2 2 j x Ttot (N) / x N = t c + P t c (t c /2+t f )-(P t x )/(2N ) jSetting derivative to zero, we obtain
jN
opt = t x P
t c (2 + P t c +2 P t f )
_ __ _ _ _ _ _ __
Optimal Value of N
jN
opt must be a whole number - the one out of floor or ceiling that minimizes T tot (N)
jOptimal inter-checkpoint interval - X opt=t x /N opt jExercise - Relax the assumption that the probability
of additional failures during the recovery process is negligible jUniform placement - optimal if checkpointing cost is constant throughout the execution jIf checkpoint size - and hence checkpointing time varies greatly from one part of the execution to the other - optimal time between checkpoints is not constant
Optimal Checkpoint Placement An Instruction Level Model

jProbability of a fault during an instruction
execution depends on the functional units used and its execution time jDecision variable M - number of instructions between consecutive checkpoints jMinimizing W - time spent per instruction jInstruction set partitioned into N subsets of similar instructions jFor a type i instruction - execution time T i , failure rate P i , frequency fi (f i=1) js (1-s) - fraction of permanent (transient) faults jQ i - repair rate of a transient failure in a type i instruction
Instruction Level Model - Notations

jPossible events during an instruction execution: c jH - Instruction is completed successfully when first c
executed - probability P jHRB - Instruction fails, failure identified, program rolled-back to last checkpoint, instruction completed - probability P RB jH PF - Program rollback fails, program fails, program reloaded and restarted - probability P PF c RB jP i , Pi , P iPF - conditional probabilities for a type i instruction jThese conditional probabilities will be calculated and then averaged:
Instruction Level Model - Further Notations

jFor a system with failure rate P and repair rate Q jProbability of no faults in the time interval (0,t) jProbability of transition from the fault-free state
at time 0 to the fault-free state at time t
jFor 0 e t1 e t, probability of transition from the
fault-free state at time 0 to the fault-free state at time t with at least one fault during (0,t 1 )
Instruction Level Model - Probabilities

jM - number of instructions between checkpoints jm - number of instructions between failing instruction
and last checkpoint =1,,M with probability 1/M each RB jP i,m - conditional probability of successful rollback given type i and m instructions to the last checkpoint jH 1 - setup time needed to initiate a program rollback including the time needed to load the information saved in the last checkpoint
Instruction Level Model - Calculating W
jX
- mean time to successfully execute an instruction jTs - time taken for checkpointing
jTime spent per instruction - W = X + T s / M jIncreasing M - first term increases, second decreases jT = f i T i - mean execution time of a fault-free
instruction jH 2 - average time required for diagnose and repair jL - average number of instructions per program
jX jX
includes W as a term is substituted in W and solved for W

Optimal value of M
jSolving for W -
jFinding the optimal value for M which minimizes W

-done iteratively jInitial value is obtained by substituting 1 for the denominator and 0 for H 2 , taking the derivative with respect to M and letting it equal 0 jInitial value of M -
CARER: Cache-Aided Rollback Error Recovery

jReducing checkpointing time allows more frequent
checkpoints - reducing penalty of rollback upon failure jThe CARER scheme reduces the time required to take a checkpoint by marking the process footprint in main memory and cache as parts of the checkpointed state jAssuming that the memory and cache are far less prone to failure than the processor jCheckpointing consists of storing the processor's registers in main memory, and includes the processes' footprint in main memory plus any lines of the cache marked as being part of the checkpoint
Checkpoint Bit For Each Cache Line

jThis requires hardware modification: an extra
checkpoint bit associated with each cache line jWhen this bit is 1: the corresponding line is unmodifiable, i.e., the line is part of the latest checkpoint - may not update without being forced to take a checkpoint immediately jIf the bit = 0: processor is free to modify the word jThe process' footprint in main memory, and marked lines in the cache do double duty as both memory and part of checkpoint - less freedom when deciding when checkpoints have to be taken
Forced Checkpointing
jCheckpointing is forced when A line marked unmodifiable is to be updated Anything in the main memory is to be
updated An I/O instruction is executed or an external interrupt occurs jTaking a checkpoint involves: (a) saving the processor registers in memory (b) setting to 1 the checkpoint bit associated with each valid cache line
Roll Back - Cost

jRolling back to the previous checkpoint is very
simple: restore the registers, and mark invalid all the lines in cache with checkpoint bit = 0 jThe cost: A checkpoint bit for every cache line Every write-back of a cache line into main memory involves taking a checkpoint
Checkpointing in Distributed Systems

jDistributed system: processors and associated
memories connected by a network Each processor may have local disks Can be a network file system accessible by all processors jProcesses connected by directional channels point-to-point connections from one process to another Assume channels are error-free and deliver messages in the order received
Deterministic and Non-deterministic Events

jA non-deterministic event: its occurrence cannot be
predicted based on prior state(s) of system jA deterministic event can be predicted jProcess execution is a sequence of deterministic events, interrupted now and then by some nondeterministic events jExample: a program controlling a pressure valve of a chemical reactor - an endless loop with frequent inputs from pressure sensors - then making control decisions jThe value of an input is a non-deterministic event: cannot predict it based on the results of prior processing
Piecewise Deterministic Process

jHowever, once the input is available, the rest is
predictable (assuming no failures) jA process execution can be regarded as piecewise deterministic: jIt consists of time-slices, each of which begins with some non-deterministic event jGiven information about the non-deterministic event and the state of the process at the beginning of that time-slice, we can predict every event that happens during the time-slice
Process/Channel/System State
jThe state of a process has the obvious meaning; jThe state of the channel at t is the set of messages
carried by it up to time t (and the order of receipt) jThe state of the distributed system is the aggregate states of individual processes and channels jThe state is said to be consistent if, for every message delivery there is a corresponding messagesending event jA state violating this - a message delivered that had not yet been sent - violating causality; such a message is called an orphan jThe converse is consistent - a system state reflect the sending of a message but not its receipt
Consistent and Inconsistent States

jTwo processes, P and Q,
each has two checkpoints taken; Message m is sent by P to Q jSets of checkpoints representing consistent system states: j{P_1, Q_1}: Neither checkpoint has any information about m j{P_2, Q_2}: P_2 indicates that m was sent; Q_2 indicates that it was received j{P_2, Q_1}: P_2 indicates that m was sent; Q_1 has no record of receiving m
Inconsistent States
jIn contrast, the set {P_1, Q_2}
is an inconsistent state; P_1 has no record of m being sent,while Q_2 records that m was received, i.e., m is an orphan message jThe sets of checkpoints that represent a consistent system state are said to form a recovery line we can roll the system back to them and restart from there: j{P_1, Q_1}: Rolling back P to P_1 undoes the sending of m and rolling back Q to Q_1 means that Q does not have any record of m jRestarting from these checkpoints, P will again send out m, which Q will receive
Inconsistent States - Cont.

j{P_2, Q_1}: Rolling back P
to P_2 means that it will not retransmit m; however, rolling back Q to Q_1 means that Q has no record of ever having received m jThe recovery process has to be able to play back m to Q - can be done by adding it to the checkpoint of P or having a separate message log, recording everything received by Q jSometimes, checkpoints can be useless - they will never form part of a recovery line, so that taking them is a waste of time
Useless Checkpoints
jQ_2 is a useless checkpoint jQ_2 records the receipt of m1, but not the
sending of m2 j{P1,Q_2} cannot be consistent (otherwise m1 would become an orphan); similarly {P_2,Q_2} cannot be consistent (since otherwise m2 would become an orphan)
The Domino Effect

jIf checkpoints are not coordinated (directly -
failure
message passing or indirectly - synchronized clocks): a single failure can cause a domino effect jWhen P suffers a transient failure, it rolls back to checkpoint P_3 jSince message f was sent after P_3, Q has to roll back (otherwise Q would have a message that was never sent: an orphan message) jP will rollback to P_2 since Q sent a message e to P jThis continues until all processes have rolled back to their starting positions
Lost Message
jMessages lost due to rollback: jSuppose Q rolls back to Q_1 after receiving
message x from P Record of having received x is lost jIf P does not roll back to P_2 - as if P had sent a message which was never received by Q jLost messages do not violate causality - similarly to messages lost due to network problems Retransmission jHowever, if Q sent an ACK of x to P before rolling back, then that ACK will be an orphaned message unless P rolls back to P_2
Example of Livelock jLivelock - another problem
that can arise in distributed checkpointed systems jQ sends P a message q1 P sends Q a message p1 Then, P fails at the point shown, before receiving q1. To
prevent p1 from being orphaned, Q must roll back to Q_1 In the meantime, P recovers, rolls back to P_2, sends another copy of p1, and then receives the copy of q1 that was sent before all the rollbacks began However, since Q has rolled back, this copy of q1 is now orphaned, and so P has to repeat its rollback This in turn, orphans the second copy p1 as well, forcing Q to also repeat its rollback This dance of the rollbacks can continue indefinitely

CHKPNT 2

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CHKPNT 2

Загружено:

Авторское право:

Доступные форматы

UNIVERSITY OF MASSACHUSETTS Dept.

of Electrical & Computer Engineering Fault Tolerant Computing

Copyright 2004 Koren & Krishna

Checkpoint Placement - Notations

Checkpoint Placement Analytical Model

This result is intuitive - (t c + X x )/2 is half the interval

Optimal Checkpoint Placement

Copyright 2004 Koren & Krishna

Copyright 2004 Koren & Krishna

Optimal Checkpoint Placement An Instruction Level Model

Instruction Level Model - Notations

Copyright 2004 Koren & Krishna

Instruction Level Model - Further Notations

jFor 0 e t1 e t, probability of transition from the

Copyright 2004 Koren & Krishna

Instruction Level Model - Probabilities

Copyright 2004 Koren & Krishna

Instruction Level Model - Calculating W

includes W as a term is substituted in W and solved for W

ECE655/Ckpt Part.11 .10

jFinding the optimal value for M which minimizes W

ECE655/Ckpt Part.11 .11

Copyright 2004 Koren & Krishna

CARER: Cache-Aided Rollback Error Recovery

ECE655/Ckpt Part.11 .12

Copyright 2004 Koren & Krishna

Checkpoint Bit For Each Cache Line

ECE655/Ckpt Part.11 .13

Copyright 2004 Koren & Krishna

ECE655/Ckpt Part.11 .14

Copyright 2004 Koren & Krishna

Roll Back - Cost

ECE655/Ckpt Part.11 .15

Copyright 2004 Koren & Krishna

Checkpointing in Distributed Systems

ECE655/Ckpt Part.11 .16

Copyright 2004 Koren & Krishna

Deterministic and Non-deterministic Events

Piecewise Deterministic Process

ECE655/Ckpt Part.11 .18

Copyright 2004 Koren & Krishna

Consistent and Inconsistent States

ECE655/Ckpt Part.11 .20

Copyright 2004 Koren & Krishna

Inconsistent States - Cont.

ECE655/Ckpt Part.11 .22

Copyright 2004 Koren & Krishna

ECE655/Ckpt Part.11 .23

Copyright 2004 Koren & Krishna

The Domino Effect

Example of Livelock jLivelock - another problem

ECE655/Ckpt Part.11 .26

Copyright 2004 Koren & Krishna

Вам также может понравиться