Академический Документы
Профессиональный Документы
Культура Документы
Fall 2006
ECE655/Ckpt Part.11 .1
jX x =
t x / N - time between consecutive checkpoints jFailures occur with rate P jFailures are transient - go away after a mean lifetime tf jSystem then rolls back to the latest checkpoint jCheckpoints in secure memory - uncorrupted by failure
ECE655/Ckpt Part.11 .2 Copyright 2004 Koren & Krishna
jt l - total time lost for every transient failure jt f - time system is down jIf failure occurs during checkpointing
probability p = t c /(t c + X ) x c lost time X x + t c /2 jIf failure occurs during execution probability p x= X x /(t c + X x ) lost time X x /2
jt l =t f +
(X x +t c /2) +p x X x /2 =t f + (t c + X x )/2
p
c
tc + X x
Copyright 2004 Koren & Krishna
jN
opt = t x P
t c (2 + P t c +2 P t f )
_ __ _ _ _ _ _ __
ECE655/Ckpt Part.11 .4
Optimal Value of N
jN
opt must be a whole number - the one out of floor or ceiling that minimizes T tot (N)
jOptimal inter-checkpoint interval - X opt=t x /N opt jExercise - Relax the assumption that the probability
of additional failures during the recovery process is negligible jUniform placement - optimal if checkpointing cost is constant throughout the execution jIf checkpoint size - and hence checkpointing time varies greatly from one part of the execution to the other - optimal time between checkpoints is not constant
ECE655/Ckpt Part.11 .5
executed - probability P jHRB - Instruction fails, failure identified, program rolled-back to last checkpoint, instruction completed - probability P RB jH PF - Program rollback fails, program fails, program reloaded and restarted - probability P PF c RB jP i , Pi , P iPF - conditional probabilities for a type i instruction jThese conditional probabilities will be calculated and then averaged:
ECE655/Ckpt Part.11 .7
fault-free state at time 0 to the fault-free state at time t with at least one fault during (0,t 1 )
ECE655/Ckpt Part.11 .8
and last checkpoint =1,,M with probability 1/M each RB jP i,m - conditional probability of successful rollback given type i and m instructions to the last checkpoint jH 1 - setup time needed to initiate a program rollback including the time needed to load the information saved in the last checkpoint
ECE655/Ckpt Part.11 .9
jX
- mean time to successfully execute an instruction jTs - time taken for checkpointing
jTime spent per instruction - W = X + T s / M jIncreasing M - first term increases, second decreases jT = f i T i - mean execution time of a fault-free
instruction jH 2 - average time required for diagnose and repair jL - average number of instructions per program
jX jX
Optimal value of M
jSolving for W -
Forced Checkpointing
jCheckpointing is forced when A line marked unmodifiable is to be updated Anything in the main memory is to be
updated An I/O instruction is executed or an external interrupt occurs jTaking a checkpoint involves: (a) saving the processor registers in memory (b) setting to 1 the checkpoint bit associated with each valid cache line
Process/Channel/System State
jThe state of a process has the obvious meaning; jThe state of the channel at t is the set of messages
carried by it up to time t (and the order of receipt) jThe state of the distributed system is the aggregate states of individual processes and channels jThe state is said to be consistent if, for every message delivery there is a corresponding messagesending event jA state violating this - a message delivered that had not yet been sent - violating causality; such a message is called an orphan jThe converse is consistent - a system state reflect the sending of a message but not its receipt
ECE655/Ckpt Part.11 .19 Copyright 2004 Koren & Krishna
Inconsistent States
jIn contrast, the set {P_1, Q_2}
is an inconsistent state; P_1 has no record of m being sent,while Q_2 records that m was received, i.e., m is an orphan message jThe sets of checkpoints that represent a consistent system state are said to form a recovery line we can roll the system back to them and restart from there: j{P_1, Q_1}: Rolling back P to P_1 undoes the sending of m and rolling back Q to Q_1 means that Q does not have any record of m jRestarting from these checkpoints, P will again send out m, which Q will receive
ECE655/Ckpt Part.11 .21 Copyright 2004 Koren & Krishna
Useless Checkpoints
jQ_2 is a useless checkpoint jQ_2 records the receipt of m1, but not the
sending of m2 j{P1,Q_2} cannot be consistent (otherwise m1 would become an orphan); similarly {P_2,Q_2} cannot be consistent (since otherwise m2 would become an orphan)
failure
message passing or indirectly - synchronized clocks): a single failure can cause a domino effect jWhen P suffers a transient failure, it rolls back to checkpoint P_3 jSince message f was sent after P_3, Q has to roll back (otherwise Q would have a message that was never sent: an orphan message) jP will rollback to P_2 since Q sent a message e to P jThis continues until all processes have rolled back to their starting positions
ECE655/Ckpt Part.11 .24 Copyright 2004 Koren & Krishna
Lost Message
jMessages lost due to rollback: jSuppose Q rolls back to Q_1 after receiving
message x from P Record of having received x is lost jIf P does not roll back to P_2 - as if P had sent a message which was never received by Q jLost messages do not violate causality - similarly to messages lost due to network problems Retransmission jHowever, if Q sent an ACK of x to P before rolling back, then that ACK will be an orphaned message unless P rolls back to P_2
ECE655/Ckpt Part.11 .25 Copyright 2004 Koren & Krishna
that can arise in distributed checkpointed systems jQ sends P a message q1 P sends Q a message p1 Then, P fails at the point shown, before receiving q1. To
prevent p1 from being orphaned, Q must roll back to Q_1 In the meantime, P recovers, rolls back to P_2, sends another copy of p1, and then receives the copy of q1 that was sent before all the rollbacks began However, since Q has rolled back, this copy of q1 is now orphaned, and so P has to repeat its rollback This in turn, orphans the second copy p1 as well, forcing Q to also repeat its rollback This dance of the rollbacks can continue indefinitely