Академический Документы
Профессиональный Документы
Культура Документы
higher processor performance, given that thermal and power by Fedorova et al. [12]
problems impose limits on the performance of single-core pertaining to throughput- L1 L1
designs. Accordingly, several chip manufacturers have already oriented systems. They
released, or will soon release, chips with dual cores, and it noted that L2 misses affect L2
T2 T2
a scheduling policy that reduces concurrency within groups. T1 T1 T1
1 1 F (3/8)
either wt(T ) or wt(T ) + 1. X X
X X
C 1/3
GIS task model. A GIS task system is obtained by remov- o
X deadline
X miss
ing subtasks from a corresponding IS (or GIS) task system. m
p
o
1/8 X
f
Specifically, in a GIS task system, a task T , after releasing sub- o
n
task Ti , may release subtask Tk , where k > i + 1, instead of e
M
e X
n
Ti+1 , with
k thej following restriction: r(Tk ) − r(Ti ) is at least t
g
a
X X
X X
j k
wt(T ) − wt(T ) . In other words, r(Tk ) is not smaller than
k−1 i−1 t 11/12
T a X X
a s
what it would have been if Ti+1 , Ti+2 , . . . ,Tk−1 were present s k X X
k X X
and released as early as possible. For the special case where s
j k is the
T k first subtask released by T , r(Tk ) must be at least 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0 0
1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
System Util System Util System Util
0 0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55
Min Task Util Allowed During Task Set Generation Min Task Util Allowed During Task Set Generation Min Task Util Allowed During Task Set Generation
48 task sets. (This variation is due to discarded task sets, pri- Beyond this point, however, miss rates level off or decrease.
marily in the partitioning case, and the way the data is orga- One explanation for this may be that our task-generation pro-
nized.) In interpreting this data, note that, because an L2 miss cess may leave little room to improve miss rates at total utiliza-
incurs a time penalty at least an order of magnitude greater than tions beyond 2.5. The fact that the three schemes approximately
a hit, even when miss rates are relatively low, a miss-rate differ- converge beyond this point supports this conclusion. With re-
ence can correspond to a significant difference in performance. spect to total WSS, at 1.5 times the L2 cache size (left column),
For example, see Fig. 5, which gives the number of cycles-per- megatasking is the clear winner. At 1.75 times (middle column)
memory-reference for the data shown in Fig. 4(b). Although and 2.0 times (right column), megatasking is still the winner in
the speed of the SESC simulator severely constrained the num- most cases, but less substantially, because all schemes are less
ber and length of our simulations, we also ran a small subset of able to improve L2 cache performance. This is particularly no-
our task sets for 100 quanta (as opposed to 20) and saw approx- ticeable in the 2.0-times case.
imately the same results. This further justifies 20 quanta as a Two anomalies are worth noting. First, in inset (e), Pfair
reasonable “stopping point.” slightly outperforms megatasking at the 3.5 system-utilization
As seen in the bottom-row plots, the L2 miss rate increases point. This may be due to miss-rate differences in the schedul-
with increasing task utilizations. This is because the heaviest ing code itself. Second, at the right end point of each plot (3.5
tasks have the largest WSSs and thus are harder to place onto system utilization or 0.5 task utilization), partitioning some-
a small number of cores. The top-row plots show a similar times wins over the other two schemes, and sometimes loses.
trend as the total system utilization increases from 2.0 to 2.5. These plots, however, are misleading in that, at high utiliza-
tions, many of the task sets were disqualified under partitioning.
Thus, the data at these points is somewhat skewed. With only
non-disqualified task sets plotted (not shown), all three schemes
Cycles per Mem Ref vs. Min Task Util (768K WSS)
100 have similar curves, with megatasking always winning.
Partitioning
Pfair
In addition to the data shown, we also performed similar ex-
80 Megatasks periments in which per-task utilizations were capped. We found
Cycles per Mem Ref
Figure 5: Cycles-per-memory-reference for the data in Fig. 4(b). Thus, these results should give a reasonable indication of how
Algorithm
Partitioning
No. Instr. No. Mem. Acc.
(177.36, 467.83, 647.64) (51.51, 131.50, 182.20)
References
Pfair (229.87, 452.41, 613.77) (65.96, 124.21, 178.04) [1] A. Agarwal, M. Horowitz, and J. Hennessy. An analytical cache
Pfair with Megatasks (232.23, 495.47, 666.62) (66.16, 137.62, 182.41)
model. ACM Trans. on Comp. Sys., 7(2):184–215, 1989.
Table 4: (Min., Avg., Max.) instructions and memory accesses com-
pleted over all non-disqualified task sets for each scheduling policy, in [2] J. Anderson, J. Calandrino, and U. Devi. Real-time schedul-
millions. From Table 3, every (almost every) task set included in the ing on multicore platforms (full version). http://www.cs.unc.edu/
partitioning counts is included in the Pfair (megatasking) counts. ∼anderson/papers.
[3] J. Anderson and A. Srinivasan. Early-release fair scheduling.
the different migration, preemption, and scheduling costs of the Proc. of the 12th Euromicro Conf. on Real-Time Sys., pp. 35–43,
three schemes impact the amount of “useful work” that is com- 2000.
pleted. As seen, megatasking is the clear winner by 5-6% on [4] J. Anderson and A. Srinivasan. Pfair scheduling: Beyond pe-
average and by as much as 30% in the worst case (as seen by riodic task systems. Proc. of the 7th Int’l Conf. on Real-Time
the minimum values). Comp. Sys. and Applications, pp. 297–306, 2000.
These experiments should certainly not be considered defini- [5] J. Anderson and A. Srinivasan. Mixed Pfair/ERfair scheduling of
tive. Indeed, devising a meaningful random task-set genera- asynchronous periodic tasks. Journal of Comp. and Sys. Sciences,
tion process is not easy, and this is an issue worthy of further 68(1):157–204, 2004.
study. Nonetheless, for the task sets we generated, megatasking [6] S. Baruah, N. Cohen, C.G. Plaxton, and D. Varvel. Proportionate
is clearly the best scheme. Its use is much more likely to result progress: A notion of fairness in resource allocation. Algorith-
in a schedulable system, in comparison to partitioning, and also mica, 15:600–625, 1996.
in lower L2 miss rates (and as seen in Sec. 4.1, for some specific
[7] E. Berg and E. Hagersten. Statcache: A probabilistic approach
task sets, miss rates may be dramatically less).
to efficient and accurate data locality analysis. Proc. of the 2004
IEEE Int’l Symp. on Perf. Anal. of Sys. and Software, 2004.
5 Concluding Remarks [8] J. Carpenter, S. Funk, P. Holman, A. Srinivasan, J. Anderson, and
S. Baruah. A categorization of real-time multiprocessor schedul-
We have proposed the concept of a megatask as a way to re- ing problems and algorithms. In Joseph Y. Leung, editor, Hand-
duce miss rates in shared caches on multicore platforms. We book on Scheduling Algorithms, Methods, and Models, pp. 30.1–
30.19. Chapman Hall/CRC, Boca Raton, Florida, 2004.
have shown that deadline misses by a megatask’s component
tasks can be avoided by slightly inflating its weight and by us- [9] H. Chen, K. Li, and B. Wei. Memory performance optimizations
ing Pfair scheduling algorithms to schedule all tasks. We have for real-time software HDTV decoding. Journal of VLSI Signal
also given deadline tardiness thresholds that apply in the ab- Processing, pp. 193–207, 2005.
sence of reweighting. Finally, we have assessed the benefits [10] Z. Deng, J.W.S. Liu, L. Zhang, M. Seri, and A. Frei. An open
of megatasks through an extensive experimental investigation. environment for real-time applications. Real-Time Sys. Journal,
While the theoretical superiority of Pfair-related schemes over 16(2/3):155–186, 1999.
other approaches is well known, these experiments are the first [11] P. Denning. Thrashing: Its causes and prevention. Proc. of the
(known to us) that show a clear performance advantage of such AFIPS 1968 Fall Joint Comp. Conf., Vol. 33, pp. 915–922, 1968.
schemes over the most common multiprocessor scheduling ap- [12] A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Perfor-
proach, partitioning. mance of multithreaded chip multiprocessors and implications
Our results suggest a number of avenues for further research. for operating system design. Proc. of the USENIX 2005 Annual
First, more work is needed to determine if the deadline tardi- Technical Conf., 2005. (See also Technical Report TR-17-04,
ness bounds given in Sec. 3 are tight. Second, we would like to Div. of Engineering and Applied Sciences, Harvard Univ. Aug.,
extend our results for SMT systems that support multiple hard- 2004.)
ware thread contexts per core, as well as asymmetric multicore [13] P. Holman and J. Anderson. Guaranteeing Pfair supertasks by
designs. Third, as noted earlier, timing analysis on multicore reweighting. Proc. of the 22nd Real-Time Sys. Symp., pp. 203–
systems is a subject that deserves serious attention. Fourth, we 212, 2001.
have only considered static, independent tasks in this paper. Dy- [14] R. Jain, C. Hughs, and S Adve. Soft real-time scheduling on
namic task systems and tasks with dependencies warrant atten- simultaneous multithreaded processors. Proc. of the 23rd Real-
tion as well. Fifth, in some systems, it may be useful to actu- Time Sys. Symp., pp. 134–145, 2002.
ally encourage some tasks to be co-scheduled, as in symbiotic
[15] S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and par-
scheduling [14, 18, 21]. Thus, it would be interesting to in- titioning on a chip multiprocessor architecture. Proc. of the Par-
corporate symbiotic scheduling techniques within megatasking. allel Architecture and Compilation Techniques, 2004.
Finally, a task’s weight may actually depend on how tasks are
[16] D. Liu and Y. Lee. Pfair scheduling of periodic tasks with allo-
grouped, because its execution rate will depend on cache behav-
cation constraints on multiple processors. Proc. of the 12th Int’l
ior. This gives rise to an interesting synthesis problem: as task Workshop on Parallel and Distributed Real-Time Sys., 2004.
groupings are determined, weight estimates will likely reduce,
due to better cache behavior, and this may enable better group- [17] M. Moir and S. Ramamurthy. Pfair scheduling of fixed and mi-
grating periodic tasks on multiple resources. Proc. of the 20th
ings. Thus, the overall system design process may be iterative
Real-Time Sys. Symp., pp. 294–303, 1999.
in nature.
1
−
7 X
Ti+1
2
1 −
− 2 −
− 2 2
1 −
− 2 −
− 2 Ti X in A(t)
7 7 7 7 7 7 7 7
2
− 2 −
− 2 T2 2
− 2 −
− 2 1
− T2
7 7 7 7 7 7 7 X
T1 T1 Uk+1
X in B(t)
Uk
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9
(a) (b) X
X in I(t)
Vm
Figure 6: Allocation in an ideal fluid schedule for the first two sub-
Vn
tasks of a task T of weight 2/7. The share of each subtask in each slot
of its window (f (Ti , u)) is marked. In (a), no subtask is released late; t
in (b), T2 is released late. share(T, 3) is either 2/7 or 1/7 depending
on when subtask T2 is released. Figure 7: Classification of three GIS tasks T , U , and V at time t. The
slot in which each subtask is scheduled is indicated by an “X.”
[18] S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive
scheduling for SMT processors. http://www.cs.washington.edu/ Lag in an actual schedule. The difference between the total
research/smt/. processor allocation that a task receives in the fluid schedule
[19] J. Renau. SESC website. http://sesc.sourceforge.net. and in an actual schedule S is formally captured by the concept
[20] S. Shankland and M. Kanellos. Intel to elaborate on new of lag. Let actual(T, t1, t2 , S) denote the total actual allocation
multicore processor. http://news.zdnet.co.uk/hardware/chips/ that T receives in [t1 , t2 ) in S. Then, the lag of task T at time t
0,39020354,39116043,00.htm, 2003. is
[21] A. Snavely, D. Tullsen, and G. Voelker. Symbiotic job scheduling
with priorities for a simultaneous multithreading processor. Proc. lag(T, t, S) = ideal(T, 0, t) − actual(T, 0, t, S)
of ACM SIGMETRICS 2002, 2002.
Pt−1 Pt−1
= u=0 share(T, u) − u=0 S(T, u).(10)
[22] A. Srinivasan and J. Anderson. Optimal rate-based scheduling
on multiprocessors. Proc. of the 34th ACM Symp. on Theory of (For conciseness, when unambiguous, we leave the schedule
Comp., pp. 189–198, 2002. implicit and use lag(T, t) instead of lag(T, t, S).) A schedule
[23] X. Vera, B. Lisper, and J. Xue. Data caches in multitasking hard for a GIS task system is said to be Pfair iff
real-time systems. Proc. of the 24th Real-Time Sys. Symp., 2003.
[24] S. Viswanathan and T. Imielinski. Metropolitan area video-on- (∀t, T ∈ τ :: −1 < lag(T, t) < 1). (11)
demand service using pyramid broadcasting. IEEE Multimedia
Systems, pp. 197–208, 1996. Informally, each task’s allocation error must always be less than
one quantum. The release times and deadlines in (1) are as-
Appendix: Detailed Proofs signed such that scheduling each subtask in its window is suffi-
cient to ensure (11). Letting 0 ≤ t0 ≤ t, from (10), we have
In this appendix, detailed proofs are given. We begin by provid-
ing further technical background on Pfair scheduling [3, 4, 5, 6, lag(T, t + 1) = lag(T, t) + share(T, t) − S(T, t), (12)
22].
Ideal fluid schedule. Of central importance in Pfair schedul- lag(T, t + 1) = lag (T, t0 ) + ideal(T, t0 , t + 1) −
ing is the notion of an ideal fluid schedule, which is defined
below and depicted in Fig. 6. Let ideal (T, t1 , t2 ) denote the actual(T, t0 , t + 1). (13)
processor share (or allocation) that T receives in an ideal fluid
Another useful definition, the total lag for a task system τ in a
schedule in [t1 , t2 ). ideal (T, t1 , t2 ) is defined in terms of
schedule S at time t, LAG(τ, t), is given by
share(T, u), which is the share (or fraction) of slot u assigned to
task T . share(T, u) is defined in terms of a similar per-subtask
(14)
P
LAG(τ, t) = lag(T, t).
function f : T ∈τ
share(V,t)+share(V,u) < wt(V) this assumption and Lemma 5, we have the following.
Vl Vm
X (I) No slot in [t, td ) is fully-idle.
V is inactive in [t+1,u)
Because t is partially-idle, by Lemma 2(d), t < td − 1 holds.
V is in B(t) Any task scheduled in [t+1,u) Let I denote the interval [t, td ). We first partition I into dis-
is in A(t). joint subintervals as shown in Fig. 9, where each subinterval is
For t < v < u, only tasks in A(v) either partially-idle or busy. Because t is partially-idle, the first
are active in v.
subinterval is partially-idle. Similarly, because there is no hole
t t+1 u in td − 1, the last subinterval is busy. By (I), no slot in [t, td )
Figure 8: Lemma 4. The slot in which a subtask is scheduled is indi- is fully-idle. Therefore, the intermediate subintervals of I alter-
cated with an “X.” If T is in B(t), subtasks like Ti or Tj cannot exist. nate between busy and partially-idle, in that order. In the rest of
Also, a task in B(t) is inactive in [t + 1, u). the proof, the notation in Fig. 10 will be used, where the subin-
tervals Hk , Bk , and Ik , where 1 ≤ k ≤ n, are as depicted in
allocation to γ in slots t and u is at most I + f (because this Fig. 9.
bounds from above the total weight of tasks in B(t) ∪ A(t)) In order to show that there exists a u, where t + 1 < u ≤ td
plus the cumulative weights of tasks scheduled in t (i.e., tasks and ∆LAG(γ, t, u) ≤ ∆lag(F, t, u), we compute the ideal and
in A(t)), which is at most |A(t)|Wmax . Finally, it can be shown actual allocations to the tasks in γ and to F in I. By Lemma 4,
that the ideal allocation to γ in a slot s in [t + 1, u) is at most the total allocation to the tasks in γ in Hk and the first slot of
|A(s)|Wmax . Adding all of these values, we get the value indi- Bk in the ideal schedule is given by ideal(γ, tsHk , tsBk + 1) ≤
cated in the lemma. A full proof is available in [2]. |A(t + Pk−1 + i − 1)| · Wmax . By (20), the
Phk
I + f + i=1
tasks in γ are allocated at most Wsum = I + f time in each
Lemma 4 Let t < td − 1 be a fully- or partially-idle slot
slot in the ideal schedule. Hence, the total ideal allocation to
in Sγ and let u < td be the earliest busy slot after t (i.e.,
the tasks in γ in Ik , which is comprised of Hk and Bk , is given
t + 1 ≤ u < td ) in Sγ . Then, ideal (γ, t, u + 1) =
by ideal(γ, tsHk , teBk ) ≤ hi=1
P k
Pu P Pu−1 (|A(t+Pk−1 +i−1)|·Wmax )+
T ∈γ share(T, s) ≤ I + f + s=t |A(s)|Wmax .
s=t
(I + f ) + (bk − 1) · (I + f ) = hi=1
P k
(|A(t + Pk−1 + i − 1)| ·
The next lemma concerns fully-idle slots. Wmax )+bk ·(I +f ). Thus, the total ideal allocation to the tasks
in γ in I is given by
Lemma 5 Let t < td be a fully-idle slot in Sγ . Then all slots in
[0, t + 1) are fully idle in Sγ . ideal(γ, t, td) ≤
Proof: Suppose, to the contrary, that some subtask Ti is sched-
Pn Phk
k=1 i=1 (|A(t+Pk−1 +i−1)|·Wmax ) +bk ·(I +f ) .(23)
uled before t. Then, removing Ti from Sγ will not cause any
subtask scheduled after t to shift to the left to t or earlier. The number of processors executing tasks of γ in Sγ is |A(t0 )|
(If such a left displacement occurs, then the displaced subtask for a slot t0 with a hole, and is I (resp., I + 1) for a busy tight
should have been scheduled at t even when Ti is included.) (resp., non-tight) slot. Hence,
Hence, even if every subtask scheduled before t is removed, P
actual(γ, t, td) = nk=1 hk
P
the deadline miss at td cannot be eliminated. This contradicts i=1 |A(t + Pk−1+ i − 1)| +
(T2). T T
I · bk + (I + 1) · (bk − bk ) . (24)
We are now ready to prove the main lemma, which shows that By (23) and (24), we have
the lag inequality, if violated, is restored by td .
∆LAG(γ, t, td ) = LAG(γ, td ) − LAG(γ, t)
Lemma 6 Let t < td be a slot such that LAG(γ, t) ≤
lag(F, t), but LAG(γ, t + 1) > lag (F, t + 1). Then, there = ideal(γ, t, td) − actual(γ, t, td) {by (16)}
Pn Phk
exists a time u, where t + 1 < u ≤ td , such that LAG(γ, u) ≤ ≤ k=1 (|A(t+ P k−1 +i−1)|·(W max −1)) +
i=1
lag(F, u).
bT T
(30)
k ·f +(bk −bk )(f −1)
Proof: Let ∆LAG(γ, t1 , t2 ) = LAG(γ, t2 ) − LAG(γ, t1 ),
where t1 < t2 , and let ∆lag (F, t1 , t2 ) be analogously defined.
Pn P hk T T
≤ k=1 i=1 (Wmax −1) +bk ·f +(bk −bk )(f −1)
It suffices to show that ∆LAG(γ, t, u) ≤ ∆lag (F, t, u), where
{Wmax ≤ 1, and hence, (30) decreases with increasing
u is as defined in the statement of the lemma.
By the statement of the lemma and Lemma 3, there is at least |A(t+Pk−1 +i−1)|. However, by (I), H1 , . . . , Hn are
one hole in t, and hence, t is either fully- or partially-idle. These partially-idle, so, |A(t+Pk−1 +i−1)| ≥ 1, 1 ≤ k ≤ n.}
H1
I1
B1 H2
I2
B2
I3 .. I
n −1
Hn
In
Bn
t Hs t He = t Bs t Be = t Hs t He = t Bs t Be = t Hs t Be = t Hs t He = t Bs t Be
1 1 1 1 2 2 2 2 3 n−1 n n n n
γ
Partially−idle Busy slots Partially−idle Busy slots ... Partially−idle Busy slots
time
~
~
~
t t+h1 t+h1 +b1 td
X X X
F X X X X
Figure 9: Subintervals of the interval I = [t, td ) as explained in Lemma 6. Sample windows and allocations for the fictitious task corresponding
to γ (after reweighting) are shown below the time line.
Hk
def
= [tsHk , teHk ) {s = “start”, e = “end”} an ideal allocation of f + ∆f in every slot. Hence, by (25),
LN
def
=
PN N N
(27) of the proof, we assume Wmax > f . In this case, by (31),
k=1 (hk + bk )
∆LAG(γ, t, td ) ≤ L · f + LN · (Wmax − f − 1), and by (34),
(28)
def Pk
Pk = i=1 (hi + bi ) ∆lag(F, t, td ) = L(f + ∆f ) − LN . By Lemma 2(c),
(29)
def
P0 = 0
LAG(γ, td ) = 1
Figure 10: Notation for Lemma 6.
⇒ ∆LAG(γ, t, td ) + LAG(γ, t) = 1
⇒ L · f + LN · (Wmax − f − 1) + LAG(γ, t) ≥ 1
Pn
= k=1 (hk (Wmax −1)+bT T
k ·f +(bk −bk )(f −1)) ⇒ L · f + LN · (Wmax − f − 1) + 1 > 1
{from the statement of the lemma and (18), LAG(γ, t) < 1}
Pn
= k=1 (hk (Wmax −1)+bk ·f −bk +bT k)
(35)
Pn
= k=1 (hk · ((Wmax −f −1)+f )+bk ·f −bk +bT k) ⇒ L > (LN (1 + f − Wmax ))/f.
Pn
= k=1 ((hk +bk )·f +
Because Wmax > f , by (7) and (17), ∆f ≥ ( 1+f Wmax −f
−Wmax ) ·
hk ·(Wmax −f −1)−bN k) {bk = bT N
k + bk } f holds. Hence, by (35), L · ∆f > L (Wmax − f ) holds.
N
{by (25)}
Pn
= L · f + k=1 (hk · (Wmax − f − 1) − bN k) Therefore, using the expressions derived above for ∆LAG and
Pn
= L · f + k=1 (hT k · (W max − f − 1) + ∆lag, ∆LAG(γ, t, td ) − ∆lag(F, t, td ) ≤ LN (Wmax − f ) −
hN · (W − f − 1) − bN {hk = hT N L · ∆f < 0 follows, establishing the lemma.
k max k) k + hk }
Let t be the largest u satisfying (22). Then, by Lemma 6,
Pn
≤ L · f + k=1 (hN N
k · (Wmax − f − 1) − bk ) {Wmax ≤ 1}
n
L · f − LN + k=1 hN {by (27)} there exists a t0 ≤ td such that LAG(τ, t0 ) ≤ lag(F, t0 ). If t0 =
P
= k · (Wmax − f )
td , then (21) is contradicted, and if t0 < td , then (21) contradicts
L · f + LN (Wmax − f − 1), Wmax > f
≤ (31) the maximality of t. Theorem 1 follows. (This result can be
L · f − LN , Wmax ≤ f.
extended to apply when “early” subtask releases are allowed,
as defined in [5], at the expense of a slightly more complicated
We now determine the change in F ’s lag across I. F receives proof.)