Академический Документы
Профессиональный Документы
Культура Документы
Xiao Zhang
Eric Tune
Robert Hagmann
Rohit Jnagal
Vrigo Gokhale
John Wilkes
Google, Inc.
{xiaozhang, etune, rhagmann, jnagal, vrigo, johnwilkes}@google.com
Abstract
(a)
100%
1.
CDF
80%
60%
40%
20%
0%
0
10
20
30
40
50
60
70
Number of tasks per machine
(b)
80
90
100
100%
CDF
80%
60%
40%
20%
0%
0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of threads per machine
Introduction
Googles compute clusters share machines between applications to increase the utilization of our hardware. We provision user-facing, latency-sensitive applications to handle
their peak load demands, but since it is rare for all the applications to experience peak load simultaneously, most machines have unused capacity, and we use this capacity to run
batch jobs on the same machines. As a result, the vast majority of our machines run multiple tasks (Figure 1). The number of tasks per machine is likely to increase as the number
of CPU cores per machine grows.
Unfortunately, interference can occur in any processor resource that is shared between threads of different jobs, such
as processor caches and memory access-paths. This interference can negatively affect the performance of latency-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
Eurosys13 April 15-17, 2013, Prague, Czech Republic
c 2013 ACM 978-1-4503-1994-2/13/04. . . $15.00
Copyright
379
2.
Background
380
Normalized Slowdown
Normalized throughput
2X
1.8X
1.6X
1.4X
IPS
TPS
1.2X
1X
0
20
40
60
80
100
120
2X
1.8X
1.6X
1.4X
1.2X
1X
0
CPI
Latency
12
20
24
1.5X
Normalized CPI
Normalized TPS
2X
1.8X
1.6X
1.4X
1.2X
1X
1X
16
Time in hour
Time (minutes)
1.2X
1.4X
1.6X
1.4X
1.3X
1.2X
1.1X
1X
1X
1.8X
Normalized IPS
1.2X
1.4X
1.6X
1.8X
2X
Normalized Latency
3.
CPI as a metric
381
Normalized Latency
(b)
Normalized Latency
Normalized Latency
(a)
1.8
1.6
1.4
1.2
1
1.0
1.1
1.2
Normalized CPI
(c)
1.3
cluster
scheduler
1.8
1.6
1.4
1.2
1
1.0
CPI samples
1.2
1.4
Normalized CPI
1.6
agent
task
task
task
2.5
2
1.15
1.3
Normalized CPI
1.45
agent
task
task
task
2
1.8
Avg CPI
agent
task
task
task
to each machine that is running a task from that job. Anomalies are detected locally, which enables rapid responses and
increases scalability.
Tue.
Wed.
Thu.
Fri.
Sat.
1.4
1.2
6am
6pm
6am
6pm
6am
6pm
6am
6pm
6am
6pm
Time of day
agent
task
task
victim
1
1.0
agent
task
task
task
machines
1.5
1.6
CPI sampleaggregator
smoothed,
averaged,
CPI specs
string jobname;
string platforminfo; // e.g., CPU type
int64 timestamp;
// microsec since epoch
float cpu_usage;
// CPU-sec/sec
float cpi;
382
CPI
0.88 0.09
1.36 0.26
2.03 0.20
tasks
312
1040
1250
7%
<
Sample Percentage
Job
Job A
Job B
Job C
CPI2
5%
< Fitted GEV function
4%
< +
3%
< +2
2%
< +3
1%
0%
1
1.5
2.5
CPI
// CPU-sec/sec
Since the CPI changes only slowly with time (see Figure 5), the CPI spec also acts as a predicted CPI for the normal behavior of a job. Significant deviations from that behavior suggest an outlier, which may be worth investigating.
Because the important latency-sensitive applications typically have hundreds to thousands of tasks and run for many
days or weeks, it is easy to generate tens of thousands of
samples within a few hours, which helps make the CPI spec
statistically robust. Historical data about prior runs is incorporated using age-weighting, by multiplying the CPI value
from the previous day by about 0.9 before averaging it with
the most recent days data. We do not perform CPI management for applications with fewer than 5 tasks or fewer than
100 CPI samples per task.
4.
6%
Identifying antagonists
4.2
Identifying antagonists
383
ui (1
cthreshold
)
ci
ui (
ci
cthreshold
1).
The final correlation value will be in the range [-1, 1]. Intuitively, correlation increases if a spike of the antagonists
CPU usage coincides with high victim CPI, and decreases if
high CPU usage by the antagonist coincides with low victim CPI. A single correlation-analysis typically takes about
100s to perform.
The higher the correlation value, the greater the accuracy
in identifying an antagonist (section 7). In practice, requiring
a correlation value of at least 0.35 works well.
This algorithm is deliberately simple. It would fare less
well if faced with a group of antagonists that together cause
significant performance interference, but which individually
did not have much effect (e.g., a set of tasks that took turns
filling the cache). In future work, we hope to explore other
ways of decreasing the number of indeterminate cases, such
as by looking at groups of antagonists as a unit, or by combining active measures with passive ones.
5.
384
Job
video processing
content digitizing
image front-end
BigTable tablet
storage server
Case studies
4.8
5.6
3.6
4.2
2.4
2.8
1.2
1.4
0
2:00
2:05
2:10
2:15
2:20
2:25
0
2:30
2.4
1.8
1.2
0.6
CPI of victim
6.1
Correlation
0.46
0.44
0.43
0.39
0.39
6.
Type
batch
latency-sensitive
latency-sensitive
latency-sensitive
latency-sensitive
Value
task
10 seconds
every 1 minute
job CPU type
every 24 hours (goal: 1 hour)
0.25 CPU-sec/sec
2 ( : standard deviation)
3 violations in 5 minutes
0.35
0.1 CPU-sec/sec
5 mins
CPI of victim
Parameter
Collection granularity
Sampling duration
Sampling frequency
Aggregation granularity
Predicted CPI recalculated
Required CPU usage
Outlier threshold 1
Outlier threshold 2
Antagonist correlation threshold
Hard-capping quota
Hard-capping duration
0
0
14:3014:4515:0015:1515:3015:4516:0016:1516:30
385
12
0.4
0.3
0.2
0.1
Job
a production service
compilation
security service
statistics
data query/analysis
maps service
image render
ads serving
scientific simulation
Type
latency-sensitive
latency-sensitive
latency-sensitive
latency-sensitive
latency-sensitive
latency-sensitive
latency-sensitive
latency-sensitive
batch
Correlation
0.66
0.63
0.58
0.53
0.53
0.43
0.37
0.37
0.36
CPU usage
1.5
CPI of victim
CPI
1.5
0.5
0
0
14:00 14:10 14:20 14:30 14:40 14:50 15:00
0.5
0.6
1.2
0.4
0.6
0.2
13:00
# of threads
6.2
0.8
1.8
Many best-effort jobs are quite robust when their tasks experience CPU hard-capping: the tasks enter a lame-duck
mode and offload work to others. Once hard-capping expires, they resume normal execution. Figure 12(a) shows the
victims CPI dropping while the antagonist is throttled, and
for a while afterwards; in this case we throttled the antagonist twice. Figure 12(b) shows the behavior of the antagonist. During normal execution, it has about 8 active threads.
When it is hard-capped, the number of threads rapidly grows
to around 80. After the hard-capping stops, the thread count
drops to 2 (a self-induced lame-duck mode) for tens of
minutes before reverting to its normal 8 threads.
On the other hand, some tasks dont tolerate CPU hardcapping, preferring to terminate themselves if their perfor-
13:30
14:00
14:30
15:00
15:30
16:00
0
16:30
80
0.8
60
0.6
40
0.4
20
0.2
13:00
13:30
14:00
14:30
15:00
15:30
16:00
CPU usage
2.4
0
0
16:15 16:30 16:45 17:00 17:15 17:30 17:45
CPI of victim
Figure 10: Case 3: the CPI and CPU usage of the victim:
the CPI changes are self-inflicted.
0
16:30
386
1.5
0.5
0
0
16:15 16:25 16:35 16:45 16:55 17:05 17:15 17:25
Figure 13: Case 6: CPU usage of a throttled suspect antagonist (a MapReduce worker) and the CPI of a victim (a
latency-sensitive service). The MapReduce batch survived
the first throttling (from 16:48 to 16:53, indicated by shaded
area) but exited abruptly during the second throttling (from
17:12 to 17:17, indicated by shaded area).
0
0
20 40 60 80 100
Machine CPU utilization (%)
(d)
1
0.5
w. antagonist
w.o. antagonist
0
0
2X 4X 6X 8X 10X 12X
Normalized CPI
Large-scale evaluation
0
0
20 40 60 80 100
Machine CPU utilization (%)
(c)
12X
10X
8X
6X
4X
2X
0
0
20 40 60 80 100
Machine CPU utilization (%)
0.5
Figure 14: CPU utilization, antagonist detection, and increase in CPI. (a) Calculated antagonist correlation versus
observed CPU utilization. (b) CDF of observed CPU utilization on the machine. (c) Observed victim CPI divided by
the jobs mean CPI versus the observed CPU utilization. (d)
CDFs of observed CPI divided by the jobs mean CPI, in
cases where an antagonist was identified and when no antagonist could be identified. All but graph (d) show data points
sampled at the time when an antagonist was detected.
7.
0.5
CDF
12
(b)
1
CDF
Correlation
15
Normalized CPI
2.5
(a)
1
CPI of victim
7.2
387
40%
True (nonproduction)
False (nonproduction)
True (production)
False (production)
20%
0%
0.2
0.3
0.4
0.5
Correlation threshold
0.6
0.4
0.2
0
0.2
Nonproduction
Production
0.3
0.4
0.5
Correlation threshold
(c)
60%
0.6
0.4
0.2
0
0
0.5
Relative CPI
20%
0%
0.35
0.4
0.45
0.5
Correlation threshold
(c)
2.0
100%
80%
60%
40%
True
False
20%
0%
2X 5X 8X 11X 14X
CPI increase (in stddevs)
(d)
1
0.8
1.5
1.0
0.5
0
0
True
False
40%
1.0
0.6
0.4
0.2
2X 4X 6X 8X 10X
CPI degradation
0
0
0.5
1
1.5
Relative CPI
Figure 16: Antagonist-detection accuracy and CPI improvement for production jobs. (a) Detection rates versus the antagonist correlation threshold value. (b) Detection rates versus how much the CPI increases, expressed in standard deviations. (c) Observed relative victim CPI (see Figure 15)
versus the victims CPI degradation (CPI before throttling
divided by the jobs mean CPI). (d) CDF of victims relative
CPI. The antagonist correlation threshold is 0.35 in (b), (c)
and (d).
80%
1.2
0.8
(b)
True/False positive
60%
80%
0.8
(a)
100%
Relative CPI
Relative CPI
CDF
(b)
(a)
100%
8.
Related Work
388
9.
Future Work
10.
Conclusion
We have presented the design, implementation, and evaluation of CPI2 , a CPI-based system for large clusters to detect
and handle CPU performance isolation faults. We showed
that CPI is a reasonable performance indicator and described
the data-gathering pipeline and local analyses that CPI2 performs to detect and ameliorate CPU-related performance
anomalies, automatically, using CPU hard-capping of antagonist tasks.
We demonstrated CPI2 s usefulness in solving real production issues. It has been deployed in Googles fleet. The
beneficiaries include end users, who experience fewer performance outliers; system operators, who have a greatly
reduced load tracking down transient performance problems; and application developers, who experience a more
predictable deployment environment.
In future work, we will be exploring adaptive throttling
and making job placement antagonist-aware automatically.
Even before these enhancements are applied, we believe that
CPI2 is a powerful, useful tool.
389
Acknowledgements
This work would not have been possible without the help
and support of many colleagues at Google. In particular,
the data pipeline was largely built by Adam Czepil, Pawe
Stradomski, and Weiran Liu. They, along with Kenji Kaneda,
Jarek Kusmierek, and Przemek Broniek, were involved in
many of the design discussions. We are grateful to Stephane
Eranian for implementing per-cgroup performance counts
in Linux, and to him and David Levinthal for their help
on capturing performance counter data. We also thank Paul
Turner for pointing us to Linux CPU bandwidth control.
[13] E RANIAN , S. perfmon2: the hardware-based performance monitoring interface for Linux. http://perfmon2.
sourceforge.net/, 2008.
[14] F EDOROVA , A., S ELTZER , M., AND S MITH , M. D. Improving performance isolation on chip multiprocessors via an
operating system scheduler. In Proc. Intl Conf. on Parallel
Architectures and Compilation Techniques (PACT) (Brasov,
Romania, Sept. 2007), pp. 2536.
References
[17] G OVINDAN , S., L IU , J., K ANSAL , A., AND S IVASUBRA MANIAM , A. Cuanta: quantifying effects of shared on-chip
resource interference for consolidated virtual machines. In
Proc. ACM Symp. on Cloud Computing (SoCC) (Cascais,
Portugal, Oct. 2011).
[20] I YER , R., I LLIKKAL , R., T ICKOO , O., Z HAO , L., A PPA RAO , P., AND N EWELL , D. VM3: measuring, modeling and
managing VM shared resources. In Computer Networks (Dec.
2009), vol. 53, pp. 28732887.
[10] DAI , J., H UANG , J., H UANG , S., H UANG , B., AND L IU ,
Y. HiTune: Dataflow-based performance analysis for big data
cloud. In Proc. USENIX Annual Technical Conf. (USENIX
ATC) (Portland, OR, June 2011).
[24] M ATTHEWS , J. N., H U , W., H APUARACHCHI , M., D E SHANE , T., D IMATOS , D., H AMILTON , G., M C C ABE , M.,
AND OWENS , J. Quantifying the performance isolation
properties of virtualization systems. In Proc. Workshop on
Experimental Computer Science (San Diego, California, June
2007).
390
[29] O LSTON , C., R EED , B., S RIVASTAVA , U., K UMAR , R., AND
T OMKINS , A. Pig Latin: a not-so-foreign language for data
processing. In Proc. ACM SIGMOD Conference (Vancouver,
Canada, June 2008).
[30] R EISS , C., T UMANOV, A., G ANGER , G., K ATZ , R., AND
KOZUCH , M. Heterogeneity and dynamicity of clouds at
scale: Google trace analysis. In Proc. ACM Symp. on Cloud
Computing (SoCC) (San Jose, CA, Oct. 2012).
[31] R EN , G., T UNE , E., M OSELEY, T., S HI , Y., RUS , S., AND
H UNDT, R. Google-Wide Profiling: a continuous profiling
infrastructure for data centers. IEEE Micro, 4 (July 2010),
6579.
[32] S ANCHEZ , D., AND KOZYRAKIS , C. Vantage: scalable
and efficient fine-grain cache partitioning. In Proc. Intl
Symposium on Computer Architecture (ISCA) (San Jose, CA,
2011).
[33] S CHURMAN , E., AND B RUTLAG , J. The user and business
impact of server delays, additional bytes, and HTTP chunking
in web search. In Proc. Velocity, Web Performance and
Operations Conference (2009).
[34] S HEN , Z., S UBBIAH , S., G U , X., AND W ILKES , J. CloudScale: Elastic resource scaling for multi-tenant cloud systems.
In Proc. ACM Symp. on Cloud Computing (SoCC) (Cascais,
Portugal, Oct. 2011).
[35] S UH , G. E., D EVADAS , S., AND RUDOLPH , L. A new
memory monitoring scheme for memory-aware scheduling
and partitioning. In Proc. Intl Symp. on High Performance
Computer Architecture (HPCA) (Boston, MA, Feb 2002).
[36] S UH , G. E., RUDOLPH , L., AND D EVADAS , S. Dynamic
partitioning of shared cache memory. The Journal of
Supercomputing 28 (2004), 726.
[37] T URNER , P., R AO , B., AND R AO , N. CPU bandwidth control
for CFS. In Proc. Linux Symposium (July 2010), pp. 245254.
[38] W EST, R., Z AROO , P., WALDSPURGER , C. A., AND
Z HANG , X. Online cache modeling for commodity multicore
processors. Operating Systems Review 44, 4 (Dec. 2010).
[39] Z AHARIA , M., KONWINSKI , A., J OSEPH , A. D., K ATZ ,
R., AND S TOICA , I. Improving MapReduce performance
in heterogeneous environments. In Proc. USENIX Symp. on
Operating Systems Design and Implementation (OSDI) (San
Diego, CA, Dec. 2008).
391