Askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel
and Distributed Systems

HAFIZ FAHAD SHEIKH, University of Texas at Arlington
HENGXING TAN, University of Florida at Gainesville
ISHFAQ AHMAD, University of Texas at Arlington
SANJAY RANKA and PHANISEKHAR BV, University of Florida at Gainesville
Enabled by high-speed networking in commercial, scientific, and government settings, the realm of high
performance is burgeoning with greater amounts of computational and storage resources. Large-scale systems such as computational grids consume a significant amount of energy due to their massive sizes. The
energy and cooling costs of such systems are often comparable to the procurement costs over a year period. In
this survey, we will discuss allocation and scheduling algorithms, systems, and software for reducing power
and energy dissipation of workflows on the target platforms of single processors, multicore processors, and
distributed systems. Furthermore, recent research achievements will be investigated that deal with power
and energy efficiency via different power management techniques and application scheduling algorithms.
The article provides a comprehensive presentation of the architectural, software, and algorithmic issues
for energy-aware scheduling of workflows on single, multicore, and parallel architectures. It also includes a
systematic taxonomy of the algorithms developed in the literature based on the overall optimization goals
and characteristics of applications.
Categories and Subject Descriptors: C.4 [Computer System Organization]: Performance of Systems
General Terms: Algorithms, Performance, Measurement
Additional Key Words and Phrases: Energy-aware scheduling, task allocation algorithms, dynamic voltage
and frequency scaling, dynamic power management
ACM Reference Format:
Sheikh, H. F., Tan, H., Ahmad, I., Ranka, S., and Bv, P. 2012. Energy- and performance-aware scheduling of
tasks on parallel and distributed systems. ACM J. Emerg. Technol. Comput. Syst. 8, 4, Article 32 (October
2012), 37 pages.
DOI = 10.1145/2367736.2367743 http://doi.acm.org/10.1145/2367736.2367743
1. INTRODUCTION
Massive energy consumption is an escalating threat to the environment. The explosive

growth of computers results in significantly increasing the consumption of precious
natural resources such as oil and coal, aggravating the looming conjuncture of energy
shortage. Studies have reported that computers consume more than 8% of the total
energy produced and this fraction is growing [Andreae 1991; Green Grid 2012]. A
report of Dataquest [1992] stated that total expenditure of power by processors in
PCs in the worldwide range was 160MW in 1992, then had grown to 9000MW by the
This work was supported by a grant from the National Science Foundation under contract no. CCF-0905308
and CRS-0905196.
Authors addresses: H. F. Sheikh, University of Texas at Arlington, TX; email: Hafizfahad.sheikh@
mavs.uta.edu; H. Tan, University of Florida at Gainesville, FL; I. Ahmad (corresponding author), University
of Texas at Arlington, TX; email: iahmad@uta.edu; S. Ranka, P. Bv, University of Florida at Gainesville, FL.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c 2012 ACM 1550-4832/2012/10-ART32 $15.00

DOI 10.1145/2367736.2367743 http://doi.acm.org/10.1145/2367736.2367743
ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.
32
32:2
H. F. Sheikh et al.
year 2001. A large percentage (around 42%) of U.S. firms are already expecting to
reach their power and cooling capacities within the next few years [Uptime 2012]. By
addressing these issues, IT organizations are motivated to seek better management
methods for increasing computing, network, and storage demands, and lowering energy
usage, while remaining competitive and able to meet future business needs. Poweraware computing is important to large-scale systems because they are burning a large
portion of energy consumed by IT devices (see EPAs report to congress [USEPA 2007]).
Novel resource management strategies that consider energy consumption upfront as a
precious resource and provide means to manage it at scale along with performance are
sorely needed.
Enabled by high-speed networking in commercial, scientific, and government settings [ATLAS 1999; CMS 2012; Loveday 2002], the realm of high-performance computing is burgeoning with both computational and storage resources [Atlas 1999; Loveday
2002; NASAES 2012; CMS 2012]. Large-scale systems such as computational grids
consume a significant amount of energy due to their massive sizes [Aea 2008; Feng and
Cameron 2007]. The energy and cooling costs of such systems are often comparable to
the procurement costs over a three-year period [ORNL 2012; Bland 2006].
In this survey, we will discuss allocation and scheduling algorithms, systems, and
software for reducing power and energy dissipation of workflows on the target platforms of single processors, multicore processors, and distributed systems. Furthermore,
recent research achievements will be investigated that deal with power and energy efficiency via different power management techniques and scheduling algorithms.
The key contributions of our survey are the following:
(1) It provides a comprehensive presentation of the architectural, software, and algorithmic issues for energy-aware scheduling of workflows on singlecore, multicore,
and parallel architectures.
(2) It provides a systematic taxonomy of the algorithms developed in the literature
based on the overall optimization goals and characteristics of applications.
The rest of the article is organized as follows. In Section 2, we define the workflow
allocation problem. In Section 3 we provide a general overview of different energyaware mechanisms and systems. Section 4 organizes the research efforts for energyaware task allocation into different categories while providing details of each work.
Section 5 discusses some important aspects based on the details presented in Section 4.
Section 6 concludes the work and highlights some of the future research directions.
2. ENERGY-AWARE TASKING ALLOCATION (EATA) PROBLEM
In this section we will first formulate the energy-aware tasking allocation problem in
general terms without assuming a particular platform of workload. Then we present
various models that are used to represent different types of applications as well as for
systems while solving the aforesaid problem.
2.1. Problem Formulation
The goal of the energy-aware tasking scheduling problem is to assign tasks to one or
more cores so that performance and energy objectives are simultaneously met. Thus,
algorithms for EATA typically need to solve a multiobjective optimization problem (i.e.,
reduce energy or enhance performance, or both). There is no unique solution [Ghazaleh
et al. 2003; Subrata et al. 2008] for a MultiObjective Optimization (MOO) problem. As illustrated in Figure 1, the MOO problem is generated when multiple objective functions
conflict: in a given application and processor system, termed as (J, M), neither power
consumption nor schedule length can be improved without trading off the other one,
Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems
32:3
Fig. 1. Illustration of the MOO-NLP problem (adapted from Ahmad et al. [2008]).
along the curve AB, the so-called Pareto-optimal frontier. The point A is the operating
point at which power consumption is minimized and at B we have the schedule length
minimization. From the given Pareto-front, we can select appropriate operating points
along curve AB to suit various constraints and requirements. The MOO problem combined with Multiple Elements optimization (ME) becomes the MEMO problem.
There are three broad goals that have been found to be of practical importance.
(1) Minimize energy consumption with allowable reduction in quality of service, known
here as Performance-Constrained Energy Optimization (PCEO). This involves the
minimization of additional time needed (performance degradation) to complete
tasks when faced with the limitation of not exceeding a requirement of execution
(response) time, which can lead to direct and indirect energy savings.
(2) Minimize the time requirement during the course of execution when given a total
energy budget. This can be seen as the opposite end of the spectrum of the first
potential objective in that energy is fixed and performance is maximized. This
survey will refer to this kind of approach as Energy-Constrained Performance
Optimization (ECPO).
(3) Optimize a combination of the previous two goals that works to optimize both
performance as well as energy efficiency by balancing the two objectives simultaneously. The target of this kind of approaches is to confer the imposed task and
energy requirements and at the same time minimize the penalty induced by requirements violation, known here as Dual Energy and Performance Optimization
(DEPO).
32:4
H. F. Sheikh et al.
2.2. Power and Energy Model
Before introducing power management techniques, we begin with the fundamental

terms and equations. The formal representation of the relationship between power and
energy can be defined as

E=
P dt,
T
where P denotes the power, T denotes the continuous time duration, and E denotes the
required energy consumed. It is easy to see that energy represents the accumulated
usage of work performed during a time period, measured in joules (J), whereas the
power is considered as the average value of work over time, measured in watts (W).
Especially when the time interval is shrinking to a point, the power can be considered
as an instant value of work rate at that time point. By realizing the difference between
power and energy, we can understand that in some cases, reducing power consumption
may not reduce the energy usage. For instance, if we lower the CPU speed, the power
consumption will be decreased. However, the execution time may be prolonged due to
the decreasing performance, which may lead to the same energy usage to complete the
assigned workloads. The estimate about the power consumption of a task or a set of
tasks is a key element in static scheduling and allocation schemes for improving the
energy consumption of different platforms. Accurate power consumption profiles can
be obtained via profiling, but this can only be done in a complete sense if the workload
is completely known ahead of time. Since most of the times this may not be the case
the power consumption needs to be estimated.
Leakage and Dynamic Power. The power expenditure in CMOS circuits has two
components: dynamic and leakage power. The leakage current in any voltage-applied
open circuit can generate leakage power [Chandrakasan et al. 1992]. Thus this kind of
power is only determined by the voltage and circuit physics. Since the leakage power
has no relationship with the clock rate of computer components and workloads applied
on the components, it is ignored by most early research on power-aware computing.
On the contrary, dynamic power consumption results from switching activities in the
circuit. This, in general, can be dependent on the clock rate and application workloads
applied to the components. Formally, the dynamic power consumption comes from
two sources: the switching capacitance and short-circuit current [Venkatachalam and
Franz 2005]. Between them the switched capacitance acts as the primary role leading
to the dynamic power dissipation, and the circuit current is hard to reduce. Thus, the
dynamic power model is usually abstracted as
PDynamic = Cc V 2 f,
where dynamic power, PDynamic , is determined by switching capacitance Cc , frequency
f (clock rate), and supply voltage V [Venkatachalam and Franz 2005]. Among the parameters, switching capacitance is the physical character of the circuit so is always
considered as constant. Lots of techniques can easily scale the supply voltage and frequency in a large range. Thus these two parameters attract a large portion of attention
in power-conscious computing research.
However, considerations of leakage power effects versus dynamic power are yet taken
by some researchers. In general, with every upcoming technology targeted to cater to
the increasing demand of computational needs, the size of CMOS devices gets reduced
by approximately 30%. The decrease in device dimension coupled with reduction of
threshold voltage can lead to an increase in leakage power (as much as five times) with
each generation of technology [Borkar 1999]. Therefore, it is necessary to consider both
leakage and dynamic power during energy-aware scheduling. It is possible to reduce
32:5
dynamic power consumption using DVFS by reducing the voltage level when slack is
available. However, longer execution time due to voltage reduction will lead to higher
dissipation of leakage power. Reduction in the voltage level beyond a certain point is
not beneficial due to the dominance of leakage power [Jejurikar et al. 2004], as the
increase in leakage power is more than the decrease in dynamic power. Both leakage
and dynamic power should be considered during scheduling decisions.
2.3. Application Model
Parallel applications, such as scientific, mathematical, and bioinformatics problems,

typically consist of workflows involving multiple tasks, with or without precedence
constraints. In some cases, the user or application is generally interested in the completion of the entire workflow. Also, in some other cases, tasks are considered as individuals without dependency constraints. In each of the following subsections, we briefly
describe the characteristics of a large class of these applications.
Independent tasks. Numerous studies [Lu et al. 2000; Yu and Prasanna 2002; Aydin
et al. 2004; Ge et al. 2005; Zhu and Mueller 2005; Shin and Choi 2007; Seo et al. 2008;
Zhang and Chatha 2007] construct their research with an independent task model
that does not consider the precedence relations between tasks. These models are most
useful for general-purpose CPUs and applications such as desktops and servers. For
parallel computing environments, communications between tasks is not considered in
the independent task model.
In general, three kinds of independent task modes are of interest: periodic, aperiodic
and sporadic tasks [Buttazzo 2005]. Identical jobs repeating after a constant period
continuously are called periodic tasks. If a periodic task with period T is initialized at ,
then the triggering/activation time (at) of the kth instance is + (k l)T. A periodic task
is usually attributed by its worst-case execution time (wcet) and its relative deadline
(D). Aperiodic tasks are different from periodic tasks in terms of having a variable time
period for task repetitions. A hard aperiodic task can be characterized by its arrival
time, worst-case execution time, and relative deadline. A soft aperiodic task has no
strict deadline. Sporadic tasks consist of an infinite sequence of identical activities
with irregular activations. Though tasks can be triggered arbitrarily, every consecutive
pair of activations has a defined minimum inter-arrival time. The attributes of the
sporadic task are usually its relative deadline, minimum inter-arrival time (), and
wcet. Scheduling algorithms based on these task models will be investigated in later
sections.
In heterogeneous grid computing environments, applications are often modeled with
the bag-of-tasks application model [Chung et al. 1999; Kim et al. 2007; Lee and Zomaya
2007] that consists of a set of independent tasks without any inter-task communications. An application is only considered finished when the whole task set completes
its execution. A typical BoT application, termed J, is composed by multiple independent heterogeneous tasks denoted by{T1 , T2 , . . . , Tn}, without precedence relationships
among them. The BoT model can be distinguished as compute intensive and data intensive [Lee and Zomaya 2007]. In the compute-intensive BoT model, the major cost
(execution time) is from the computation, while time spent on data retrieving data
and transporting it (from a node to another) is small and can be ignored. For a dataintensive BoT model, one needs to consider the cost of retrieving data. The data sources
of task Ti in job J can be a set of objects, denoted by {Ii,1 , Ii,2 , . . . Ii,d}. The cost of transferring input data among tasks, that is, changing the relationship between data objects
and tasks, is considerable and can impact the total performance. Furthermore, a data
source can supply data for multiple tasks. Such a data-intensive model is more realistic
than the compute-intensive model, especially in distributed systems, because the data
transportation may happen frequently between nodes or servers.
32:6
H. F. Sheikh et al.
Task0
Task 1
Task2
Task 3
3
3
Task 4
5
Task 7
2
Task6
5
Task8
Task5
5
2
3
3
Task11
Task 9
4
6
Task10
3
Fig. 2. A simple task graph.
Dependent tasks. When the tasks are constrained by precedence relationships, the
DAG (Directed Acyclic Graph) model is commonly used to represent these tasks [Kang
and Ranka 2008a, 2008b; Lee and Zomaya 2009]. A DAG (Figure 2) can illustrate the
workflow among the tasks of an application in terms of G = (N, E), in which N denotes
the set of nodes that represent tasks, while E denotes the set of edges, that represent
the dependency relationships among nodes. Furthermore, an edge between two nodes
can also represent the inter-task communication. The critical path is the longest path
of the graph; given a task, its closest predecessor (with the latest finish time) is named
as MIP (Most Influential Parent).
By involving the attribute weight into nodes, the computation cost can be quantified,
while weight on each edge between two tasks represents the communication cost between them. In practice, most algorithms only allow edges where the tasks presented
by the end nodes are assigned on different processors or servers. That is, the communication cost between tasks allocated to the same processor or server is ignored. A typical
utilization of the DAG application model is task scheduling via slack reclamation, the
details of which will be discussed in following sections. When distributing slacks to
tasks (this is equal to inserting a task into the free time slots between another two
tasks) the precedence relationship reflected by the graph cannot be violated.
32:7
As we observed from the previous discussion, tasks represented by different models

may require different considerations during the task scheduling and allocation phase.
Similarly, the opportunities for saving energy can be different based on the type of task
model.
3. ENERGY-AWARE MECHANISMS AND SYSTEMS
In this section we will discuss basic underlying techniques used by various energyaware task scheduling and allocation schemes.
3.1. Mechanisms and Methodologies for Power Management
3.1.1. ACPI. Almost all modern computer systems can provide power management
functionalities. Initially such power-aware policies or mechanisms were implemented
inside computer BIOS or chip firmware, thus bringing inflexibility to configure power
events of devices. Several leading organizations in the PC industry agreed on specifying a common standard of configuration interface between computer hardware and
software. Then ACPI (Advanced Configuration and Power Interface) was published in
1999 [ACPI 1999], which provides the functionality as operating-system-level power
management (OSPM) to take over the control of device configuration and power management events from hardware. Thus OS as well as upper-layer applications have a
general interface to address the configuration and power information of either individual devices or the whole computer system. Furthermore, ACPI defines multiple power
states for both the system and individual devices. The system or device should work
in one of the states which represent scenarios with different upper bounds of power
supply. For example, G0G3 are defined as global power state of the whole system.
Among them, G0 represents the working state, where the system has the maximal
power consumption, while G3 represents the mechanical off state, where the system
is almost off and the only power consumption is from the RTC battery on the motherboard. Similarly, device power states are represented by states D0D3 and states
C0C3 represent the processor power states since the CPU occupies the largest portion
of power consumption of the whole system. Moreover, ACPI also define performance
states, represented by P0 Pn , and sleep states, represented by S0 Sn , where the
maximal value of n is dependent on the device. Given the performance states, ACPI
provides the system functionality to dynamically scale the voltage or frequency to
achieve different performance levels. Among the sleep states, deeper sleep states imply
less idle power consumption, however, more power is necessary to reactivate the device
from this state. Currently, ACPI can provide more advanced features such as mobile
device management and flexible thermal management. ACPI brings significant flexibility and convenience for people to design and implement power management policies
or algorithms on an OS and application level.
3.1.2. DVFS. From the dynamic power equation in Section 2.2, power consumption
can be conserved by scaling the voltage and frequency of the CPU. Since the supply
voltage applied to a circuit has mutual effects with its frequency, the dynamic scaling
of frequency in most scenarios is considered the same technique with dynamic voltage
scaling, termed Dynamic Voltage and Frequency Scaling (DVFS). The relationship
between supply voltage and frequency can be approximated as [Venkatachalam and
Franz 2005]
(V Vth )
f
,
V
where f denotes the frequency, V denotes the supply voltage, Vth denotes threshold
voltage, and is a constant determined by circuit physics. Since when a circuit is
given, the threshold voltage is almost a constant, power consumption can be modeled
32:8
H. F. Sheikh et al.
as the function of frequency or supply voltage. Especially when is approximated

as 2, the relationship between frequency and supply voltage is linear, thus dynamic
power is considered to be of cubic frequency or supply voltage. More important, since
the execution time of the CPU is averse to its clock rate, say, frequency, the problem
of minimizing power consumption with a constrained performance requirement can be
modeled as a convex optimization problem. This is the basis of lots of energy-aware
scheduling algorithms. We will investigate these algorithms in the following sections.
In practice, DVFS may not achieve ideal experimental results due to two kinds of
problems [Venkatachalam and Franz 2005]. First, the nature of workloads may be unpredictable. Such a complication can result from execution preemption by high-priority
processes or inaccuracy of estimation of future task execution time. Second, nondeterminism and anomalies of real systems can bring additional errors to the dynamic
power model: dynamic power dissipation is cubic in supply voltage and execution time
is inversely proportional to clock frequency.
Several empirical approaches are found to apply DVFS to real systems, most of
which can be categorized as interval-based approaches, intertask approaches, or intratask approaches [Venkatachalam and Franz 2005]. The interval-based approaches
can estimate CPU utilization in the near future by analyzing historical CPU utilization. Instead of analyzing CPU utilization information, intertask approaches observe
the characteristics of individual tasks like execution time, deadline, etc., and assign
appropriate voltage and CPU frequency to them, respectively, thus reducing the total power usage. On the other hand, intratask approaches are based on the fact that
the workloads and resource requirements of a task may vary during the whole execution period. Such approaches divide each task into fine-grained time slots and assign
various CPU voltages and frequencies on each time slot, respectively.
3.1.3. Scheduling Considerations with DVFS. In general, in a DVFS-enabled processor,
higher processing speed results in faster processing of tasks and hence shorter schedule
length, but consumes more energy. In contrast, slowing cores incurs lower energy
consumption, but at the expense of increased execution time. Thus, the two objectives
are incommensurate and in conflict with one another. In designing an energy-efficient
system (hardware and software), the following issues are important.
(1) Optimizing power may not optimize energy. As mentioned in previous sections, energy can be considered as the accumulation of power over a period of time [Gonzalez
and Horowitz 1996]. A system that aggressively decreases the clock rate to reduce
power may lead to significant increase in time, thereby increasing the overall energy consumption.
(2) Simplistic power-aware designs may increase power or energy consumption [Stout
2006; Pering et al. 2000]. Systematic and overall reduction in power can lead to
better overall energy consumption [Usami and Horowitz 1995].
(3) Optimizing the average power consumption often optimizes the maximum power
consumption [Liang and Ahmad 2006]. However, in some cases this could lead to
an increase in peak power requirements [Luo and Jha 2000].
(4) Benefits of voltage scaling may be limited due to the cost of overheads. Some
analytical estimates of such overheads can be found in Brooks et al. [2000] and
Burd et al. [2000], but still there seems to be a need for a more accurate simulation.
Effect of CPU switching overheads. An important consideration in scheduling is the
DVFS switching overhead that is studied for various available processors [Albonesi
2002; Brook and Rajamani 2003]. Another factor is that unused cores can suffer energy wastage unless completely powered down (repowering may incur new overhead).
Switching takes a certain amount of time that, despite continuous improvements,
32:9
incurs delays [Albonesi 2002]. DVFS switching is typically done in a synchronous or

asynchronous manner. Traditional DVS-enabled processors use synchronous switching
in which execution is blocked. In asynchronous switching, execution is not blocked but
it incurs a ramp-up effect in which a surge in current occurs to raise the voltage to
the required frequency. Asynchronous overhead is relatively lower but the frequency
of switching can outweigh the benefit. Sophisticated DVFS switching algorithms have
been proposed [Albonesi 2002; Zhu and Mueller 2005], offering the selection of an
appropriate frequency to meet the task slack. These factors should be considered to
fine-tune scheduling algorithms.
3.1.4. Energy-Aware Software. Power efficiency is critical to wireless and mobile devices
because of the strict power budget provided by the battery. In Pering et al. [2006], the
authors propose a framework for wireless devices, called CoolSpots, that can switch
among different radio interfaces like WiFi and Bluetooth. Thus it can achieve substantial energy savings and increase the battery lifetime. In Flinn and Satyanarayanan
[2004], the authors investigate the possibility of extending the adaptation of the
Odyssey system for a weak-power environment. On the basis of the Coda system, which
is a distributed file system for mobile devices, Odyssey manages all the resources and
makes decisions to scale application quality to adapt to different scenarios. For example, in a weak connection network, the video stream transported between devices can be
set to low quality to save bandwidth. Similarly, in limited energy budget, PowerScope,
the power management tool in Odyssey, the server (the device running applications and
computing) can negotiate with the client (the device sending requests and receiving
results) to adjust the requirement of application quality. In this way, the short-term
power consumption can be predicted from the agreement of quality requirements so
that global power conservation can be achieved. In Oikawa and Rajkumar [1999], the
authors extend the Linux kernel to integrate dynamic power management techniques
including DVFS. Not only on Linux, but also on some other OSs the resource kernel can
be migrated to with small changes, so it is called portable RK. RK can isolate resources
for applications with different performance or power requirements so global energy
savings can be achieved. PowerDial is a middleware system of the Grid Computing environment for dynamically adapting the behavior of running applications to respond to
fluctuations in load, power, or any other event that threatens the ability of the computing platform to deliver adequate computing power to satisfy demand [Hoffmann et al.
2011]. PowerDial transforms static configuration parameters into dynamic knobs and
contains a control system that uses the dynamic knobs to maintain performance in the
face of load fluctuations, power fluctuations, or any other event that may impair the
ability of the application to successfully service its load with the given computational
resources. It uses dynamic influence tracing to transform static application configuration parameters into dynamic control variables stored in the address space of the
running application. These control variables are made available via a set of dynamic
knobs that can change the configuration (and therefore the point in the trade-off space
where it executes) of a running application without interrupting service or otherwise
perturbing the execution.
For large-scale data centers, power or energy consumption certainly occupies the
major part of the operating cost. While architecture-level dynamic power management
techniques like DVFS are widely adopted by current computing servers, programmers
continue to seek ways to make their programs more energy efficient. Many cluster
computing environments provide software to help programmers estimate the cost of
each block of their source-code; however, they cannot evaluate and self-adjust the
impact to the program effect due to their estimations. In Baek and Chilimbi [2010], the
authors propose Green, a framework that can provide approximation of the functions
32:10
H. F. Sheikh et al.
and loop blocks in programs. Furthermore, it also provides a model to evaluate the loss
on performance (QoS) due to the approximation. During such a calibration process,
the approximation can be adjusted in a timely and accurate fashion.
For dynamically scheduling tasks, that is, the global task queue and resource allocation that is decided before execution and may change during runtime, resource and
energy monitoring techniques act as key roles, either on that architecture level or system level. In Ge et al. [2010], the authors propose a power monitoring tool, PowerPack,
which can obtain component-level power measurements at function-level granularity
in large-scale systems. This framework can provide functionalities to monitor, track,
and analyze the performance and power consumption of applications in distributed
systems. The PowerPack toolkit is capable of profiling the power consumption of both
individual components and application functions.
4. EATA ALGORITHMS
4.1. Overview
Scheduling algorithms on single processors are comparatively simple. Since offline

task scheduling on single processors can be shown to have global optimal solutions,
online algorithms and complex task models for single processors have been attracting
recent research interest. Moreover, even if the CPU occupies the largest portion of
power consumption among computer components, the profit of power saving by scaling
down other devices, like caches and communication links, cant be ignored when the
number of CPUs in the system is small. Unlike computer components like the CPU
and storage devices for which speed can be scaled, those devices which cannot be
scaled have more complex power management issues. A typical case is that prolonging
execution time of tasks on the nonscaling device can increase the energy dissipation,
while such an activity on the CPU can reduce energy consumption. Thus, power-aware
scheduling algorithms should also take the effects on these nonscaling devices into
account. In Swaminathan and Chakrabarty [2005], the authors propose switch-off
policies for those devices without workloads in a period. Their strategy can be applied
on systems without DVFS. In Jejurikar and Gupta [2005b], a heuristic is provided for
the system with discrete execution states to determine the appropriate states (speed)
of given devices. Similarly, Mochocki et al. [2007] take not only the CPU but also the
network interface into account to conserve energy. Zhang et al. [2005] introduce a
cache-tuning policy to save energy by reconfiguring cache modes.
Most of the energy supplied to multicore systems is consumed by CPUs. Thus dynamic scaling techniques on the CPU, like DVFS, can effectively achieve power savings
for multicore systems. There has been considerable research on developing algorithms
for scheduling and assigning tasks on DVFS-enabled multicore processors [Aydin et al.
2004; Zhu et al. 2003; AMD 2008]. These mainly address the independent or realtime task model. As a derivative technique of DVFS on scheduling, slack reclamation
techniques are proposed for various system and application models [Felter et al. 2005;
Feng and Cameron 2007; Kang and Ranka 2008a, 2008b]. The basic idea is motivated
by the fact that some tasks may complete earlier than the required deadline so leave
unused time periods that can be added to incomplete tasks [Aydin et al. 2004; Jejurikar
and Gupta 2005a]. Besides device scaling techniques, another general methodology to
save energy for multicore systems is workload consolidation [Jerger et al. 2007], which
can reduce power dissipation by scheduling workloads onto the minimum of active
computing resources. A special workload technique is VM consolidation on distributed
server clusters. From the perspective of scheduling algorithms, multicore systems are
very similar to distributed systems. Since architectural details are out of scope of this
32:11
Table I. Objective Functions to Perform Energy and Execution Time Trade-Offs for Each Scenario
Function
Performance-Constrained
Energy Optimization
(PCEO)
Objective
Minimize energy
consumption with
permitted loss in
quality of service.
Energy-Constrained
Performance
Optimization (ECPO)
Minimize the execution

time under the given
energy requirement.
Dual Energy and

Performance
Optimization (DEPO)
Minimize the overall

penalty for violating
the timing and energy
constraints.
Description
Assume a schedule for a workflow that
optimizes the execution time on a set of
processors is available. Determine an
adjusted schedule that minimizes the
energy by considering an additional slack
allowed over the execution time.
Determine the normal energy requirement of
an application, and then find the minimum
execution time with total energy budget
reduced by a given factor, say 70%.
A budget for energy and execution
requirements for all the cores and the
violation of any constraint results in
incurring penalty. The overall penalty is to
be minimized.
survey, we only give a high-level classification of uniprocessing and multiprocessing

platforms where different scheduling algorithms run.
4.2. Algorithm Taxonomy
In this section, we will first present the taxonomy for the algorithms and approaches
used for energy-aware scheduling, followed by the details of work related to each
category. However, while summarizing the corresponding research for multicore and
distributed systems we also present some details about the approaches used for single
processor systems. This helps in elaborating how the scope of the problem as well
as the solution techniques change with the type of system under consideration. The
main actions for mutual trade-off between performance and energy are summarized in
Table I.
(1) Scenario 1. Suppose there is a need to encode real-time H.264 video at 30 frames
per second to ensure the best possible video quality for a live video stream, or
a need to solve a set of fluid dynamics equations for weather forecasting to be
broadcasted in that evenings news. In these cases, there are deadlines (although
not hard deadlines like those in hard real-time systems) that need to be met along
with the goal of minimizing energy consumption. This corresponds to the PCE-OPT
function.
(2) Scenario 2. It is an unusually hot July and the peak month in energy consumption
(based on location, the peak month could also be December or January). Energy is
scarce and increasingly expensive. A system manager has to ration energy based
on a given energy budget. A physics user of the machine needs to execute an application including FFT (Fast Fourier Transform) and sparse linear algebra tasks,
which takes hours to days. The manager determines the applications normal power
requirement and then reduces the allocated energy budget by a given factor (say
70%). This corresponds to the ECP-OPT function.
(3) Scenario 3. A user has to complete her parallel programming assignment where
there are penalties attached for late submission, but has a limited energy quota.
The scheduling algorithm should optimize both energy and performance. The objective, in this case, is to find the optimal matches between tasks and processors
that balance minimization of the total energy utilized and reducing the makespan.
Variations in this general scenario are possible. For example, deviation from constraints incurs penalties and thus the balance has to be achieved by considering
such penalties. This corresponds to the DEP-OPT function.
32:12
H. F. Sheikh et al.
Table II. Related Work for PCEO on Single Processor Platforms
Paper
[Tomiyama
et al. 1998]
[Shin and
Choi 1999]
[Lu et al.
2000]
[Zhang et al.
2002]
Algorithmic
Approach
Heuristics
Heuristic
Performance
Definition
Instruction Cache
Misses
Power
Task Model
Control Strategy
Dependent tasks, Instruction
represented by
Scheduling
DAG
Deadline
Independent,
DVFS
periodic tasks
Timing Constraint Independent tasks DPM
Greedy
Heuristic
LP (Algorithm) Deadline,
precedence
constraints
[Aydin et al.
Greedy
Deadline
2004]
heuristic
[Gniady et al. Predictor
Mis-predictions
2004]
[Zhu and
Heuristic
Deadline
Mueller
2005]
[Zhou and
Greedy
Time Complexity
Chakrabarki Heuristic
2005]
[Zhang and
Approximate
Deadline,
Chatha
(FPTAS)
Quality bound
2007]
Components
Cache
CPU
CPU, Disk,
Network
CPU
Dependent
Discrete and
continuous DVS
Independent,
periodic tasks
Discrete DVFS
levels
DPM
CPU
Independent,
Periodic, Fully
preemptive
Periodic
DVFS (discrete,
feedback)
CPU
Periodic,
Independent
DVFS
I/O devices
DVFS (continuous) CPU

CPU
Additional scenarios are possible but here we focus on the preceding three scenarios since these fundamentally capture various energy and performance trade-offs. A
specific scenario may be applied by the energy manager based on the type of applications and users. For example, there may be some applications that have very strict
performance requirements and thus need high quality of service. To meet both the
performance and energy requirements, the manager may select different scenarios
over a period of time to satisfy the demands of her users.
4.2.1. Performance-Constrained Energy Optimization (PCEO) Algorithms. PCEO algorithms
refer to scheduling algorithms minimizing energy consumption under performance
constraints. The performance metric varies for different scenarios, which can be in
terms of execution time, response time, QoS, SLA, and so on. Since in popular power
models, power can be modeled as a polynomial form of reciprocal performance metrics
(for example, the dynamic power model), a small sacrifice of performance can bring
significant power efficiency. Thus PCEO problems attract lots of attention in energyconscious computing research.
PCEO on Single Processor Platforms.

First we will discuss the related work for performance-constrained energy optimization schemes for single processor systems (Table II).
Complier-Optimized Energy Savings. To reduce the overall systems power consumption many high-performance embedded processors leverage on-chip caches. This is because driving off-chip caches is power intensive and at the same time on-chip caches
can reduce data transfers among chips on the system board. But with on-chip caches
too, the power requirement of the caches is still very high, about 70% of total chip
power [Ahmad et al. 1998; Tomiyama et al. 1998]. A compiler optimization technique
proposed in Tomiyama et al. [1998] aims to reduce energy consumption per instruction
cache miss. The power consumed by on-chip drivers is minimized by reducing the data
transfers between on-chip cache and main memory. Although many compiler optimization techniques to improve cache performance, such as prefetching of instructions and
32:13
data, loop transformations for data caches, and code placement for instruction caches
[Abdelzaher and Lu 2001], have been proposed, few strides in compiler optimization
have been made towards reducing the average energy consumption per cache miss.
The technique is very effective for embedded system design since it requires neither an
additional hardware cost nor a loss in the systems performance. Furthermore, most
compiler optimizations targeting cache miss reduction and the technique proposed in
this article are not exclusive but complementary to each other. It has been shown
based on the results from the conducted experiments that the proposed scheduling
algorithm reduces the transitions between on-chip cache and memory by up to 28%
without achieving excessive overheads.
Energy Savings through Task Prioritizing. Shin and Choi [1999] take a simple but
effective approach towards modifying fixed priority scheduling, commonly used in realtime system schedulers to achieve energy savings while still meeting hard real-time
system deadlines. They considered embedded, real-time system static scheduling problems for periodic tasks on a single core making use of dynamic voltage scaling. The
authors first identify the key characteristics of periodic real-time systems that can be
exploited for savings in energy. They identify execution time variation between the
worst and the best cases along with idle time intervals and how they can be used for
power savings. In addition to execution time variations and idle intervals, dynamic
processor speed variations are applied for maximum benefit. They state that their
power-conscious scheduling policies can be applied with only slight changes to a conventional fixed policy scheduler implemented in a real-time system kernel.
A fixed priority scheduler is described as utilizing two queues: a run queue and a
delay queue. The run queue contains tasks waiting to be executed in order of priority.
And the delay queue contains tasks already executed in the current period but waiting
to run in the next period, ordered by time of release. Once the scheduler is invoked,
it checks to see if a task should be promoted from the delay queue to the run queue.
If a task or tasks are moved into the run queue, the priorities are compared with the
active task to check if the active task should be swapped with a higher-priority task
in the run queue. The Low-Power Fixed Priority Scheduler (LPFPS) adds two cases
to the scheduler when the run queue is empty. In the absence of an active task, the
scheduler brings the processor to an idle state. However, if there is an active task, the
processors power is scaled down to a speed appropriate to complete the task in time to
start the next task. A heuristic to compute the ratio of the processors speed is used for
the appropriate speed.
Energy Optimization across Multiple Devices. Lu et al. [2000] present a multidevice, dynamic, energy-optimizing performance constraint approach for nonreal-time
systems with independent tasks. This work focuses on the goal of ordering task execution through the operating system scheduler to adjust idle time length for the multiple
devices that a process may utilize during execution. Adjusting idle time length allows
the scheduler to achieve better opportunities for power management by minimizing
state changes and running processes that depend on these devices simultaneously. The
authors provide proof that long, clustered idle times provide the best results. The authors present a greedy scheduling algorithm that focuses on the power management of
multiple devices. This algorithm concentrates on the concept of Required Device Sets
(RDS) that are those devices in a system required by a task to perform its function. It
is assumed that the scheduler can make predictions of which devices a task will use.
The scheduling algorithm attempts to order tasks based on their RDS. Performance
of the multidevice greedy scheduling algorithm is O(n log n). A scheduling simulator is used for the performance analysis of the authors proposed algorithm under a
Linux-based system. Analysis is done using different workloads with various levels
32:14
H. F. Sheikh et al.
of RDS prediction accuracy and timing constraints. Advantages in energy savings are
observed in all cases that the algorithm performs better in terms of energy with higher
RDS prediction and looser timing constraints. Experimentation shows that the authors
were able to achieve power savings up to 33% over base scheduling.
Joint DVS and Task Assignment. Zhang et al. [2002] present a two-phase framework
to minimize the energy consumption of real-time tasks. The proposed approach combines the tasks assignment, time scheduling, and voltage selection to allocate given
real-time dependent tasks on a set of DVFS-enabled processors. To generate a schedule with the best slowing down opportunity (which results in minimum performance
degradation) they first apply Earliest-Deadline-First (EDF) scheduling, by ordering the
tasks based on their priority and using a best-fit strategy to allocate tasks to multiple
processors. Then, the voltage selection problem for the Directed Acyclic Graph (DAG) is
formulated as an Integer Programming (IP) problem. The optimal solution is obtained
by solving the IP formulation exactly. By limiting the range of discrete voltage selection,
the problem can be solved approximately by a heuristic in polynomial time complexity.
The proposed approximation for the problem under discrete voltage settings produces
results with a deviation of at most 3% from the optimal solution.
Static and Dynamic Voltage Assignment for Single Processor. Aydin et al. [2004]
present their solutions to the problem of scheduling real-time, periodic tasks while taking into consideration power consumption and the necessary task completion deadlines
inherent to real-time systems. Their approach is targeted at single-core DVS-capable
architectures using an intertask scheduling technique. Intertask scheduling performs
processor speed assignments at task level and at task dispatch and completion time.
The authors use a three-level approach: static level, reclaiming level, and speculation
level. Algorithms for each of these levels and their interactions are given. First, the
static level is used to compute optimal speeds for tasks, assuming a worst-case workload for each task arrival. An instance of this problem has been shown by Aydin in a
previous work to be equivalent to solving an instance of the Reward-Based Scheduling
(RBS) problem by concave reward functions [Aydin et al. 2001] and that solutions for
RBS can be used. However, the authors state that an RBS solution alone would be too
conservative for use as a complete solution to the real-time DVS scheduling problem
due to the amount of variation in actual workloads. The reclaiming level is an online
mechanism that improves on the static schedule by using the actual workload to reclaim energy. The authors describe their reclaiming algorithm as a generic dynamic
reclaiming algorithm (GDRA). GDRA is a greedy algorithm that attempts to allocate
the largest possible amount of slack time to the first task satisfying a proper priority
requirement. Task deadline objectives are maintained by investigating execution times
for the remaining tasks from the static schedule. Finally, the speculation level is described as an online speculative mechanism for speed adjustment using average-case
workload to predict earlier finishing times for next executions. The online, adaptive,
and speculative speed adjustment mechanism takes into consideration the questions
of the intensity of aggressiveness that condone speculative speed reductions as well as
guarantees about timing constraints in aggressive modes. The authors algorithm is
unique in that it reduces the speed of dispatched tasks exclusively, borrowing time from
other available tasks. The level of aggressiveness can be adjusted. The results show
that the best solution can be reached by compromising the aggressiveness on speed requirements for the expected workloads. Experimental results show that applying the
dynamic reclaiming algorithm to the static optimal algorithm provides up to a 50% energy savings over the static optimal algorithm for workloads with significant deviation
from the worst-case requirements. Adding aggressive techniques through the proposed
speculative adjustment algorithm improves the dynamic reclaiming algorithm results
32:15
by 1015%. The authors AGR2 speculative algorithm closes to the lower bound set by
an optimal, theoretical clairvoyant algorithm for DVS scheduling by a 10% margin.
Reducing Energy in I/O Devices. In Gniady et al [2004], the authors proposed a
technique to predict I/O utilization using program counters, and to reduce energy
dissipation by turning off idle devices. In the literature, this technique only focuses on
disk devices; however, the idea can be extended to other devices like those for display or
Wi-Fi. Motivated by the branch predictor behaviors of HPC CPUs, the novel technique,
named PCAP (Program Counter Access Predictor) can analyze past activities of the I/O
device and recognize behavior patterns, thus deciding the utilization in the near future.
If the device turns out to be idle, the predictor can decide to turn it off to reduce power
consumption. The key innovation of PCAP is the small penalty on energy dissipation
when mis-prediction happens. Furthermore, the pattern recognition process not only
analyzes the devices statistics of past behaviors, but also takes into account the type
of user and application.
Hybrid Static and Dynamic Slack Allocation. Zhuo and Chakrabarti [2005] present
a hybrid static and dynamic slack allocation approach combining feedback control with
DVS schemes. They target battery-aware task scheduling, based on the assumption
that all of the tasks information, such as deadline, execution time, etc., is already
given, which is common in embedded environments. They propose a novel EDF algorithm for dynamic scheduling of periodic tasks. Their algorithm adopts the dynamic
version of an average rate heuristic, and delays slack incorporation so it can achieve
better performance. The experiment shows more significant performance than the
near-optimal result in Albonesi [2002] and with lower time complexity.
Feedback-Control-Based Scheduling. Zhu and Mueller [2005] proposed a PID (Proportional Integrator Differentiator) controller-based approach to modify the EDF
scheduling scheme. They target hard real-time systems with dynamic workloads. With
the help of the operating system they improve the Earliest Deadline First scheduling
(EDF) algorithm by incorporating DVFS and feedback techniques. The task is executed
in two stages. In the first stage, the voltage/frequency is scaled to keep an average of
the task execution time. During this stage, the feedback mechanism helps to decide the
appropriate voltage/frequency for the task. In the second stage, the scaling depends
on the execution status of the first stage to meet the deadline. Their experiment results show that at most 29% energy conservation can be achieved via the two-stage
algorithm over the simple DVFS algorithm.
EDF and Rate Monotonic Scheduling. Zhang and Chatha [2007] investigate energy
efficiency resulting from the DVFS technique in RM (Rate Monotonic) and EDF (Earliest Deadline First scheduling) schemes in embedded systems. These two schemes
are the most frequently used scheduling algorithms on single processor systems, especially for periodic tasks. RM works under static mode, that assigns the fastest-complete
task the highest priority, while EDF dynamically maintains a priority queue of tasks
in order of their deadlines. The problem is formulated as indicating the appropriate
CPU speed (voltage/frequency) of each task to execute, so that the total energy dissipation is minimized without violating the rules defined by RM or EDF. This problem is
proved NP-complete. Several works give similar approaches: pseudo-polynomial time
in Zhang et al. [2002], and fully polynomial time in Zhang and Chatha [2007] and Chen
and Mishra [2009]. The approximation scheme in Zhang and Chatha [2007] showed
the lowest complexity among the three.
PCEO on Multicore and Distributed platforms.
Table III presents selected research for the Performance-Constrained Energy Optimization scenario (PCEO). The performance constraint can include response time,
32:16
H. F. Sheikh et al.
Table III. Related Work for PCEO on Multicore and Distributed Platforms
Paper
[Schmitz and
Al-Hashimi
2001]
[Elnozahy
et al. 2002]
[Yu and
Prasanna
2002]
[Ge et al.
2005]
Algorithmic Performance
Approach
Definition
Heuristic
Overhead
Algorithm
Task Model
Dependent
SLA (response Independent

time)
LR Heuristic Response time Independent

periodic
Platform
Multicore
Components
CPU
DVFS
Switch
ON/OFF
Discrete
DVS
Cluster,
CPU
Homogeneous
Multicore
CPU
Execution
time
Independent
DVS
Cluster, Homo- CPU

geneous
Deadline,
Precedence
constraints
Deadline,
Overhead
Dependent
Continuous
DVS
Multicore
CPU
Independent DVS
dependent,
periodic
Dependent
DPM
Multicore
CPU
Multicore
CPU
[Kang and
Heuristic,
Ranka
Optimal
2008b]
(using LP)
[Nathuji 2008] Heuristic
Execution
time and
QoS
[Qi and Zhu
Heuristic
Deadline
2008]
[Seo et al.
Heuristic
Deadline
2008]
[Srikantaiah
Bin-packing Throughput
et al. 2008]
heuristic
Independent& DVS
dependent
Multicore
CPU
Independent
VM consolidation
Cluster
server
Real-time
DVFS, block
partitions
DVFS
CMP
(multicore)
Multicore
CPU
Independent
Switch
ON/OFF
CPU
[Ghasemazar
et al. 2010]
Independent
DVFS, Core
consolidation
VM consolidation
Data Center,
Homogeneous
CMP
(multicore)
Cluster
server
[Mishra et al.
2003]
Heuristic
profile
based
Greedy,
Heuristic
Power
Control
Strategy
Discrete
DVFS
[Zhu et al.
2003]
Heuristic
[Stout 2006]
Algorithm
Heuristic
Execution
time
Deadline
Throughput
[Petrucci et al. Heuristic for Execution

2010]
MIP
time
Independent
Independent
CPU
CPU
throughput, task acceptance rate, or any other quantitative representation of output

of the computational system under consideration.
Genetic Algorithm for Energy Reduction. Schmitz and Al-Hashimi [2001] proposed
an energy optimizing heuristic algorithm with performance constraints for offline task
scheduling of dependent tasks on distributed, embedded systems with multiple processing elements. The authors claim that their approach is unique at the time of writing
in that it considers the power profiles and variations of DVS processing elements (referred to in the article as DVS-PEs). Voltage selection is done for each task based on
the power dissipation caused by each task. The algorithm presented accepts a task
graph, the mapping of tasks onto processing elements, schedule of the tasks and communications, execution times, power dissipations, and minimal schedule extension as
inputs. Scheduling and mapping techniques are based on genetic algorithms similar to
other cited approaches. The runtime of the authors algorithm is said to be polynomial
and is successful at the identification of refined voltage selections based on the power
dissipation at task level. They implemented their algorithm on a Pentium-III-based
Linux system and performed experiments for two and four processing elements. Their
32:17
experiments showed that there is an advantage to taking power variations of processing elements into consideration and that their algorithm performs voltage selections
successfully when compared to DVS algorithms that do not take variations into account. A significant energy consumption reduction up to 80.7% is observed for design
space exploration when a genetic algorithm adopts the proposed heuristic.
Multiple Power Management Policies. In one of the investigations on power efficiency
of clustering Elnozahy et al. [2002] proposed the power-aware resource management
method to reduce the overhead brought by unnecessary operating cost in the pure Webservice environment (Web servers). The performance model is based on SLA (ServiceLevel Agreements), which can be measured by response time in this model, and the
degree of load-balancing for a homogeneous cluster. The power model is based on DVFS
on the CPU and switching physical nodes on and off (VOVO). Five power adaptation
policies are applied on the resource management strategy: VOVO (Vary On and Vary
Off), CVS (Coordinated Voltage Scaling), IVS (Independent Voltage Scaling), VOVOIVS (Combined Policy) and VOVO-CVS (Coordinated Combined Policy), among which
VOVO-CVS is stated to be the most advanced policy. VOVO-CVS dynamically decides
the CPU frequency thresholds then determines the schedule to turn on or turn off
nodes. So the overall CPUs frequency can then be estimated to compute the expected
response time. Compared to the performance constraint, the optimal amount of nodes
and CPU frequency of these nodes can be acquired. These policies are tested on a
single-application cluster environment. The results show that while energy savings
achieved by VOVO are dependent on workload intensity, VOVO-CVS outperforms all
the tested approaches in terms of energy savings. The energy savings obtained through
VOVO-CVS are shown to be up to 18.2% better than VOVO.
Integer Linear Programming. In Yu and Prasanna [2002], the authors study the
problem of allocating independent periodic tasks to a real-time computing environment, that is heterogeneous. DVFS is supported by each computing component in the
system. In general, this allocation problem is NP-complete. First an ILP (Integer Linear Programming) formulation for the allocation problem is presented [Aydin et al.
2004; Yu and Prasanna 2002]. An extended linearization heuristic (LR-Heuristic)
[Aydin et al. 2004; Yu and Prasanna 2002] is then used for solving the problem. They
then analyze the applicability of the heuristic which can work just within a limited
number of tasks. The experimental results show that the greedy approach can be up to
90% off from the optimal while the maximum deviation for LR-Heuristic is only 15%
for small size problems. However, for large size problems, LR-Heuristic can be at most
40% better than the greedy approach.
Distributed DVS Scheduling. In Ge et al. [2005], the authors propose a performancecentralized DVFS scaling mechanism applied on the power-aware distributed highperformance computing cluster. Within the cluster, each member has multiple powerperformance modes determined by scaling techniques such as DVFS. Performance
(measured by execution time) and energy usage can be derived by duration in the
modes and time spent on transition between the modes. They further investigate distributed DVFS techniques for task scheduling in power-aware cluster systems. Either
the system-driven DVFS (driven by CPU speed) or the user-driven DVFS (driven by
command line settings) is transparent to the applications. Moreover, DVFS can also
be driven by source-code instructions with precomputed performance profiles. The presented results show that considerable energy savings can be achieved via distributed
DVFS scheduling (maximum of 36% savings) and at the same time the performance
is not degraded. However, the amount of energy conservation may be different among
compositions of various workloads, application types, and system configurations. The
32:18
H. F. Sheikh et al.
results also corroborate the effect of external DVFS by the observation that in most
cases, the cost of complex internal scheduling cant obtain more significant energy
savings than the straightforward external scheduling method, like user-driven DVFS.
Combined Static and Dynamic Slack Allocation. In Mishra et al. [2003], the authors
investigate the slack allocation problem with the scope of power management in the
distributed real-time system. They assume that the task model is communication intensive and allow dependency relationships between tasks. They proposed a two-step
slack allocation policy on the existing scheduling queue generated from a schedule
process [Selvakumar and Siva Ram Murthy 1994]. The first step, named P-SPM (static
power management for parallelism), executes static assignment of slacks (slacks of the
whole queue) to parts of the scheduling queue (the queue is divided into parts by different parallelism levels). The second step dynamically allocates slacks along the queue
via a greedy selection strategy: always choose the first ready-to-run task to allocate
slacks. Note that this greedy gap filling strategy may change the initial scheduling
order. The experiment results show that the static slack assignment algorithm may
achieve 10% mean energy conservation compared to the early slack assigning method
that simply assigns slacks proportional to the length of tasks. Further, the combination
of the proposed static and dynamic algorithm can obtain more energy conservation.
Slack Sharing and Reclamation. In Zhu et al. [2003], the authors propose two energyaware slack reclamation algorithms for task scheduling in multicore systems, which
can support both independent and dependent task models. The main idea of their
slack reclamation technique is to reallocate the unused time of the tasks that complete early to other in-run tasks so the CPU speed can be slowed down and achieve
total energy conservation. In Zhu et al. [2003], they first extend the greedy algorithm
issued in Azevedo et al. [2001] to the scope of global scheduling defined in Albonesi
[2002] and argue that it exceeds (or violates) the deadline. Instead, they introduce the
shared slack reclamation algorithm, a new global slack reclamation solution. However,
the shared slack reclamation is not perfect because when this algorithm is applied
on list scheduling, the deadline constraint may not be violated either. Further, they
propose a modified solution, FLSSR, that finishes execution of tasks before the deadline. They also take into account the overhead of transition between different levels of
voltage or frequency because in practice, DVFS scaling is discrete rather than continuous. The experiments show that their algorithm can obtain almost the same energy
conservation via discrete DVFS as with continuous DVFS.
Peak Power Minimization. In the parallel computing system, energy consumption
may not be the most in need of care, because the energy is supplied by an external
source and normally does not decrease with time. But it is of interest whether the
required peak power will readily be available from the external source. Stout [2006]
investigates the problem of peak power minimization from the perspective of the parallel algorithm working in a grid composed of a number of small and simple processors,
among which each processor is connected with its neighbors. Such a structure is common in sensor networks, cellular automata, and some supercomputer systems. The
standard mesh-based parallel algorithms work on the assumption that all processors
work simultaneously, but this assumption is unrealistic. Based on this fact, Stout [2006]
designs a near-optimal algorithm to reduce the peak power for some basic problems,
including labeling elements in an image and calculating the distance between them,
calculating the minimum spanning tree of a graph, or determining whether the graph
is biconnected. To find processors which are running simultaneously at a given time
instant, this article introduces the squirrel model. A squirrel represents an active processor. Squirrels carry information in a limited number of words; they can track their
32:19
location and leave information in a location. The communication paths among processors can be obtained by tracking squirrels movement. So the problem is converted into
minimizing the number of squirrels in the given time.
Static and Dynamic Slack Assignment. Kang and Ranka [2008b] explored novel algorithms for scheduling DAGs on DVS-enabled parallel and distributed machines in both
static and dynamic environments for minimizing energy. The proposed schemes can
be used for both homogenous and heterogeneous parallel machines. The scheduling of
DAG-based applications with the goal of DVS-based energy minimization broadly consists of two steps: assignment and slack allocation. They present algorithms for static
slack allocation, dynamic slack allocation, static assignment, and dynamic assignment.
Kang and Ranka [2008b] describe an LP-based approach for slack allocation, with the
goal of minimizing total energy consumption under the constraints of deadlines. They
extend this version by considering the data transfer time in the form of precedence
relationships among tasks. Their scheme improves running time and memory requirements of the LP-based approach by combining compatible task matrix and search space
reduction techniques. The dynamic slack allocation algorithm is built on the approach
of using the k descendent look-ahead approach. However, for readjusting a schedule
it only considers the directly influenced tasks consequential to an early or late finish
time of a certain task rather than taking into account all the tasks within a certain
range of time. This approach outperforms some of the previous approaches [Felter et
al. 2005; Li 2008] both in terms of time and memory in both overestimation (i.e., the
task finishes earlier than its estimated time) and underestimation cases (i.e., the task
finishes later than its estimated time).
Power Management with Virtual Machine Consolidation. Nathuji et al. [2008] investigate the possibility of adopting virtualization techniques to reduce power consumption in distributed environments. The authors observe that, although switching
between the hard power states of machine hardware can scale down the supply voltage and so reduce power consumption, the range between two neighbor hard power
states can still be split via software techniques. The finer-grained ranges obviously
can improve power efficiency. They find that this method can be achieved by utilizing
virtualization techniques supported by distributed operating systems. To evaluate the
power efficiency of VMs, the authors design a framework software integrated with the
XEN hypervisor to coordinate VM scheduling on clustered servers. For VMs running
on machines with a power-aware operating system, the authors define local policies
to schedule the VM allocation locally to optimize power efficiency with execution time
or QoS constraints. By extending the power-aware task scheduling strategies for nonvirtualized machines, the local policies can be explicit and practical. However, for the
operating system without support for hard power states, the framework should provide
the functionality to remotely manage the VM activities as migration or idling to meet
power requirements. These global policies are attractive but not clearly illustrated.
Block Partitioning of Multicore Systems with DVFS. The work in Qi and Zhu [2008]
explores the aspect of core consolidation at hardware level, where it is assumed that the
cores of the system are grouped / partitioned into several blocks. Each block can then be
controlled by a separate power supply and thus can change its frequency/voltage independently of other blocks. The main motivation is to avoid the excessive complexity of
per-core control as well as excessive power consumption due to common frequency scaling. Similar to Ghasemazar et al. [2010], a two-phase approach was used. In the first
phase, static slack was used to minimize the number of blocks required to complete the
task. Various configurations of symmetric and asymmetric partitions were evaluated in
terms of their energy efficiency to find the optimal partitioning approach. As expected,
having small number of cores per block significantly improves energy consumption.
32:20
H. F. Sheikh et al.
However, the difference in energy by having 1 or 4 cores / block was not significant,
therefore, it can be deduced that a certain small number of cores/block can guarantee
good energy efficiency. Dynamic slack was distributed at runtime among all the active
cores by reducing their voltage/frequency setting and thereby improving energy consumption. Synthetic workloads comprised of real-time tasks were used to evaluate the
efficiency of the proposed approach. A baseline approach where all cores are run at
maximum frequency and are switched-off when idle was used for comparison.
Dynamic Partition and Core Scaling. Seo et al. [2008] present two algorithms, dynamic repartitioning and dynamic core scaling, as solutions to efficiently reduce clock
frequencies and reduce leakage power by managing active cores, respectively. Dynamic
repartitioning for multicore processors is said by the authors to be the same as producing partitioned task sets that perform the most balanced utilization of the cores. Due
to demand on the cores changing during runtime, some tasks must be migrated onthe-fly in order to maintain the balanced performance demand and consistently have
low power consumption. To implement this, it is first determined if a task can safely be
migrated from one core to another by an analysis described by the authors. One of the
safe temporal migration conditions is making sure that a repartitioning wont violate
any deadlines of the original schedule. If conditions are met, the dynamic repartitioning algorithm will migrate the task with the lowest required utilization from the core
with the maximum load to the least loaded core. The migration is continued until the
difference in load between the most utilized and least utilized cores is less than the
remaining dynamic utilization of the task currently running on the most utilized core.
The task to be migrated is the currently scheduled task on the most utilized core. The
dynamic core scaling algorithm decides the optimal number of active cores on-the-fly
by taking advantage of most commercial multicore processors ability to transition a
core to a given ACPI processor state independently. Based on these expectation functions the power-optimal number of active cores can be different from the appropriate
number of cores based on a systems parameters. This problem is of NP-hard complexity and gives rise to the need of their heuristic algorithm for finding a near-optimal
solution. Their algorithm is run at the start of a new task period and is rerun for the
updated dynamic utilization after a task completes. If the optimal number of cores is
less than the currently active number, the core with the lowest dynamic utilization has
its tasks migrated prior to deactivation. The core with the lowest dynamic utilization
is chosen because it may have the least number of tasks to migrate and migration of
these tasks will have the least impact on other cores. Also, a core may be activated if
it is decided that fewer cores are active than the optimal number based on an increase
in dynamic utilization. If activation is needed, dynamic repartitioning is performed to
migrate tasks from core with highest load to the newly activated core.
Resource Utilization Bin Packing. Srikantaiah et al. [2008] propose dynamic consolidation of applications in a heterogeneous data center to minimize energy dissipation
while satisfying performance requirements. The authors analyze the workload consolidation by a metric of energy per transaction with the data of CPU and storage
utilizations. The result shows that the consolidation status can impact the relationship between energy consumption and resource utilization. They found the energy
consumption per transaction development appears as a U-shape: low utilization leads
to a high fraction of servers in an idle state so the energy-performance metric is high;
on the contrary, high utilization can increase scheduling conflicts, context switches,
and cache misses so performance is degraded and execution times extended under
high energy consumption. Thus, they observe the optimal balance between resource
utilization and energy-performance metrics resides around 50% on the CPU and 70%
storage usage. They proposed a dynamic consolidation via a bin packing algorithm to
32:21
achieve optimal resource utilization of the server. The algorithm abstracts the servers
into bins with multiple dimensions so these dimensions can represent the resources of
interest like CPU, memory, network, and storage). The length along each dimension
can be obtained by computation from measurements of the corresponding resource into
an optimal utilization level. Also, the applications are represented as objects and the
estimated resource utilizations are represented as the proportional length along each
dimension of the object. So the problem is abstracted as minimizing the amount of bins
to accommodate as many objects as possible, thus the total energy dissipation can be
minimized by reducing the number of active nodes and turn off the idle ones. To find a
solution to this problem, a heuristic is introduced: once a new request (object) arrives,
find the server (bin) to allocate the request so that the bin has the minimum space left.
From the heuristic it is apparent that an idle server can be activated only if all the
active servers are fully utilized. The authors find that experiments using the heuristic
can achieve near-optimal energy conservation (5.4% more energy dissipation than the
theoretical optimal solution).
Combination of Core Consolidation and DVFS. The idea of minimizing energy consumed in a chip multiprocessor by combining chip consolidation along with DVFS is
explored in Ghasemazar et al. [2010]. A mixed integer program has been formulated for
minimizing energy under a given throughput constraint. A two-phase heuristic which
tackles the problem in hierarchical manner is proposed to efficiently solve the aforementioned optimization problem. In order to minimize the additional static power, the
proposed heuristics first aims to determine the optimal number of cores that should be
in ON state based on the throughput requirements; this is achieved by following a
steepest descent approach. It then adjusts the voltage setting of each core to optimize
the energy consumption. The optimal value of voltage/ frequency setting is obtained
by using a simple PI-controller. It is assumed that all cores can run at the same frequency at a given time, hence, a single value of optimal voltage/frequency was obtained
at each decision step. Experiments are conducted by selecting applications from the
SPEC2000 benchmark uniformly on a chip multiprocessor with each core similar to the
alpha-21264 processor. Results highlight a 17% improvement in energy savings over a
dynamic power management approach that does not use core consolidation and uses
an open loop controller for DVFS.
Mixed Integer Programming for Virtualization Environments. In Petrucci et al.
[2010], the authors address performance-constrained power optimization in virtualized server clusters. Their solution includes an optimization MIP model and a dynamic
configuration strategy. In the static assignment stage, the applications at any time are
allocated to run on only one VM per each physical server; then the dynamic optimization mechanism allows an application to use multiple VMs over distributed servers.
The dynamic optimization control periodically selects and enforces the lowest power
consumption configuration (derived by the MIP model) that maintains required performance under a variable workload of multiple applications for the cluster. Thus the
whole system can keep near to the optimal operating point under the controlling loop.
The authors also present a framework that provides several configuration mechanisms
in the form of monitoring system execution, evaluating system requirement violation,
and configuration adaptation.
From preceding related work, dynamic scaling techniques like DVFS are widely
used for PCEO problems. Because of its quadratic relationship between CMOS power
and clock rate, a tiny increase of execution time (or violation of SLA) can achieve
substantial reduction of power dissipation. However, the efficiency of DVFS can be
significantly affected by the granularity of the voltage or frequency on the systems. In
32:22
H. F. Sheikh et al.
Table IV. PCEO on Single Processor Platforms
Power
Performance
Control
Paper
Definition
Task Model
Strategy
Platform
[Alenawy
(m-k) Firm
Real-time,
DVFS
Single
and Aydin
Deadline,
Independent
Processor
2005]
Dynamic
Failure Ratio
(DFR)
[Devadas
Heuristic/
Competitive
Real-time,
Online
Single
et al. 2009]
Theoretical
Factor / value
Independent Scheduling,
Processor
Analysis
metric
DVFS
[Ranvijay
Heuristic
Acceptance
Real-time,
Energy-aware Single
et al. 2010]
ratio
Independent Scheduling
Processor
DVFS,
Algorithmic
Approach
Heuristic
Components
CPU
CPU
CPU
most scenarios, DVFS can be utilized to solve PCEO problems via continuous convex
optimization techniques but the slack granularity directly determines the optimality
of the solution. Another general idea for PCEO problems is resource consolidation,
including CPU utilization, server resources, and virtual machines, which can reduce
overall power consumption by minimizing the number of active components in multicore or distributed environments. The consolidation problems can be fit into classical
optimization problems like bin-packing, knapsack, integer programming, and so on.
Motivated by the heuristics to solve these problems, a lot of algorithms are devised to
find the local optimal solutions of PCEO problems, which are practical in most cases
comparing to the high overhead to find a global optimal or bounded approximate solution. However, most of these solutions are offline; hence online techniques, as well as
relevant methods of statistics and prediction, are attracting state-of-the-art research
interest in PCEO algorithms.
4.2.2. Energy-Constrained Performance Optimization (ECPO) Algorithms. The energyconstrained performance optimization problem is primarily motivated by the prosperity of portable and mobile devices. These devices relying on battery power have the
strict limitation of power usage. Furthermore, ECPO problems are observed in lots
of real-time systems, where the performance metrics are represented as Quality-ofService (QoS) or the combination of QoS and execution time. Table IV presents some
related work on ECPO for single processor systems while Table V outlines the related
algorithms for multicore and distributed platforms.
ECPO on Single Processor Platforms.
Energy-Constrained Scheduling for Weakly Hard Real-Time Systems. Minimization

of the dynamic failure ratio for weakly hard real-time systems when subjected to tight
energy budgets using frequency selection and slack reclamation is proposed in Alenawy and Aydin [2005]. For an (m-k) firm deadline, the timing constraint of m tasks
out of k instantiations of a particular task is to be met rather than the deadlines of all
tasks. Two static frequency setting approaches, namely the greedy and energy-density
scheme, have been designed to minimize the Dynamic Failure Ratio (DFR) while at
the same time staying under the available energy budget. DFR is the weighted ratio
of number of failures for a periodic task to the dynamic failures of all the tasks during
a specified interval of operation. The weight is assigned based on the relative importance of a task among the whole task set. First it is proved that the problem of checking
an (m,k)-firm deadline for a given periodic real-time tasks set under an imposed energy constraint is NP-hard. Then a conservative nominal speed for executing the task
set based on the computational requirement of the task set is presented. However,
32:23
Table V. ECPO on Multicore and Distributed Platforms

Algorithmic
Paper
Approach
[Felter et al. Heuristic
2005]
(Workload
aware)
[Li 2008]
Heuristic
(Profile
based)
[Ahmad et al. Heuristic
2009]
[Gandhi
Queuing
et al. 2009]
theory
[Lee and
Zomaya
2009]
Heuristic
Performance
Definition
Execution time
Task Model
Independent
Platform
Single
processor
Power
Control
Strategy
DPM
Components
CPU,
Memory
Schedule length Independent
Multiple
processors
DVS
CPU
Schedule length Dependent
Multicore
DVS
CPU
Response time, Independent

Static
allocation
Makespan,
Dependent
Static
scheduling
with dynamic
adjustments
Server farm
heterogeneous
HCS (heterogeneous
computing
systems)
DVFS
CPU
DVS
CPU
the conservative nominal speed (also called the utilization-based speed) does not take
into account the energy constraint and the fact that only the (m-K)-deadline has to
be satisfied. The greedy scheme improves the nominal speed/frequency statically by
considering the processors demand requirement for completing m instances out of k
for each task. For this, the load requirements of only mandatory tasks (m for each task)
are taken into consideration. To select m-mandatory tasks, a simple algorithm called
deeply-red is used, that executes the first m instances of each task and skips the
next k-m tasks. The greedy scheme with modified nominal speed selection is further
improved by using a dynamic slack reclamation (called DRA [Aydin et al. 2004]) procedure which computes earliness after tasks are completed and adjusts the nominal
speed for the next tasks. In the case of a very strict energy constraint, such that it is not
possible to meet the (m,k)-firm deadline for all the tasks, the greedy scheme will not
perform satisfactorily. Another static heuristic called Energy-Density Scheme (EDS)
is proposed to provide better solutions under such conditions. EDS prioritizes tasks
based on their energy-density value which is the ratio of energy requirements for the
mandatory instances to the weighted maximum dynamic failures for each task. This
scheme calculates the nominal speed by using the load requirements of the currently
selected tasks only, rather than for all the mandatory tasks. EDS also allows for dynamic slack reclamation and the extra energy accumulated during execution that can
be used to accept additional mandatory tasks. Both the schemes were evaluated with
and without slack reclamation on randomly generated periodic real-time tasks with
varying parameters under both the worst-case execution and actual execution scenarios. Energy-density schemes with nominal speed calculated based on the processors
demand for the selected tasks attain smaller or comparable dynamic failure ratios with
respect to greedy and a modified distance-priority-based (DPB) algorithm.
Competitive Analysis of Real-Time Scheduling Under Energy Constraints. A theoretical study to evaluate the scheduling of real-time tasks under hard energy constraints
is presented in Devadas et al. [2009]. Competitive analysis for online, semi-online,
and semi-online with DVFS scheduling algorithms is illustrated using an adversary
method. The paper starts with discussing the performance of the regular EDF algorithm under a given energy budget. It is shown that if the energy budget is greater than
the minimum required energy for execution of a task set then EDF is optimal. However,
as the energy budget is constrained, EDF performs worst by providing no guarantee
32:24
H. F. Sheikh et al.
even for obtaining nonzero total value (value of a job is a quantitative representative of
the worth of executing a task, obtained only if the task is completely executed). Hence
the authors put forth the argument that EDF cannot provide nonzero competitive
factor under tight energy constraints followed by a proof on the upper bound on the
performance of any online real-time scheduling algorithm in underloaded conditions (in
terms of energy). An online real-time scheduling algorithm called EC-EDF is then presented that evaluates the possible completion of the task before admitting the task for
execution. It is then proved that within the given energy budget EC-EDF always completes all the admitted jobs and hence does achieve nonzero total value and competitive
factor. A semi-online algorithm called EC-EDF equipped with additional information
about the largest task size is presented to obtain a guarantee on the competitive factor.
It is shown that EC-EDF can achieve a constant competitive factor of 0.5 (ideal being
1), and no other semionline algorithm can achieve a better competitive factor than 0.5.
The aforesaid bounds were established for underloaded systems and with the assumption that the value density of jobs is uniform. However, the algorithm is later evaluated
under nonuniform value settings and is shown to have a competitive factor of 1/2 kmax ,
where kmax is the largest value associated with any task. EC-EDF is then extended
to the case where the processor is equipped with DVFS thus tasks can be executed at
different frequency. This approach called -EC-EDF can achieve a competitive factor
of 1, given twice as much as energy as the adversary. Important theoretical results are
presented in this article regarding energy-constrained scheduling or real-time tasks
but no experimental /simulation results are presented.
Window-Based Lazy Scheduling for Real-Time Systems. Recently a window-based
scheduling approach for real-time systems is introduced in Ranvijay et al. [2010].
The authors argue that a simple greedy-based scheduling scheme in the presence
of energy constraints may result in higher rejection ratios. Specifically, scheduling
a task for execution immediately on its arrival can constrain another high-priority
task from completion. Under scheduling with preemption, this case may result in a
poor acceptance ratio. The authors propose a window-based lazy scheduling algorithm
where the scheduling decision is governed by both the energy constraints and the
deadline of the task. Based on the energy budget and deadline, a task can be deferred
for execution while allowing other tasks with earlier deadlines to be scheduled. A slack
reclamation approach for saving energy by slowing down tasks with total response time
less than the window size is also proposed. Simulations are conducted by generating
aperiodic tasks randomly using an exponential distribution. Results are compared for
the window-based lazy approach with and without DVFS, greedy EDF with and without
DVFS, as well as for a DVFS-based algorithm for sporadic tasks. Though window-based
lazy scheduling reportedly achieves better energy consumption and acceptance ratio,
a detailed analysis of the algorithms is not presented, which affects the extensive
applicability of the proposed idea.
ECPO on Multicore and Distributed Platforms.
We now present brief details of selected related work (Table V) for energy-constrained
performance optimization in this section.
Power Shifting. Felter et al. [2005] discuss and demonstrate a power shifting method
for controlling component power consumption while minimizing the impact on performance in server systems. Power shifting distributes power among components (including memory and processing units) dynamically using workload-sensitive policies.
In turn, a minimal degradation in performance can reduce the power budgets for
workloads, or alternately, enables the system to improve performance under a given
power budget. Power shifting, claim the authors, involves dynamic management in the
32:25
division of an entire systems power budget among CPU and memory, citing that these
are the components of a system that consume the most energy during operation. Two
observations behind the idea of power shifting are that components and overall system
activity vary with workload and that components of a system are typically not fully
utilized at the same time.
Energy-Constrained Combinatorial Optimization. Li [2008] provides a combinatorial
optimization approach for solving the problem of task scheduling on DVFS-enabled
multiprocessor systems. The author addresses two problems: energy-constrained performance (schedule length) optimization and performance-constrained (given schedule
length) energy minimization. These problems are approached as a sum of powers problem with scheduling tasks and power supply determination taken as two subproblems.
First, the study shows equivalence of minimizing schedule length and minimizing energy consumption with the sum of powers problem. Given task execution requirements
as the number of CPU cycles for each task, finding the minimal schedule length with
an energy consumption constraint involves finding power supplies for each task and
a nonpreemptive schedule for the tasks spread over multiple processors. Of these two
components of the problem, scheduling tasks involves partitioning tasks into sets to be
executed on a processor while determination of power supplies requires to minimize
the schedule length under the given energy budget. This can be performed after the
partitioning is completed. To begin, a uniprocessor computer with an energy constraint
is considered. Minimizing the schedule length in this case only involves finding power
supplies for those tasks that produce the minimal schedule while not violating the
energy constraint. Next, minimizing the schedule length is extended to multiprocessors. It is shown that when a task set is partitioned and divided among processors,
finding power supplies for each task that minimizes schedule length is equivalent to
finding the total energy consumption of each processors assigned task set that minimizes the schedule length. Given task execution requirements, finding a schedule
that consumes minimal energy with a schedule length constraint on a multiprocessor
system involves finding power supplies and nonpreemptive schedule, as before. First
the author considers a uniprocessor system finding power supplies that minimize consumption while keeping execution length below the constraint. This idea is extended
to multiprocessor systems using a similar method as before. Second, Li shows the
NP-hardness of these optimization problems followed by lower bounds for the optimal schedule length and minimal power consumption. NP-hardness is demonstrated
by showing that the problems mentioned previously are equivalent to sum of powers
problems and using a reduction from the well-known partition problem. Lower bounds
for optimal schedule length and minimal power consumption are used to show that
optimizing the power-performance product is done by fixing one factor and minimizing
the other. These lower bounds are used to benchmark the heuristic algorithm. Next,
the author proposes variations of the list scheduling algorithm to determine partitions
(schedules or sum of powers) for each processor. Classic list scheduling approaches, including Largest-Requirement-First-List-Scheduling and Smallest-Requirement-FirstList-Scheduling and how to apply them to this problem are discussed. Finally, equal
speed algorithms are considered. An equal speed algorithm supplies all tasks with the
same power and speed and is a prepower determination algorithm (power supplies are
determined before the schedule is found).
Iterative Voltage Adjustments. Ahmad et al. [2009] propose a static voltage adaptation (ISVA) algorithm that minimizes energy requirements while allocating voltages to the subtasks of an application represented by a DAG. An initial schedule
is first generated using an existing efficient algorithm without taking the energy constraint into consideration. Next, ISVA computes each tasks relative importance and the
32:26
H. F. Sheikh et al.
corresponding energy burden. It then adjusts the schedule to achieve the best possible
schedule under the given energy budget. The inputs to ISVA are a task graph, the
number of processors, DVFS levels of the processors, and an energy budget. Using an
efficient list scheduling algorithm such as DCP [Ahmad and Kwok 1998], the ISVA
generates a schedule with all tasks allocated the lowest available voltage level. The
energy consumption for the schedule is recalculated and if this one does not exceed
the energy budget, the algorithm proceeds. From the tasks which are not allocated the
maximum voltage level, ISVA selects tasks by turn and increases their voltage level.
The task with the incremented voltage level is called the candidate task. The earliest
starting task scheduled to run at the lowest voltage level among all the processors is
selected for adjustment in case there is no candidate task. ISVA stops when an increase
in the voltage level of a task results in exceeding the allocated energy budget.
Queuing Model for Optimal Power Allocation. In Gandhi et al. [2009], the authors investigate the optimization problem of minimizing the average response time of servers
with a given total energy budget on the platform of a heterogeneous high-performance
server farm. They begin with an analysis of the relationship between scaling CPU
frequency and dynamic power dissipation. They apply the same CPU-bound workloads
to three different frequency scaling techniques of DVFS (ACPI P states), DFS (ACPI
T states) and mixed DFS+DVFS (fine-grained T states and coarse-grained P states).
From the results, both DVFS and DFS showed a linear relationship between dynamic
power consumption and CPU frequency while DFS+DVFS showed a cubic relationship. They then investigate the optimal power allocation which is represented as the
optimal CPU frequencies with the minimum mean response time. The authors assume
the system can be simulated by a queuing-theoretic model so mean response time can
be predicted by taking into consideration factors such as the power-frequency relationship, arrival rate, peak power budget, and so on. Then the model can estimate the
optimal power allocation under each possible combination of the different values of
these factors. Under different workload intensities, this model gives out different solutions of power allocation. The results show that the best performance under a power
constraint may not have tasks on a small number of servers at maximum frequency;
a large amount of servers running at lower performance levels may be the optimal
solution in some cases.
Energy-Conscious Scheduling Heuristic with Makespan-Conservative Energy Reduction. Lee and Zomaya investigated the task scheduling problem on Heterogeneous
Computing Systems (HCSs) [Lee and Zomaya 2009]. The application model is based
on precedence-constrained tasks so the communication costs are accounted for into
the model. They proposed an Energy-Conscious Scheduling heuristic (ECS) for loosely
coupled HCSs (e.g., grids and clouds) using advance reservation and multiple sets
of frequency-voltage settings. ECS is devised with the incorporation of DVS to reduce energy consumption; this means that a trade-off exits between the quality of
schedules (makespan) and energy consumption. The precedence-constrained tasks are
constructed as a DAG with nodes representing tasks and directed edges representing
precedence relations between nodes. Computation costs are represented as weights
on nodes, while communication costs are represented as weights on edges. The algorithm consists of two typical phases: a static scheduling phase and an energy reduction
phase. In the first phase, ECS is executed repeatedly to formulate a balanced schedule
with the objective dealing with the trade-off between performance and energy. In this
phase, task-to-processor (machine) mappings are also constructed. In the second phase,
the Makespan-Conservative Energy Reduction technique (MCER) is incorporated into
ECS. In this phase, the initial schedule generated in phase-I is scrutinized to identify
32:27
Table VI. DEPO on Multicore and Distributed Platforms
Paper
[Malik et al.
2000]
[Kremer
2000]
[Azevedo et
al. 2001]
[Zhang et al.
2005]
Algorithmic
Approach
Heuristic
Performance
Definition
System
Utilization
Heuristic,
Task
Profile based,
completion
compiler
directed
Heuristic,
Hit/Miss rate
Profile based,
compiler
directed
Heuristic,
Hit/Miss rate
profile based
Task Model
Platform
Benchmark
Multi-core
(Powerstone)
Dependent
Single
processor
Independent
Power
Control
Strategy
Components
Cache
Cache
Tuning
Remote task CPU
execution
Single
processor
Discrete
DVFS
CPU
Benchmark
Single
(Powerstone,
processor
Mediabench,
SPEC2000)
Cache
tuning
Cache
whether any changes to the schedule further reduce energy consumption without an
increase in makespan.
For ECPO problems, workload consolidation techniques are not as important as those
for PCEO problems. It is observed that improvement of power estimation or prediction
techniques for tasks and computer systems can also improve the system performance
from the previously cited related work. Comparing to the static scheduling methods,
which solely consider the energy constraint, the dynamic scheduling approaches including slack reclamation may attract more attention for ECPO problems in the future. However, since ECPO can have a strict energy threshold and violation of this
threshold cant be afforded in many cases, the energy overhead of dynamic allocation
and migration should be carefully evaluated. Furthermore, not only the cumulative
energy constraint, but the threshold of peak power, especially for some power-sensitive
computer systems, should also be taken into account.
4.2.3. Dual Energy and Performance Optimization (DEPO) Algorithms. The dual optimization
of both performance and energy consumption can be considered as multiobjective optimization with multiple constraints. Trade-offs are evaluated for different scenarios,
and a summary of various characteristics of research efforts for DEPO targeted for
single processor systems as well as multicore and distributed systems is presented in
Table VI and Table VII, respectively.
DEPO on Single Processor Platforms.

Compiler-Driven Optimization. Kremer, Hicks, and Rehg [2000] make use of
compile-time program analysis to offload tasks involved in face recognition and
detection from embedded or other battery-driven devices to servers. They aim to minimize the energy consumption while also minimizing some performance penalty. During
program compilation, phases of face detection are broken down into tasks and used to
construct a DAG. Then a decision is made about the benefit of potentially executing
the tasks on a remote server rather than the local device. Once the compiling phase
is completed, the client can be kept informed about the progress of a task executed
on the server. In the case of a network disconnection, the client can continue execution individually without the help of a server. Another optimization solution driven
by compiler events is proposed by Azevedo et al. [2001]. They use a DVFS technique
to dynamically manage power via compiler-driven strategies. Their COPPER project
(compiler-controlled continuous power performance) makes use of the GCC compiler,
Wattch simulator with an updated power profiler and power scheduler modules, and
32:28
H. F. Sheikh et al.
Table VII. DEPO on Multicore and Distributed Platforms
Algorithmic
Paper
Approach
[Ahmad et al. Pareto
2008]
Optimal
(NBS)
[Liu et al.
Heuristic
2008]
[Verma et al. Bin-Packing
2008]
[Bao et al.
Heuristics for
2009]
MIP
[Kusic et al.
2009]
[Lammie
et al. 2009]
Predictive
control
theory
Heuristics
Performance
Definition
Makespan
Task Model
Independent
Timing
Constraint
SLA
Periodic
DAGs
Independent
Throughput
Dependent
and execution
time
SLA
Independent
Turnaround
time
Independent
(Grid
Workload)
Platform
Multicore,
Grid Heterogeneous
MPSoC
Power
Control
Strategy
DVFS
Components
CPU
CPU
Virtualized
cluster
Multicore
DVFS, Task
retiming
VM consolidation
DVFS
Virtualized
cluster
VM consolidation
Server
Multicore
Cluster
Frequency
and Node
Scaling
CPU
Server
CPU
SimpleScalar to enable analysis at the levels of code generation up to the simulated

execution. Compiler-generated configuration code is embedded into an application to
produce different versions of the code to be selected at runtime. A power scheduler
makes choices based on the available power profile. Scheduling heuristics are used
to predict the dissipation of power by an application using the ahead-of-time power
profile. To effectively address both issues of power and performance control simultaneously, additional controls must be put in place. To achieve the additional goal of time
constraints, program checkpoints are introduced at specific locations in the code. Time
constraints are then set for the amount of time (acceptable upper and lower bounds) the
program should take between checkpoints. A heuristic approach involving two profiling
phases for time and a power scheduling phase is discussed to meet both goals.
Cache Tuning Methods. As caches contribute significantly to the systems power
consumption, people have thought out ways to make cache configurable to apply on
different applications. Zhang et al. [2005] introduce a configurable cache architecture
where cache can be tuned on associativity, total size, and line size. The overall performance and size overhead of these configurable features is significantly small. By
incorporating simple configurable circuits, the power savings gained from configuration can reduce overall system power as much as up to 40%. Malik et al. [2000] also
present cache-tuning solutions for an M CORE M3 architecture controlled via a cache
control register (CACR), different cache-tuning features as programmable modes of
operations, write modes (write-through and copy-back) changing, way management
control, and data caching policy adjustment. Followed by the benchmark analysis, an
optimal performance/power consumption profile can be generated using the aforesaid
features.
DEPO on Multicore and Distributed Platforms.
Game-Theoretical Scheduling Algorithm. Ahmad et al. [2008] solve the EnergyAware Task Allocation (EATA) problem for a form of the third scenario by designing
an algorithm with concepts from cooperative game theory using the Nash Bargaining Solution (NBS), named as NBS-EATA. A cooperative game is played among the
cores, such that they compete with each other to grab tasks for maximizing their profit
(based on makespan and power reduction). The desired outcome of such a game is a
task-to-core mapping which benefits the whole processing system. The problem with
32:29
the DEP-OPT function (assuming heterogeneous cores and also assuming tasks have a
certain affinity to specific cores) corresponds to a min-min-max optimization problem.
An algorithm is designed first that adheres to the Nash Bargaining Solution that yields
Pareto-optimal solutions but has a high complexity. Then the min-min-max problem is
transformed into a max-max-min optimization problem which significantly lowers the
complexity of the problem. The immense advantage of this conversion besides lower
complexity is the Pareto-optimality and fairness that comes with the guarantee to always have a bargaining point for the max-max-min problem. There is a great match
between the problem scenario and game theory, with cores of a multicore processor competing to optimize multiple goals from multiple perspectives. The scheduler can easily
modify the rules to adjust the strategy of games to suit various scheduling scenarios
and policies. Game theory allows multiple objective functions for cores and memory
modules. These objectives can be dynamically adjusted and tuned.
Overhead-Aware Joint Energy and Performance Optimization on MPSoCs. A heuristic approach to jointly optimize energy and performance while executing applications
represented by periodic DAGs is explored in Liu et al. [2008]. The target is to explore
the mutual trade-off among the quantities by exploiting software pipelining as well
as DVFS and DPM. The proposed approach is a two-phase approach, where in the
first step a given periodic DAG is transformed into a set of independent tasks and in
the second phase DVFS and DPM are applied on the scheduled DAG to obtain energy
savings. The task-level software pipelining approach uses a technique called retiming
to remove the intratask dependencies among the tasks. The result is a periodic DAG
that can be executed as a set of independent sets with only inter-set dependencies. The
paper elaborates the advantage of working with a pipelined version of DAG as opposed
to the original DAG by showing that the former can be scheduled with significantly
tighter timing constraints. In the second phase, a shrinking and extending approach
(SpringS) is used to adjust the initial schedule. The initial schedule is adjusted if the
timing constraint is violated and is afterwards extended in case the adjustment has
reduced the schedule length less than the constraint. The adjustments are made until
no further reduction in energy can be obtained or the timing constraint is violated. The
overhead considerations are applied in the form of communication overhead as well as
transitional overhead incurred due to switching among the voltage levels as well as the
off and on states of the cores. Experiments are performed for both randomly generated
and practical task-graphs by considering each core as equivalent to an AMD Mobile
Athlon processor. The proposed approach is compared with a non-DVFS list scheduling
approach as well as with a static DVFS approach for DAG scheduling that does not
perform software pipelining (DAGwoSP). The proposed technique is able to achieve
69.9% and 49.8% energy savings on average as compared to non-DVFS approach and
DAGwoSP, respectively. Additionally, it is shown that SpringS can satisfy much tighter
timing constraints than DAGwoSP for the same number of processors (cores). Tradeoff surfaces among energy consumption and timing constraints are also presented for
different numbers of processors.
Trade-Offs among Performance, Power and VM Migration Cost. Verma et al. [2008]
investigate the power-aware application placement problem in a virtualization environment. They propose an application placement controller, pMapper, which can dynamically place applications to achieve different trade-offs such as performance (SLA)
optimization and power efficiency optimization. PMapper categorizes all the allocation activities into three categories that are communicated and monitored by three
software components: performance manager covers performance-related activities including VM resizing and idling; power manager covers all power management activities
at hardware layer; and virtualization manager covers consolidation activities via live
32:30
H. F. Sheikh et al.
VM migration. All the three managers report to the arbitrator, that analyzes all the
system information and makes decisions at the global view. To solve the problems of
different objectives, the authors abstract the power-aware VM allocation problems to
the bin-packing problem with different bin sizes, in which objects volumes can present
VM parameters while bin sizes represent server parameters. Motivated by the first-fit
solution for the classic bin-packing problem, the authors propose the min Power Parity
(mPP) algorithm to solve power-aware optimization problems, for example, the power
minimization problem. The mPP algorithms work on the basis of assumptions that,
one, VMs are independent of servers, say, no extra power increments for some specific
VM-server peers; the other is that the power slope between two servers applied with
the same VM will keep the direction for all VMs. Based on the assumptions, the mPP
can be shown globally optimal. However, the mPP algorithm will result in frequent
VM migration for some cases. So the mPPH algorithm is proposed, which is migrationaware by keeping track of previous allocation information. However, it always achieves
a suboptimal solution to the original optimization problem. Thus the third algorithm
pMaP is proposed, that considers the ratio of power increments from migrations over
the migration cost. By choosing the lowest ratio before deciding any migration, pMaP
can be shown locally optimal. Their results show pMaP will be better than mPP on average and the gap is acceptable between the local optimal solution and global optimal
solution in the best case. But both algorithms outperform load-balancing algorithms.
The proposed method shows good applicability both in theoretical and practical domains. However, their assumptions can hold for a heterogeneous environment and
the proposed algorithms are all offline, so that say, the arbitrator should know the
required information of all applications, VMs and servers. Thus it may fail for the
real-time application/task model.
Trade-Offs among Performance, Power and Temperature. The work in Bao et al.
[2009] proposes a scheme to address the problems of suboptimality and computational
overhead while applying DVFS for task allocation using a task model of an MPEG-2
decoder on homogeneous cores. The DVFS schemes normally used for optimization
of the energy utilized in the processors assume the maximum allowable temperature
(Tmax) for calculating the leakage current and fmax. This results in suboptimal solutions. Additionally, for many DVFS approaches it is assumed that applications will
complete in WNC (worst number of cycles), which is not always true. The article elaborates that the use of the actual temperature of the core while calculating the values
of fmax and taking the leakage current into consideration, the temperature-frequency
dependence can improve the overall efficiency of DVFS. To apply such a scheme at
runtime, a two-phase approach is proposed. An offline approach for pre-computing the
voltage/frequency settings for all tasks is proposed. The offline approach computes the
actual temperature for an application/task iteratively starting with Tmax and then repeating the computation of voltage/frequency (required for energy optimization) until
convergence is achieved.
Prediction-Based Limited Look-Ahead Control. In Kusic et al. [2009], the authors
propose a dynamic workload allocation strategy for virtualized clustered servers. The
optimization objective is the overall processing rates of VMs so as to achieve the best
SLA satisfactory. The trade-off is also considered between power consumption (active
VM numbers, active server numbers, and CPU shares to each VM) and performance
(SLA violation) by the rules that SLA violation leads to penalty of profit while satisfactory SLA results in rewards. The strategy applies for homogenous environments,
that is, VMs are all identical so the power consumption on any server can be easily
calculated by VM numbers. The proposed approach is offline, that is, when any task
32:31
comes in, the costs of all possible allocations are already known. Based on this idea,
all the states are classified into horizons with lengths by prediction. So any workload
input can have state prediction chains with all possible states after it is allocated as
the head of any chains. Then the head of the optimal chain among all options is chosen
as the result of this workload allocation. When next input comes, the chosen procedure
is repeated. It is called the limited look-ahead control strategy and the semi-global
optimal solution can attain more profit than the accumulation of simple local optimal
solutions of each step. To trade-off between different SLA requirements, the risk variable is defined to decide how aggressive the optimization is held. If the variable is zero,
the misses of predictions are ignored, which suits for the scenario of low penalty on SLA
violations. However, it is obvious that the scalability of this approach is not ideal since
the linear increase of the number of VMs or servers will lead to an exponential increase
of possible state options. To solve this problem, the authors give an alternative strategy
for large clusters, in which a neural network is built to record the historic chosen state
chain of inputs. When the neural network is well-trained, it can provide an estimated
state chain for any coming workload, instead of choosing from all the chain options. So
the scalability complexity is degraded to polynomial. Although this approach is simulated and the result outperforms the simple greedy workload allocation algorithm, the
comparison with other VM consolidation approaches is yet lacking.
Scheduling Parallel Workloads on Multicore Clusters for Joint Optimization. Several cluster management policies leveraging frequency scaling and node scaling to
improve the energy consumption and turnaround time of jobs submitted to a system
are proposed in Lammie et al. [2009]. The proposed approach designs a multilayer
joint optimization scheme that combines the effort of determining the optimal number
of nodes that should be kept in ON state based on the parallel workload along with
the DVFS and task allocation scheme. The target is to take advantage of the bursty
nature of the workload. The paper compares cluster management schemes which do
not perform node scaling, perform node scaling but do not perform frequency scaling,
and those which perform both. In the second approach, a task in the queue is either
assigned to an idle node or a new node is turned on in the case where no idle node is
available. In the third case, frequencies are first scaled up to determine if they can complete the new task or the task in the queue, or else a new machine is turned on to fulfill
the task completion requirements. The third policy, called SWQ, is then combined with
three heuristics to improve its task assignment procedure. SWQI allocates the next
job in the queue to a core in the machine scheduled to run the longest. SWQN assigns
the pending job to the machine with the most recently submitted tasks while SWQO
maps the task to the machine with oldest running job. Among these adaptive schemes
SWQI and SWQN consume the least energy except for the case when the number of
nodes is very small. The experiments are conducted based on the workload data of four
different parallel systems. In terms of turnaround times, SWQI and SWQO are able to
achieve good turnaround times under variable load as compared to the best round-trip
time. The efficiency of machines in terms of cycles per second, equipped with DVFS,
was also investigated. The authors make the observation that at full capacity machines
have better cycles/joule value than having the machine running at higher frequency,
mainly due to the wasted cycles. The proposed approach is adaptive, yet its behavior
at runtime may need further investigations.
The combinational optimization of performance and power consumption given different constraints can be processed by different approaches. Furthermore, comprehensive
trade-offs need to be evaluated for complex optimization like those among power, temperature, and performance using different objectives.
32:32
H. F. Sheikh et al.
5. DISCUSSION
The discussion in Section 4 on various scenarios for keeping energy in perspective for
modern computational devices presents an overview of the current possibilities in that
direction. However, at the same time we observe that there are still some limitations
regarding the possibilities for energy management in computational resources. We can
notice that many research efforts make significant assumptions about the workload,
nature of the system, or the type of environment in which their scheme can be used.
Therefore, no solution can be considered as the principle one or a general approach for
energy minimization. Also, it can be observed that as the size of the systems grows
(at infrastructure level), there are usually more opportunities available for optimizing
energy. In our example, in an HPC facility, the designers and managers have to select
the best method for minimizing the energy consumption by keeping in mind various
models and conditions applicable to their system.
6. CONCLUSION AND RESEARCH DIRECTIONS
A recent surge of research on energy-efficient computing techniques has given us new

opportunities and challenges. Previous research on energy-efficiency problems were
summarized in this survey with the different objectives, such as PCEO, ECPO, DOPO,
and different target platforms, such as single processors, multicore processors, and
parallel and distributed systems.
To extend existing work and create innovations on energy-efficient techniques, especially in distributed systems, several directions are discussed next which are rarely or
not completely investigated in current literature.
Support resources that are heterogeneous and dynamic. A major advantage of a
network-based high-performance computing environment such as grids and clouds
is the heterogeneity of the computing machines, that requires assigning the tasks to
machines that are best suited to the tasks nature and requirements but exacerbates
the energy-aware scheduling problem [Foster and Kesselman 1997]. In such systems,
resources also become dynamically available, not only compounding the complexity of
scheduling but also requiring dynamic monitoring of resources energy usage. The same
is going to happen as multicore systems become heterogeneous and each core is best
suited to certain type of load. Most scheduling algorithms assume a fixed availability
of resources or specific type of hardware, but in a distributed environment, resources
may be added to or removed from the shared pool dynamically. Dynamic changes in
available resources can significantly impact the energy and time requirements and
should be carefully incorporated in scheduling. It is desirable to empower grids and
clouds with fast, dynamic, scalable, and adaptive governing mechanisms instead of
static and inflexible static manual solutions. Following this direction, novel algorithms
can be developed that exploit dependencies among different tasks for slack allocation. Some nodes in the DAG (e.g., critical path nodes) may benefit more than others
from slack allocation. Moreover, these priorities may change as the execution proceeds,
making the problem more interesting because the profits and losses of players vary
in a game that dynamically changes. This affects the objective functions and gaming
strategies, that can be captured by cooperative gaming algorithms [Khan and Ahmad
2006, 2007b]. Involving bargaining and cooperative games, existing critical path analysis methods [Kwok and Ahmad 1996] can be utilized and new suitable methods can
be developed for taking advantage of the precedence relationships between tasks for
slack allocation and assignment. Moreover, dynamic algorithms will also take advantage of energy monitoring. To determine the energy consumption of different processes
at runtime, time-driven statistical techniques are used [Chang et al. 2002; Gniady
32:33
et al. 2004]. Power monitoring tools can be exploited with integration with the resource
management framework, such as those in the Vista operating system [Microsoft 2007a,
2007b].
Support a wide variety of trade-offs beyond energy and performance. In a distributed
environment, one expects requests that represent different trade-offs between energy
and performance. For example, one request may be time critical; here minimizing time
is the key goal. To meet this goal, the scheduler may have to sacrifice some energy.
Another request may have loose timing goals, allowing energy consumption to remain
within a given energy quota. Current literature investigates the trade-off between
energy and performance in terms of execution time, SLA, QoS, etc. However, only
few research studies investigate the optimization with other objectives. For example,
thermal management is similar to energy management topics. As energy presents
the accumulated power consumption, temperature is more closely related to the peak
power. A possible requirement is to save energy with both performance and temperature
constraints. The exploration on the optimization problem with energy, temperature,
and performance is necessary and attractive.
REFERENCES
ABDELZAHER, T. AND LU, C. 2001. Schedulability analysis and utilization bounds for highly scalable real-time
services. In Proceedings of the IEEE Real-Time Technology and Applications Symposium.
ACPI. 1999. Advanced configuration and power interface specification revision 4.0a. http://www.acpi.
info/DOWNLOADS/ACPIspec40a.pdf.
AEA. 2008. American electronics association report cybernation. http://www.aeanet.org.
AHMAD, I. AND KWOK, Y. 1998. On exploiting task duplication in parallel program scheduling. IEEE Trans.
Parallel Distrib. Syst. 9, 9, 872892.
AHMAD, I. AND LUO, J. 2006. On using game theory for perceptually tuned rate control algorithm for video
coding. IEEE Trans. Circ. Syst. Video Technol. 16, 2, 202208.
AHMAD, I., KHAN, S., AND RANKA, S. 2008. Using game theory for scheduling tasks on multi-core processors
for simultaneous optimization of performance and energy. In Proceedings of the Workshop on NSF Next
Generation Software Program in Conjunction with the International Parallel and Distributed Processing
Symposium.
AHMAD, I., ARORA, R., WHITE, D., METSIS, V., AND INGRAM, R. 2009. Energy-Constrained scheduling of dags on
multiprocessors. In Proceedings of the 1st International Conference on Contemporary Computing.
ALBONESI, D. 2002. Selective cache ways: On demand cache resource allocation. J. Instruct.-Level Parall.
ALENAWY, T. AND AYDIN, H. 2005. Energy-Constrained scheduling for weakly-hard real-time systems. In Proceedings of the 26th IEEE International Real-Time Systems Symposium. 376385.
AMD. 2008. Amd firestream 9170 stream processor. http://ati.amd.com/technology/streamcomputing/
product firestream 9170.html.
ANDRAE, M. 1991. Biomass burning: Its history, use, and distribution and its impacts on the environmental
quality and global change. In Global Biomass Burning: Atmospheric, Climatic, and Biosphere Implications, J. S. Levine, Ed., MIT Press, Cambridge, MA, 321.
ATLAS COLLABORATION. 1999. Atlas physics and detector performance. Tech. des. rep. LHCC.
AYDIN, H., MELHEM, R., MOSSE, D., AND MEJIA-ALVAREZ, P. 2001. Optimal reward-based scheduling for periodic
real-time tasks. IEEE Trans. Comput. 50, 111130.
AYDIN, H., MELHEM, R., MOSS, D., AND MEJA-ALVAREZ, P. 2004. Power-Aware scheduling for periodic real-time
tasks. IEEE Trans. Comput. 53, 5, 584600.
AZEVEDO, A., CORNEA, R., ISSENIN, I., GUPTA, R., DUTT, N., NICOLAU, A., AND VEIDENBAUM, A. 2001. Architectural
and compiler strategies for dynamic power management in the copper project. In Proceedings of the
International Workshop on Innovative Architecture.
BADER, D., LI, Y., LI, T., AND SACHDEVA, V. 2005. BioPerf: A benchmark suite to evaluate high-performance computer architecture on bioinformatics applications. In Proceedings of the IEEE International Symposium
on Workload Characterization.
BAEK, W. AND CHILIMBI, T. 2010. Green: A framework for supporting energy-conscious programming using
controlled approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language
Design and Implementation.
32:34
H. F. Sheikh et al.
BAO, M., ANDREI, A., ELES, P., AND PENG, Z. 2009. On-Line thermal aware dynamic voltage scaling for energy optimization with frequency/temperature dependency consideration. In Proceedings of the 46th
ACM/IEEE Design Automation Conference (DAC09). 490495.
BLAND, B. 2006. Leadership computing facility. Presented at The Fall Creek Falls Workshop.
BORKAR, S. 1999. Design challenges of technology scaling. IEEE Micro 19, 4, 2329.
BROOK, B. AND RAJAMANI, K. 2003. Dynamic power management for embedded systems. In Proceedings of the
IEEE International Systems-on-Chip (SOC) Conference.
BROOKS, D., BOSE, P., SCHUSTER, S., JACOBSON, H., KUDVA, P., BUYUKTOSUNOGLU, A., WELLMAN, J., ZYBAN, V.,
GUPTA, M., AND COOK, P. 2000. Power aware microarchitecture: Design and modeling challenges for
next-generation microprocessors. IEEE Micro 20, 6, 2644.
BURD, T., PERING, T., STRATAKOS, A., AND BRODERSEN, R. 2000. Dynamic voltage scaled microprocessor system.
IEEE J. Solid-State Circ. 35, 11, 15711580.
BUTTAZZO, G. 2005. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications. Springer.
CAVIUM NETWORKs. 2008. Octeon plus cn58xx multi-core mips64 based soc processors. http://www.
caviumnetworks.com/OCTEON-Plus CN58XX.html.
CHANDRAKASAN, A., SHENG, S., AND BRODERSEN, R. 1992. Low-Power cmos digital design. IEEE J. Solid-State
Circ. 27, 4, 473484.
CHANG, F., FARKAS, K., AND RANGANATHAN, P. 2002. Energy-Driven statistical profiling: Detecting software
hotspots. In Proceedings of the Workshop on Power Aware Computing Systems.
CHEN, M. AND MISHRA, P. 2009. Efficient techniques for directed test generation using incremental satisfiability. In Proceedings of the 22nd International Conference on VLSI Design. IEEE Computer Society, Los
Alamitos, CA, 6570.
CHUNG, E., BENINI, L., AND MICHELI, G. 1999. Dynamic power management using adaptive learning tree. In
Proceedings of the International Conference on Computer-Aided Design. 274279.
CMS COLLABORATION. 2012. Cms data grid system overview and requirements. CMS note 037.
DAREMA, F. 2005. Grid computing and beyond: The context of dynamic data driven applications systems. Proc.
IEEE 93, 3, 692697.
DATAQUEST. 1992. http://data1.cde.ca.gov/dataquest/.
DEVADAS, V., LI, L., AND AYDIN, H. 2009. Competitive analysis of energy-constrained real-time scheduling. In
Proceedings of the 21st Euromicro Conference on Real-Time Systems. 217226.
ELNOZAHY, E., KISTLER, M., AND RAJAMONY, R. 2002. Energy-Efficient server clusters. In Proceedings of the
PACS Conference. 179196.
FELTER, W., RAJAMANI, K., KELLER, T., AND RUSU, C. 2005. A performance-conserving approach for reducing peak
power consumption in server systems. In Proceedings of the International Conference on Supercomputing.
293302.
FENG, W. AND CAMERON, K. 2007. The Green500 list: Encouraging sustainable supercomputing. IEEE Comput.
40, 12, 5055.
FLINN, J. AND SATYANARAYANAN, M. 2004. Managing battery lifetime with energy-aware adaptation. ACM Trans.
Comput. Syst. 22, 2, 179.
FOSTER, I. AND KESSELMAN, C. 1997. Globus: A metacomputing infrastructure toolkit. Int. J. Supercomput.
Appl. 11, 2, 115128.
GANDHI, A., HARCHOL-BALTER, M., DAS, R., AND LEFURGY, C. 2009. Optimal power allocation in server farms.
In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer
Systems. ACM, New York, 157168.
GE, R., FENG, X., AND CAMERON, K. 2005. Performance-Constrained distributed dvs scheduling for scientific applications on power-aware clusters. In Proceedings of the 17th IEEE/ACM High Performance Computing,
Networking and Storage Conference. 11.
GE, R., FENG, X., SONG, S., CHANG, H., LI, D., AND CAMERON, K. 2010. PowerPack: Energy profiling and analysis
of high-performance systems and applications. IEEE Trans. Parall. Distrib. Syst. 21.
GHASEMAZAR, M., PAKBAZNIA, E., AND PEDRAM, M. 2010. Minimizing energy consumption of a chip multiprocessor
through simultaneous core consolidation and dvfs. In Proceedings of the IEEE International Symposium
on Circuits and Systems. 4952.
GHAZAALEH, N., MOSSE, D., CHILDERS, B., MELHEM, R., AND CRAVEN, M. 2003. Collaborative operating system
and compiler power management for real-time applications. In Proceedings of the 9th IEEE Real-Time
and Embedded Technology and Applications Symposium.
32:35
GNIADY, C., HU, Y., AND LU, Y. 2004. Program counter based techniques for dynamic power management. In
Proceedings of the 10th International Symposium on High Performance Computer Architecture.
GONZALEZ, R. AND HOROWITZ, M. 1996. Energy dissipation in general-purpose microprocessors. IEEE J. SolidState Circ. 31, 9, 12771284.
GREEN GRID. 2012. http://www.thegreengrid.org/home.
HUANG, Z. AND MALIK, S. 2001. Managing dynamic reconfiguration overhead in systems-on-a-chip design
using reconfigurable datapaths and optimized interconnection networks. In Proceedings of the Design,
Automation and Test in Europe Conference and Exhibition. 735740.
HOFFMANN, H., SIDIROGLOU, S., CARBIN, M., MISAILOVIC, S., AGARWAL, A., AND RINARD, M. 2011. Dynamic knobs
for responsive power-aware computing. SIGPLAN Not. 46, 3.
JEJURIKAR, R., PEREIRA, C., AND GUPTA, R. 2004. Leakage aware dynamic voltage scaling for real-time embedded
systems. In Proceedings of the Design Automation Conference. 275280.
JEJURIKAR, R. AND GUPTA, R. 2005a. Dynamic slack reclamation with procrastination scheduling in real-time
embedded systems. In Proceedings of the Design Automation Conference. 111116.
JEJURIKAR, R. AND GUPTA, R. 2005b. Energy aware non-preemptive scheduling for hard real-time systems. In
Proceedings of the 17th Euromicro Conference on Real-Time Systems. 2130.
JERGER, N., VANTREASE, D., AND LIPASTI, M. 2007. An evaluation of server consolidation workloads for multicore designs. In Proceedings of the IEEE 10th International Symposium on Workload Characterization.
4756.
KAMIL, S., SHALF, J., AND STROHMAIER, E. 2008. Power efficiency in high performance computing. In Proceedings
of the IEEE International Symposium on Distributed Processing (IPDPS08). 18.
KANG, J. AND RANKA, S. 2008a. DVS based energy minimization algorithm for parallel machines. In Proceedings
of the IEEE International Symposium on Distributed Processing (IPDPS08). 112.
KANG, J. AND RANKA, S. 2008b. Dynamic algorithms for energy minimization on parallel machines. In Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP08).
399406.
KHAN, S. U. AND AHMAD, I. 2006. Non-Cooperative, semi-cooperative and cooperative games-based grid resource
allocation. In Proceedings of the 20th International Parallel and Distributed Processing Symposium
(IPDPS06).
KHAN, S. U. AND AHMAD, I. 2007. A cooperative game theoretical replica placement technique. In Proceedings
of the International Conference on Parallel and Distributed Systems. 18.
KHANNA, G., BEATY, K., KAR, G., AND KOCHUT, A. 2006. Application performance management in virtualized
server environments. In Proceedings of the 10th IEEE/IFIP Network Operations and Management Symposium (NOMS06). 373381.
KIM, K. H., BUYYA, R., AND JONG KIM. 2007. Power aware scheduling of bag-of-tasks applications with deadline
constraints on dvs-enabled clusters. In Proceedings of the 7th IEEE International Symposium on Cluster
Computing and the Grid (CCGrid07). 541548.
KREMER, U., HICKS, J., AND REHG, J. M. 2000. Compiler-Directed remote task execution for power management.
In Proceedings of the Workshop on Compilers and Operating Systems for Low Power.
KUSIC, D., KEPHART, J. O., HANSON, J. E., KANDASAMY, N., AND JIANG, G. 2009. Power and performance management of virtualized computing environments via lookahead control. Cluster Comput. 12, 115.
KWOK, Y. AND AHMAD, I. 1996. Dynamic critical-path scheduling: An effective technique for allocating task
graphs to multiprocessors. IEEE Trans. Parall. Distrib. Syst. 7, 506521.
LAMMIE, M., BRENNER, P., AND THAIN, D. 2009. Scheduling grid workloads on multicore clusters to minimize
energy and maximize performance. In Proceedings of the 10th IEEE/ACM International Conference on
Grid Computing. 145152.
LEE, Y. C. AND ZOMAYA, A. Y. 2007. Practical scheduling of bag-of-tasks applications on grids with dynamic
resilience. IEEE Trans. Comput. 56, 815825.
LEE, Y. C. AND ZOMAYA, A. Y. 2009. Minimizing energy consumption for precedence-constrained applications
using dynamic voltage scaling. In Proceedings of the 9th IEEE/ACM International Symposium on Cluster
Computing and the Grid (CCGrid09). 9299.
LI, K. 2008. Performance analysis of power-aware task scheduling algorithms on multiprocessor computers
with dynamic voltage and speed. IEEE Trans. Parall. Distrib. Syst. 19, 14841497.
LIANG, Y. AND AHMAD, I. 2006. Power and distortion optimization for ubiquitous video coding. In Proceedings
of the International Conference on Image Processing (ICIP06).
LIU, H., SHAO, Z., WANG, M., AND CHEN, P. 2008. Overhead-Aware system-level joint energy and performance
optimization for streaming applications on multiprocessor systems-on-chip. In Proceedings of the Euromicro Conference on Real-Time Systems (ECRTS08). 92101.
32:36
H. F. Sheikh et al.
LOVEDAY, J. 2002. The sloan digital sky survey. Contemp. Phys. 43.
LU, Y., BENINI, L., AND DE MICHELI, G. 2000. Low-Power task scheduling for multiple devices. In Proceedings
of the 8th International Workshop on Hardware/Software Codesign. ACM, New York, 3943.
LUO, J. AND JHA, N. K. 2000. Power-Conscious joint scheduling of periodic task graphs and aperiodic tasks in
distributed real-time embedded systems. In Proceedings of the IEEE/ACM International Conference on
Computer-Aided Design. 357364.
MALIK, A., MOYER, B., AND CERMAK, D. 2000. A low power unified cache architecture providing power and
performance flexibility (poster session). In Proceedings of the International Symposium on Low Power
Electronics and Design. ACM, New York, 241243.
MICROSOFT. 2007a. Microsoft whitepaper, application power management best practices for windows vista.
http://www.microsoft.com/whdc/system/pnppwr/powermgmt/PM apps.mspx.
MICROSOFT. 2007b. Microsoft whitepaper, processor power management in windows vista and windows server
2008. http://www.microsoft.com/whdc/system/pnppwr/powermgmt/ProcPowerMgmt.mspx.
MISHRA, R., RASTOGI, N., DAKAI ZHU, MOSSE, D., AND MELHEM, R. 2003. Energy aware scheduling for distributed
real-time systems. In Proceedings of the International Parallel and Distributed Processing Symposium.
MOCHOCKI, B., HU, X. S., AND QUAN, G. 2007. Transition-Overhead-Aware voltage scheduling for fixed-priority
real-time systems. ACM Trans. Des. Autom. Electron. Syst. 12. http://doi.acm.org/10.1145/1230800.
1230803.
MONTET, C. AND SERRA, D. 2003. Game Theory and Economics. Palgrave Macmillan.
MPI-FORUM. 2008. MPI: A message-passing interface standard. http://www.mpi-gotum.org/docs/mpi-1.3/mpireport-1.3-2008-05.30.pdf.
NASAES. 2012. Nasa earth science. http://science.nasa.gov/earth-science/.
NATHUJI, R., ISCI, C., GOBATOV, E., AND SCHWAN, K. 2008. Providing platform heterogeneity-awareness for data
center power management. Cluster Comput. 11, 259271.
NEWEGG. 2008. AMD phenom 9850 specifications. http://ww.newegg.com/Product/Product.aspx?Item=N82#
16819103249.
OIKAWA, S. AND RAJKUMAR, R. 1999. Portable RK: A portable resource kernel for guaranteed and enforced
timing behavior. In Proceedings of the 5th IEEE Real-Time Technology and Applications Symposium.
111120.
ORNL. 2012. OLCF jaguar. http://www.olcf.ornl.gov/computing-resources/jaguar/.
PERING, T., BURD, T., AND BRODERSEN, R. 2000. Voltage scheduling in the iparm microprocessing system. In
Proceedings of the International Symposium on Low-Power Electronics and Design (ISPLED00). 96101.
PERING, T., AGARWAL, Y., GUPTA, R., AND WANT, R. 2006. CoolSpots: Reducing the power consumption of wireless
mobile devices with multiple radio interfaces. In Proceedings of the 4th International Conference on
Mobile Systems, Applications and Services. ACM, New York, 220232.
PETRUCCI, V., LOQUES, O., AND MOSSE, D. 2010. Dynamic optimization of power and performance for virtualized
server clusters. In Proceedings of the ACM Symposium on Applied Computing. ACM, New York, 263264.
QI, X. AND ZHU, D. 2008. Power management for real-time embedded systems on block-partitioned multicore platforms. In Proceedings of the International Conference on Embedded Software and Systems
(ICESS08). 110117.
RANVIJAY, Y. R. S. AND AGRAWAL, S. 2010. Efficient energy constrained scheduling approach for dynamic
real-time system. In Proceedings of the 1st International Conference on Parallel Distributed and Grid
Computing (PDGC10). 284289.
SCHMITZ, M. T. AND AL-HASHIMI, B. M. 2001. Considering power variations of dvs processing elements for
energy minimisation in distributed systems. In Proceedings of the 14th International Symposium on
System Synthesis. 250255.
SEO, E., JEONG, J., PARK, S., AND LEE, J. 2008. Energy efficient scheduling of real-time tasks on multicore
processors. IEEE Trans. Parall. Distrib. Syst. 19, 15401552.
SELVAKUMAR, S. AND SIVARAMMURTHY, C. 1994. Scheduling precedence constrained task graphs with nonnegligible intertask communication onto multiprocessors. IEEE Trans. Parall. Distrib. Syst. 5, 328336.
SHIN, Y. AND CHOI, K. 1999. Power conscious fixed priority scheduling for hard real-time systems. In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference. ACM, New York, 134139.
SRIKANTAIAH, S., KANSAL, A., AND ZHAO, F. 2008. Energy aware consolidation for cloud computing. In Proceedings
of the Conference on Power Aware Computing and Systems. USENIX Association, 10.
STOUT, Q. F. 2006. Minimizing peak energy on mesh connected systems. In Proceedings of the 18th Annual
ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, 331.
32:37
SUBRATA, R., ZOMAYA, A. Y., AND LANDFELDT, B. 2008. A cooperative game framework for QoS guided job
allocation schemes in grids. IEEE Trans. Comput. 57, 1413-1422.
SWAMINATHAN, V. AND CHAKRABARTY, K. 2005. Pruning-Based energy-optimal deterministic i/o device scheduling
for hard real-time systems. ACM Trans. Embed. Comput. Syst. 4, 141167.
TOMIYAMA, H., ISHIHARA, T., INOUE, A., AND YASUURA, H. 1998. Instruction scheduling for power reduction in
processor-based system design. In Proceedings of the Conference on Design, Automation and Test in
Europe. IEEE Computer Society, 855860.
UPTIME. 2012. Uptime institute. http://uptimeinstitute.org/.
USAMI, K. AND HOROWITZ, M. 1995. Clustered voltage scaling technique for low-power design. In Proceedings
of the International Symposium on Low Power Design. ACM, New York, 38.
USEPA. 2007. U.S. environmental protection agency report to congress on server and data center energy
efficiency public law 109-431, Energy Star Program.
VENKATACHALAM, V. AND FRANZ, M. 2005. Power reduction techniques for microprocessor systems. ACM Comput.
Surv. 37, 195237.
VERMA, A., AHUJA, P., AND NEOGI, A. 2008. Power-Aware dynamic placement of hpc applications. In Proceedings
of the 22nd Annual International Conference on Supercomputing. ACM, New York, 175184.
WANG, Y., LIU, H., LIU, D., QIN, Z., SHAO, Z., AND SHA, E. H. 2011. Overhead-Aware energy optimization for
real-time streaming applications on multiprocessor system-on-chip. ACM Trans. Des. Autom. Electron.
Syst. 16, 14:114:32.
YU, Y. AND PRASANNA, V. K. 2002. Power-Aware resource allocation for independent tasks in heterogeneous realtime systems. In Proceedings of the 9th International Conference on Parallel and Distributed Systems.
341348.
ZHANG, Y., HU, X., AND CHEN, D. Z. 2002. Task scheduling and voltage selection for energy minimization. In
Proceedings of the 39th Design Automation Conference. 183188.
ZHANG, C., VAHID, F., AND NAJJAR, W. 2005. A highly configurable cache for low energy embedded systems.
ACM Trans. Embed. Comput. Syst. 4, 363387.
ZHANG, S. AND CHATHA, K. S. 2007. Approximation algorithm for the temperature-aware scheduling problem.
In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD07). 281
288.
ZHANG, S., CHATHA, K. S., AND KONJEVOD, G. 2007. Approximation algorithms for power minimization of earliest
deadline first and rate monotonic schedules. In Proceedings of the ACM/IEEE International Symposium
on Low Power Electronics and Design (ISPLED07). 225230.
ZHU, D., MELHEM, R., AND CHILDERS, B. R. 2003. Scheduling with dynamic voltage/speed adjustment using
slack reclamation in multiprocessor real-time systems. IEEE Trans. Parall. Distrib. Syst. 14, 686700.
ZHU, Y. AND MUELLER, F. 2004. Feedback edf scheduling exploiting dynamic voltage scaling. In Proceedings of
the 10th IEEE Symposium on Embedded Technology and Applications (RTAS04). 8493.
ZHUO, J. AND CHAKRABARTI, C. 2005. An efficient dynamic task scheduling algorithm for battery-powered dvs
systems. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC05).
846849.
Received July 2011; revised October 2011; accepted October 2011

Askdfjnaiow

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Askdfjnaiow

Загружено:

Авторское право:

Доступные форматы

Energy- and Performance-Aware Scheduling of Tasks on Parallel

and Distributed Systems

Massive energy consumption is an escalating threat to the environment. The explosive

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

2.2. Power and Energy Model

Before introducing power management techniques, we begin with the fundamental

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Parallel applications, such as scientific, mathematical, and bioinformatics problems,

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

As we observed from the previous discussion, tasks represented by different models

as the function of frequency or supply voltage. Especially when is approximated

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

incurs delays [Albonesi 2002]. DVFS switching is typically done in a synchronous or

Scheduling algorithms on single processors are comparatively simple. Since offline

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Minimize the execution

Dual Energy and

Minimize the overall

survey, we only give a high-level classification of uniprocessing and multiprocessing

DVFS (continuous) CPU

PCEO on Single Processor Platforms.

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

SLA (response Independent

LR Heuristic Response time Independent

Cluster, Homo- CPU

[Petrucci et al. Heuristic for Execution

throughput, task acceptance rate, or any other quantitative representation of output

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Energy-Constrained Scheduling for Weakly Hard Real-Time Systems. Minimization

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Table V. ECPO on Multicore and Distributed Platforms

Schedule length Independent

Schedule length Dependent

Response time, Independent

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Table VI. DEPO on Multicore and Distributed Platforms

DEPO on Single Processor Platforms.

SimpleScalar to enable analysis at the levels of code generation up to the simulated

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

A recent surge of research on energy-efficient computing techniques has given us new

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems

Вам также может понравиться