Вы находитесь на странице: 1из 21

R APPORT DE STAGE M ASTER D I NFORMATIQUE F ONDAMENTALE - ENS LYON

E NERGY- AWARE FRAMEWORKS FOR HIGH - PERFORMANCE DATA TRANSPORT AND COMPUTATIONS IN LARGE - SCALE DISTRIBUTED
SYSTEMS
` LA CONSOMMATION D E NERGIE POUR LE E NVIRONNEMENTS LOGICIELS SENSIBLES A ES HAUTES PERFORMANCES ET LES CALCULS DANS LES SYST E ` MES TRANSPORT DE DONN E S A ` GRANDE E CHELLE DISTRIBU E

Anne-C ecile Orgerie annececile.orgerie@ens-lyon.fr Stage r ealis e sous la direction de Laurent Lef` evre et Jean-Patrick Gelas
Abstract The question of energy savings has been a matter of concern since a long time in the mobile distributed systems and battery-constrained systems. However, for large-scale non-mobile distributed systems, which nowadays reach impressive sizes, the energy dimension (electrical consumption) just starts to be taken into account. After studying energy-awareness propositions in high-performance networks and in large-scale distributed systems, we analyze the usage of an experimental grid over a one-year period. Based on this analysis, we propose a resource reservation infrastructure which takes into account the energy issue. This infrastructure can be applied as well to both cases: the case of a resource manager in a grid and the case of a bandwidth manager on a high capacity link. Finally, we validate our infrastructure on the experimental grid traces previously studied. R esum e L energie est depuis longtemps consid er ee comme un probl` eme dans les syst` emes mobiles distribu es et plus g en erale` large e chelle non ment dans tous les syst` emes fonctionnant sur batteries. Cependant, pour les syst` emes distribu es a nerg `e tre mobiles, qui atteignent aujourdhui des tailles impressionnantes, la dimension e etique commence tout juste a prise en compte. tudi Apr` es avoir e e les solutions existantes en mati` ere de sauvegarde d energie dans les r eseaux hautes performances ` grande e chelle, nous avons analys et les syst` emes distribu es a e les traces dutilisation dune grille exp erimentale sur une tude nous a permis de concevoir une infrastructure de r p eriode dun an. Cette e eservation de ressources qui prend en nerg compte la dimension e etique. Cette infrastructure sadapte aussi bien au cas dun gestionnaire de nuds de calcul ` grande capacit dans une grille quau cas dun gestionnaire de bande passante sur des liens a e. Nous concluons ce travail tudi en validant cette infrastructure sur les traces dutilisation pr ealablement e ees.

Contents
1 2 Introduction Related works on energy efciency in networking equipments and large-scale distributed systems 2.1 Energy efciency in large-scale distributed systems . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Thermal issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Server and data center power management . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Node optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Energy-awareness in networking equipments . . . . . . . . . . . . . . . . . . . . . . . . . . . . Understanding large-scale distributed systems usage 3.1 Presentation of the Grid5000 testbed . . . . . . . 3.2 A year in the life of an experimental Grid . . . . 3.3 The site view in terms of usage . . . . . . . . . . 3.4 The node view in terms of usage . . . . . . . . . 3 3 3 3 4 4 4 5 5 6 7 7 7 7 9 10 10 11 11 11 13 13 14 14 15 15 15 15 15 15 16 17 18 19 20 21

. . . . .

. . . . .

. . . . .

. . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Proposition for an energy-aware reservation infrastructure (EARI) 4.1 Global architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Energy monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The resource managing algorithms of EARI 5.1 Denitions of the EARI components . . . . . . . . . . . . . . . . . . 5.2 Principle of the resource managing algorithm of EARI . . . . . . . . 5.3 The resource allocation . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 The resource release . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Global mechanisms: selection of specic resources and load balancing Predictions 6.1 Estimation of the next reservation . . . . . . . 6.2 Feedback on the next reservation estimation . . 6.3 Energy consumption estimation of a reservation 6.4 Slack periods . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Experimental validation of EARI 7.1 Energy-aware computing reservation . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 The replay principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Without moving the reservations: validation of the prediction algorithm 7.1.3 By moving the reservations: validation of our green policy . . . . . . 7.1.4 How much we can save in kWh: the Lyon case study . . . . . . . . . . 7.2 Energy-aware data transfer reservation . . . . . . . . . . . . . . . . . . . . . . Conclusion and future works

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Bibliography A Example of a reservation log furnished by OAR

Introduction

High performance computing aims to solve problems that require a lot of resources in terms of power and communication. A large-scale distributed system is a set of nodes connected among themselves and geographically distributed. In the same site, they are organized in data centers which include several clusters. The nodes are often heterogeneous (different power, different architectures, different operating systems, etc.). They are designed for high performance computing and so the electric consumption is taken into account only at the time of the power supply specications. We call energy the electrical power consumption. The general common idea is indeed that, when they are not reserved, the grid resources should be always available, so that they should always remain fully powered on. The large-scale distributed systems are sized to support reservation bursts. A previous work on operational Grids [14] shows that grids are not used at their full capacity: they are used at between 60% and 80%. Between the bursts, some resources remain free, so we can save energy. This is our rst approach taken in this work: to save energy by shutting down nodes when they are not used. This is the same for high performance data transport: the high-speed links are not always fully used and we can turn off the Ethernet cards and switch ports off to save energy. Understanding the characteristic usage and workloads of the large-scale distributed systems is a crucial step towards the design of new energy-aware distributed system frameworks. Therefore we have studied the Grid50001 [2] platform over a one-year period. With these traces, we have found that in Lyon for the whole 2007 year, the site has consumed 172500 kWh. These values do not include the network equipments nor the cooling system, only the nodes. This is the same consumption as a TGV2 which goes farther than 11650 km away. The analysis of these usage traces lead us to propose an energy-aware reservation infrastructure (EARI) which is able to shut down nodes when they are idle. This infrastructure proposes several energy efcient solutions for a reservation made by a user: several possible and energy efcient start times for his reservation. Thus the user is able to choose among these green solutions and the date he has submitted if this required date is possible. This infrastructure also includes a prediction algorithm to anticipate the next reservation in order to avoid shutting down nodes that we will need to be restarted quickly. Next, we have made a large number of experiments on the Grid5000 data previously collected to evaluate our infrastructure. We use several policies to simulate user behavior by using percentage of energy-aware users. Our infrastructure is also adapted to high-performance data transfer reservations: the user makes a bandwidth reservation from one point to another. If the network is not used at its full capacity, we can turn down Ethernet cards, router ports and routers by using our energy-aware reservation infrastructure. Section 2 presents the related works on energy efciency in networking equipments and large-scale distributed systems. Then we provide our analysis of Grid5000 usage on Section 3. Sections 4, 5 and 6 present our energy-aware reservation infrastructure with its global architecture and its resource management and prediction algorithms. Section 7 shows the evaluation of our infrastructure based on Grid5000 traces. Finally, Section 8 concludes and introduces our future works.

Related works on energy efciency in networking equipments and largescale distributed systems

The overall observations on the waste of energy can be observed for a lot of computing and networking equipments: PCs, switches, routers, servers, etc., because they remain fully powered-on during idle periods. In a large-scale distributed system context, different policies can be applied depending on the level on which we want to make savings: node level, cluster level or network level.

2.1
2.1.1

Energy efciency in large-scale distributed systems


Thermal issues

Although energy has been a matter of concern for sensor networks and battery constrained systems since their creation, energy issues are recent for plugged systems. Thermal issues have rst been considered because they are consequences of the increasing number of transistors on the processor chips. These issues are related to energy ones. Indeed, decreasing the heat production of a node will lead to decrease its energy consumption by using it less. Thats why many algorithms deal with both energy and thermal issues [18, 19, 17]. In [18], the authors take advantage of the different clusters location in the Grid: they make workload assignement decisions based on the data thermal management infrastructure and the seasonal and diurnal variations. They take the
experiments of this report were performed on the Grid5000 platform (http://www.grid5000.fr) power consumption of a TGV is about 14.79 kWh per km (http://fr.wikipedia.org/wiki/Deplacement_a_grande_ vitesse).
2 The 1 Some

example of two sites which belong to the same Grid: one in New Dehli and one in Phoenix. When it is summer time, the external temperature in New Dehli can reach high temperatures at midday. At that time, it is night in Phoenix and so the temperature should be lower. Thus it is preferable to place workload in Phoenix, it uses less the cooling system than in New Dehli. This technique is called thermal load balancing [19]. It is multi-system resource management with thermal and energy concerns at the granularity of data centers. Some works [18, 19, 17] prove that temperature issues are really near to energy issues and show us that they belong to the same loop. Indeed, if we decrease the heat production of the nodes, it will lead to a decreasing of their consumption (the fans will work less and the cooling system will be less used too). And vice versa, if we want to decrease the energy consumption, we should shut down the nodes or put them in slower modes for less consumption. 2.1.2 Server and data center power management

Data centers are made up of large numbers of servers with high power requirements concentrated in a small area. They need huge power capacities and one difculty is to nd out the consumption of all its components (network equipments, nodes, cooling system). In [5], the authors make a model of energy consumption that uses the CPU activity. A different approach consists of deducing it by using event monitoring counters [17] for example. One idea to make energy savings, is to be able to shut down idle nodes [4]. But then comes the problem of how to wake them up after. The Wake On LAN is a mechanism implemented on Ethernet cards to allow a distant user to wake up a PC by sending to it some packets by the network [6]. However, such mechanism needs to keep the Ethernet card powered all the time. Then, we need a scheduling algorithm to attribute nodes to the tasks. The main issue in that case is to design an energy-aware scheduling algorithm with the current constraints (divisible task or not [4], synchronization [15], etc). In [3], the authors want to minimize the consumed energy by minimizing the number of joules per operation. The resource manager get a set of awake nodes and should minimize its size as much as possible. When a task ends on a node, it tries to put the other tasks of this node on the other running nodes. And if a new tasks arrives, it tries to put it on the awake nodes. The other nodes remain off. This algorithm includes no load balancing mechanisms, so we can suppose that some nodes will be worn prematurely while others will stay unused. As for all the algorithms presented, the unnecessary wake-ups waste energy twice: by the wake-up power spike and during the idle state on time before going to sleep again. So we should be careful not to shut down nodes too quickly. 2.1.3 Node optimizations

At a node level, we can also make energy savings which lead to great energy savings with some scale effects. To reduce the wake-up power spike and the booting time, we can use suspend to disk (also called hibernation) techniques. When we switch to that state, all content of main memory is saved to the hard drive in an hibernate le, preserving the state of the operating system (all applications, open documents etc.). All the nodes components are off and at the next state switch, the node will load the hibernate le and that will restore the previous state. Other improvements could be done into the CPU. Some algorithms include DVFS (Dynamic Voltage Frequency Scaling) techniques [15, 13, 5]. The CPU reduces its frequency and voltage when it is under used [13]. These techniques have been already standards on laptops since few years. This increases a lot the range of possible energy savings: we are now able to save energy when the nodes are not idle but not fully used. Although we are fully aware that such techniques will be available on all processors in a near future, our work does not include this in the rst step presented here. Such techniques are indeed difcult to use in presence of processor and user heterogeneity especially if we want to design a centralized resource managing algorithm. Virtualization seems to become an other promising track [12] but remains still hard to deploy. We summarize the node components which we can act on to make energy savings on Figure 1. This gure reminds the node components that we will be able to switch off or put in lower modes in a near future. We will see in the next section that the mechanisms used for saving energy in the large-scale distributed systems can be linked with the ones used in networking equipments: they both used sleeping and low power modes.

2.2

Energy-awareness in networking equipments

The consumption of networking devices is rarely considered. Mainly, two research teams have contributed towards this domain: one in Portland [10, 9, 11, 8] and the other one in Florida [6, 7]. The Portland team presents an interesting approach [10]: they want to switch off network interfaces, routers and switches components. They rst analyze traces and check whether there is really inactivity periods [10, 9]. They then design algorithms to shut down resources based on periodic protocol behavior and trafc estimation [9]. They analyze that lots of energy can be saved by this way by running their algorithm onto utilization traces and then they propose to save energy even on under-utilization periods. Therefore they use the low power modes available on most Ethernet interfaces [11] (that means use Gigabits Ethernet cards at 10MBps, 100MBps or 1GBps). Their results show that their algorithm doesnt affect the communication performances in terms of both delay and packet loss.

Figure 1: Node components with different energy states

Their algorithms are based on predictions to take sleeping decisions. They use the buffer occupancy, the behavior of previous packet arrival times and a maximum bounded delay [8]. They assume that the routers are able to store packets in their buffer even if they are asleep. When the buffer occupancy reaches a certain size, they wake up the whole router. The real problem of shutting down networking devices is how to ensure network presence. Indeed, when a switch is asleep, it could not answer to the requests (ARP requests or PING for example). Moreover, normally when a link is re-established, an auto-negotiation protocol is run (to synchronize clocks, determine link type and link rate, etc.) and this takes about few hundreds of milliseconds which is too long on high capacity links. So they modify the auto-negociation protocol for their algorithms [8], the auto-negotiation is not run after a sleeping because sleeping periods are really short and no state has changed for the link during these periods. Another solution, given by the Florida team, is to use proxying techniques: the Ethernet card or the switch lters packets that require no response (like broadcasts), replies to packets that require minimal response (like ping) and only wakes up the system for packets requiring a non-trivial response [6]. In [6], the authors give a complete analysis of the trafc received by an idle PC and they explain that most of this trafc would be ltered-out or trivially responded to by a proxy. These authors have also proposed an algorithm called Adaptive Link Rate (ALR) which changes link data rate based on an output buffer threshold policy [7]. This algorithm does not affect the mean packet delay. To conclude, we have seen that we have access to different techniques to improve the energy efciency in high data transport and in large-scale distributed systems. However, to know which techniques we can use and which improvements we can hope, we need to study their usage.

Understanding large-scale distributed systems usage

We take as example the french experimental grid Grid5000 [2]. In order to better understand the stakes and the potential savings with the scaling effects, we need to have a comprehensive multi-levels view: grid, cluster and node levels.

3.1

Presentation of the Grid5000 testbed

The Grid5000 [2] platform is an experimental testbed for research in grid computing which owns more than 3400 processors geographically distributed on 9 sites in France (Bordeaux, Grenoble, Lille, Lyon, Nancy, Orsay, Rennes, Sophia and Toulouse). This platform can be dened as a highly recongurable, controllable and monitorable experimental Grid equipment. In this context, what we call jobs are in fact resource reservations (see Appendix A for a reservation log example). The Grid5000 testbed is provided with several monitoring tools: Monika displays current and scheduled reservations,

Ganglia3 provides resources usage metrics (memory, cpu, jobs...) for individual sites or the whole grid, Nagios4 monitors servers and services and automatically reports incidents and failures. The Grid5000 utilization is very specic. Each user can indeed reserve in advance some nodes and then during its reservation time, he is root on these nodes and he can deploy his own system images, collect data, reboot and so on. The node is entirely dedicated to the user during his reservation. So Grid5000 is, by some important aspects, really different from an operational Grid (exclusive usage, deployment, etc.), but the energy issue is still the same and we can design a solution which ts for both experimental and operational Grids as well.

3.2

A year in the life of an experimental Grid

We analyze the node reservation traces of Grid5000 for each site over a one-year period (the 2007 year). The users can indeed reserve some nodes during a period of time. We have obtained these traces by using the history consultation mechanism of the Grid5000 scheduler OAR5 [1]. Then, we have made a parser written in Perl6 to use the 1.2 GBytes of traces and to make the statistics reported below. Actually, we have made two parsers because we had to deal with three different versions of OAR during our experiments.
Site Bordeaux Lille Lyon Nancy Orsay Rennes Sophia Toulouse nb of reservations 45775 330694 33315 63435 26448 36433 35179 20832 nb of cores 650 250 322 574 684 714 568 434 nb of core per reservation 55.50 4.81 41.64 22.46 47.45 54.85 57.93 12.89 mean length of a reservation 5224.59 s. 1446.13 s. 3246.15 s. 19480.49 s. 4322.54 s. 7973.39 s. 4890.28 s. 7420.07 s. real work 47.80% 36.44% 46.38% 56.41% 18.88% 49.87% 51.43% 50.57%

Figure 2: Grid5000 usage over one-year period: 2007

We observe that the usage greatly varies from one site to another (Fig. 2) in terms of number of cores, average number of cores per reservation, mean length of a reservation and the percentage of real work (without including the dead and absent time periods during which the nodes do not consume any energy because they are unplugged). We do not present the Grenoble results because they dont reect the real usage on this site which is the testing site for the new Grid tools. This heterogeneous usage can be due to geographical purposes (the most distant sites are interesting to conduct communication experiments) and to hardware purposes: each site has different nodes with different architectures (storage, network capabilities. . . ). We dene the grid reservations as at least two reservations on different sites with the same user and which have at least ve minutes in common during their execution time: a user has launched at least two reservations on at least two different sites and they are simultaneous for at least ve minutes.

Figure 3: Working hours of the grid reservations per week and per site
3 http://ganglia.info/ 4 http://www.nagios.org/ 5 OAR

is a resource manager (or batch scheduler) for large clusters (http://oar.imag.fr/)

6 http://www.perl.org/

Figure 3 shows the number of working hours by week and by site spent by the grid reservations. The graph presents the grid reservations per site in hours for the histogram and in number of reservations (jobs) for the plain line. The doted line shows the total number of work hours for all the sites per week. This gives a global view of the grid. This gure shows that we cannot apply the same energy saving policy on each site because the usage varies from one site to another. This policy should be globalized at least for the reservation placements (in terms of time and resources). Numerous users need indeed to have coordinated reservations on different sites as we can see on Figure 3.

3.3

The site view in terms of usage

Figure 4 shows the example of the site of Sophia7 with 368 cores at the beginning and 200 cores are added in the middle of February. The plain line indicates the number of reservations per week. For each week, we represent the time during which some cores are dead: they are down; when they are suspected: they do not work properly; when they are absent: they do not answer and when they are working: a reservation is running. For this site, the real percentage of work time is 51.43%. We see on Figure 4 that during some weeks, the usage of the site is low, but the real matter of concern of such a Grid is to be able to support burst periods of work and communication specially before well-known deadlines and we can see that such periods exist.

Figure 4: Global weekly diagram for Sophia

3.4

The node view in terms of usage

The three following diagrams (Figure 5) present the weekly repartition of three particular resources: the minimal, the median and the maximal resources in terms of working time. The minimal (maximal) resource is the resource which provides the minimal (maximal respectively) work time among the resources which are present during the whole monitored period. The median resource is the resource which is the nearest to the median value of cumulative work over the experiment duration. This gure shows a bad resource (the minimal one), a not-so-good resource (the median one) and a good resource (the maximal one) in terms of energy protability. Indeed, the minimal one is always powered on, but does not work so much, thus it wastes a lot of energy in idle state. To conclude, we have seen that the platform is used at about 40% in average for the 9 sites. However, during some bursting periods, it is used at more than 95%. Yet, between the bursts, the resources are idle during substantial periods, we can thus save energy by turning them off.

4
4.1

Proposition for an energy-aware reservation infrastructure (EARI)


Global architecture

In the context of large-scale distributed systems, in order to reduce the global energy consumption, we need to directly act on the nodes, on the network devices and on the scheduler. This paper presents an On/Off algorithm for the resources managed by the scheduler which is also the resource manager. Figure 6 presents the global architecture of our infrastructure. Each user is connected to a portal to submit a reservation. The scheduler processes the submission and validates it. Then it manages the resources and gives access to the resources to the users who have made reservations on them according to the scheduler agenda. Energy sensors monitor
7 An exhaustive study of the results for each site can be found at http://perso.ens-lyon.fr/annececile.orgerie/resultats_ grid5000.pdf

Figure 5: Three resource examples on Sophia

energy parameters from the resources (which can be nodes, routers, etc.) and these data are collected by our infrastructure. They are used to compute green advices which are sent to the user in order to inuence his reservation choice. Our infrastructure computes also the consumption diagrams of the past reservations and it sends them as a feedback on the portal, so the users can see them. And last, but not least, it decides which resources should be on and which resources should be off. This architecture can be centralized or distributed by cluster for example.

Figure 6: Global architecture

4.2

Energy monitoring

Our objective is the measurement of the power consumption of the Grid nodes in Watts in order to modelize the link between electrical cost and applications or processes. In order to measure the real consumption of some machines, we use a watt-meter furnished by the SME Omegawatt8 . This equipment works as a multi-socket: we plug six nodes on it and we obtain the consumption via a serial link (RS232). So we have written scripts that send a request by the serial link to the watt-meter each second, then it converts the answer (the watt-meter accepts requests and sends results in a specic hexadecimal format) and sends it to a distant collector which stores them (see Figure 7). This architecture can easily be adapted to other cluster infrastructures.

Figure 7: Our experimental architecture to measure nodes energy consumption

Figures 8 and 9 show our results concerning the energy consumption on the site of Lyon. We have dynamically collected the consumption in Watts of six different nodes representing the 3 hardware architectures available on Lyon site : two IBM eServer 325 (2.0GHz, 2 CPUs per node), two Sun Fire v20z (2.4GHz, 2 CPUs per node) and two HP Proliant 385 G2 (2.2GHz, 2 dual core CPUs per node). We can see that the nodes have a high idle power consumption and that some of them reach impressive powers during their boot and that they consume power even turning off. Other experiments that we have made on the same nodes show that an Iperf9 experiment consumes between 10 and 12 watts more than the hdparm10 upper bound and a cpuburn11 experiment between 0 and 6 watts more than the Iperf experiment. These experiments represent a typical life of an experimental Grid node : nodes down but plugged in the wall socket, booting, having intensive disks accesses (hdparm), experimenting intensive high performance network communications (Iperf), or having intensive CPU usage (cpuburn).
is a SME established in Lyon (http://www.omegawatt.fr/gb/index.php). is a commonly used network tool to measure TCP and UDP bandwidth performance that can create TCP and UDP data streams (http: //dast.nlanr.net/Projects/Iperf/). 10 hdparm is a command line utility for Linux to set and view IDE hard disk hardware parameters (http://sourceforge.net/projects/ hdparm/). We have made a loop of hdparm instructions to simulate intensive disk accesses. 11 cpuburn is a tool designed to heavily load CPU chips in order to test their reliability (http://pages.sbcglobal.net/redelm/).
9 Iperf 8 Omegawatt

Figure 8: Booting consumption of the nodes

Figure 9: Intensive communication and computing

We can explain the difference of consumption between the two HP Proliants by the fact that the one which consumes the more embeds a powerful congurable Ethernet card than the second one does not have. Furthermore, the two Sun Fires are identical in terms of components (same architectures, same features) and in the same way the two IBM eServers are identical. However, we can observe that one Sun Fire consumes more than the other one and that one IBM eServer consumes more than the other one. We should add that the two ones which consume the less were on the bottom of the rack, while the two ones which consume the more were on the top of the rack. So we observe that the position in the rack inuences the consumption. This should be due to the air cooling infrastructure position. The more ventilated nodes consume less energy. These results show the impact on energy usage resulting from node utilization. Then, we use this analysis to design an energy-aware reservation infrastructure.

5
5.1

The resource managing algorithms of EARI


Denitions of the EARI components

We dene a reservation R as a tuple: (l, n, t0 ) where l is the length in seconds of the reservation, n is the required number of resources and t0 is the wished start time. N denotes the total number of resources managed by the scheduler. So we should always have n N and t0 t where t is the actual time and l 1 to get a valid reservation. For example, in the case of a large-scale distributed system, a reservation is a reservation of n nodes during l seconds starting at t0 . In the case of a high-performance data transfer, a reservation is a bandwidth portion which size is n (n can be in MBytes for example) during l seconds starting at t0 . When a reservation is accepted by the scheduler, it writes it down into the corresponding agenda. Each site indeed has got its own agenda. The agenda contains all the future reservations. The history contains all the past and current reservations. So when a reservation starts, it is deleted from the agenda and added to the history.

Figure 10: Booting and shutting down of a resource PI refers to the power consumption in Watts of a given resource (it can vary from one resource to another) when it is idle. POF F refers to the consumption in Watts of a given resource when it is off (POF F < PI ). EON OF F (EOF F ON ) refers to the needed energy (in Joules) for a given resource to switch between On and Off modes (Off and On modes respectively). Figure 10 illustrates these denitions.

10

So, we can roughly estimate the energy consumption in Joules of a given reservation R = (l, n, t0 ):
n

Em (R) = l
i=1

Pm (i)

where Pm (i, S ) is the mean consumption of the resource i.

5.2

Principle of the resource managing algorithm of EARI

We split our algorithm into two parts: when a reservation is submitted (see Fig. 11) and when a reservation ends (see Fig. 13). On Figure 11, we show the algorithm used for a reservation arrival: R = (l, n0 , t0 ). At t0 , we know that there will be at least n busy resources (because of previously arrived reservations). So, rst of all, we wonder whether this reservation is acceptable, ie. n0 N n. If it is not acceptable, we compute the earliest possible start time after t0 (by taking into account the reservations which are already written down in the agenda) which is called t1 . Then, we estimate different amounts of energy, the energy consumed by R if it starts: at t0 (or t1 , if t0 was not possible; t1 is the next possible start time); just after the next possible end time (of a reservation) which is called tend ; l seconds before the next possible start time which is called tstart ; during a slack period (time 2 hours and usage under 50%, see section 6.4), at tslack . We will see on the next sections the prediction algorithms. In order to achieve these estimations, we need to compute: t1 (done previously), tend , tstart and to estimate tslack . Our goal is to glue the reservations in order to avoid bootings and turnings off which consume energy. Our infrastructure does not impose any solution, it just offers several of them and the user chooses.

5.3

The resource allocation

In order to calculate tend , we look for the next reservation end in the agenda and we verify if it is possible to start R at that time (enough resources for the total length). If it is not possible, we look for the next one in the agenda and so on. tend is then dened as the end time of this reservation. In the same way, we calculate tstart by looking for the next reservation start time in the agenda and we check out if it is possible to place R before (this start time should then be at least at t + l where t is the current time and l the duration of R). If the found start time does not match, we try the next one and so on. An enhancement consists in nding several possible reservation end times and start times. We can then take the ones which minimize the energy consumption: the ones with which we should the least turn on and turn off resources. Finally, we give all these estimations to the user (energy estimations and corresponding start times) who selects its favorite solution. The reservation is then written down in the agenda and the reservation number is given to the user. With this approach, the user can still make his reservation exactly when he wants to, but he can also delay it in order to save energy. It will raise user awareness upon energy savings. The scheduler makes the resource allocation by choosing the resources with the smallest power coefcient. That coefcient is calculated depending on the mean power consumption of the resource (calculated during reservations on a great period of time). Thus a resource which consumes few energy will have a big power coefcient and will so be choose as a priority by the scheduler. Indeed, resources are not identical (different architectures, ...), so they do not have the same consumptions. This allocation policy is used when we give resources for a reservation without constraints. In fact, when the scheduler places a reservation just after another (by using tend or not) or just before another, it allocates the resources which are already awake (and in priority those which have the biggest power coefcient). Moreover, the user can explicitly choose certain resources, so in that case, this policy is not applicable. The power coefcient is calculated when the resource is added to the platform and will not change after.

5.4

The resource release

Figure 13 presents the algorithm used for a reservation end: M resources are freed. First of all, we compute the total real consumption of this reservation. We give this information to the user and we store it in the history for the prediction algorithms. Moreover, we compute the error made when we have estimated the consumption of this reservation with the corresponding start time: this is the difference between the true value and the predicted one. We will use it in the next section to compute a feedback error in order to improve our estimation algorithms.

11

Figure 11: Algorithm for a reservation arrival

We need to dene an imminent reservation: it is a reservation that will start in less than Ts seconds in relation to the present time. The idea is to compute Ts such as it will be the minimum time which ensures an energy saving if we turn off the resource during this time. In fact, we dene Ts so that if we turn off a resource during Ts seconds, we save Es Joules. Es is a xed energy, it is the minimum energy that we dont go to a lot of trouble to save it. To this denition, we add a special time, denoted by Tr , which is related to the resource type. For example, if we turn on and turn off often an Ethernet card, it has not the same consequences, in terms of hardware resistance, compared to the same for a hard disk for example. The hard disk is indeed mechanical and can support a limited number of ignitions. Thus we should not turn it off too often or too quickly. So Tr reects this concern and differs from one resource to another. So, if we denote tot = ON OF F + OF F ON , Ts is dened by:
Ts = Es POF F tot + EON OF F + EOF F ON + Tr PI POF F

As we can see, Ts varies from one resource to another because it depends on PI , POF F , ON OF F , OF F ON , EON OF F , EOF F ON and Tr which depend on the resource. We can x Es = 10 Joules for example. We can notice that we should have: Ts tot 0. We want indeed to have at least enough time to turn off the resource and turn on it again. Now, we look for the freed resources that have an imminent reservation. We can see this in the agenda as depicted on Figure 12. These resources are considered as busy and are left turned on (top part of Fig. 12). During this active watch, we loose less than Es Joules per resource and then they are used again.

Figure 12: An agenda example which shows the role of Ts We look for other awake resources: resources which are waiting for a previous estimated imminent reservation. For 12

these m awake resources (M minus the previous busy ones and plus the other awake resources), we need to estimate when will occur the next reservation and how many resources it will take. We call this reservation Re = (le , ne , te ). We can now verify if Re is imminent. If it is not the case, all the remaining resources are turned off (as in the bottom part of Figure 12). If Re is imminent, we look for min(m, ne ) resources or less that can accept this potential reservation: they are free at te for at least le seconds. We keep these resources awake during Ts + Tc seconds and we turn off the other ones. Tc is a xed value that corresponds to the mean computation time of a reservation for the scheduler. It is the mean time between the user request and the reservation acceptation by the scheduler (it includes among other things the time to compute the energy estimations and a minimum time to answer for the user). At the next reservation arrival, we will compute the estimation errors we have done and we will use them as feedback in our prediction algorithms. Moreover, if there are idle resources (which are turned on without any reservation) and if the reservation which is just arrived is not imminent, we turn off the idle resources.

Figure 13: Algorithm for a reservation end

5.5

Global mechanisms: selection of specic resources and load balancing

To summarize, we can decompose our approach into two steps: 1. during a submission: we compute the consumption predictions and we propose different solutions for the given reservation; then when the user has made his choice, we register this reservation into the agenda; 2. after a reservation end: we ll the history, compute the feedback on the predictions that we have made and we measure the real power consumption of this reservation (with the sensors). Before a reservation which is written down in the agenda, if some resources are not turned on, the scheduler puts them on at least ON OF F seconds before the beginning of the reservation. In this general infrastructure, we have not distinguished the different resources to make it clearer: in fact, different resources have different associated consumptions and different components (Ethernet cards, hard disks, etc.). But, instead of giving just a number of resources, the user can give a list of wished resources with the wanted characteristics. Moreover, we add to our infrastructure a load balancing system which ensures that the scheduler will not always use the same resources. This load balancing system includes a topology model (the geographic position of the different resources). This mechanism allows us to distribute geographically the reservations. The reserved resources are not sticked together, so we have less heat production. Then when a reservation is submitted, the scheduler does not allocate resources to it at random but with following this load balancing policy.

Predictions

The efciency of EARI, compared to a simple algorithm where the resources are put into sleep state from the moment that they have nothing to do, resides in our ability to make accurate predictions: the estimation of the next reservation (length, 13

number of resources and start time), the estimation of the energy consumed by a given reservation and the estimation of a slack period. But our prediction algorithm should remain sufciently simple in terms of computation in order to be efcient and applicable during reservation scheduler run time.

6.1

Estimation of the next reservation

First of all, we take care about the estimation of the next reservation Re = (le , ne , te ) for a given site. To estimate its start time, we take into account the day of the week, the hour of the day and the previous values of arrival times. This method, called method 1, assumes that the reservations start times for each day are similar to those of the two previous days and to those of the same day of the previous week. This method is based on the similarity of the day load and on the cycle of a day (daytime and night) per site. At a given time t, our estimated start time is the average of the start times of the reservations which are just after t the two days before and the same day one week before on this site, plus the feedback (dened further). te = 1/3[tt,j 1 + tt,j 2 + tt,j 7 ] + t f eedback where tt,j i is the start time of the reservation just after t on this site for the day j i with j which stands for today. The estimations of the length and of the number of resources required by the next reservation are done in a similar way. We remember the three reservations used to make the previous calculation. We call them Ra = (la , na , tt,j 1 ), Rb = (lb , nb , tt,j 2 ), Rc = (lc , nc , tt,j 7 ). So we have: ne = 1/3[na + nb + nc ] + n f eedback le = 1/3[la + lb + lc ] + l f eedback If we do not observe this day similarity, we use the method 2. This method is based on the similarity between the close reservations in term of start time per site. The basic idea is to calculate the average of the characteristics of the ve previous reservations on this site. At a given time t, we denote R0 , . . . , R5 the six previous reservations on this site (with Ri = (li , ni , ti )). They are the six reservations on this site whose start times are the nearest to t (but not necessarily before t, scheduled reservations can be taken into account). These reservations are in order of growing start time (R0 is the oldest). So the estimation of the start time is done by calculating the average of the ve intervals between the six previous start times. This average is added to t with the feedback to obtain the estimation: te = t + 1/5[t5 t0 ] + t f eedback We dene the estimations of the length and of the number of resources required by the next reservation similarly. In the two methods, if we obtain te < t (because of the feedback), we set te = t + 1. The choice between method 1 and method 2 should be done according to the site usage. We should compare their performance on a given site to say which one is better for this given platform. The accuracy of the next reservation prediction is crucial for our power management. If we make too many wrong estimations, resources wait for imminent reservations that do not arrive and so they waste energy or they are turned off just before an imminent reservation arrival and so they waste the energy of one halting plus one reboot per resource.

6.2

Feedback on the next reservation estimation

The feedback is used to improve the energy efciency of our approach. As we have seen before, the estimation errors are really penalizing in terms of consumed energy. We need to obtain accurate predictions. Moreover, we have observed that there is less reservations during the night and more during the morning for the Grid5000 traces. So, early in the morning for example, method 2 will certainly have some difculties to predict the next reservation start time. Thus, to limit the effects of such errors we use a feedback mechanism. The feedback is in fact a corrective factor calculated with the three previous errors (it can be more). At each reservation arrival, we compute the estimation errors we have made. More precisely, at a given time t, the reservations R = (l0 , n0 , t0 ) arrives. Re = (le , ne , te ) is the last reservation that we have predicted. We denote Errl = (l0 le ): the error done by estimating the length of the next reservation, Errn = (n0 ne ) and Errt = (t0 te ) the errors done by estimating the number of resources and the start time of the next reservation respectively. Basically, if we predict the reservation too early, then we have Errt > 0. So, if we add Errt to the next predicted start time, we delay the predicted start time by Errt seconds and that is exactly what we want to do. Then we denote Errl (a), Errl (b) and Errl (c) the three last errors for the length of a reservation. n f eedback and t f eedback are similar to l f eedback : l f eedback = 1/3[Errl (a) + Errl (b) + Errl (c)]

14

6.3

Energy consumption estimation of a reservation

This estimation takes into account the user, the resource type and all the characteristics of the reservation R = (l, n, t). The assumption made here is that each user has almost the same usage of the resources. What we really estimate is the average power during working time per resource for each different type of resource (different architectures for example). The idea is to use the last three reservations made by the concerned user to compute these average values and to use them to compute the estimation by multiplying the number of used resources of a certain type by the average power value for the type just calculated and then summing these values for all the types. Finally we add the booting energy and turning off energy we need if we should turn on and turn off some nodes. We can now see why the different estimations made in the algorithm of Figure 11 are different: they differ from the number of resources to turn on and to turn off and also from the used resources. Indeed, at different time, there are different available resources with different types (different node architectures for example), so even the working consumptions are different. The feedback is, as dened previously the mean error made per resource during the consumption estimations of the last three reservations. This estimation algorithm can be improved by adding parameters like the number of resources which are working during this reservation or the temperature of the resources. But these improvements are not easy because the effect of such considerations (like the temperature) on the consumption of a given resource are not direct and well dened. One possible improvement could come from the user. When he submits a reservation, he could specify what type of tasks he will predominantly execute: network communications, parallel computations, benchmark launching, etc. One consumption model could be designed per application type (and per resource type) and used in coordination with the other parameters (like the user, the resource type and the reservation characteristics) to estimate the energy consumption. This estimation relies on the consumption measurements. With such an approach we need indeed to know the average consumption of each reservation on each type of resource.

6.4

Slack periods

A slack period is a period longer than two hours with a usage percentage of the platform inferior to 50%. Typically such periods happen during the night. We take two hours because it is just a bit longer than the average length of a reservation on Grid5000 (see section 3.2). So a lot of reservations can take place during such a period in terms of length. To estimate when the next slack period will occur, we use the values of the three previous days (real values are known at that time). If there was no slack periods during the three previous days, we estimate that there will be no slack period that day. To be really complete, EARI should include the energy consumption of the cooling infrastructure proportionally distributed on each resource. In fact, Preal (typek , Ra ) (the real average power for the reservation Ra for a resource which have a typek type) would include a fraction of the average power consumed by the cooling infrastructure during the duration of the reservation Ra proportionally to its heat production. However, such a fraction is really hard to estimate, that is why most of the power management systems do not take the cooling infrastructure into account.

7
7.1
7.1.1

Experimental validation of EARI


Energy-aware computing reservation
The replay principle

To evaluate EARI, we conduct experiments based on a replay of the 2007 traces previously analyzed (see section 3). In a rst step, we do not move the reservations: we always respect the reservation characteristics given by the user. So we can fully test our prediction algorithm. In a second step, we move the reservations on a time scale by respecting several policies. The results show that EARI makes energy savings. In the presented results, we have only implemented method 2 for the estimations. 7.1.2 Without moving the reservations: validation of the prediction algorithm

Here, we take the example of Bordeaux on the 2007 year (Fig. 14 and Fig. 15). Figure 14 shows the percentages of energy consumption of EARI with prediction (we use the method 2 described in paragraph 6.1) and the percentages of energy consumption of EARI without prediction where Ts varies from 120 seconds to 420 seconds and Pidle (the power consumed by a single resource when it is idle: on but not working) is 100, 145 or 190 Watts. These percentages are given in relation to the energy consumption of EARI by knowing the future: in that ideal and not reachable case we dont need any prediction algorithm because we always know when to turn on and turn off resources. In fact, this is the theoritical lower bound. Based on Fig. 8 results, we set Pwork = 216 Watts, POF F = 10 and ON OF F + OF F ON = 110 seconds. These values are a little bit bigger than the mean values shown in Figure 8; this is indeed the worst case for EARI.

15

Figure 14: Percentage of energy consumption by using EARI in relation to the energy consumed by knowing the future

Based on Fig. 8 results, we can set Pidle to 190 Watts, but we make Pidle vary to simulate the future capacity of EARI to shut down resources components (like a core or a disk for a node for example). In the same way, we make Ts vary to simulate the future possibility to use hibernate modes (Ts is at least equal to the time needed to boot a resource plus the time to shut down it, so if we use suspend to disk or suspend to RAM mechanisms, it will decrease Ts ). These percentages are in relation to the aimed used energy: the energy we would consume if we knew the future (it is as if our prediction algorithm made always perfect predictions), so it is the lower limit we try to be close to. We see that EARI with prediction is better than without prediction in all the cases. However, we have still room for maneuver to improve our prediction algorithm in order to be closer to the aimed case. Figure 15 shows the surprise reservations impact for EARI with and without prediction. The surprise reservations are reservations that arrive less than Ts seconds after we have turned off some resources (which can instead have been used for these arriving reservations). This is the reason why we are not closer to the future known case on Figure 14. As expected, EARI with prediction has better results because it tries to predict such reservations, but it is not possible to achieve this goal perfectly (the future is not known!).

Figure 15: Percentage of surprise reservations in relation to total reservation number

7.1.3

By moving the reservations: validation of our green policy

Now, we evaluate the complete EARI (still with method 2 for predictions) by allowing our simulator to move the jobs at a better date. We design six policies to conduct our experiments: user: we always select the solution that ts the most with the users demand (we select the date asked by the user or the nearest possible date of this asked date); green: we always select the solution that saves the most energy (where we need to boot and to shut down the smallest number of resources); green-percentage-25: we treat 25% of the submission, taken at random, with the previous green policy and the remaining ones with the user policy; green-percentage-50: we treat 50% of the submission, taken at random, with the green policy and the others with the user policy;

16

green-percentage-75: we treat 75% of the submission, taken at random, with the green policy and the others with the user policy; deadlined: we use the green policy if it doesnt delay the reservation from the initial users demand for more than 24 hours, otherwise we use the user policy. These policies simulate the behavior of real users: there is a percentage of green users who follow the advice given by EARI. Maybe they do not want to delay too long their reservation as in the deadlined policy. And some users do not want to move their reservation even if they can save energy by doing this, that is the user policy. The green policy illustrates the case of an administrator decision: the administrator always chooses the most energy-efcient option. Figure 16 shows the consumption with EARI for the different policies on the Bordeaux traces. We set Ts = 240 seconds and the energy is represented in percentage of the consumption when all the nodes are always powered on (the present situation in fact). We take the example of Ts = 240 seconds, but we have also made these experiments for a Ts of 120, 180, 300, 360 and 420 seconds.

Figure 16: Consumption for Bordeaux with Ts = 240 s. compared with the consumption when all nodes are always on.

This gure shows that our energy policy is more energy-efcient than the user one by about 5 percents which represents about one tenth of the energy used by the nodes when they are working. The green policy shown in Figure 16 has moved about 95% of the reservations by an average time of 200 hours per reservation. So it is not really well adapted to the users demands. The green-percentage-50 policy has moved about 60% of the reservations by an average time of 25 hours. So it is nearly the deadlined policy. The all-glued line shows the theoritical lower bound: if we could glue all the reservations without idle periods between them. 7.1.4 How much we can save in kWh: the Lyon case study

We come back here to the example of Lyon (see Fig. 17). This is indeed the site where we have conducted the energy consumption measurements (see Section 4.2), so this is the one where we can calculate the energy gains as if they were measured by sensors (each site has indeed different node architectures which lead to different energy consumptions per node).

Figure 17: Consumption for Lyon with Ts = 240 s. compared with the consumption when all the nodes are always on.

17

Figure 18 shows the energy consumption of the Lyon site in kilowatt hour (kWh) for the whole 2007 year in the actual case (without energy saving policy), with our user policy, with our green-percentage-50 policy, with our green policy and the all glued consumption. These values are those represented in percentage on Fig. 17.
Pidle 100 145 190 present state 135500 154000 172500 user 101500 101700 102000 50% of green 100000 100300 100800 fully green 98300 98500 98700 all glued 97300 97300 97300

Figure 18: Energy consumption in kWh for Lyon with Ts = 240 s. for several policies for 2007

We can notice that the last value is the same for the three lines, it is normal because the all glued consumption represents the case where all the reservations are sticked together without any idle period between them. This is the most energy efcient reservation distribution but this distribution is not really possible. Furthermore, we notice that our green policy is near the all-glued case (which is the optimal solution in terms of energy) and that the three values for this policy are near: this is normal because we try to reduce as much as possible the useless idle periods. In the present situation (where all the nodes are always powered on and consume 190 Watts when they are idle), we could save 73800 kWh only for Lyon; this represents the consumption of a TGV which goes about 5000km12 . During these experiments, we have reserved Grid5000 nodes for 1391 days and 9 hours13 (and fully used them during these hours) to analyze the traces and to conduct the replay simulations. We have written about 7800 lines of code (Perl, C and bash scripts; including the code to interact with the watt-meter and the two parsers we had to write for the trace analysis because OAR has switched from version 1.6 to 2.0 and nally to 2.2 during these experiments with modications of the trace format). We have not yet taken into account the problem of network presence. Indeed, we turn off nodes that we can wake up by their Ethernet cards, but the system monitoring tools (like Monika, Ganglia or Nagios for Grid5000) need to have periodical answers from these nodes. So by turning off them, they will believe that these nodes are dead although this is not the case. This problem can be solved by proxying techniques like we have seen in Section 2.2. The Ethernet cards can be congured to be able to answer such basic requests.

7.2

Energy-aware data transfer reservation

We focus now on the data transfer reservation issue. We see on Figure 21 that the goal is to send data between the users, but they all share the same links and routers. The users submit reservations with a deadline to the data transfer service which allocate to them a portion of the bandwidth during a certain period of time for each link they should use (see Fig. 19 and Fig. 20 for two examples of a link agenda).

Figure 19: Trafc on a link scheduled by jBDTS

Figure 20: Trafc on a link scheduled by EARI: a link agenda

The data transfer service should be able to guarantee that each reservation is respected in terms of bandwidth and time and that the reservations are well coordinated between the links for a given data transfer. In this context, the resources that we can put on and off are the links, the router ports and the routers themselves. For example, on Figure 21, we can shut down the router on the bottom if it is not used.
consumption of a TGV is about 14.79 kWh per km (http://fr.wikipedia.org/wiki/Deplacement_a_grande_vitesse). time is computed by Kaspied, a Grid5000 analysis tool for the user utilization. This represents an energy consumption of about 7200 kWh if we take Pwork = 216 W.
13 This 12 The

18

We adapt an existing framework called BDTS14 to conduct our experiments. BDTS controls the coordination of resource allocation among the end points for data transfers. The transfers requests are modeled as transfer jobs. They should rst be accepted by the admission control of BDTS and then they are effectively scheduled and BDTS controls transparently the transfers. To each transfer it associates a prole: a step function that expresses a variable bandwidth over time. In this work, the authors use a strategy which leads to a staggering of the data: at each time, they want to let the bandwidth free as much as possible to allow the future transfers to be set at the same time (see Figure 19). We deploy a different link usage strategy: at each time, our green policy wants to use as much bandwidth as possible to increase the idle periods (see Figure 20), and then we are able to shut down the resources (Ethernet cards, ports and routers) when they are not used. We just let a security part of the bandwidth free for the acknowledgments and other control packets. This represents our green policy in that case. For an other policy, we can take more into account the prole given by the user.

Figure 21: Architecture of EARI on an example of network

Our strategy provides a different quality of service from the actual BDTS strategy. Indeed, the BDTS strategy leaves space for the other reservations in a different way from our own. So we do not accept the same reservations. We can see our infrastructure architecture in this context on Figure 21. We are currently adaptating our EARI on jBDTS.

Conclusion and future works

This report presents a rst step of our work whose goal is to better understand the usage of large-scale distributed systems and to propose methods and energy-aware tools to reduce the energy consumption in such systems. Our analysis has provided instructive results about the utilization of an experimental Grid over the example of Grid5000 [16]. Next, we have proposed an energy-aware reservation infrastructure to reduce the global consumption of large-scale distributed systems and high performance data transport. This infrastructure is efcient and can be implemented and deployed easily. We have presented our rst results which validate our energy-aware reservation infrastructure. We have indeed seen that with EARI (Energy-Aware Reservation Infrastructure) and our green policy, the Lyon site could have consumed 98700 kWh during the 2007 year without including the air cooling and the network equipment. We could have saved the energy consumption of 5000km of TGV15 . Based on these gures, we have made a rough estimation for the whole Grid5000: our green policy could have saved 1209159 kWh for 2007, this corresponds about to twice around the world in TGV. We are currently working on tools, portals and frameworks proposing these results in a real-time manner to the users and grid middleware. We are working on such a tool that we will integrate on the Grid5000 website. We plan to make further experiments to fully validate our infrastructure and to enhance our prediction algorithm, this work will include the implementation of the method 1. We also plan to make the same experiments with the whole grid traces including the grid reservation constraints (as dened in Section 3.2, on several sites at the same time). We will study the possibility to move reservations from one site to another (according external temperatures parameters for example). Our long term goal is to incorporate virtualization and DVFS techniques in our infrastructure with the objective to save more energy without impacting performances. Virtualization could also solve the problem of ensuring network presence and answering basic requests from the monitoring tools of large-scale distributed systems or dealing with the high-performance data transport systems.
14 jBDTS is a service for Bulk Data Transfer Scheduling developed by the RESO team in Java (http://perso.ens-lyon.fr/marcelo. pasin/RESO/jBDTS.html) 15 The consumption of a TGV is about 14.79 kWh per km (http://fr.wikipedia.org/wiki/Deplacement_a_grande_vitesse).

19

References
[1] Nicolas Capit, Georges Da Costa, Yiannis Georgiou, Guillaume Huard, Cyrille Martin, Gr egory Mouni e, Pierre Neyron, and Olivier Richard. A batch scheduler with high level components. In Cluster computing and Grid 2005 (CCGrid05), 2005. [2] F. Cappello et al. Grid5000: A Large Scale, Recongurable, Controlable and Monitorable Grid Platform. In 6th IEEE/ACM International Workshop on Grid Computing, Grid2005, Seattle, Washington, USA, Nov. 2005. [3] Jeff Chase and Ron Doyle. Balance of Power: Energy Management for Server Clusters. In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems (HotOS01), May 2001. [4] Jeffrey S. Chase, Darrell C. Anderson, Prachi N. Thakar, Amin M. Vahdat, and Ronald P. Doyle. Managing energy and server resources in hosting centers. In SOSP 01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 103116, New York, NY, USA, 2001. ACM. [5] Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power provisioning for a warehouse-sized computer. In ISCA 07: Proceedings of the 34th annual international symposium on Computer architecture, pages 1323, New York, NY, USA, 2007. ACM. [6] Chamara Gunaratne, Ken Christensen, and Bruce Nordman. Managing energy consumption costs in desktop PCs and LAN switches with proxying, split TCP connections, and scaling of link speed. International Journal of Network Management, 15(5):297310, 2005. [7] Chamara Gunaratne and Stephen Suen. Ethernet Adaptive Link Rate (ALR): Analysis of a Buffer Threshold Policy. In Global Telecommunications Conference, 2006. GLOBECOM 06, New York, NY, USA, 2006. ACM. [8] M. Gupta and S. Singh. Dynamic Ethernet Link Shutdown for Energy Conservation on Ethernet Links. Communications, 2007. ICC 07. IEEE International Conference on, pages 61566161, June 2007. [9] Maruti Gupta, Satyajit Grover, and Suresh Singh. A Feasibility Study for Power Management in LAN Switches. In ICNP 04: Proceedings of the Network Protocols, 12th IEEE International Conference, pages 361371, Washington, DC, USA, 2004. IEEE Computer Society. [10] Maruti Gupta and Suresh Singh. Greening of the Internet. In SIGCOMM 03: Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, pages 1926, New York, NY, USA, 2003. ACM. [11] Maruti Gupta and Suresh Singh. Using Low-Power Modes for Energy Conservation in Ethernet LANs. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 24512455, Washington, DC, USA, May 2007. IEEE Computer Society. [12] Fabien Hermenier, Nicolas Loriant, and Jean-Marc Menaud. Power Management in Grid Computing with Xen. In Proceedings of 2006 on XEN in HPC Cluster and Grid Computing Environments (XHPC06), number 4331 in LNCS, pages 407416, Sorento, Italy, 2006. Springer Verlag. [13] Y. Hotta, M. Sato, H. Kimura, S. Matsuoka, T. Boku, and D. Takahashi. Prole-based optimization of power performance by using dynamic voltage scaling on a PC cluster. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, April 2006. [14] A. Iosup, C. Dumitrescu, D. Epema, Hui Li, and L. Wolters. How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications. In 7th IEEE/ACM International Conference on Grid Computing, September 2006. [15] Ravindra Jejurikar and Rajesh Gupta. Energy aware task scheduling with task synchronization for embedded realtime systems. In Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, pages 1024 1037. IEEE, June 2006. [16] Laurent Lef` evre, Jean-Patrick Gelas, and Anne-C ecile Orgerie. How an Experimental Grid is Used: the Grid5000 Case and its Impact on Energy Usage, 2008. Poster in Cluster computing and Grid 2008 (CCGrid08) available at http://www.ens-lyon.fr/LIP/RESO/energy_grid/. [17] Andreas Merkel and Frank Bellosa. Balancing power consumption in multiprocessor systems. SIGOPS Operating Systems Review, 40(4):403414, 2006. [18] C. Patel, R. Sharma, C. Bash, and S. Graupner. Energy Aware Grid: Global Workload Placement based on Energy Efciency. Technical report, HP Laboratories, November 2002. [19] Ratnesh K. Sharma, Cullen E. Bash, Chandrakant D. Patel, Richard J. Friedrich, and Jeffrey S. Chase. Balance of Power: Dynamic Thermal Management for Internet Data Centers. IEEE Internet Computing, 9(1):4249, 2005.

20

Example of a reservation log furnished by OAR

Here is an example of a reservation log provided by OAR: <item key="30644"> <hashref memory_address="0x116c8b0"> <item key="command"></item> <item key="job_type">INTERACTIVE</item> <item key="launching_directory">/home/lyon/aorgerie</item> <item key="limit_stop_time">1167734239</item> <item key="properties"></item> <item key="queue_name">default</item> <item key="resources"> <arrayref memory_address="0x116c880"> <item key="0">128</item> <item key="1">129</item> </arrayref> </item> <item key="start_time">1167730639</item> <item key="state">Terminated</item> <item key="stop_time">1167730797</item> <item key="submission_time">1167730638</item> <item key="user">aorgerie</item> <item key="walltime">3600</item> </hashref> </item> In this example, the user called aorgerie submits a reservation at time 1167730638 which starts at 1167730639 and nishes at 1167730797 (all the dates are given in seconds from the 1st of January 1970). This reservation is on 2 resources: number 128 and 129. An exhaustive presentation of these experiments can be found at http://perso.ens-lyon.fr/annececile. orgerie/resultats_grid5000.pdf.

21