Вы находитесь на странице: 1из 10

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 130

Strategies of Domain Decomposition to


Partition Mesh-Based Applications onto
Computational Grids
Beatriz Otero and Marisa Gil

Abstract— In this paper, we evaluate strategies of domain decomposition in Grid environment to solve mesh-based
applications. We compare the balanced distribution strategy with unbalanced distribution strategies. While the former is a
common strategy in homogenous computing environment (e.g. parallel computers), it presents some problems due to
communication latency in Grid environments. Unbalanced decomposition strategies consist of assigning less workload to
processors responsible for sending updates outside the host. The results obtained in Grid environments show that unbalanced
distributions strategies improve the expected execution time of mesh-based applications by up to 53%. However, this is not true
when the number of processors devoted to communication exceeds the number of processors devoted to calculation in the
host. To solve this problem we propose a new unbalanced distribution strategy that improves the expected execution time up to
43%. We analyze the influence of the communication patterns on execution times using the Dimemas simulator.

Index Terms— Domain decomposition methods, load balancing algorithms, parallelism and concurrency, simulation.

——————————  ——————————

1 INTRODUCTION

D OMAIN decomposition strategy is used for efficient


parallel execution of mesh-based applications. These
applications are widely used in various disciplines such
In order to obtain optimal performance of mesh-based
applications in Grid environments a suitable partitioning
method should take into consideration several features, such
as engineering, structural mechanics and fluid dynamics and as the characteristics of the processors, the quantity of traffic
require high computational capabilities [1]. Computational in the network, the latency and the bandwidth between pro-
Grids are emerging as a new infrastructure for high perfor- cessors both inside the host and between hosts. Most parti-
mance computing. Clusters of workstations of multiple insti- tioners do not have this capacity; therefore they do not pro-
tutions can be used to efficiently solve PDEs in parallel duce good results when the network and processors in the
where the problem size and number of processors are chosen Grid have a heterogeneous nature.
to maintain sufficient coarse-grained parallelism. A worksta- In this paper, we evaluate mesh-based applications in
tion can be a computer or a group of computers and hereaf- Grid environments using a domain decomposition technique.
ter we will refer to both as host. The main objective of this study is to improve the execution
Our focus is simulations that make finite element analysis time of mesh-based applications in Grid environments by
to solve the problems that arise from the discretization of overlapping remote communications and useful computa-
PDEs on meshes. The general algorithmic structure of the tion. To achieve this, we propose new strategies of domain
explicit simulations is composed of two nested loops. The decomposition for partitioning mesh-based applications in
first, the outer loop, performs the discretization of PDE with computational Grids where the workload can vary depend-
reference to the simulation time. The second, the inner loop, ing on the characteristics of the processor and of the net-
applies this discretization onto all finite elements of the work. The successful deployment of parallel mesh-based
mesh. This inner loop performs a matrix-vector product. applications in a grid environment must involve efficient
This numerical operation represents between 80% and 90% mesh partitioning. We use the Dimemas tool to simulate the
of the total iteration time, so an efficient parallelization of behavior of the distributed applications in Grid environ-
this calculation might significantly improve the total simula- ments [2].
tion time. To this end, we use domain decomposition tech- This work is organized as follows. Section 2 briefly dis-
niques, where matrix and vector are decomposed properly in cusses the work related with this study. Section 3 de-
sub-domains of data that are mapped in one processor. Each scribes the applications and the algorithmic structure of
sub-domain has interrelations with each of the others in the explicit simulations. Section 4 defines the Grid topologies
boundary elements. considered and describes the tool used to simulate the
Grid environment. Section 5 deals with the mesh-based
———————————————— applications studied and the workload assignment pat-
 B. Otero is with the Department d’Arquitectura de Computadors, Univer- terns. Section 6 shows the results obtained in the envi-
sitat Politècnica de Catalunya, 08034, Spain. ronments specified for the three different data distribu-
 M. Gil is with the Department d’Arquitectura de Computadors, Universi-
tat Politècnica de Catalunya, 08034, Spain.
tion patterns. Section 7 presents the new unbalanced dis-
tribution that solves the problems of the unbalanced dis-
tribution proposed before. Finally, the conclusions of the
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 131

work are presented in Section 8. bines the two previous unbalanced distributions and al-
lows a more efficient processor utilization.
2 RELATED WORK
We can distinguish between two distinct types of related 3 APPLICATIONS AND SIMULATIONS
work, one based on the partitioner method and other In this section we describe the mesh-based applications
based on the load balancing. As mentioned above, the features and the general algorithmic structure of the
success of parallel mesh-based applications in Grid envi- simulation schemes.
ronments depends on efficient mesh partitioning. Several
works have already proposed partitioners for computa- 3.1 Mesh-based Applications
tional Grids. PART, JOSTLE, SCOTCH, MinEX, PaGrid Finite element methods have been fundamental techniques
and METIS are some examples of these. in the solution of problems in engineering modeled by
PART [3] uses intensive algorithms and simulation an- PDEs. These methods include three basic steps:
nealing, so requires a parallel implementation to obtain good 1. Step 1: The physical problem is written in varia-
performance. JOSTLE [4] produces data partitions without tional form (also called weighted residual form).
taking into account the communication cost for each proces- 2. Step 2: The problem’s domain is discretized by
sor. SCOTCH [5] has the same limitation of JOSTLE be- complex shapes called elements. This is called
cause it generates partitions for homogeneous interprocessor meshing.
communication cost. MinEX [6] makes partitions without 3. Step 3: The variational form is discretized using
taking into account the application granularity. PaGrid [7] quadrature rules leading to a system of equations.
uses some techniques already applied by other partitioners The solution of this system represents a discrete
but adds a stage for load balancing the execution time. PaGr- approximation of the solution of the original con-
id produces comparable partitions to JOSTLE and attempts tinuum problem.
some improvement by minimizing the estimated execution Applications that involve a meshing procedure are re-
time. Finally, METIS [8] is based on a multilevel recursive ferred to as mesh-based applications (step 2). Mesh-based
bisection algorithm. applications are naturally suited for parallel or distri-
All of these approaches consider estimated execution buted systems because these applications require large
time rather than communication cost to measure the perfor- amounts of processing time. Furthermore, mesh-based
mance of a mesh partitioner. However, minimizing the applications can be partitioned to execute concurrently on
communication between hosts is fundamental in computa- heterogeneous computers in a Grid. Implementing the
tional Grids to reduce the execution time. finite element method in parallel involves partitioning the
As regards workload, there are some works dealing nodes global domain into nprocs processors. Our example
with the relationship between architecture and domain applications use explicit finite element analysis for prob-
decomposition algorithms [9]. There are several studies lems involving sheet stamping and car crashes [19]. We
on latency, bandwidth and optimum workload to take describe each of these below.
full advantage of the available resources [10, 11]. There Sheet stamping problems. Performance prediction of
are also analyses of the behavior of MPI applications in sheet stamping dies during the die-design process. As
Grid environments [12, 13]. In all of these cases, the same well as market pressure for faster delivery of dies and
workload for all the processors is considered. cost reduction, large car manufacturers increasingly tend
Li et al. [14] provide a survey of the existing solutions to offload design responsibility onto their suppliers. Typi-
in load balancing as well as new efforts to deal with it in cally, dies are designed by highly experienced people
the face of the new challenges in Grid computing. In this who know what sort of die is needed to produce a part of
work they describe and classify different schemes of load a given shape. On the basis of the design, fine-tuning is
balancing for grid computing, but there is no solution performed by actually using dies to produce parts, ob-
which would be fully adaptive to the characteristics of the serving the result and manually milling the die until the
Grid. sheet comes out as specified by the contractor. In complex
In previous works we suggested two unbalanced dis- cases, it is very difficult to produce a good die design by
tribution strategies, called singleB-domain and multipleB- intuition. In addition to the associated costs, failure to
domain, to execute mesh-based applications in Grid envi- meet a delivery date damages a company’s image and has
ronments [15, 16, 17, 18]. All of these use unbalanced data a negative impact on future business.
distribution and they take into account the execution plat- Numerical simulations could provide the quantitative
form and the processor characteristics. Both strategies information needed to minimize modifications during the
minimize the communication between the processors and manufacturing process [19]. Simulations with serial codes
reduce the expected execution time by up to 53% when take as long as 40 to 60 processor hours and usually re-
compared with a balanced distribution strategy. In this quire high end workstations. Parallel stamping simula-
paper we present the details of the two unbalanced dis- tions enable die manufacturing companies to alter their
tributions proposed above. We describe the characteris- die design procedures: instead of an iterative approach,
tics of the applications executed and the schemes to expli- they can run more analyses before the first die prototypes
cit simulations, and we propose a new unbalanced distri- are made. This reduces overall die design and manufac-
bution, called multipleCB-domain distribution, which com- turing time, which is vital for the industry.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 132

This problem has high computational requirements.


for (all step time of simulation)
The size of the problem is very large because of the high
for (all finite elements)
complexity of the design model. In this case, Grid compu- e
Obtain global vector associate to finite element ({un })
ting is necessary because a complex model is required to e e e
Solve system: {un+1 } = [A ] {un }
solve the problem and currently supplier companies do e
Scatter {un+1 } to global vector ({un+1})
not have enough computational power available to per-
form such simulations. They are forced to use the services
Calculate residual vector
of remote computers that belong to other companies.
Update boundary conditions
Car crash problems. The car body design and the passen-
Calculate next step of time
ger’s safety must be considered. Before performing a real
vector associate to that element ({une}). The result obtained
crash test, hundreds of crash worthiness simulations are
({un+1e}) is scattering to global vector ({un+1}). After this, the
computed and analyzed [20]. Car crash simulations are
residual vector is calculated and the boundary conditions
required to predict the effects on new advanced materials
are updated if necessary. This scheme has the advantage
of various collisions, such as two cars colliding. As in the
that the global matrix [A] does not need to be formed in
above case the car crash problem has high computational
the inner loop. This leads to considerable savings in
requirements and the platform Grid is a good option to
memory and allows to solve large problems in relatively
realize the simulations at less cost.
small memory PC’s.
3.2 Simulations of Mesh-based Applications In contrast, the second scheme needs an initial gather-
When doing structural analysis of mesh-based applica- ing of the global matrix. After this, a matrix-vector prod-
tions such as car crash and sheet stamping we use the uct operation is needed as in the first scheme. The algo-
displacement equation [21]. This equation determines the rithm for this scheme is the following:
numerical solution for our applications. The discretiza-
tion of the displacement equation using the finite element Our work follows the without matrix assembling
method has the following mathematical equation: scheme to produce the explicit simulations as this saves
both memory and time.
[M]{ů}+[K]{u} = {Fa} (1)
for (all step time of simulation)
for (all finite elements)
where: e e
Gather [A ]: [A] = [A] + [A ]
[M] is the mass matrix
[K] is the tensor matrix
Solve system: {un+1} = [A] {un}
{ů} is the acceleration vector
Calculate residual vector
{u} is the displacement vector and
Update boundary conditions
{Fa} is the force vector.
Calculate next step of time

To obtain the numerical solution of (1), we use the


central difference method and numerical integration in 4 DIMEMAS AND GRID ENVIRONMENT
time, and so obtain the follow equation:
We use a performance simulator called Dimemas. Dime-
mas is a tool developed by CEPBA1 for simulating paral-
{un+1} = [A]{un} (2)
lel environments [2, 12, 13]. This tool is fed with a trace
Equation (2) defines the explicit method to determine file and a configuration file. In order to obtain the trace
the numerical solution of our applications. Details of the file, the parallel application is executed using an instru-
characteristics of matrix [A] and the {un} vector can be mented version of MPI [22]. It is important to note that
seen in [21]. this execution can be done on any kind of computer. The
configuration file contains details of the simulated archi-
3.3 Explicit Simulations tecture, such as number of the nodes, latency and band-
In the previous subsection, we described the general algo- width between nodes. Dimemas generates an output file
rithmic structure to determine the numerical solution for that contains the execution times of the simulated applica-
our simulations. The simulations have two different tion for the parameters specified in the configuration file.
schemes, called with or without matrix assembling, de- Furthermore, it is possible to obtain a graphical represen-
pending on whether or not the matrix inside the inner tation of the parallel execution. Figure 1 shows the se-
loops is gathering. quence of steps to obtain the output file.
The scheme without matrix assembling uses a calcula- The Dimemas simulator considers a simple model for
tion algorithm element by element, thus it is not neces- point to point communications. This model breaks down the
sary to gather a global equation system. The algorithm communication time into five components:
structure of this scheme is the following:
1. Latency time is a fixed time to start the communi-
As we can see in the above algorithm, the scheme cation.
without matrix assembling makes one matrix-vector
product operation per element, using a part of the global 1 European Center for Parallelism of Barcelona
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 133
Sequential machine
machine (latency of 10 ms and 100 ms, bandwidth of 64 Kbps, 300
Dimemas
Dimemas MPI
MPI
Kbps and 2 Mbps, flight time of 1ms and 100 ms).
We model a Grid environment using a set of hosts.
Application
Trace
Trace File Library Application
Application
Code
File Code
Code

Each host is a network of Symmetric Multiprocessors


(SMP). The Grid environment is formed by a set of con-
nected hosts. Each host has a direct full-duplex connec-
Grid Environments: tion with any other host. We do this because we think
Configuration files
that some of the most interesting Grids for the scientist
Visualization
Visualization
involve nodes that are themselves high-performance pa-
Trace
Trace
File
File
File
Visualization tool: rallel machines or clusters. We consider different topolo-
PARAVER
gies in this study: two, four and eight hosts.
Dimemas
Dimemas simulation
simulation

Fig. 1. The Dimemas tool.


5 DATA DISTRIBUTION
2. Resource contention time is dependent on the glob- This work is based on the use of distributed applications that
al load in the local host [23]. solve sparse linear systems using iterative methods. These
3. The transfer time is dependent on the message size. systems arise from the discretization of partial differential
We model this time with a bandwidth parameter. equations, especially when explicit methods are used. These
4. The WAN contention time is dependent on the algorithms are parallelized using domain decomposition for
global traffic in the WAN [24]. the data distribution. Each particular domain has a parallel
5. The flight time is the time spent in the transmission process associated to it.
of the message, during which no CPU time is used A matrix-vector product operation is carried out in each
[23]. It depends on the distance between hosts. We iteration of the iterative method. The matrix-vector product
consider hosts distributed at equal distances, since is performed using a domain decomposition algorithm, as a
our environment is homogeneous. set of independent computations and a final set of communi-
cations. The communications in this context are associated
We consider an ideal environment one where resource to the domain boundaries. Each process exchanges the
contention time is negligible: there are an infinite number boundary values with all its neighbors so that, each process
of buses for the interconnection network and as many has as many communication exchanges as neighbor domains
links as the number of different remote communications [26, 27]. For each communication exchange, the size of the
the host has with others hosts. For the WAN contention message is the length of the boundary between the two do-
time, we use a linear model to estimate the traffic in the mains involved. We use METIS to perform the domain de-
external network. We have considered the traffic function composition for the initial mesh [28, 29, 30].
with 1% influence from internal traffic and 99% influence Balanced distribution pattern. This is the usual strategy
from external traffic. Thus, we model the communications for domain decomposition algorithms. It generates as
with just three parameters: latency, bandwidth and flight many domains as processors in the Grid. The computa-
time. These parameters are set according to what is com- tional load is perfectly balanced between domains. This
monly found in present networks. We have studied dif- balanced strategy is suitable in homogeneous parallel
ferent works to determine these parameters [24, 25]. Table computing, where all communications have the same
1 shows the values of these parameters for the internal cost. Figure 2 shows an example of a finite element mesh
and external host communications. The internal column with 256 degrees of freedom (dofs) with the boundary
defines the latency and bandwidth between processors nodes for each balanced partition. We consider a Grid
inside a host. The external column defines the latency and with 4 hosts and 8 processors per host. We solve an initial
bandwidth values between hosts. The communications decomposition in four balanced domains. Figure 3 shows
inside a host are fast (latency 25 µs, bandwidth 100 the balanced domain decomposition.
Mbps), and the communications between hosts are slow

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79

80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111

112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

TABLE 1 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143

LATENCY, BANDWIDTH AND FLIGHT TIME VALUES 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159

160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

Parameters Internal External


176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191

192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207

Latency 25 µs 10 ms and 100 ms


208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223

224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239

Bandwidth 100 Mbps 64 Kbps, 300 Kbps and 2Mbps 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

Flight time - 1 ms and 100 ms


Fig. 2. Boundary nodes per host
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 134

CPU 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CPU 1
CPU 2
16 D
170 18 20D1 21 22 23 24 25 26D8 27 28 29 D309 31
D0 D1 D2 D3 D8 D9 D10 D11
CPU 3 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
CPU 4
CPU 5 48 49 50 51 52 54 D
55
5 D5613 57 58 59 60 61 62 63

CPU 6
D10
64 65 66 67 68 69 70 72 73 74 76 77 78 79
CPU 7 D2 D3 D4 D12
CPU 8 80 81 82 83 84 85 86 88 89 90 91 92 93 94 95
CPU 9 D4 D5 D6 D7 D12 D13 D14 D15 D11
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
CPU 10
CPU 11 D7 116 117 118
112 113 114 115 D6 D
120
14 122 123 D15 125 126 127
124
CPU 12
CPU 13 128 129 130 D 21 132 133 134 D22
131 135 D30 137 138 139 D
140
29
141 142 143
CPU 14
CPU 15 144 145 146 148 149 150 151 152 153 154 155 156 157 158 159
D16 D17 D18 D19 D24 D25 D26 D27
CPU 16 164 D
D16162
160 161 165 166 167
17 168 169 187 171 172 173 174 175
CPU 17 D24 D25 D26
CPU 18 176 177 178 179 180 181 182 183 184 185 187 187 188 189 190 191
CPU 19 D23 D
192 193 194 195 196 197 198 199 31
200 201 202 203 204 205 206 207
CPU 20
CPU 21 208 209 210 211 212 213 214 215 216 217 218 219 220 222 223
CPU 22 D20 D21 D22 D23 D28 D29 D30 D31 D18 D19 D20
CPU 23 224 225 226 227 228 229 230 231 D27235 236 237 D238
232 233 234 239
28
CPU 24 240 241 242 243 244 245 246 247 187 249 250 251 252 253 254 255
CPU 25
CPU 26
CPU 27
CPU 28
CPU 29
CPU 30
Fig. 3. Balanced distribution. Fig. 5.MultipleB-domain distribution.
CPU 31

balanced and the unbalanced domain decomposition may


Fig. 6.c.Communication diagram for a computational iteration be different, since the number of neighbors of each do-
of the MultipleB-domain distribution.
main may also be different. Figure 6 illustrates the com-
munication pattern of the balanced/unbalanced distribu-
Unbalanced distribution pattern. Our proposal is to tions for this example. The arrows in the diagram
build some domains with a negligible computational represent processors interchanging data. The beginning of
load. Those domains are devoted only to manage the the arrow identifies the sender. The end of the arrow
slow communications. In order to do this, we divide the identifies the receiver. Short arrows represent local com-
domain decomposition in two phases. First, balanced munications inside a host, whereas long arrows represent
domain decomposition is done between the number of remote communications between hosts. In Fig. 6.a, all the
hosts. This guarantees that the computational load is ba- processors are busy and the remote communications are
lanced between hosts. Second, unbalanced domain de- done at the end of each iteration. The bars in the diagram
composition is done inside a host. This second decompo- represent the computational time to each processor. In
sition involves splitting the boundary nodes of the host Figs. 6.b and 6.c, the remote communication takes place
sub-graph. We create as many special domains as remote overlapped with the computation. In figure 6.b, the re-
communications. Note that these domains contain only mote communication is overlapped only with the first
boundary nodes, so they have negligible computational remote computation. In figure 6.c, all remote communica-
load. We call these special domains B-domains (Boundary tions in the same host are overlapped.
domains). The remainder host sub-graph is decomposed CPU 0

in nproc-b domains, where nproc is the number of proces-


CPU 1
CPU 2
CPU 3

sors in the host and b stands for the number of B-domains.


CPU 4
CPU 5
CPU 6

We call these domains C-domains (Computational do- CPU 7


CPU 8
CPU 9

mains). As a first approximation we assign one CPU to CPU 10


CPU 11
CPU 12

each domain. The CPUs assigned to B-domains remain CPU 13


CPU 14

inactive most of the time. We use this policy to obtain the


CPU 15
CPU 16
CPU 17

worst case for our decomposition algorithm. This ineffi-


CPU 18
CPU 19
CPU 20

ciency could be solved assigning all the B-domains in a


CPU 21
CPU 22
CPU 23

host to the same CPU. We consider two unbalanced de- CPU 24


CPU 25
CPU 26

composition of the mesh shows in figure 2. First, we CPU 27


CPU 28

create a sub-domain with the layer of boundary nodes for


CPU 29
CPU 30
CPU 31

each initial domain (singleB-domain), which contains seven


computational domains (Fig. 4). Second, we create some Fig. 6.a. Communication diagram for a computational iterations
domains (multipleB-domain) for each initial partition using of the balanced distribution.
the layer of boundary nodes. Then, the remainder mesh is CPU 0

decomposed in five computational domains (Fig. 5).


CPU 1
CPU 2
CPU 3

We must remark that the communication pattern of the CPU 4


CPU 5
CPU 6
CPU 7
CPU 8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CPU 9
16D017 18 D1 20 D
212 22 23 24 25D826 27D928 29 D3010 31 CPU 10
CPU 11
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
CPU 12
48 49 50 51 52 54 55 56 57 58 59 60 61 62 63 CPU 13
D4 D7 D15 D11 CPU 14
64 65 66 67 68 69 70 72 73 74 76 77 78 79
CPU 15
D3 D6 D12 D13
80 81 82 83 84 85 86 88 89 90 CPU 16
D5 D14 CPU 17
96 97 98 99 100 101 102 103 104 105 106
CPU 18
D7 116 117 118
112 113 114 115 122 123 D
124
15
126 CPU 19
CPU 20
D23132 133 134
128 129 130 131 137 138 139 D
140
31
141 142 143 CPU 21
CPU 22
144 145 146 148 151 152 153 154 155 156 157 158 159 CPU 23
D26 CPU 24
160 D
161 162
16 D17
164 167 168 169 187 171 172 173 174 175
D24 D25 CPU 25
176 177 178 179 180 D18 183 184 185 187 187 188 189 190 191 CPU 26

192 193
D
199
23 D 31
200
D27 207
201 202 203 204 205 206
CPU 27
D20 CPU 28
208 209 215 216 217 218 219 220 222 223 CPU 29
D19
CPU 30
224 225 227 229 230 231 D29236 237 D238
D28234 235
232 233 239
D21 D22 30 CPU 31
240 241 242 243 244 245 246 247 187 249 250 251 252 253 254 255

Fig. 6.b.Communication diagram for a computational iteration


Fig. 4.SingleB-domain distribution. of the SingleB-domain distribution.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 135

6 RESULTS s ingleB-domain (4x 4)

STICK MESH multipleB-domain (4x 4)

In this section we show the results obtained using Dimemas. E xternal latency of 10 ms and flight time of 1 ms
s ingleB-domain (4x 8)
multipleB-domain (4x 8)

We simulate a 128 processors machine using the following s ingleB-domain (4x 16)
multipleB-domain (4x 16)
90,00
Grid environment. The number of hosts is 2, 4 or 8; the 80,00
70,00
s ingleB-domain (4x 32)
multipleB-domain (4x 32)

number of CPUs/host is 4, 8, 16, 32 or 64; thus, we have 60,00


50,00
s ingleB-domain (8x 8)
multipleB-domain (8x 8)

Execution time reduction (%)


40,00 s ingleB-domain (8x 16)
from 8 to 128 total CPUs. The simulations were done consi- 30,00
20,00
multipleB-domain (8x 16)

dering lineal network traffic models. We consider three sig- 10,00


0,00

nificant parameters to analyze the execution time behavior: -10,00


-20,00
-30,00
the communication latency between hosts, the bandwidth in -40,00
-50,00

the external network and the flight time. -60,00


-70,00

As data set, we consider a finite element mesh with


-80,00
-90,00
-100,00
1,000,000 dofs. This size is usual for car crash or sheet 64 Kbps 300 Kbps 2 Mbps
Ba ndw idth
stamping models. We consider two kinds of meshes,
which define most of the typical cases. The first one, Fig. 7.b. Execution time reduction for the stick mesh with external
called stick mesh, can be completely decomposed in latency of 10 ms and flight time of 1 ms (4 and 8 hosts).
strips, so there are, at most, two neighbors per domain.
The second one, called box mesh, cannot be decomposed s ingleB-domain (2x 4)

in strips, so the number of neighbors per domain could be ST ICK M ESH


Exte rnal late ncy of 10 ms and flight time of 100 ms
multipleB-domain (2x 4)
s ingleB-domain (2x 8)

greater than two. The size of the stick mesh is 104x10x10 multipleB-domain (2x 8)
s ingleB-domain (2x 16)

nodes. The size of the box mesh is 102x102x102 nodes. 60,00 multipleB-domain (2x 16)
55,00 s ingleB-domain (2x 32)

Figures 7.a, 7.b, 8.a and 8.b show the time reduction
50,00 multipleB-domain (2x 32)
45,00 s ingleB-domain (2x 64)

Execution time reduction (%)


40,00
percentages compared with the balanced distribution for
multipleB-domain (2x 64)
35,00
30,00

each Grid configuration in stick mesh as a function of the 25,00


20,00
15,00
bandwidth. The unbalanced decomposition reduces the 10,00
5,00

execution time expected for the balanced distribution in 0,00


-5,00

most cases. -10,00


-15,00
-20,00
-25,00
-30,00

For a Grid with 2 hosts and 4 processors per host, the -35,00
64 Kbps 300 Kbps 2 Mbps

predicted execution time of the balanced distribution is Ba ndw idth

better than other distributions because the number of re-


mote communications is two. In this case, the multipleB- Fig. 8.a. Execution time reduction for the stick mesh with external
latency of 10 ms and flight time of 100 ms (2 hosts).
domain unbalanced distribution has only one or two pro-
cessors per host computation.
The results are similar when we consider that the ex- ST ICK M ESH
s ingleB-domain (4x 4)
multipleB-domain (4x 4)

ternal latency is equal to 100 ms (figs. 9.a, 9.b, 10.a and E xte rnal late ncy of 10 ms and flight time of 100 ms
s ingleB-domain (4x 8)
multipleB-domain (4x 8)

10.b). Therefore, the value of this parameter has not sig-


s ingleB-domain (4x 16)
multipleB-domain (4x 16)

nificant impact on the results for this topology. In the 100,00 s ingleB-domain (4x 32)
90,00 multipleB-domain (4x 32)
80,00
other cases, the benefit of the unbalanced distributions
s ingleB-domain (8x 8)
70,00
Execution time reduction (%)

60,00 multipleB-domain (8x 8)

ranges from 1% to 53% of time reduction. The execution


50,00 s ingleB-domain (8x 16)
40,00 multipleB-domain (8x 16)
30,00

time reduction increases until 82% for other topologies 20,00


10,00

and configurations. For 4 and 8 hosts, the singleB-domain


0,00
-10,00
-20,00

unbalanced distribution has similar behavior than the -30,00


-40,00
-50,00
balanced distribution, since the remote communications -60,00
-70,00
-80,00
-90,00
-100,00
64 Kbps 300 Kbps 2 Mbps
s ingleB-domain (2x 4) Ban d w id th
ST ICK M ESH multipleB-domain (2x 4)

Exte rnal late ncy of 10 ms and flight time of 1 ms s ingleB-domain (2x 8)


multipleB-domain (2x 8)
s ingleB-domain (2x 16)
Fig. 8.b. Execution time reduction for the stick mesh with external
60,00
55,00
multipleB-domain (2x 16) latency of 10 ms and flight time of 100 ms (4 and 8 hosts).
s ingleB-domain (2x 32)
50,00
multipleB-domain (2x 32)

cannot be overlapped and they have to be done sequen-


45,00
Execution time reduction (%)

40,00 s ingleB-domain (2x 64)


35,00 multipleB-domain (2x 64)
30,00
25,00 tially. In this case, the topologies having few processors
per computation are not appropriate. The unbalanced
20,00
15,00
10,00
5,00
0,00
distribution reduces the execution time up to 32%.
-5,00
-10,00
-15,00
-20,00
-25,00
-30,00
-35,00
64 Kbps 300 Kbps 2 Mbps

Ba ndw idth

Fig. 7.a. Execution time reduction for the stick mesh with external
latency of 10 ms and flight time of 1 ms (2 hosts).
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 136
singleB-domain (2x4) s ingleB-domain (4x 4)

STICK MESH multipleB-domain (2x 4) ST ICK M ESH multipleB-domain (4x 4)


singleB-domain (2x8) s ingleB-domain (4x 8)
Exte rnal latency of 100 ms and flight time of 1 ms multipleB-domain (2x 8)
E xte rnal late ncy of 100 ms and flight time of 100 ms multipleB-domain (4x 8)
singleB-domain (2x16) s ingleB-domain (4x 16)
multipleB-domain (2x 16) multipleB-domain (4x 16)
60,00
55,00 singleB-domain (2x32) 90,00 s ingleB-domain (4x 32)
50,00 multipleB-domain (2x 32) 80,00 multipleB-domain (4x 32)
45,00 singleB-domain (2x64) 70,00 s ingleB-domain (8x 8)
Execution time reduction (%)

40,00 multipleB-domain (2x 64) 60,00 multipleB-domain (8x 8)

Execution time reduction (%)


35,00 50,00
s ingleB-domain (8x 16)
30,00 40,00
30,00 multipleB-domain (8x 16)
25,00
20,00
20,00
10,00
15,00 0,00
10,00 -10,00
5,00 -20,00
0,00 -30,00
-5,00 -40,00
-10,00 -50,00
-15,00 -60,00
-20,00 -70,00
-25,00 -80,00
-30,00 -90,00
-35,00 -100,00
64 Kbps 300 Kbps 2 Mbps 64 Kbps 300 Kbps 2 Mbps
Ba ndw idth
Ba ndw idth

Fig. 10.b. Execution time reduction for the stick mesh with external
Fig. 9.a. Execution time reduction for the stick mesh with external
latency and flight time of 100 ms (4 and 8 hosts).
latency of 100 ms and flight time of 1 ms (2 hosts).

s ingleB-domain (4x 4)
fluence of the external latency on the application perfor-
STICK MESH multipleB-domain (4x 4)
s ingleB-domain (4x 8)
mance in a box mesh increases the percentage of reduc-
External late ncy of 100 ms and flight time of 1 ms
multipleB-domain (4x 8) tion of the execution time up to 4%. We suppose that the
s ingleB-domain (4x 16)
multipleB-domain (4x 16) distance between hosts is the same. However, if we con-
s ingleB-domain (4x 32)
sider hosts distributed at different distances, we obtain
90,00
80,00 multipleB-domain (4x 32)
70,00 s ingleB-domain (8x 8)
60,00
multipleB-domain (8x 8) similar benefits for the different distributions. Moreover,
Execution time reduction (%)

50,00

if the calculation capacity of each processor in a host is


40,00 s ingleB-domain (8x 16)
30,00 multipleB-domain (8x 16)
20,00
10,00
0,00
different, the initial data partition will be done consider it.
-10,00
-20,00 Then the data in each processor will not be the same but
-30,00
-40,00
-50,00
-60,00
-70,00
-80,00 s ingleB-domain (2x 8)
-90,00 BO X M ESH multipleB-domain (2x 8)
-100,00
Exte rnal late ncy of 10 ms and flight time of 1 ms s ingleB-domain (2x 16)
64 Kbps 300 Kbps 2 Mbps
multipleB-domain (2x 16)
Bandw idth
s ingleB-domain (2x 32)
90,00
multipleB-domain (2x 32)
85,00 s ingleB-domain (2x 64)
Fig. 9.b. Execution time reduction for the stick mesh with external multipleB-domain (2x 64)
Execution time reduction (%)

80,00
latency of 100 ms and flight time of 1 ms (4 and 8 hosts).
75,00

70,00

65,00
s ingleB-domain (2x 4)
60,00
ST ICK M ESH multipleB-domain (2x 4)

E xte rnal late ncy of 100 ms and flight time of 100 ms s ingleB-domain (2x 8) 55,00
multipleB-domain (2x 8)
50,00
s ingleB-domain (2x 16)
55,00
multipleB-domain (2x 16) 45,00
50,00
45,00 s ingleB-domain (2x 32)
multipleB-domain (2x 32) 40,00
40,00
Execution time reduction (%)

64 Kbps 300 Kbps 2 Mbps


35,00 s ingleB-domain (2x 64)
30,00 multipleB-domain (2x 64) Ba ndw idth
25,00
20,00
15,00
10,00
5,00
Fig. 11.a. Execution time reduction for the box mesh with external
0,00
-5,00
latency of 10 ms and flight time of 1 ms (2 hosts).
-10,00
-15,00
-20,00
-25,00
-30,00
singleB -do main (4x8)
-35,00
64 Kbps 300 Kbps 2 Mbps BO X M ESH m ultipleB -do main (4x8)
singleB -do main (4x16)
Ban d w id th E xte rnal late ncy of 10 ms and flight time of 1 ms
m ultipleB -do main (4x16)
singleB -do main (4x32)
70,00 m ultipleB -do main (4x32)
singleB -do main (8x8)
Fig. 10.a. Execution time reduction for the stick mesh with external 60,00
m ultipleB -do main (8x8)
latency and flight time of 100 ms (2 hosts). singleB -do main (8x16)
Execution time reduction (%)

50,00
m ultipleB -do main (8x16)
40,00

Figures 11.a, 11.b, 12.a and 12.b show the reduction of 30,00

the expected execution time obtained for each Grid confi-


20,00

10,00

guration varying the flight time, the external latency and 0,00

the bandwidth in a box mesh. For the 2 hosts configura- -10,00

tion in a box mesh, the behavior for singleB-domain and -20,00

multipleB-domain unbalanced distribution is similar, since


-30,00
64 Kbps 300 Kbps 2 Mbps

the number of remote communications is the same. Varia- Ba nd w id th

tions of the flight time and the external latency improve


Fig. 11.b. Execution time reduction for the box mesh with external
the results up to 85%. latency of 10 ms and flight time of 1 ms (4 and 8 hosts).
Figures 11.b and 12.b shows the reduction on the ex-
pected execution time obtained for 4 and 8 hosts. The in-
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 137

BOP X M ESH
s ingleB-domain (2x 8) TABLE 2
CPU 0 multipleB-domain (2x 8)
MAXIMUM NUMBER OF COMMUNICATIONS FOR A COM-
0
P
Exte rnal late ncyCPU
of12 10
CPU P ms and flight time of 100 ms
1

2
s ingleB-domain (2x 16)
CPU 3 P3
multipleB-domain (2x 16)
PUTATIONAL ITERATION

Host 0
CPU 4 P4
CPU 5 P5
CPU 6 P6 s ingleB-domain (2x 32)
90,00 P7
CPU 7 P8
multipleB-domain (2x 32)
85,00
CPU 8
P9
P10
s ingleB-domain (2x 64) STICK MESH
CPU 9 P11 multipleB-domain (2x 64)
Execution time reduction (%)

80,00
singleB-domain multipleB-
CPU 10 P12
CPU 11 P13
CPU 12 P14
Balanced
Host 1
75,00 CPU 13 P15
CPU 14 P16
P17
domain
70,00 CPU 15 P18
P19
65,00 CPU 16
CPU 17
P20
P21
Host x Remote / Local Remote / Local Remote / Local
CPU 18 P22
60,00 CPU 19
CPU 20
P23
P24 CPUs Communication Communication Communication
Host 2

CPU 21 P25
55,00 CPU 22 P26

50,00 CPU 23
P27
P28
P29
2x4 1 1 1 1 1 1
2x8 1 1 1 1 1 1
CPU 24 P30
45,00 CPU 25 P31
CPU 26 P32
CPU 27 P33
40,00 CPU 28 P34
2x16 1 1 1 1 1 1
Host 3

64 Kbps CPU 29 P35 300 Kbps 2 Mbps


CPU 30 P36

CPU 31
P37
P38
Ba ndw idth 2x32 1 1 1 1 1 1
P39

Fig. 12.a. Execution time reduction for the box mesh with external 2x64 1 1 1 1 1 1
latency of 10
Fig. 14. ms and flight time
Communication of 100
diagram forms
a (2 hosts).
computational iteration the 4x4 1 1 2 2 1 3
MultipleCB-domain distribution. 4x8 1 1 2 2 1 3
singleB -do main (4x8) 4x16 1 1 2 2 1 3
BOX M ESH multipleB -do main (4x8)

E xte rnal late ncy of 10 ms and flight time of 100 ms


singleB -do main (4x16)
4x32 1 1 2 2 1 3
multipleB -do main (4x16)
singleB -do main (4x32)

70,00 multipleB -do main (4x32) 8x8 1 1 2 2 1 3


singleB -do main (8x8)
60,00 multipleB -do main (8x8) 8x16 1 1 2 2 1 3
singleB -do main (8x16)
Execution time reduction (%)

50,00
multipleB -do main (8x16) BOX MESH
40,00

30,00 2x4 2 3 1 3 1 3
20,00 2x8 4 5 1 6 1 6
10,00
2x16 5 8 1 7 1 8
0,00
2x32 6 7 1 15 1 14
-10,00

-20,00 2x64 7 8 1 25 1 24
-30,00
64 Kbps 300 Kbps 2 Mbps
4x8 7 5 3 6 4 6
Ba ndw idth 4x16 10 9 3 11 4 9
Fig. 12.b. Execution time reduction for the box mesh with external 4x32 9 8 3 22 4 14
latency of 10 ms and flight time of 100 ms (4 and 8 hosts). 8x8 13 5 6 7 13 7
8x16 13 4 6 13 13 11

the computational load will be balanced between proces-


sors. In addition, to reduce the number of processors per-
The number of remote and local communications va- forming remote communications in the multipleB-domain
ries depending on the partition and the dimensions of the we propose to assign all B-domains in a host to a single
data meshes. Table 2 shows the maximum number of CPU, which concurrently will manage all the communica-
communications for a computational iteration. The num- tions. We will call this unbalanced distribution multip-
ber of remote communications is higher for a box mesh leCB-domain. Figures 13 and 14 illustrate the domain de-
than for a stick mesh. Thus, the box mesh suffers from composition and communication pattern of the multip-
higher overhead. leCB-domain distribution for the example described in
We propose the use of unbalanced distribution pat- section 5.
terns to reduce the number of remote communications The main difference between the multipleB-domain and
required. Our approach shows to be very effective, espe- multipleCB-domain is the amount of domains per host be-
cially for box meshes. We observe that the multipleB- cause in the second case all communications are assigned
domain with unbalanced distribution is not sensitive to the to the same CPU inside a host. In Figure 5, the multipleB-
latency increase until the latency is larger than the com- domain distribution has 8 data domains per host, now
putational time. However, the execution time for the ba- multipleCB-domain distribution has 10 data domains per
lanced distribution increases with the latency. D1
14 15

D0 D10 D11 30 31
D2 D12
46 47
D3
D7 D17 57 58
D13
7 MULTIPLECB-DOMAIN DISTRIBUTION D4 D5 D6
73 74

D16
D14 D15

The multipleB-domain unbalanced distribution creates as D9 D8 D18 D19

many special domains per host as external communica-


D27 D28 D38 D37

D21 D32
tions. Then, the scalability of the unbalanced distribution D20 D22 D30 D31

will be moderated, because a processor is devoted just to D23 D29 D39 D35

manage communications for every special domain. The


D24 D26
D25 D33 D34 D36

optimum domain decomposition is problem dependent,


but a simple model can be built to approximate the opti- Fig. 13. MultipleCB-domain distribution.
mum.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 138

host, for three of them will be assigned to one CPU (spe- optimal case is when the number of processors making
cial domains) into the host and the remainder of data calculation in a host is twice the number of processors
domains will be assigned to the rest of the CPUs. Now we managing remote communications. Otherwise, if the
have solely a CPU that manages remote communications number of processors making calculations is small, then
and a larger number of CPUs performing computation. the unbalanced distribution will be less efficient than the
This kind of distribution allows us to minimize the num- balanced distribution. In this case, we propose the use of
ber of idle CPUs in a host devoted only to remote com- the multipleCB-domain distribution. In this distribution all
munications. remote communications in a host are concurrently ma-
For a Grid with 2 hosts, the predicted execution time is naged by the same CPU. This distribution has around a
the same that to the multipleB-domain because the number 10% worse execution time than others unbalanced distri-
of remote communications is only one. However, when butions.
considering 4 or 8 hosts, multipleCB-domain domain makes In general, to obtain a good performance in the strate-
a reduction in execution time percentage up to 43% com- gies presented in this paper the number of processors per
pared to balanced distribution, while multipleB-domain host needs to be equal or higher than 8. In other case the
distribution makes a reduction percentage up to 53%. In number of processors performing computation is not
general, multipleCB-domain distribution is 10% worse than enough to overlap remote communications.
multipleB-domain distribution, mainly due to the problems
in managing concurrency remote communications in the
ACKNOWLEDGMENTS
simulator.
It is also important to look at the MPI implementation This work was supported by the Ministry of Science and
[31]. The ability to overlap communications and computa- Technology of Spain under contract TIN2007-60625, the Hi-
tion depends on this implementation. A multithread MPI PEAC European Network of Excellence and Barcelona Su-
implementation could overlap communication and com- percomputing Center (BSC).
putation, but problems with context switching between
threads and interferences between processes could ap- REFERENCES
pear. [1] G. Allen et al. “Classifying and enabling grid applications. Concurrency
In a single thread MPI implementation we can use and Computation”, Practice and Experience, vol.0, pp. 1-13, 2000. (Journal
non-blocking send/receive with a wait_all routine. citation)
However, we have observed some problems with this [2] Dimemas, Internet, http://www.cepba.upc.es/dimemas/. 2000.
approach. The problems are associated with the internal [3] J. Chen and V. E. Taylor. “Mesh partitioning for efficient use of distri-
buted systems”, IEEE Trans. Parallel and Distributed Systems, vol. 13, no.
order in no blocking MPI routines for sending and receiv- 1, pp.67-79, 2002. (IEEE Transactions)
ing actions. In our experiments, this could be solved pro- [4] C. Walshaw and M. Cross. “Multilevel mesh partitioning for heteroge-
gramming explicitly the proper order of the communica- neous communications networks”, Future Generation Computer Systems,
tions. But the problem remains for a general case. We vol. 17, no. 5, pp. 601-623, 2001. (Journal citation)
conclude that it is very important to have no blocking [5] F. Pelligrini and J. Roman. “A software package for static mapping by
MPI primitives that actually exploit the full duplex chan- dual recursive bipartitioning of process and architecture graphs”, Proc.
of the High Performance Computing and Networking, pp. 493-498, 1996.
nel capability. As a future work, we will consider other
(Conference proceedings)
MPI implementations that optimize the collective opera- [6] S. K. Das, D. J. Harvey and R. Biswas. “MinEX: a latency-tolerant dy-
tions [32, 33]. namic partitioner for grid computing applications”, Future Generation
Computer Systems, vol. 18, no. 4, pp. 477-489, 2002. (Journal citation)
[7] S. Huang, E. Aubanel and V. Bhavsar. “Mesh partitioners for computa-
8 CONCLUSIONS tional grids: a comparison”, Computational Science and Its Applications,
In this paper, we present an unbalanced domain decom- LNCS 2269, pp. 60-68, 2003. (Journal or magazine citation)
[8] S. Kumar, S. Das and R. Biswas. “Graph partitioning for parallel appli-
position strategy for solving problems that arise from
cations in heterogeneous grid environments”, Proc. Sixteenth Internation-
discretization of partial differential equations on meshes. al Parallel and Distributed Processing Symposium, 2002,
Applying the unbalanced distribution in different plat- doi.ieeecomputersociety.org/10.1109/IPDPS.2002.1015564. (Confe-
forms is simple, because the data partition is easy to ob- rence proceedings)
tain. We compare the results obtained with the classical [9] W. D. Gropp and D. E. Keyes. “Complexity of Parallel Implementation
balanced strategy used. We show that the unbalanced of Domain Decomposition Techniques for Elliptic Partial Differential
distribution pattern improves the execution time of do- Equations”, SIAM Journal on Scientific and Statistical Computing, vol. 9,
no. 2, pp. 312-326, 1988. (Journal citation)
main decomposition applications in Grid environments.
[10] D. K. Kaushik, D. E. Keyes and B. F. Smith. “On the Interaction of Ar-
We considered two kinds of meshes, which define the chitecture and Algorithm in the Domain-based Parallelization of an
most typical cases. We show that the expected execution Unstructured Grid Incompressible Flow Code”, Proc. Tenth International
time can be reduced up to 53%. Conference on Domain Decomposition Methods, pp. 311-319, 1997. (Confe-
The unbalanced distribution pattern reduces the num- rence proceedings)
ber of remote communications required per host com- [11] W. Gropp et al. “Latency, Bandwidth, and Concurrent Issue Limita-
tions in High-Performance CFD”, Proc. First Mit Conference on Computa-
pared with the balanced distribution, especially for box
tional Fluid and Solid Mechanics, pp. 830-841, 2001. (Conference proceed-
meshes. However, the unbalanced distribution can be ings)
inappropriate if the total number of processors is less
than the total number of remote communications. The
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 139
[12] R. M. Badia et al. “Dimemas: Predicting MPI Applications Behavior in Catalonia (UPC). Currently, she is an Assistant Professor at the
Grid Environments”, Proc. of the Workshop on Grid Applications and Pro- Computer Architecture Department at UPC. Her research interests
gramming Tools GGF8, 2003. (Conference proceedings) include parallel programming, load balancing, cluster computing, and
[13] R. M. Badia et al. “Performance Prediction in a Grid Environment”, autonomic communications. She is member of HiPEAC Network of
Proc, First European across Grid Conference, 2003. (Conference procced- Excellence.
ings)
M. Gil is an Associate Professor at the Universitat Politècnica de
[14] Y. Li and Z. Lan. “A Survey of Load Balancing in GridComputing.
Catalunya (UPC). She received her Ph.D. in computer science from
Computational and Information Science”, Proc. First International Sym- the UPC in 1994. Her research is primarily concerned with the de-
posium (CIS’04), pp. 280-285, 2004. (Conference proceedings). sign and implementation of system software for parallel computing,
[15] B. Otero et al. “A Domain Decomposition Strategy for GRID Environ- to improve resource management. Her work focus mainly in the area
ments”, Proc. Eleventh European PVM/MPI Users’ Group Meeting, pp. of OS, middleware and runtime multicore architectures support. She
353-361, 2004. (Conference proceedings) is member of HiPEAC Network of Excellence and in the SARC Eu-
[16] B. Otero and J. M. Cela. “A workload distribution pattern for grid envi- ropean project.
ronments”, Proc. the 2007 International Conference on Grid Computing and  
Applications, pp. 56-62, 2007. (Conference proceedings)
[17] B. Otero et al. “Performance Analysis of Domain Decomposition”, Proc.
Fouth International Conference Grid and Cooperative Computing, pp. 1031-
1042, 2005. (Conference proceedings)
[18] B. Otero et al. “Data Distribution Strategies for Domain Decomposition
Applications in Grid Environments”, Proc. Sixth International Conference
on Algorithms and Architecture for Parallel Processing, pp. 214-224, 2005.
(Conference proceedings)
[19] W. Sosnowski. “Flow Approach-Finite Element Model for Stamping
Processes versus Experiment”, Computer Assisted Mechanics and Engi-
neering Sciences, vol. 1, pp. 49-75, 1994. (Journal citation)
[20] N. Frisch et al. “Visualization and Pre-processing of Independent Finite
Element Meshes for Car Crash Simulations”, The Visual Computer, vol.
18, no. 4, pp. 236-249, 2002. (Journal citation)
[21] Z. H. Zhong. Finite Element Procedures for Contact-Impact Problems. Ox-
ford University Press, pp.1-372, 1993. (Book style)
[22] Paraver. http://www.cepba.upc.es/dimemas. 2002.
[23] R. M. Badía et al. “DAMIEN: Distributed Applications and Middleware
for Industrial Use of European Networks”. D5.3/CEPBA. IST-2000-
25406, unpublished. (Unpublished manuscript)
[24] R. M. Badía et al. “DAMIEN: Distributed Applications and Middleware
for Industrial Use of European Networks”. D5.2/CEPBA. IST-2000-
25406, unpublished. (Unpublished manuscript)
[25] B. Otero and J. M. Cela. “Latencia y ancho de banda para simular am-
bientes Grid”, Technical Report TR-UPC-DAC-2004-33, UPC. España,
2004. (Technical report with report number)
http://www.ac.upc.es/recerca/reports/DAC/2004/index,ca.html.
[26] D. E. Keyes. “Domain Decomposition Methods in the Mainstream of
Computational Science”, Proc. Fourteenth International Conference on Do-
main Decomposition Methods, pp. 79-93, 2003. (Conference proceedings)
[27] X. C. Cai. “Some Domain Decomposition Algorithms for Nonselfad-
joint Elliptic and Parabolic Partial Differential Equations”, Technical
Report TR- 461, Courant Institute, NY, 1989. (Technical report with re-
port number)
[28] K. George and K. Vipin K. “Parallel multilevel k-way partitioning
scheme for irregular graphs”, SIAM Rev., vol 41, no. 2, pp. 278-300, 1999.
(Journal citation)
[29] K. George and K. Vipin. “A fast and high quality multilevel scheme for
partitioning irregular graphs”, SIAM J. Sci. Comput., vol. 20, no. 1, pp.
359-392, 1998. (Journal citation)
[30] Metis, Internet, http://glaros.dtc.umn.edu/gkhome/views/metis.
2011.
[31] Message Passing Interface Forum, MPI-2: Extensions to the MPI, 2003.
http://scc.ustc.edu.cn/zlsc/cxyy/200910/W020100308601028317962.p
df (2011)
[32] N. Karonis, B. Toonen and I. Foster. “Mpich-g2: A Grid-enabled Im-
plementation of the Message Passing Interface”, Journal of Parallel and
Distributed Computing, vol. 63, no. 5, pp. 551-563, 2003. (Journal citation)
[33] I. Foster and N. T. Karonis. “A Grid-enabled MPI: Message Passing in
Heterogeneous Distributed Computing Systems”, Proc. of the
ACM/IEEE Supercomputing, 1998. (Conference proceedings)
 
B. Otero received her M.Sc. and her first Ph.D. degrees in Computer
Science at University of Central of Venezuela in 1999 and 2006,
respectively. After that, she received her second Ph.D. in Computer
Architecture and Technology in 2007 at Polytechnic University of

Вам также может понравиться