Академический Документы
Профессиональный Документы
Культура Документы
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 130
Abstract— In this paper, we evaluate strategies of domain decomposition in Grid environment to solve mesh-based
applications. We compare the balanced distribution strategy with unbalanced distribution strategies. While the former is a
common strategy in homogenous computing environment (e.g. parallel computers), it presents some problems due to
communication latency in Grid environments. Unbalanced decomposition strategies consist of assigning less workload to
processors responsible for sending updates outside the host. The results obtained in Grid environments show that unbalanced
distributions strategies improve the expected execution time of mesh-based applications by up to 53%. However, this is not true
when the number of processors devoted to communication exceeds the number of processors devoted to calculation in the
host. To solve this problem we propose a new unbalanced distribution strategy that improves the expected execution time up to
43%. We analyze the influence of the communication patterns on execution times using the Dimemas simulator.
Index Terms— Domain decomposition methods, load balancing algorithms, parallelism and concurrency, simulation.
—————————— ——————————
1 INTRODUCTION
work are presented in Section 8. bines the two previous unbalanced distributions and al-
lows a more efficient processor utilization.
2 RELATED WORK
We can distinguish between two distinct types of related 3 APPLICATIONS AND SIMULATIONS
work, one based on the partitioner method and other In this section we describe the mesh-based applications
based on the load balancing. As mentioned above, the features and the general algorithmic structure of the
success of parallel mesh-based applications in Grid envi- simulation schemes.
ronments depends on efficient mesh partitioning. Several
works have already proposed partitioners for computa- 3.1 Mesh-based Applications
tional Grids. PART, JOSTLE, SCOTCH, MinEX, PaGrid Finite element methods have been fundamental techniques
and METIS are some examples of these. in the solution of problems in engineering modeled by
PART [3] uses intensive algorithms and simulation an- PDEs. These methods include three basic steps:
nealing, so requires a parallel implementation to obtain good 1. Step 1: The physical problem is written in varia-
performance. JOSTLE [4] produces data partitions without tional form (also called weighted residual form).
taking into account the communication cost for each proces- 2. Step 2: The problem’s domain is discretized by
sor. SCOTCH [5] has the same limitation of JOSTLE be- complex shapes called elements. This is called
cause it generates partitions for homogeneous interprocessor meshing.
communication cost. MinEX [6] makes partitions without 3. Step 3: The variational form is discretized using
taking into account the application granularity. PaGrid [7] quadrature rules leading to a system of equations.
uses some techniques already applied by other partitioners The solution of this system represents a discrete
but adds a stage for load balancing the execution time. PaGr- approximation of the solution of the original con-
id produces comparable partitions to JOSTLE and attempts tinuum problem.
some improvement by minimizing the estimated execution Applications that involve a meshing procedure are re-
time. Finally, METIS [8] is based on a multilevel recursive ferred to as mesh-based applications (step 2). Mesh-based
bisection algorithm. applications are naturally suited for parallel or distri-
All of these approaches consider estimated execution buted systems because these applications require large
time rather than communication cost to measure the perfor- amounts of processing time. Furthermore, mesh-based
mance of a mesh partitioner. However, minimizing the applications can be partitioned to execute concurrently on
communication between hosts is fundamental in computa- heterogeneous computers in a Grid. Implementing the
tional Grids to reduce the execution time. finite element method in parallel involves partitioning the
As regards workload, there are some works dealing nodes global domain into nprocs processors. Our example
with the relationship between architecture and domain applications use explicit finite element analysis for prob-
decomposition algorithms [9]. There are several studies lems involving sheet stamping and car crashes [19]. We
on latency, bandwidth and optimum workload to take describe each of these below.
full advantage of the available resources [10, 11]. There Sheet stamping problems. Performance prediction of
are also analyses of the behavior of MPI applications in sheet stamping dies during the die-design process. As
Grid environments [12, 13]. In all of these cases, the same well as market pressure for faster delivery of dies and
workload for all the processors is considered. cost reduction, large car manufacturers increasingly tend
Li et al. [14] provide a survey of the existing solutions to offload design responsibility onto their suppliers. Typi-
in load balancing as well as new efforts to deal with it in cally, dies are designed by highly experienced people
the face of the new challenges in Grid computing. In this who know what sort of die is needed to produce a part of
work they describe and classify different schemes of load a given shape. On the basis of the design, fine-tuning is
balancing for grid computing, but there is no solution performed by actually using dies to produce parts, ob-
which would be fully adaptive to the characteristics of the serving the result and manually milling the die until the
Grid. sheet comes out as specified by the contractor. In complex
In previous works we suggested two unbalanced dis- cases, it is very difficult to produce a good die design by
tribution strategies, called singleB-domain and multipleB- intuition. In addition to the associated costs, failure to
domain, to execute mesh-based applications in Grid envi- meet a delivery date damages a company’s image and has
ronments [15, 16, 17, 18]. All of these use unbalanced data a negative impact on future business.
distribution and they take into account the execution plat- Numerical simulations could provide the quantitative
form and the processor characteristics. Both strategies information needed to minimize modifications during the
minimize the communication between the processors and manufacturing process [19]. Simulations with serial codes
reduce the expected execution time by up to 53% when take as long as 40 to 60 processor hours and usually re-
compared with a balanced distribution strategy. In this quire high end workstations. Parallel stamping simula-
paper we present the details of the two unbalanced dis- tions enable die manufacturing companies to alter their
tributions proposed above. We describe the characteris- die design procedures: instead of an iterative approach,
tics of the applications executed and the schemes to expli- they can run more analyses before the first die prototypes
cit simulations, and we propose a new unbalanced distri- are made. This reduces overall die design and manufac-
bution, called multipleCB-domain distribution, which com- turing time, which is vital for the industry.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 132
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
TABLE 1 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
LATENCY, BANDWIDTH AND FLIGHT TIME VALUES 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
Bandwidth 100 Mbps 64 Kbps, 300 Kbps and 2Mbps 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
CPU 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CPU 1
CPU 2
16 D
170 18 20D1 21 22 23 24 25 26D8 27 28 29 D309 31
D0 D1 D2 D3 D8 D9 D10 D11
CPU 3 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
CPU 4
CPU 5 48 49 50 51 52 54 D
55
5 D5613 57 58 59 60 61 62 63
CPU 6
D10
64 65 66 67 68 69 70 72 73 74 76 77 78 79
CPU 7 D2 D3 D4 D12
CPU 8 80 81 82 83 84 85 86 88 89 90 91 92 93 94 95
CPU 9 D4 D5 D6 D7 D12 D13 D14 D15 D11
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
CPU 10
CPU 11 D7 116 117 118
112 113 114 115 D6 D
120
14 122 123 D15 125 126 127
124
CPU 12
CPU 13 128 129 130 D 21 132 133 134 D22
131 135 D30 137 138 139 D
140
29
141 142 143
CPU 14
CPU 15 144 145 146 148 149 150 151 152 153 154 155 156 157 158 159
D16 D17 D18 D19 D24 D25 D26 D27
CPU 16 164 D
D16162
160 161 165 166 167
17 168 169 187 171 172 173 174 175
CPU 17 D24 D25 D26
CPU 18 176 177 178 179 180 181 182 183 184 185 187 187 188 189 190 191
CPU 19 D23 D
192 193 194 195 196 197 198 199 31
200 201 202 203 204 205 206 207
CPU 20
CPU 21 208 209 210 211 212 213 214 215 216 217 218 219 220 222 223
CPU 22 D20 D21 D22 D23 D28 D29 D30 D31 D18 D19 D20
CPU 23 224 225 226 227 228 229 230 231 D27235 236 237 D238
232 233 234 239
28
CPU 24 240 241 242 243 244 245 246 247 187 249 250 251 252 253 254 255
CPU 25
CPU 26
CPU 27
CPU 28
CPU 29
CPU 30
Fig. 3. Balanced distribution. Fig. 5.MultipleB-domain distribution.
CPU 31
192 193
D
199
23 D 31
200
D27 207
201 202 203 204 205 206
CPU 27
D20 CPU 28
208 209 215 216 217 218 219 220 222 223 CPU 29
D19
CPU 30
224 225 227 229 230 231 D29236 237 D238
D28234 235
232 233 239
D21 D22 30 CPU 31
240 241 242 243 244 245 246 247 187 249 250 251 252 253 254 255
In this section we show the results obtained using Dimemas. E xternal latency of 10 ms and flight time of 1 ms
s ingleB-domain (4x 8)
multipleB-domain (4x 8)
We simulate a 128 processors machine using the following s ingleB-domain (4x 16)
multipleB-domain (4x 16)
90,00
Grid environment. The number of hosts is 2, 4 or 8; the 80,00
70,00
s ingleB-domain (4x 32)
multipleB-domain (4x 32)
greater than two. The size of the stick mesh is 104x10x10 multipleB-domain (2x 8)
s ingleB-domain (2x 16)
nodes. The size of the box mesh is 102x102x102 nodes. 60,00 multipleB-domain (2x 16)
55,00 s ingleB-domain (2x 32)
Figures 7.a, 7.b, 8.a and 8.b show the time reduction
50,00 multipleB-domain (2x 32)
45,00 s ingleB-domain (2x 64)
For a Grid with 2 hosts and 4 processors per host, the -35,00
64 Kbps 300 Kbps 2 Mbps
ternal latency is equal to 100 ms (figs. 9.a, 9.b, 10.a and E xte rnal late ncy of 10 ms and flight time of 100 ms
s ingleB-domain (4x 8)
multipleB-domain (4x 8)
nificant impact on the results for this topology. In the 100,00 s ingleB-domain (4x 32)
90,00 multipleB-domain (4x 32)
80,00
other cases, the benefit of the unbalanced distributions
s ingleB-domain (8x 8)
70,00
Execution time reduction (%)
Ba ndw idth
Fig. 7.a. Execution time reduction for the stick mesh with external
latency of 10 ms and flight time of 1 ms (2 hosts).
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 136
singleB-domain (2x4) s ingleB-domain (4x 4)
Fig. 10.b. Execution time reduction for the stick mesh with external
Fig. 9.a. Execution time reduction for the stick mesh with external
latency and flight time of 100 ms (4 and 8 hosts).
latency of 100 ms and flight time of 1 ms (2 hosts).
s ingleB-domain (4x 4)
fluence of the external latency on the application perfor-
STICK MESH multipleB-domain (4x 4)
s ingleB-domain (4x 8)
mance in a box mesh increases the percentage of reduc-
External late ncy of 100 ms and flight time of 1 ms
multipleB-domain (4x 8) tion of the execution time up to 4%. We suppose that the
s ingleB-domain (4x 16)
multipleB-domain (4x 16) distance between hosts is the same. However, if we con-
s ingleB-domain (4x 32)
sider hosts distributed at different distances, we obtain
90,00
80,00 multipleB-domain (4x 32)
70,00 s ingleB-domain (8x 8)
60,00
multipleB-domain (8x 8) similar benefits for the different distributions. Moreover,
Execution time reduction (%)
50,00
80,00
latency of 100 ms and flight time of 1 ms (4 and 8 hosts).
75,00
70,00
65,00
s ingleB-domain (2x 4)
60,00
ST ICK M ESH multipleB-domain (2x 4)
E xte rnal late ncy of 100 ms and flight time of 100 ms s ingleB-domain (2x 8) 55,00
multipleB-domain (2x 8)
50,00
s ingleB-domain (2x 16)
55,00
multipleB-domain (2x 16) 45,00
50,00
45,00 s ingleB-domain (2x 32)
multipleB-domain (2x 32) 40,00
40,00
Execution time reduction (%)
50,00
m ultipleB -do main (8x16)
40,00
Figures 11.a, 11.b, 12.a and 12.b show the reduction of 30,00
10,00
guration varying the flight time, the external latency and 0,00
BOP X M ESH
s ingleB-domain (2x 8) TABLE 2
CPU 0 multipleB-domain (2x 8)
MAXIMUM NUMBER OF COMMUNICATIONS FOR A COM-
0
P
Exte rnal late ncyCPU
of12 10
CPU P ms and flight time of 100 ms
1
2
s ingleB-domain (2x 16)
CPU 3 P3
multipleB-domain (2x 16)
PUTATIONAL ITERATION
Host 0
CPU 4 P4
CPU 5 P5
CPU 6 P6 s ingleB-domain (2x 32)
90,00 P7
CPU 7 P8
multipleB-domain (2x 32)
85,00
CPU 8
P9
P10
s ingleB-domain (2x 64) STICK MESH
CPU 9 P11 multipleB-domain (2x 64)
Execution time reduction (%)
80,00
singleB-domain multipleB-
CPU 10 P12
CPU 11 P13
CPU 12 P14
Balanced
Host 1
75,00 CPU 13 P15
CPU 14 P16
P17
domain
70,00 CPU 15 P18
P19
65,00 CPU 16
CPU 17
P20
P21
Host x Remote / Local Remote / Local Remote / Local
CPU 18 P22
60,00 CPU 19
CPU 20
P23
P24 CPUs Communication Communication Communication
Host 2
CPU 21 P25
55,00 CPU 22 P26
50,00 CPU 23
P27
P28
P29
2x4 1 1 1 1 1 1
2x8 1 1 1 1 1 1
CPU 24 P30
45,00 CPU 25 P31
CPU 26 P32
CPU 27 P33
40,00 CPU 28 P34
2x16 1 1 1 1 1 1
Host 3
CPU 31
P37
P38
Ba ndw idth 2x32 1 1 1 1 1 1
P39
Fig. 12.a. Execution time reduction for the box mesh with external 2x64 1 1 1 1 1 1
latency of 10
Fig. 14. ms and flight time
Communication of 100
diagram forms
a (2 hosts).
computational iteration the 4x4 1 1 2 2 1 3
MultipleCB-domain distribution. 4x8 1 1 2 2 1 3
singleB -do main (4x8) 4x16 1 1 2 2 1 3
BOX M ESH multipleB -do main (4x8)
50,00
multipleB -do main (8x16) BOX MESH
40,00
30,00 2x4 2 3 1 3 1 3
20,00 2x8 4 5 1 6 1 6
10,00
2x16 5 8 1 7 1 8
0,00
2x32 6 7 1 15 1 14
-10,00
-20,00 2x64 7 8 1 25 1 24
-30,00
64 Kbps 300 Kbps 2 Mbps
4x8 7 5 3 6 4 6
Ba ndw idth 4x16 10 9 3 11 4 9
Fig. 12.b. Execution time reduction for the box mesh with external 4x32 9 8 3 22 4 14
latency of 10 ms and flight time of 100 ms (4 and 8 hosts). 8x8 13 5 6 7 13 7
8x16 13 4 6 13 13 11
D0 D10 D11 30 31
D2 D12
46 47
D3
D7 D17 57 58
D13
7 MULTIPLECB-DOMAIN DISTRIBUTION D4 D5 D6
73 74
D16
D14 D15
D21 D32
tions. Then, the scalability of the unbalanced distribution D20 D22 D30 D31
will be moderated, because a processor is devoted just to D23 D29 D39 D35
host, for three of them will be assigned to one CPU (spe- optimal case is when the number of processors making
cial domains) into the host and the remainder of data calculation in a host is twice the number of processors
domains will be assigned to the rest of the CPUs. Now we managing remote communications. Otherwise, if the
have solely a CPU that manages remote communications number of processors making calculations is small, then
and a larger number of CPUs performing computation. the unbalanced distribution will be less efficient than the
This kind of distribution allows us to minimize the num- balanced distribution. In this case, we propose the use of
ber of idle CPUs in a host devoted only to remote com- the multipleCB-domain distribution. In this distribution all
munications. remote communications in a host are concurrently ma-
For a Grid with 2 hosts, the predicted execution time is naged by the same CPU. This distribution has around a
the same that to the multipleB-domain because the number 10% worse execution time than others unbalanced distri-
of remote communications is only one. However, when butions.
considering 4 or 8 hosts, multipleCB-domain domain makes In general, to obtain a good performance in the strate-
a reduction in execution time percentage up to 43% com- gies presented in this paper the number of processors per
pared to balanced distribution, while multipleB-domain host needs to be equal or higher than 8. In other case the
distribution makes a reduction percentage up to 53%. In number of processors performing computation is not
general, multipleCB-domain distribution is 10% worse than enough to overlap remote communications.
multipleB-domain distribution, mainly due to the problems
in managing concurrency remote communications in the
ACKNOWLEDGMENTS
simulator.
It is also important to look at the MPI implementation This work was supported by the Ministry of Science and
[31]. The ability to overlap communications and computa- Technology of Spain under contract TIN2007-60625, the Hi-
tion depends on this implementation. A multithread MPI PEAC European Network of Excellence and Barcelona Su-
implementation could overlap communication and com- percomputing Center (BSC).
putation, but problems with context switching between
threads and interferences between processes could ap- REFERENCES
pear. [1] G. Allen et al. “Classifying and enabling grid applications. Concurrency
In a single thread MPI implementation we can use and Computation”, Practice and Experience, vol.0, pp. 1-13, 2000. (Journal
non-blocking send/receive with a wait_all routine. citation)
However, we have observed some problems with this [2] Dimemas, Internet, http://www.cepba.upc.es/dimemas/. 2000.
approach. The problems are associated with the internal [3] J. Chen and V. E. Taylor. “Mesh partitioning for efficient use of distri-
buted systems”, IEEE Trans. Parallel and Distributed Systems, vol. 13, no.
order in no blocking MPI routines for sending and receiv- 1, pp.67-79, 2002. (IEEE Transactions)
ing actions. In our experiments, this could be solved pro- [4] C. Walshaw and M. Cross. “Multilevel mesh partitioning for heteroge-
gramming explicitly the proper order of the communica- neous communications networks”, Future Generation Computer Systems,
tions. But the problem remains for a general case. We vol. 17, no. 5, pp. 601-623, 2001. (Journal citation)
conclude that it is very important to have no blocking [5] F. Pelligrini and J. Roman. “A software package for static mapping by
MPI primitives that actually exploit the full duplex chan- dual recursive bipartitioning of process and architecture graphs”, Proc.
of the High Performance Computing and Networking, pp. 493-498, 1996.
nel capability. As a future work, we will consider other
(Conference proceedings)
MPI implementations that optimize the collective opera- [6] S. K. Das, D. J. Harvey and R. Biswas. “MinEX: a latency-tolerant dy-
tions [32, 33]. namic partitioner for grid computing applications”, Future Generation
Computer Systems, vol. 18, no. 4, pp. 477-489, 2002. (Journal citation)
[7] S. Huang, E. Aubanel and V. Bhavsar. “Mesh partitioners for computa-
8 CONCLUSIONS tional grids: a comparison”, Computational Science and Its Applications,
In this paper, we present an unbalanced domain decom- LNCS 2269, pp. 60-68, 2003. (Journal or magazine citation)
[8] S. Kumar, S. Das and R. Biswas. “Graph partitioning for parallel appli-
position strategy for solving problems that arise from
cations in heterogeneous grid environments”, Proc. Sixteenth Internation-
discretization of partial differential equations on meshes. al Parallel and Distributed Processing Symposium, 2002,
Applying the unbalanced distribution in different plat- doi.ieeecomputersociety.org/10.1109/IPDPS.2002.1015564. (Confe-
forms is simple, because the data partition is easy to ob- rence proceedings)
tain. We compare the results obtained with the classical [9] W. D. Gropp and D. E. Keyes. “Complexity of Parallel Implementation
balanced strategy used. We show that the unbalanced of Domain Decomposition Techniques for Elliptic Partial Differential
distribution pattern improves the execution time of do- Equations”, SIAM Journal on Scientific and Statistical Computing, vol. 9,
no. 2, pp. 312-326, 1988. (Journal citation)
main decomposition applications in Grid environments.
[10] D. K. Kaushik, D. E. Keyes and B. F. Smith. “On the Interaction of Ar-
We considered two kinds of meshes, which define the chitecture and Algorithm in the Domain-based Parallelization of an
most typical cases. We show that the expected execution Unstructured Grid Incompressible Flow Code”, Proc. Tenth International
time can be reduced up to 53%. Conference on Domain Decomposition Methods, pp. 311-319, 1997. (Confe-
The unbalanced distribution pattern reduces the num- rence proceedings)
ber of remote communications required per host com- [11] W. Gropp et al. “Latency, Bandwidth, and Concurrent Issue Limita-
tions in High-Performance CFD”, Proc. First Mit Conference on Computa-
pared with the balanced distribution, especially for box
tional Fluid and Solid Mechanics, pp. 830-841, 2001. (Conference proceed-
meshes. However, the unbalanced distribution can be ings)
inappropriate if the total number of processors is less
than the total number of remote communications. The
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 4, APRIL 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 139
[12] R. M. Badia et al. “Dimemas: Predicting MPI Applications Behavior in Catalonia (UPC). Currently, she is an Assistant Professor at the
Grid Environments”, Proc. of the Workshop on Grid Applications and Pro- Computer Architecture Department at UPC. Her research interests
gramming Tools GGF8, 2003. (Conference proceedings) include parallel programming, load balancing, cluster computing, and
[13] R. M. Badia et al. “Performance Prediction in a Grid Environment”, autonomic communications. She is member of HiPEAC Network of
Proc, First European across Grid Conference, 2003. (Conference procced- Excellence.
ings)
M. Gil is an Associate Professor at the Universitat Politècnica de
[14] Y. Li and Z. Lan. “A Survey of Load Balancing in GridComputing.
Catalunya (UPC). She received her Ph.D. in computer science from
Computational and Information Science”, Proc. First International Sym- the UPC in 1994. Her research is primarily concerned with the de-
posium (CIS’04), pp. 280-285, 2004. (Conference proceedings). sign and implementation of system software for parallel computing,
[15] B. Otero et al. “A Domain Decomposition Strategy for GRID Environ- to improve resource management. Her work focus mainly in the area
ments”, Proc. Eleventh European PVM/MPI Users’ Group Meeting, pp. of OS, middleware and runtime multicore architectures support. She
353-361, 2004. (Conference proceedings) is member of HiPEAC Network of Excellence and in the SARC Eu-
[16] B. Otero and J. M. Cela. “A workload distribution pattern for grid envi- ropean project.
ronments”, Proc. the 2007 International Conference on Grid Computing and
Applications, pp. 56-62, 2007. (Conference proceedings)
[17] B. Otero et al. “Performance Analysis of Domain Decomposition”, Proc.
Fouth International Conference Grid and Cooperative Computing, pp. 1031-
1042, 2005. (Conference proceedings)
[18] B. Otero et al. “Data Distribution Strategies for Domain Decomposition
Applications in Grid Environments”, Proc. Sixth International Conference
on Algorithms and Architecture for Parallel Processing, pp. 214-224, 2005.
(Conference proceedings)
[19] W. Sosnowski. “Flow Approach-Finite Element Model for Stamping
Processes versus Experiment”, Computer Assisted Mechanics and Engi-
neering Sciences, vol. 1, pp. 49-75, 1994. (Journal citation)
[20] N. Frisch et al. “Visualization and Pre-processing of Independent Finite
Element Meshes for Car Crash Simulations”, The Visual Computer, vol.
18, no. 4, pp. 236-249, 2002. (Journal citation)
[21] Z. H. Zhong. Finite Element Procedures for Contact-Impact Problems. Ox-
ford University Press, pp.1-372, 1993. (Book style)
[22] Paraver. http://www.cepba.upc.es/dimemas. 2002.
[23] R. M. Badía et al. “DAMIEN: Distributed Applications and Middleware
for Industrial Use of European Networks”. D5.3/CEPBA. IST-2000-
25406, unpublished. (Unpublished manuscript)
[24] R. M. Badía et al. “DAMIEN: Distributed Applications and Middleware
for Industrial Use of European Networks”. D5.2/CEPBA. IST-2000-
25406, unpublished. (Unpublished manuscript)
[25] B. Otero and J. M. Cela. “Latencia y ancho de banda para simular am-
bientes Grid”, Technical Report TR-UPC-DAC-2004-33, UPC. España,
2004. (Technical report with report number)
http://www.ac.upc.es/recerca/reports/DAC/2004/index,ca.html.
[26] D. E. Keyes. “Domain Decomposition Methods in the Mainstream of
Computational Science”, Proc. Fourteenth International Conference on Do-
main Decomposition Methods, pp. 79-93, 2003. (Conference proceedings)
[27] X. C. Cai. “Some Domain Decomposition Algorithms for Nonselfad-
joint Elliptic and Parabolic Partial Differential Equations”, Technical
Report TR- 461, Courant Institute, NY, 1989. (Technical report with re-
port number)
[28] K. George and K. Vipin K. “Parallel multilevel k-way partitioning
scheme for irregular graphs”, SIAM Rev., vol 41, no. 2, pp. 278-300, 1999.
(Journal citation)
[29] K. George and K. Vipin. “A fast and high quality multilevel scheme for
partitioning irregular graphs”, SIAM J. Sci. Comput., vol. 20, no. 1, pp.
359-392, 1998. (Journal citation)
[30] Metis, Internet, http://glaros.dtc.umn.edu/gkhome/views/metis.
2011.
[31] Message Passing Interface Forum, MPI-2: Extensions to the MPI, 2003.
http://scc.ustc.edu.cn/zlsc/cxyy/200910/W020100308601028317962.p
df (2011)
[32] N. Karonis, B. Toonen and I. Foster. “Mpich-g2: A Grid-enabled Im-
plementation of the Message Passing Interface”, Journal of Parallel and
Distributed Computing, vol. 63, no. 5, pp. 551-563, 2003. (Journal citation)
[33] I. Foster and N. T. Karonis. “A Grid-enabled MPI: Message Passing in
Heterogeneous Distributed Computing Systems”, Proc. of the
ACM/IEEE Supercomputing, 1998. (Conference proceedings)
B. Otero received her M.Sc. and her first Ph.D. degrees in Computer
Science at University of Central of Venezuela in 1999 and 2006,
respectively. After that, she received her second Ph.D. in Computer
Architecture and Technology in 2007 at Polytechnic University of