Академический Документы
Профессиональный Документы
Культура Документы
in Mathematics
and its Applications
Volume56
Series Editors
Avner Friedman Willard Miller, Jr.
Institute for Mathematics and
its Applications
IMA
The Institute for Mathematics and its Appllcations was established by a grant from the
National Science Foundation to the University of Minnesota in 1982. The IMA seeks to encourage
the development and study of fresh mathematical concepts and questions of concern to the other
sciences by bringing together mathematicians and scientists from diverse fields in an atmosphere
that will stimulate discussion and collaboration.
The IMA Volumes are intended to involve the broader scientific community in this process.
Avner Friedman, Director
Willard Miller, Jr., Associate Director
**********
IMA ANNUAL PROGRAMS
1982-1983 Statistical and Continuum Approa'1hes to Phase Transition
1983-1984 MathematicalModeIs for the Economics of Decentrallzed
Resource Allocation
1984-1985 Continuum Physics and Partial Differential Equations
1985-1986 Stochastic Differential Equations and Their Applications
1986-1987 Scientmc Computation
1987-1988 Applied Combinatorics
1988-1989 Nonlinear Waves
1989-1990 Dynamical Systems and Their Applications
1990-1991 Phase Transitions and Free Boundaries
1991-1992 Applied Linear AIgebra
1992-1993 Control Theory and its Applications
1993-1994 Emerging Applications of Probability
IMA SUMMER PROGRAMS
1987 Robotics
1988 Signal Processing
1989 Robustness, Diagnostics, Computing and Graphics in Statistics
1990 Radar and Sonar
1990 Time Series
1991 Semiconductors
1992 Environmental Studies: Mathematical, Computational, and Statistical Analysis
**********
SPRINGER LECTURE NOTES FROM THE IMA:
The Mathematics and Physics of Disordered Media
Editors: Barry Hughes and Barry Ninham
(Leeture Notes in Math., Volume 1035, 1983)
Orienting Polymers
Editor: J .L. Ericksen
(Leeture Notes in Math., Volume 1063, 1984)
New Perspectives in Thermodynamics
Editor: James Serrin
(Springer-Verlag, 1986)
Models of Economic Dynamics
Editor: Hugo Sonnenschein
(Lecture Notes in Econ., Volume 264, 1986)
Alan George John R. Gilbert
Joseph W.R. Liu
Editors
Springer-Verlag
New York Berlin Heidelberg London Paris
Tokyo Hong Kong Barcelona Budapest
Alan George John R. Gilbert
University of Waterloo Xerox Palo AIto Research Center
NeedIes Hall 3333 Coyote Hill Road
Waterloo, Ontario N2L 3G1 Palo AIto, CA 94304-1314 USA
Canada
Series Editors:
Joseph W.H. Liu Avner Friedman
Department of Computer Scienee Willard Miller, Jr.
York University Institute for Mathematies and its
North York, Ontario M3J 1P3 Applieations
Canada University of Minnesota
Minneapolis, MN 55455 USA
Mathematies Subjeet Classifieations (1991): 05C50, 65F50, 05C05, 05C70, 05C20,
15A23, 15A06, 65F05, 65FlO, 65F20, 65F25, 68RlO
All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New
York, NY 10010, USA), exeept for brief exeerpts in eonneetion with reviews or scholarly
analysis. Use in eonneetion with any form of information storage and retrieval, eleetronic
adaptation, computer software, or by similar or dissimilar methodology now known or hereaf-
ter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this publication, even
if the former are not especially identified, is not to be taken as a sign that such names, as
understood by the Trade Marks and Merchandise Marks Aet, may aceordingly be used freely
byanyone.
Permission to photoeopy for internai or personal use, or the internai or personal use of speeific
dients, is granted by Springer-Verlag, Inc., for libraries registered with the Copyright Clearanee
Center (CCC), provided that the base fee of $5.00 per eopy, pius $0.20 per page, is paid directly
to CCC, 21 Congress St., Salem, MA 01970, USA. Speeial requests should be addressed directly
to Springer-Verlag New York, 175 FifthAvenue, NewYork, NY 10010, USA.
ISBN-13:978-1-4613-8371-0/1993 $5.00 + 0.20
Production managed by Hal Henglein; manufaeturing supervised by Jaequi Ashri.
Camera-ready eopy prepared by the IMA.
987654321
Current VoIurnes:
Volume 1: Homogenization and Effeetive Moduli of Materials and Media
Editors: Jerry Erieksen, David Kinderlehrer, Robert Kohn, J.-L. Lions
Volume 10: Stochastie Differential Systems, Stochastie Control Theory and Applieations
Editors: Wendell Fleming and Pierre-Louis Lions
Volume 20: Coding Theory and Design Theory Part I: Coding Theory
Editor: Dijen Ray-Chaudhuri
Volume 21: Coding Theory and Design Theory Part II: Design Theory
Editor: Dijen Ray-Chaudhuri
Volume 42: Partial Differential Equations with Minimai Smoothness and Applications
Editors: B. Dahlberg, E. Fabes, R. Fefferman, D. Jerisoll, C. Kenig, and
J. Pipher
Volume 52: Shoek Induced Transiliolls and Phase Structures in General Media
Editors: J .E. Dunn, Roger Fosdick, and Marshall Slelm'od
VoIurne 53: VariationaI Problems
Editors: Avner Friedman and Joel Spruck
Forthcorning VoIurnes:
Control Tbeory
Robust Control Theory
Control Design for Advanced Engineering Systems: Complexity, Uncertainty, In-
formation and Organization
Control and Optimal Design of Distributed Parameter Systems
Flow Control
Robotics
Nonsmooth Analysis & Geometric Methods in Deterministic Optimal Control
Systems & Control Theory for Power Systems
Adaptive Control, Filtering and Signal Processing
FOREWORD
is based on the proceedings of a workshop that was an integraI part of the 1991-
92 IMA program on "Applied Linear AIgebra." The purpose of the workshop was
to bring together people who work in sparse matrix computation with those who
conduct research in applied graph theory and grl:l,ph algorithms, in order to foster
active cross-fertilization. We are grateful to Richard Brualdi, George Cybenko,
Alan Geo~ge, Gene Golub, Mitchell Luskin, and Paul Van Dooren for planning and
implementing the year-Iong program.
We espeeially thank Alan George, John R. Gilbert, and Joseph W.H. Liu for
organizing this workshop and editing the proceedings.
The finaneial support of the National Science Foundation made the workshop
possible.
Avner Friedman
Willard Miller. Jr.
PREFACE
When reality is modeled by computation, linear algebra is often the con nec-
tiori between the continuous physical world and the finite algorithmic one. Usually,
the more detailed the model, the bigger the matrix, the better the answer. Efficiency
demands that every possible advantage be exploited: sparse structure, advanced com-
puter architectures, efficient algorithms. Therefore sparse matrix computation knits
together threads from linear algebra, parallei computing, data struetures, geometry,
and both numerieal and discrete algorithms.
Graph theory has been ubiquitous in sparse matrix computation ever since Sey-
mour Parter used undireeted graphs to model symmetric Gaussian elimination mo re
than 30 years ago. Three of the reasons are paths, loeality, and data structures. Paths
in the graph of a matrix are important in many contexts: fill paths in Gaussian
elimination, strongly connected components in irreducibility, bipartite matehing, and
alternating paths in linear dependence and structural singularity. Graphs are the
right setting to diseuss the kinds of locality in a sparse matrix that allow a parallei
algorithm to work on different parts of a problem more or less independently. And
the aetive field of graph algorithms is a rich source of data structures and effieient
techniques for manipulating sparse matrices by computer.
The Institute for Mathematics and Its Applications held a workshop on "Sparse
Matrix Computations: Graph Theory Issues and AIgorithms," organized by the ed-
itors of this volume, from Oetober 14 to 18, 1991. The workshop included fourteen
invited and several contributed talks, software demonstrations, an open problem ses-
sion, and a great deal of stimulating discussion between mathematicians, numerical
analysts, and t!-teoretical computer scientists. After the workshop we invited some of
the participants to SUL.,;::! papers for this colleetion. We intend the result to be a
resource for the researeher or advanced student of either graphs or sparse matrices
who wants to explore their conneetions. Therefore, we asked the authors to undertake
the challenging task of making current research accessible to both eommunities.
The order of papers in the volume reHects a rough grouping into three categories.
• First, graph models of symmetric matrices and faetorizations: Blair and
Peyton on chordal graphs and clique trees; Agrawal and Klein on provably
good nested dissection orderings; and Miller, Teng, Thurston, and Vavasis
on separators for geometric graphs.
• Second, graph models of a.lgorithms on nonsymmetric matrices: Eisenstat
and Liu on Schur complements; Johnson and Xenophontos on Perron com-
plements; Gilbert and Ng on QR faetorization and partial pivoting; and
Alvarado, Pothen, and Schreiber on partitioned inverses of triangular matri-
ees.
• Third, parallel sparse matrix algorithms: Asheraft on distributed-memory
sparse Cholesky faetorization; Schreiber on scalability and its limits; Kratzer
and Cleary on massively parallel LU and QR; and Jones and Plassman on a
parallei iterative method.
Of course, the categories overlap and interrelate. Separators (Agrawal, Miller) are
useful in parallel matrix computation, for both direet and iterative methods. So are
partitioned inverses (Alvarado). Nonsymmetrie analyses (Gilbert) gain leverage from
symmetric models, with interseetion graphs as the fulerum. Another view might try
to group the papers by general subject, by matrix algorithm, or by graph-theoretic
model:
Subjects: Reorderings for efficient factorization (Agrawal, Miller, Ashcraft),
nonzero structure prediction (Eisenstat, Johnson, Gilbert), partitioning (Agrawal,
Miller, Alvarado, Jones), parallelism (Miller, Alvarado, Ashcraft, Schreiber, Kratzer,
Jones).
Matrix algorithms: Cholesky factorization (Blair, Agrawal, Miller, Ashcraft,
Schreiber), nonsymmetric factorization (Eisenstat, Gilbert, Kratzer), matrix-vector
multiplication (Miller, Jones), triangular solution (Alvarado), Schur complement
(Eisenstat, Johnson).
Graph modeIs: Chordal graphs (Blair, Agrawal, Alvarado, Schreiber), various
trees (Blair, Agrawal, Alvarado, Schreiber), directed graphs (Eisenstat, Johnson, Gil-
bert, Alvarado), bipartite graphs (Gilbert), other undirected graphs (Agrawal, Miller,
Jones).
The astute reader will recognize this as an adjacency-list representation of a
sparse matrix; its nonzero structure and one" of its graphs are displayed below.
Anyone who has spent time at the IMA knows that Avner Friedman and his staff
nurture an amazing environment of mathematical stimulation and interdisciplinary
excitement. The IMA special year on applied linear algebra was blessed further
by having Richard Brualdi as organizer and intellectual shepherd. We express our
deepest thanks to them, to the workshop participants, and most of all to the authors
of these papers.
Alan George, Waterloo
John R. Gilbert, Palo AIto
Joseph W. H. Liu, York
March 1993
Papers 4
7..,.;:"-~
nz=45 9
Foreword ................................................................. Xl
The efficient paralleI iterative soIution of Iarge sparse Iinear systems 229
Mark T. Jones and Paul E. Plassmann
AN INTRODUCTION TO CHORDAL GRAPHS AND
CLIQUE TREES'
Clique trees and chordal gr ap hs have carved out a niche for themselves in recent work on sparse
matrix algorithms, due primariJy t.o research questions associated with advanced computer archi-
t.ectures. This paper is a unified and eJementary introduction to the standard characterizatious
of chordal graphs and clique trees. The pace is Jeisurely, as detailed proofs of all results are in-
cluded. We also briefly discuss applications of chordal graphs and clique trees in sparse matrix
compntations.
Key Words. chordal graphs, clique t.rees, acyclic hypergraphs, minimum spanning tree, Prim's
algorithm, maximum cardinality search, sparse linear systems, Cholesky factorization
1. Introduction. It is weil known that chordal graphs model the sparsity struc-
ture of the eholesky factor of a sparse positive definite matrix [40J. Of the many ways
to represent a chordal graph, a particularly useful and compact representation is pro-
vided by clique trees [24, 46J. Until recently, explicit use of the properties of chordal
graphs or clique trces in sparse matrix computations was rarely needed. For example,
chordal graphs are mentioned in a single exercise in George and Liu [16J. However,
chordal graphs and clique trees have found a niche in more reeent work in this area,
primarily due to various research questions associated with advanced computer ar-
chitectures. For ;nstance, the multifrontal method [7J, which was developed to obtain
good performance on vectvr supercomputers, can be expressed very succinctly 111
terms of a clique tree representation of the underlying chordal graph [34, 38J.
This paper is intended as an update to the graph theoretieal results presented and
proved in Rose [40], which predated the introduction of clique trees. Our goal is to
provide a unified introduction to chordal graphs and clique trees for those interested
in sparse matrix computations, though we hope it will be of use to those in other
application areas in which these graphs play a major role. We have striven to write
a primer, not a survey article: we present a limited number of weil known results
of fundamental importance, and prove all the results in the paper. The pacing is
intended to be leisurely, and the organization is intended to enable the reader to read
selected topics of interest in detail.
The paper is organized as follows. Section 2 contains the standard weil known
• Work was supported in part by the Applied Mathematical Sciences Research Program, Office
of Energy Research, U.S. Department of Energy under contrad DE-AC05-840R21400 with Mar-
tin Marietta Energy Systems, Incorporated, and in part by the Institute for Mathematics and Its
Applications with funds provided by the National Science Foundation.
t Department of Computer Science, University of Tennessee, Knoxville, TN 37996-1301.
t Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Bldg. 6012,
Oak Ridge, TN 37831-6367.
All technical terms used in this section are defined later in the paper.
2
We let G(S) denote the subgraph of G induced by S, namely the subgraph (S, E(S)).
At times it will be convenient to consider the induced subgraph of G obtained by
removing aset of vertices S ~ V from the graphj hence we define G \ S by
G \ S := G(V - S).
Two vertices u,v E V are said to be adjacent if (u,v) E E. AIso, the edge
(u, v) E E is said to be incident with both vertiees u and v. The set of vertices
adjacent to v in G is denoted by adja( v). Similarly, the set of vertiees adjacent
to S ~ V in G is given by
(The subscript G often will be suppressed when the graph is known by context.) An
induced subgraph G(S) is complete if the vertices in S are pairwise adjacent in G. In
this case we also say that S is complete in G.
3
We let [VO, VI, ••• , Vk] denote a simple path of length k from Vo to Vk in G, i.e.,
Vi =I
Vj for i =I j and (Vi, Vi+I) E E for 0 ~ i ~ k - 1. Similarly, [vo, VI. •.• ,Vk, VD]
denotes a simple cycle of length k + 1 in G. Finally, a chord of a path (eyele) is any
edgejoining two noneonseeutive vertices of the path (eyele).
DEFINITION 1. An undireded graph G = (V, E) is chordal (triangulated, rigid
circuit) iJ every cycle oJ length greater than three has a chord.
Clearly, any indueed subgraph of a ehordal graph is also ehordal, a fact that is
useful in several of the proofs that follow.
G:
graphs in Theorem 2.1, whieh is due to Dirae [6]. The proof is taken from Peyton [34],
which, in turn, elosely follows the proof given by Golumbic [20].
THEOREM 2.1 (DIRAC [6]). A graph G is chordal iJ and only iJ every minimal
vertex separator oJ G is complete in G.
G(B)
CS5
sO o
they are of the smallest possible length greater than one, and combine them to form
the cycle u = [x, al, ... , aT) y, bl , ... , bt , x]. Since G is chordal and u is a cycle of
length greater than three, u must have a chord. Any chord of u ineident with ai,
1 ::; i ::; r, would either join ai to another vertex in J1 contrary to the minimality of r,
Dr would join ai to a vertex in E, which is impossible because S separates A from E
in G. Const'J.ü.~!'t.lv. no chord of u is incident with a vertex ai, 1 ::; i ::; r, and by
the same argument no chord of the cycle is incident with a vertex bj , 1 ::; j ::; t. It
follows that the only possible chord is (x,y). D
Remark. In reality, r = t = 1, otherwise [x, ar, ... , aT) y, x] or ry, bb ... , bt, x, y] is
a chordless cycle of length greater than three.
Again, the subscript G often will be suppressed where the graph is known by context.
A vertex v is simplicial if adj( v) induees a complete subgraph of G. The orderi ng a
is a perfect eliminatian ardering (PEO) if for 1 ::; i ::; n, the vertex Vi is simpIieial in
5
the graph G(Ci ). As shown below in Lemma 2.2, every nontrivial ehordal graph has
a simplieial vertex (aetually, at least two). Theorem 2.3, which states that ehordal
graphs are eharaeterized by the possession of a PEO, follows easily from Lemma 2.2.
The proofs are again taken from Peyton [34], whieh, in turn, closely follow argument s
found in Golumbie [20].
LEMMA 2.2 (DIRAe [6]). Every chordal graph G has a simplicial vertex. IJG is
not complele, then it has two nonadjacent simplicial vertices.
Proof. The lemma is trivial if G is eomplete. For the ease where G is not eomplete
we proeeed by induetion on the number of vertiees n. Let G be a ehordal graph with
n ;::: 2 vertices, including two nonadjaeent vertiees a and b. If n = 2, both vertiees of
the graph are simplieial sinee both are isolated (i.e., adj(a) = adj(b) = 0). Suppose
n > 2 and assume that the lemma holds for all such graphs with fewer than n
vertiees. Sinee a and b are nonadjaeent, there exists an ab-separator (e.g., the set
V - {a,b}). Suppose S is aminimai ab-separator of G, and let G(A) and G(B) be
the eonneeted eomponents of G \ S containing a and b, respeetively. The indueed
subgraph G(A U S) is a ehordal graph having fewer vertiees than G; henee, by the
induetion hypothesis one of the following must hold: Either G( AUS) is eomplete and
every vertex of A is a simpIicial vertex of G( AUS), or G( AUS) has two nonadjacent
simpIicial vertiees, one of whieh must be in Asinee, by Theorem 2.1, S is complete
in G. Beeause adja(A) ~ AUS, every simpIicial vertex of G(A U S) in A is also a
simpIicial vertex of G. By the same argument, B also eontains a simpIicial vertex
of G, thereby eompleting the proof. 0
THEOREM 2.3 (FULKERSON AND GROSS [10]). A graph G is chordal iJ and only
iJ G has a perJect d::mination ordering.
Proof. Suppose G is ehordal. We proeeed by induetion on the number of vertiees n
to show the existenee of a PEO of G. The ease n = 1 is trivial. Suppose n > 1 and
every ehordal graph with fewer vertiees has a PEO. By Lemma 2.2, G has a simpIieial
vertex, say v. Now G \ {v} is a ehordal graph with fewer vertiees than G; henee, by
induetion it has a PEO, say /3. If a orders the vertex v first, followed by the remaining
vertiees of G in the order determined by /3, then a is a PEO of G.
Conversely, suppose G has a PEO, say a, given by VI> V2, ••• , Vn. We seek a ehord
of an arbitrary eycle J.t in G of length greater than three. Let Vi be the vertex on J.t
whose label i is smaller than that of any other vertex on J.t. Sinee a is a PEO,
madj(vi) is eomplete; whenee J.t has at least one ehord: namely, the edge joining the
two neighboring vertiees of Vi in J.t. 0
2.4. Maximum cardinality search. Rose, Tarjan, and Lueker [41] introdueed
the first Iinear-time algorithm for producing a PEO, known as the lexicographic
breadth first search algorithm. In aset of unpubIished leeture notes, Tarjan [44]
introdueed a simpler algorithm known as the maximum cardinality search (MCS) al-
gorithm. Tarjan and Yannakakis [46]later deseribed MCS algorithms for both ehordal
graphs and aeyclie hypergraphs. The MCS algorithm for ehordal graphs orders the
vertiees in reverse order beginning with an arbitrary vertex V E V for whieh it sets
a(v) = n. At each step the algorithm seleets as the next vertex to label an unlabeled
vertex adjaeent to the largest number of labeled vertiees, with ties broken arbitrarily.
6
'cn+! f- 0;
for i n to 1 step -1 do
f-
The following lemmaand theorem prove that the MCS algorithm pro du ees a PEO.
The lemma provides a useful characterization of the orderings of a chordal graph that
are not perfect elimination orderings. Edelman, Jamison, and Shier [9, 43] prove
similar resuits while studying the notion of convexity in chordal graphs. Theorem 2.5
is then proved by showing that every ordering that is not a PEO is also not an MCS
ordering. The proof is taken from Peyton [34]. Later in Section 4.2, we will provide
a more intuitive view of how the MCS algorithm works: it can be viewed as a special
implementation of Prim's algorithm applied to the weighted clique intersection graph
of G (defined in Section 3.4).
LEMMA L; •.,;, ••J n ordering a of the vertices in a graph G is not a perfect elimination
ordering if and only if for some vertex v, there exists a chordless path of length greater
than one from v = a-1(i) to some vertex in 'ci+l through vertices in V - 'ci.
Proo! Suppose a is not a PEO. There exists then by Lemma 2.2 a vertex U E V
for which madj( u) is not complete in G; hence, there exist two vertices v, w E madj( u)
joined by no edge in E. Without loss of generality assume that i = a(v) < a(w).
Then [v,u,w] is a chordless path of length two from v = a-1(i) to w E 'ci+! through
u E V - 'ci.
Conversely, suppose there exists a chordless path Il = [uo, Ul, •.. , url of length
r ~ 2 from Uo = a-1(i) to U r E 'ci+l through vertiees Uj E V - 'ci, 1 ::; j ::; r - l.
Let Uk, where 1 ::; k ::; r -1, be the internaI vertex in Il whose label a(uk) is smaller
than that of any other internaI vertex in Il. Then madj( Uk) indudes two nonadjacent
vertices: namely, the two neighboring vertiees of Uk in Il. It follows that a is not a
PEO. D
THEOREM 2.5 (TARJAN [44], TARJAN AND YANNAKAKIS [46]). Every maxi-
mum cardinality search ordering of a chordal graph G is a perfect elimination ordering.
Proo! Let a be any orderi ng of a chordal graph G that is not a PEO. We will
show that the orderi ng a cannot be generated by the MCS aIgorithm.
By Lemma 2.4, for some vertex Uo there exists a chordless path Il = [uo, Ul, .•. , Ur-J,
url of Iength r ~ 2 fromUo = a-1(i) to U r E 'ci+l through vertiees Uj E V - 'ci,
7
1 ::; j ::; r - 1. (See Figure 4.) Choose Uo so that the label i = a(uo) is maximum
among all the vertices of G for which such a chordless path exists.
1'0 show that a is not an MCS ordering it suffices to show that there exists some
vertex w E V - CHI for which ladj(w) n C;+11 exceeds ladj(uo) n CHII. We will
show that the vertex UT-I E J.! is indeed such a vertex. Note that adj( uo) n C HI and
madj(uo) are by definition identical, and thus it suffices to show that
(1)
For the trivial case madj( uo) = 0, the theorem holds since UT-I is adjacent to
UT E CHI . Assume instead that madj(uo) of. 0, and choose a vertex x E madj(uo). To
see that x is also adjacent to UT-Il consider the path I = [x, Uo, ... , UT-Il uTl pictured
in Figure 4. The maximality of i implies that every path of length greater than one
FIG. 4. Illustration for the proof of Theorem 2.5. The dark solid edges exist by hypothesis; existence
of the lighter broke1' edges is argued in the proof and the remark that follows it.
having the following two properties will have a chord: a) the endpoints of the path are
both numbered greater than i, and b) the interior vertices are numbered less than the
minimum of the endpoints. The path I satisfies these two properties and hence has a
chord. Moreover, since J.! = [uo, UIl'" ,url has no chords, every chord of I is incident
with x. Let Uk be the vertex in I adjacent to x which has the largest subscript. If
k of. r then [x, Uk, ••• , Url is a chordless path, again contrary to the maximalityof ij
hence (x, UT) E E.
It follows that 17 = [x, UO, ... , UT-I, Ur , xl is a cycle of length greater than three in G
(recall that r 2 2). Since G is chordal, 17 must have a chord, and, as argued above,
any such chord must be incident with x. Let Ut be the vertex in 17 with the highest
subscript other than r, for which (x, Ut) E E. If t of. r - 1, then [x, Ut, ... , Un xl
is a chordless cycle of length greater than 3, contrary to the chordality of G. In
consequence, (X,UT-I) E E for all x E madj(uo). But UT-I is also adjacent to UT E
CHI - madj(uo), whence (1) holds, completing the proof. D
Remark. In the preceding proof the argument leading to the inclusion of (x, UT-I)
in E can be repeated for every edge (x, U j), 1 ::; j ::; r - 2. In consequence we have
(2) madj(uo) ~ adj(uj) n CHI for 1 ::; j::; r - 2.
Statement (1) implies that if the MCS algorithm "tried" to generate a, then as the
vertex to be labeled with i is diosen, the priority of UT-I would be greater than that
8
of Uo. Similarly, (2) implies that the priority of each vertex Uj (1 :5 j :5 r - 2) would
be at least as great as that of Uo.
The reader may verify that the graph in Figure 5 is a chordal graph with four
cliques, each of size three. The graph in Figure 5 will be used throughout this seetion
to ilIustrate results and key points. For convenience we shall refer to the vertices of
this graph as VI, V2, . •• , V7; e.g., the vertex labeled "6" will be referred to as V6. Note
that the labeling of the vertices is a PEO of the graph.
For any chordai gl"ph G there exists a subset of the set of trees on KG known as
clique trees. Any one of these clique trees can be used to represent the graph, often in
a very compaet and efficient manner [24, 46], as we shall see in Section 4. This section
contains a unified and elementary presentation of several key properties of clique trees,
each of which has been shown, somewhere in the literature, to characterize the set of
clique trees associated with a chordal graph.
The notion of clique trees was introduced independently by Buneman [5], Gavril [12],
and Walter [47]. The property we use to introduce and define clique trees in Sec-
tion 3.1 is a simple variant of one of the key properties introduced in their work.
We use this variant because, in our experience, it is more readily apprehended by
those who are studying this materiaJ for the first time. Section 3.2 presents the short
argument needed to show that the more reeent variant is equivalent to the original.
Clique trees have found application in relational databases, where they can be
viewed as a subclass of acyclic hypergraphs, which are heavily used in that area.
Open problems in relational database theory motivated the pioneeri ng work of Bern-
stein and Goodman [2], Beeri, Fagin, Maier, and Yannakakis [1], and Tarjan and
Yannakakis [46]. Our two final charaeterizations of clique trees, presented in Sec-
tions 3.3 and 3.4, are based on results from these papers. Seetion 3.5 summarizes
these results, and also ilIustrates these results in negative form using the example in
Figure 5.
For every pair of distinct cliques K, K' E KG, the set K n K' is
contained in every clique on the path connecting K and K' in the
tree.
If
~-~-~
~-G CV
FIG. 6. A tree on the c/iques of the chordal graph in Figure 5, which satisfies the c/ique-intersedion
property.
for example, the set K 4n K 2 = {V7} is contained in Kl, which is the only clique on the
path from K 4 to K 2 in the tree. The reader may also verify that the only other tree
on {Kl,K2 ,K3'!{4} that satisfies the clique-intersection property is obtained from
the tree in Figure 6 by replacing the edge (K3, K 2) with (I{3, Kr).
We will show in Theorem 3.2 below that G is chordal if and only if there exists
a tree on KG that satisfies the clique-intersection property. For any given chordal
graph G, we shalllet T~t de not e the nonempty set of trees T = (KG, eT) that satisfy
the clique intersection property, and we shall refer to any member of T!/
as a clique
tree of the underlying chordal graph G. In Section 3.2, we prove the original version
of this result, which was introduced independently by Buneman [5], Gavril [12], and
Walter [47].
To prove the main result of this subsection, we require two more definitions and a
simple lemma. A vertex K in a tree T is a lea! if it has precisely one neighbor in T (i.e.,
!adjT(I{)! = 1). We let K G( v) <;;: KG denote the set of cliques containing the vertex v.
10
rhe following simple characterization of simplicial vertices has been useful in various
:l.pplications. This result has been used widely in the literature [8, 19,23,24,46], and
ilas been formally stated and proven in at least two places [23, 24].
LEMMA 3.1. A vertex is simplicial iJ and only iJ it belongs to precisely one clique.
Proo! Suppose a vertex v belongs to two diques K, K' E '/CG. Maximalityof the
cliques implies the existence of two distinet nonadjacent vertices u E K - K' and
u' E K' - K. Since both u and u ' are adjacent to v, it follows that v is not simplieial.
Assume now that the vertex v belongs to one and onlyone dique K E '/CG. Note
~hatv is adjacent to a vertex u i= v if and only if there exists a dique of G to which
both u and v belong. Consequently adj (v) = K - {v}, whence v is simplicial. D
The first part of the following proof dosely resembles the argument given by
Gavril [12] to prove aresult that shall be presented in the next seetion. The second
half was improvised for this paper, and resembles the first half in many of its features.
THEOREM 3.2. A connected graph G is chordal iJ and only iJ there exists a tree
T ='('/CG, eT) Jor which the clique-intersection property halds.
Proo! We proceed by induetion on the number of vertices n to show the "only if"
part. The base step n = 1 is obvious. For the induetion step, let G be a chordal graph
with n ~ 2 vertices and assume the result is true for all chordal graphs having fewer
~han n vertices. By Lemma 2.2, G has a simplicial vertex, say v. Let K be the single
dique of G that contains v (see Lemma 3.1), and consider the induced subgraph
G' = G \ {v}. Since G' is a chordal graph with n-I vertices, by the induetion
hypothesis there exists a tree T ' = ('/CG" eT') that satisfies the dique-interseetion
property.
To complete the proof of the "only if" part, there are two cases to consider. First,
suppose K' = K - {v} remains maximal in G' (i.e., K' E '/CG'). It is trivial to show
that '/CG' = '/CG U {K'} - {K}, and we leave it for the reader to verify this. It follows
that the only difference between the diques of G and G' is the presence in G of the
simplieial vertex v in K and the absence of v from the corresponding dique K' of
G'. In consequence, the intersection of any pair of diques in G is identical to the
interseetion of the corresponding pair in G'. Let T be the tree on '/CG obtained from
T ' by replacing K' with K. Since T ' has the dique-interseetion property, it follows
that T has this property as well, thereby completing the argument for the first case.
Now, suppose S' = K - {v} is not a maximal dique in G' (i.e., S' (j. '/CG'). Since
n ~2 and G is conneeted, v is not an isolated vertex, and we have
S' =K - {v} = adj(v) i= 0.
Since S' is complete in G', there exists a dique P E '/CG' = '/CG - {K} for which
S' e P. (As before, we leave it for the reader to verify that '/CG' = '/CG - {I{}.) Let
T be the tree on '/CG obtained by adding the dique K and the edge (K, P) to T ' . We
now verify that T satisfies the dique-interseetion property. Because T ' satisfies the
dique-interseetion property, the set Kl n K 2 is contained in every dique on the path
from Kl to K 2 in T whenever neither Kl nor K 2 is K. Consider now the set KnK"
where Kli E '/CG - {K} = '/CG'. Since K - {v} e P and v belongs to no dique in
'/CG - {K}, it follows that Kli n K e P. Because T ' satisfies the dique-interseetion
11
property, the set K n Kli = P n Kli is contained in every clique on the path from K
to Kli in T, and T therefore satisfies the clique-intersection property as weIl.
To prove the "if" part, let G = (V, E) be a graph and suppose there exists a tree
T = (ICa,ET) that satisfies the clique-intersection property. Again we proceed by
induction on n to show that G is chordal. The base step n = 1 is obvious. For the
induction step, let G be a graph with n ~ 2 vertices and assume the result is true for
all graphs having fewer than n vertiees.
Let K and P be respectively a leaf of T and its sole neighbor (Le., "parent")
in T. By maximality of the cliques there exists a vertex v E K - P. The vertex v
moreover cannot belong to any clique K' E ICa - {K,P}, for were it otherwise the
clique P, whieh is on the path from K to K' in T, would not contain the set K n K'.
Consequently v belongs to no other clique but K, whence by Lemma 3.1 it is a
simplicial vertex of G.
Consider the reduced graph G' = G \ {v} and let IC = K - {v}. If K' fj.. P,
then the "reduced" tree T ' for G' is obtained simply by replacing K with K' in Tj if
K' e P, then T ' is obtained by removing from T the vertex K and the single edge
(K, P) incident with it in T. As before, in the first case, ICa' = IC a U {I<'} - {I<}j in
the second case, ICal = ICa - {K}. In either case, it is trivial to verify that the tree
T ' satisfies the clique-intersection property. From the induction hypothesis it follows
that G' is chordal. Let {3 be any PEO of G'. A PEO of G can then be obtained by
ordering v first, followed by the remaining vertiees of G in the order determined by (3.
Thus by Theorem 2.3, G is also chordal, giving us the resulto D
We thus have the following well known result from the literature.
THEOREM 3.4 (BUNEMAN (5), GAVRIL (12), WALTER [47]). A connected graph
G is chordal iJ and only iJ there exists a tree T = (KG, eT) Jor which the induced-
subtree property holds.
Prao! The result follows immediately from Theorems 3.2 and 3.3. D
(3)
For any RlP orderi ng of the cliques, we construct a tree Trip on KG by making each
clique K j adjacent to a "parent" cliql1e K i identified by (3). (Since more than one
clique K i , 1 :::; i :::; j -1, may satisfy (3), the parent may not be uniql1ely determined.)
We let T~ip be the set containing every tree on KG that can be constructed from an
RlP,ordering in this manner. We define a reverse topolagieal ordering of any root ed
tree as an orderi ng that numbers each parent beJore any of its children. Finally, note
that any RlP orderi ng is a reverse topological orderi ng of a rooted tree constructed
from the orderi ng in the manner specified above.
The ordering K l ,!{2, K 3 , K 4 of the cliques shown in Figure 5 is an RlP ordering;
a corresponding RIP-induced parent function is displayed in Figure 7. Note that the
parent function specifies precisely the edges of the clique tree in Figure 6. lndeed, we
can show that for any connected graph G, we have T~ip = T~t.
FIG. 7. Clique tree in Figure 6 is an RlP tree. Arrows point from child to parent.
THEOREM 3.5 (BEERI, FAGIN, MAIER, YANNAKAKIS [1]). For any connected
graph G, we have T~ip = T~t.
Prao! We first show that T// ~ T!;i P • Let T ct E T~t; choose R E KG; and
root Tet at R. Consider any reverse topological ordering R = Kl, K 2 , • •• , Km of
the rooted tree Tct . For any clique Kj, 2 :::; j :::; m, let Kp be its parent clique in
the root ed tree (whence 1 :::; P :::; j-I). Now, for 1 :::; i :::; j - I , the clique K i
13
To see that T~jp ~ T~\ consider a tree T = (Ka, e) rt T~t. We will show that
T rt r;;jp.
Since T rt T~t, there exists then a pair of distinct cliques K, K' E Ka such
that the set K n K' is not contained in at least one clique on the path connecting K
and K' in the tree. Choose two such cliques K, K' E Ka that minimize the length of
the path from K to K' in T. The key observation on which our argument depends
is that the set K n K' belongs to no clique on the path connecting K and K' in the
tree, except K and K'. Let K 1 ,K2 , •.. ,Km be any reverse topological orderi ng of
T for arbitrary root Kl E Ka. It suffices to show that (3) does not hold for some
parent-child pair in T.
Consider the path Il = [K = l<;o'!<il l " " K j , = K/J in T. Let l<;, be the clique
with lowest index among the cliques in Il, and without loss of generality assume that
i o > is. SiIjce under the given reverse topological ordering Kio is a proper descendant
of Kõ, E Il, the clique K il is necessarily the parent of K io in the rooted tree, and hence
i o > il. Our choice of K (= K io ) and JC (= K;,) implies that (a) s 2: 2, and (b)
K io n K j , CZ K õr for each r, 1 ::; r ::; s - 1. In consequence, we have K õo n K i , CZ K i"
whence (3) does not hold for the parent-child pair K il and Kio, which completes the
proof. D
Remark. In the preceding proof, the argument that T~t ~ T~jp verifies that any
reverse topologieal ordering of a clique tree Tct E T~t is an RIP ordering of the cliques.
Our argument requires two ideas commonly used in the study of maximum-weight
(minimum-weight) spanning tree algorithms. First, let T = (Ka, eT) be a spanning
tree of Wa. It is weIl known that T is a maximum-weight spanning tree if and
only if for every pair of cliques K, K' E Ka for which (K, JC) rt eT, the weight of
every edge on the path joining K and JC in T is no smaller than lK n K'i (see, for
example, Tarjan [45, pp. 71-72]). Second, given an edge (K,K ' ) in a tree, we define
the fundamental eut sel (see Gibbons [18, p. 58]) associated with the edge as follows.
The removal of (K, K') from the tree partitions the vertiees of Tinto precisely two
sets, say Kl and K 2 • The fundamental cut set associated with (K, K') consists of
every edge with one vertex in Kl and the other in K 2 , including (K, lC) itself.
14
FIG. 8. Weighted clique intersection graph for graph in Figure 5. Bold edges belong to the clique
tree in Figure 6. Also shown are the intersection sets upon which the weights are based.
THEOREM 3.6 (BERNSTEIN AND GOODMAN [2]). For any connected chordal
graph G, TJ;'st = T~t.
Proo! We first show that T~t <;;; T Gmst . Let Tct E TJ'/ and ehoose two cliques [(
and [(' that are not eonneeted by an edge in Tct . Consider the eycle formed by adding
the edge {[(, [('} to Tet . By Theorem 3.2 every edge along this eycle has weight no
smaller than 1[( n [('I, whenee Tct is a maximum-weight spanning tree of WG •
To see that TJ;'st <;;; T~\ ehoose Tmst E TJ;'st. By Theorem 3.2, T~t f= 0. Choose
Tct E T;:/ that has a maximum number of edges in common with Tmst • Assume for
the purpose of eontradietion that there is an edge ([(1, [(2) of Tmst that is not an edge
of Tet • Consloer ~~'" fundamental eut set (in WG) associated with the edge ([(1, [(2)
of Tmst and also the eycle (in Tet ) obtained by adding the edge ([(1, [(2) to Tet • Any
eycle containing one edge from the eut set must eontain another edge from the eut
set as weil. Seleet from the eycle in Tet one of the edges (I{3, [(4) f= ([(1,[(2) that
belongs to the eut seto
Note that the edge (I{3, [(4) is an edge of Tet , but it is not an edge of Tmst • Sinee
Tet is a clique tree, it follows from Theorem 3.2 that [(I n [(2 <;;; [(3 n [(4. However, if
[(I n [(2
were a proper subset of [(3 n [(4 , then replacing ([(I, [(2) in Tmst with ([(3, [(4)
would result in a spanning tree of greater weight, eontrary to the maximalityof Tmst 's
weight. Henee, [(I n [(2 = [(3 n [(4' Consider the tree obtained by replacing (I{3' [(4)
in Tet with the edge ([(1,[(2), The reader ean easily verify that the resulting tree is
a clique tree. The new clique tree moreover has one more edge in common with Tmst
than originally possessed by Tet, giving us the contradietion we seek. Consequently,
Tmst = Tct , and the result holds. D
3.5. Summary. The following eorollary summarizes the results presented in this
seetion.
COROLLARY 3.7. For every connected graph G, we have
15
Furthermore, G is chordal iJ and only iJ this set is nonempty, in which case we have
,..-ct _ ..-rist _ ..-rrip _ ..-rmst
.La -.l.c -.LG -..la .
Based on Corollary 3.7, we heneeforth drop the superseripts from our notation
and shall use T G to denote the set of clique trees of G. Finally, Figure 9 illustrates
Corollary 3.7 in negative form. We now verify that the tree displayed in this figure
~ {5.7KK~
~ - ~
____ (7)
4. CIique trees, separators, and MCS revisited. This seetion ties together
some of the results and concept s presented separately in Seetions 2 and 3. See-
tion 4.1 presents results that link the ed ges in a clique tree with the minimaI vertex
separators of the underlying ehordal graph. Section 4.2 presents an efficient algo-
rithm for computing a clique tree. This algorithm, whieh is a simple extension of the
MCS algorithm, is shown to be an implementation of Prim's algorithm for finding a
maximum-weight spanning tree of the weighted clique interseetion graph WG • New
definitions and notation will be introdueed as needed, and appropriate referenees to
the literature will be given in eaeh subseetion. As in the previous section, we assume
without loss of generality that G is connected.
4.1. CIique tree edges and minimaI vertex separators. Choose a clique
tree T E T G and let S = [(i n [(j for some edge ([(i, [(j) E eT. Let T i = (ICi, ei) and
Tj = (IC], ej) denote the two subtrees obtained by removing the edge ([(i, [(j) from
16
V;:= ( U K) - S
KEK.;
and
vi := ( U K) - S.
K E K. j
We first prove two technical lemmas, the second of which shows that the set
S = K i n K j separates V; from vi in G. These two results are then used in the proof
of Theorem 4.3 to show that for any clique tree T E Tct the set S' e v is aminimai
vertex separator if and only if S' = K n K' for some edge (K, K') E eT. The results
in this section have appeared in both Ho and Lee [21] and Lundquist [33]. The proofs
of Lemma 4.2 and Theorem 4.3 are similar to arguments given by Lundquist [33].
LEMMA 4.1. The sets V;, Vi, and S form a partition ofV.
Prao! Let T, S, K i , K j , K. i , K. j , V;, and vi be as defined as in the first paragraph
of the subsection. Clearly, V = V; u vi u S, and S is disjoint from both V; and Vi.
Hence it suffices to show that V; n vi = 0. By way of contradiction assume the there
exists a vertex v E V; n vi. It follows that v belongs tü süme clique K E K. i and also
belongs to some clique K' E K. j • Since T E Ta, the vertex v belongs to every clique
along the path joining K and K' in T, which necessarily includes both K; and K j • In
consequence, v E S = K i n K j , which is impossible since both V; and vi are disjoint
from S, whence the result follows. D
To prove the "only if" part, choose T E Ta and let S be aminimai vw-separator
of G. Since (v, w) rt E, the sets K. a (v) and K. a (w) induce disjoint subtrees of T.
Choose K E K.a(v) and K' E K.a(w) to minimize the distance in T between K and
K'. Consider the path f! = [K = Ka, Kl, .. . , Kr-J, Kr = K'] in T, where l' ~ 1.
17
That this multiset is the same for all clique trees T E Ta is an immediate consequence
of aresult by Ho and Lee [21]; the result was also proven by Lundquist [33]. The
proof is taken directly from Blair and Peyton [4].
THEOREM 4.4 (Ho AND LEE [21], LUNDQUIST [33]). The multiset of separators
is the same for every clique tree T E Ta.
Proof. For the purpose of contradiction, suppose there exist two distinet clique
trees T,T' E Ta for which MT # MT'. From among the clique trees T' E Ta for
whieh MT' # MT, choose T' so that it shares as many edges as possible with T.
(Note that T and T' cannot share the same edge set, for then they also would share
the same multiset of separators.)
Let ( Kl, K 2) be an edge of T that does not belong to T'. As in the proof of
Theorem 3.6, consider the fundamental cut set (in Wa) associated with the edge
(Kt, K 2 ) of T and also the cycle (in T') obtained by adding the edge (/{t, K:z) to T'.
Recall that any cycle containing one edge from the cut set must contain another edge
from the cut set as weIl. Seleet from the cycle in T' one of the edges (K3 , K4) #
(Kl, K 2 ) that belongs to the cut seto Note that the edge (K3 , K 4 ) is an edge of T'
but not an edge of T.
Since T E Tet , it follows by Theorem 3.2 that K 3 n K 4 ~ Kl n K 2 ; similarly, since
T' E Tet. it follows by Theorem 3.2 that Kl n K 2 ~ K 3 n K 4 ; hence K 3 n K 4 =
Kl n K 2 • By Theorem 3.6, the replacement of (K3 , K 4 ) in T' with (Kl, K 2 ) results
in aclique tree, which, moreover, clearly has the same multiset of separators that T'
haso Contrary to our assumption about T', the modified tree shares one more edge
with T, and thus result follows. D
18
4.2. MCS and Prim's algorithm. Prim's algorithm [39J is an efficient method
for computing a maximum-weight (minimum-weight) spanning tree of a weighted
graph. Thus, by Theorem 3.6, Prim's algorithm applied to the weighted clique in-
tersection graph Wa computes a clique tree T E Ta. At any point the algorithm
has constructed a subtree of the eventual maximum-weight spannin~ tree T, and at
each step it adds one more clique and edge to this subtree. Let '/C e '/Ca be the
cliques in the subtree construeted thus far. As the next edge to be added, the algo-
rithm chooses the heaviest edge that joins JC to '/Ca - JC. For a proof that Prim's
algorithm correctly computes a maximum-weight spanning tree, we refer the reader
to Tarjan [45, pp. 73-75] or Gibbons [18, pp. 40-42J. A version of Prim's algorithm
formulated specifically for our problem is given in Figure 10.
eT +- 0;
Choose K E '/Ca;
JC +- {K};
for r +- 2 to m do
Choose cliques K E JC and K' E '/Ca - JC
for which lK n K'i is maximum;
eT +- eT U {(K,K' )};
JC +- JC U {I<'};
end for
FIG. 10. Prim 's algorithm for finding a maximttm-weight spanning tree of the weighted cliqtte inter-
section graph W G.
In this seetion wt:: will show that the MCS algorithm applied to a chordal graph
G can be viewed as an implementation of Prim's algorithm applied to Wa . In Sec-
tion 4.2.1 we show that since the MCS algorithm generates a PEO, it can easily deteet
the cliques in '/Ca during the course of the computation. Section 4.2.2 shows that 1)
the MCS algorithm can be viewed as a block algorithm that "searches" the cliques in
'/Ca one after the other, and 2) the order in which the cliques are searched is precisely
the order in which the cliques are searched by Prim's algorithm in Figure 10. Using
the results in Seetions 4.2.1 and 4.2.2, we also show how to supplement the MCS
algorithm with a few additional statements so that it deteets the eliques and aset
of clique tree edges as it generates a PEO. A detailed statement of this algorithm
appears at the end of Section 4.2.2.
The elose connection between the MCS algorithm and Prim's algorithm was, to
our knowledge, first presented by Blair, England, and Thomason [3J. Several of the
proofs in this seetion are similar to arguments given by Lewis et al. [24J. Though the
techniques discussed in this seetion can be implemented to run quite efficiently, there
are more efficient ways to compute a elique tree when certain data struetures that
arise in sparse matrix computations are available. The reader should consult Lewis
et al. [24] for details on how to compute a clique tree in the course of solving a sparse
positive definite linear system.
4.2.1. Detecting the cliques. In this subseetion we show that the MCS algo-
rithm can easily and efficiently deteet the cliques in '/Ca. To do so we exploit the fact
19
that MCS computes a PEO. We shall use the following result from Fulkerson and
Gross [10].
LEMMA 4.5 (FULKERSON AND GROSS [10]). Let vI, V2, ... , v n be a perfect elim-
ination ordering of G. The set of maximal cliques lea contains precisely the sets
{Vi} U madj(vi) for which there exists no vertex Vi, j < i, such that
(4) {v;} U madj(Vi) e {Vi} U madj(vi)'
Proof. Choose K E lea and let Vi E K be the vertex whose label i assigned
by the PEO is lowest among the labels assigned to a vertex of K. Consider the
vertex set {v;} U madj(vi)' Since K consists of Vi and neighbors of Vi with labels
larger than i, clearly K ~ {v;} U madj(vi)' Because the ordering is a PEO, the set
{Vi} U madj(v;) must be complete in G. Thus by maximality of the clique K we have
K = {Vi} U madj(Vi) , and moreover it follows that .(4) holds for no vertex vj, j < i.
Now, let K = {Vi} U madj(v;) and suppose that (4) holds for no vertex vj, j < i.
Since the ordering is a PEO, clearly K is complete in G. If K is submaximal, then
there exists a vertex Vi E V - K that is adjacent to every vertex of K. But the
existence of such a vertex Vi is impossible: if j > i then Vi E madj(vi), contrary to
Vj E V -Kj ifj < i then (4) holds for Vj, contrary to our assumption. In consequence,
no such vertex Vj exists, and the result follows. 0
Throughout the remainder of the paper we let VI, V2, ... ,Vn be a PEO obtained
by applying the MCS algorithm to a conneeted chordal graph G. We shall call Vi r
the representative vertex of Kr whenever Kr = {Vi r } U madj(vir)j that is, we let
Vi" Vi., ... ,Vim be ~!;.~ rp.presentative vertices of the cliques Kl, K 2, ... ,Km, respec-
tively, where il > i 2 > ... > im. Thus the ordering Kl! K 2 , ••• , Km specifies the order
in which the cliques are searched by the MCS algorithm.
As the MCS algorithm generates a PEO it can easily deteet the representative
vertices and hence can easily colleet the cliques in lea. Condition 2 in the next
lemma provides a test for determining when a vertex in an MCS orderi ng is not
a representative vertex. Lemma 4.7 then provides a simple test for detecting the
representative vertices.
LEMMA 4.6. Let VI, V2, ... ,Vn be a perfect elimination ordering obtained by ap-
plying the maximum cardinality search algorithm to a connected chordal graph G. For
each vertex label i, 1 :5 i :5 n-I, the following are equivalent:
Assume that the first condition in the statement of the lemma holds for Vi+!, and
consider the vertex Vi selected by the MCS algorithm at the next step. When the
algorithm selects Vi there exists (by Lemma 4.5) a vertex u E V - Ci+I that is adjacent
to every vertex in {Vi+d U madj(vi+t}. In light of (6), the existence of such a vertex
u ensures that the vertex Vi chosen by the MCS algorithm (perhaps Vi = u) satisfies
the second condition.
Assume now that the second condition in the statement of the lemma holds for
the two vertices Vi and Vi+!o It immediately follows that
Imadj(v;) I = I{Vi+d U madj(Vi+I)I·
Consequently, to prove that the third condition holds true it suffices to show that
madj(vi) ~ {vi+!l U madj(Vi+I). Now if it were the case that Vi+! ~ adj(Vi)' then
from (5) and the fact that Ci+I = Ci+2 U {vi+d we would have
ladj(v;)nci+!1 Sladj(vi+I)nCi+2l,
contrary to our assumption that condition 2 holds true. It follows then that Vi+! is
adj~nt to Vi in G. Now choose Vk E madj(vi) - {vi+d. Clearly k ~ i + 2j moreover,
since {Vi} U madj(vi) is complete in G, Vk is necessarily adjacent to Vi+I E madj(vi)j
whence Vk E madj(vi+!), giving us condition 3.
Finally, by Lemma 4.5 the first condition follows immediately from the third,
which completes the proof. D
Further extending the result in Lemma 4.6, we obtain the following technique for
detecting the representative vertices of '/Ca while generating the MCS ordering.
LEMMJ.. 4.7. Let VI, V2, ... , v n be a perfect elimination ordering obtained by apply-
ing the maximum carainality search algorithm to a connected chordal graph G. Then
'/Ca contains precisely the following sets: {Vd U madj(vt} and {vi+!l U madj(vi+!),
1 S i S n-I, for which
(7)
Proof. From Lemma 4.5 it follows that {Vd U madj(vI) E '/Ca. Consider the set
{vi+d U madj(vi+t} where 1 S i S n-I. It follows from (6) and the equivalence of
conditions 1 and 2 in Lemma 4.6 that {Vi+d U madj(Vi+I) is a member of '/Ca if and
only if (7) holds. This condudes the proof. D
4.2.2. MCS as a block algorithm. Clearly, the MCS algorithm can deteet
the diques in '/Ca by determining at each step whether or not (7) holds. With the
next lemma we show that the MCS algorithm can be viewed as a block algorithm
that searches the diques of '/Ca one after the other.
LEMMA 4.8. Let Vt, V2, ... , Vn be a perfect elimination ordering obtained by ap-
plying the maximum cardinality search algorithm to a connected chordal graph G,
and let Vi" Vi., ... , Vi m be the representative vertices of the cliques 1<l> 1<2' ... ' 1<m,
respectively, where il > i 2 > ... > im. Then
T
for each r, 1 S r S m.
21
Proo! Choose r, 1 ::; r ::; m, and assume Vj rl- Li r , Le., j < ir. Since clearly Vj rl-
{v;,} U madj(vi,) for each s, I ::; s ::; r, it follows by Lemma 4.5 that Vj rl- U~=IK•.
Now assume Vj E Lir and for convenience of notation define i o := n + 1. Choose s,
1 ::; s::; r, for which i. ::; j < is-I' If j = iS! then clearly Vj E K s = {Vj} U madj(vj).
If is < j, then by repeated application of condition 3 of Lemma 4.6, we have
Ks {vd U madj(vi,),
{vi"v;,+d U madj(vi,+1)'
{ vh , Vii +1 , . . . , Vn = Vio -1 } Kl
{Vi 2 ' Vi 2 +1,· .. , Vi1-l} K 2 -Kl
2
K3 - UK.
a=1
m-l
Km - UKs.
8=1
For convenience we J",~~., the function clique : V --+ {I, ... , m} by clique( Vj) := r
where i o := n + 1 and Vj E {Vir,Vir+I, ... ,Vir_,-d (i.e., ir ::; j < ir-I)' Clearly
clique( V) is the lowest index of a clique that contains Vj that is,
LEMMA 4.9. Let VI, V2, . .. ,Vn be a perfect elimination ordering obtained by ap-
plying the maximum cardinality search algorithm to a connected chordal graph G, and
let Vi" Vi" .. . , Vi m be the representative vertices of the cliques Kl> K 2 , • •• , Km, respec-
tively, where il > i 2 > ... > im. For any integer r, 1 ::; r ::; m - 1, there exists an
integer p, 1 ::; p ::; r, such that
(9)
Proo! Let 1 ::; r ::; m - 1. From Lemma 4.8 it follows that for 1 ::; p ::; r we have
22
To prove the result it suffices to show that Kr+! n Cir ~ Kp. Now consider the set
Kr+! n Cir, and choose Vj E Kr+! n Cir with smallest label j. Clearly K r+1 n Cir is
complete in G and moreover
(10)
(11)
Combining (10) and (11), we obtain the resulto D
From Lemmas 4.8 and 4.9 it follows that any MCS clique orderi ng is also an RIP
ordering. Furthermore, Lemma 4.9 shows specifically how to use the clique function
to obtain the edges of a clique tree in an efficient manner. (This technique for deter-
mining a clique tree parent function was introduced by Tarjan and Yannakakis [46]
and also appears in Lewis et al. [24].) It follows that the MCS algorithm can generate
a clique tree by 1) detecting the cliques via representative vertices (Lemma 4.7) and
2) choosing as the parent of K r+1 the clique Kp for which p = clique( Vj) where j
is the smallest label in Kr+! n Cir' The following result shows that any clique tree
generated in this fashion could also be generated by Prim's algorithm applied to Wa .
THEOREM 4.10. Any order in which the cliques are search ed by the maximum
cardinality search algorithm is also an order in which the cliques are search ed by
Prim 's algorithm applied to W a .
(12)
To prove that (12) holds, choose any s and t for which 1 ~ s ~ r < t ~ m. Consider
the vertex Vj E Kr+! n Cir for which j is minimum, and let p = clique(vj). By
Lemma 4.9, we can write
(13)
Lemma 4.8 and the discussion following that result imply that Vir-I is the vertex
from K r+1 - Cir whose label is maximumo By repeated application of condition 3 of
Lemma 4.6 (as needed) we obtain the following:
In consequence we have
(14)
23
Now, if
contrary to the maximum cardinality search criterion by which the vertices were
labeled. It follows then that
(15)
(16)
Combining (13), (14), (15), and (16) shows that (12) holds, giving us the result. D
From the results in this subsection, we obtain an expanded version of the MCS
algorithm, which computes a clique tree in addition to a PEO. The MCS algo-
rithm is shown in Figure 3, and the expanded algorithm is shown in Figure 11. We
prev_card +- Oj
Cn +! +- 0j
s+- Oj
eT +- 0j
for i +- n to 1 step -1 do
r.hoose a vertex v E V - CHI for which
!adj(v) n CHI ! is maximumj
a(v) +- ij [v becomes vd
new_card +-!adj(vi) n CHI!j
if new_card :::; prev_card then [begin new clique]
s +- s + lj
K. +- adj(Vi) n CHI j [= madj(v)]
if new_card f:. 0 then [get edge to parent]
k +- min{j !Vi E K.}j
p +- clique(vk)j
eT +- eT U {K., Kp}j
end if
end if
clique( Vi) +- Sj
K. +- K. U {Vi}j
Ci +- CHI U {Vi}j
prev_card +- new_cardj
end for
FIO. 11. An expanded version of MCS, which implements Prim's algorithm in Figure 10.
emphasize that the primary purpose of this section is to establish the connection be-
tween the MCS algorithm and Prim's aIgorithm (applied to Wa ), and Theorem 4.10
24
demonstrates that the detailed algorithm in Figure 11 can be viewed as a special im-
plementation of Prim's algorithm shown in Figure 10. Some of the details necessary
to represent a chordal graph as a clique tree have been discussed herej for a complete
discussion of this topic the reader should consult the papers [24, 46]. It is worth
noting that a clique tree is often a much more compaet and more computationally
efficient data strueture than the adjacency lists usually used to represent G.
5.2. Elimination trees. More commonly used than the clique tree, the elim-
ination tree associated with the ordered graph GA has proven very useful in sparse
matrix computations. The elimination tree TA = (V, ET) for an irreducible graph
GA is a rooteõ LL':~ -lefined by a parent funetion as follows: for eaeh vertex vj,
1 :::; j :::; n-I, the parent of Vj is Vi, where the first off-diagonal nonzero entry in
column j of L occurs in row i > j. If GA is reducible, one ohtains a forest rather than
a tree. A topological ordering of TA is any orderi ng of the vertices that numbers each
parent with alabeI larger than that of any of its ehiIdren. The order in which the
unknowns are eliminated, for example, is a topological ordering of the tree TA, and, in
faet, any topological ordering of the tree is a PEO of GF. Elimination trees evidently
were introduced by Schreiber [42], though they had earlier been used implicitly in a
number of algorithms and applications. Liu [30] has provided a survey of the many
uses of elimination trees in sparse matrix computations.
Liu has also discovered an interesting conneetion between clique trees and elim-
ination trees. To faeilitate our discussion of this conneetion we need to introduee
the following concepts and results. If:F is a finite family of nonempty sets, then the
intersection graph of :F is obtained by representing each set in :F by a vertex and
conneeting two vertices by an edge if and only if the interseetion of the corresponding
sets is nonempty. A subtree graph is an intersection graph where :F is a family of
subtrees of a specific tree. Buneman [5], Gavril [12], and Walter [47] independently
discovered that the set of chordal graphs coincides with the set of subtree graphs in
aresult that further extends Theorem 3.4.
Theorem 3.3 provides an obvious way to represent a chordal graph G := GF as a
subtree graph. Choose any clique tree Tct E 'Ta, and consider the family of subtrees
25
of Tct given by
:F = {KG(v) I V E V}.
Since two vertices are adjacent to one another in G if and only if there exists a
clique K E KG to which both vertices belong, it follows that for each pair of vertices
u,v E V, we have (u,v) E E if and only if the subtree induced by KG(u) intersects
the subtree induced by K G ( v). In consequence, G is a subtree graph for the family
of subtrees :F in any clique tree Tct E T G •
Liu has shown how elimination trees provide another way to view chordal graphs
as subtree graphs. Let the row vel'tex set, denoted Struct(L j •• ), be defined by
5.3. Equivalent orderings. The fill added to GA contains preciseIy the edges
needed to make the order in which the unknowns of the linear system are eliminated
a PEO of the filled graph GF [40]. Usually, the primary objective in reordering the
linear system is to reduce the storage (i.e., fill) and work required by the factoriza-
tion. Every PEO of GF results in precisely the same factorization storage and work
requirement [28]. It is common practice in this setting to define all perfect elimination
orderings of G F as equivalent orderings.
Before advanced machine architectures entered the marketplace, there was lit-
tle reason to con,;.:!;;:- rhoosing one PEO of GF over another. Generally, whatever
orderi ng was produced by the fill-reducing orderi ng algorithm (e.g., nested dissec-
tion [14, 15] or minimum degree [17, 25]) was accepted without modification. But
this situatian has changed to some extent with the advent of veetol' supercomputers,
powerful RISC-based workstations, and a wide variety of parallei architectures. AI-
gorithms designed for such machines may benefit by choosing one PEO of G F over
the others in order to optimize some secondary objective function. (There is still
the underlying assumption that a good fill-reducing ordering is desired, though this
assumption is subject to question more than it once was and deserves further study.)
The following summarizes a few algorithms designed to produce an equivalent order-
ing that optimizes some secondary objective function.
Reordering for stack storage reduction. One of the first algorithms of this type
was a simple algorithm due to Liu [27] for finding, among all topolagieal orderings
of the elimination tree, an orderi ng that minirnizes the auxiliary storage required by
the multifrontal factorization algorithrn. In addition, Liu [28] gives a heuristic for
finding an equivalent orderi ng that further reduces auxiliary storage for multifrontal
factorization. Finding an optimal equivalent ordering for this problem is still an op en
question.
Jess and Kees reordering. Short elirnination trees can be useful when the fac-
torization is to be performed in paralle!. .less and Kees [22] introduced a sim-
ple greedy heuristic for finding 'an equivalent ordering that reduces elirnination tree
26
leight. Liu [29] has shown that the Jess and Kees ordering scheme minimizes elimi-
lation tree height among all equivalent orderings. Liu and Mirzaian [32] introduced
m O(n + IEFI) implementation of the Jess and Kees scheme. Lewis, Peyton, and
Pothen [24] used a clique tree of GF to obtain an O(n + q)-time implementation of
~he Jess and Kees algorithm where q = 2:~1 IKd, which in practice is substantially
!maller than IEFI. Because a PEO of G F is known a priori, a clique tree of GF can
be obtained in O(n) time using output from the symbolic factorization step of the
!olution process [24].
A block Jess and Kees reordering. Blair and Peyton [4] have studied a block
form of the Jess and Kees algorithm that generates a clique tree T E T G of minimum
:liameter. The primary motivation for this algorithm is to minimize the number
::If expensive communication calls to the general router on a fine-grained parallel
machine [19]. The time complexity of their algorithm is also O(n + q) in the sparse
matrix setting, where a PEO is known a priori. A similar algorithm motivated by
the same application was given by Gilbert and Schreiber [19].
5.4. Clique trees and the multifrontal method. Block algorithms have be-
come increasingly important on advanced machine architectures, both in dense and
sparse matrix computations [11]. The multifrontal factorization algorithm [7, 31]
is perhaps the canonical example in sparse matrix computation. That clique trees,
which represent chordal graphs in block form, might be a useful tool in explaining
the multifrontal method is not at all surprising.
Clique trees provide the framework for presenting the multifrontal algorithm in
Peyton, Pothen, and Sun [34, 38]. The clique tree is rooted and ordered by a pos-
tordering of the tree, and each clique K has associated with it a frontal matrix F(K).
Let K and P be respectively a clique and its parent in the clique tree. The columns
of F(K) are partitioned into two sets: the factor columns of F(K) correspond to the
vertices in K \ P, and the update columns of F(K) correspond to the vertices in
K n P. For further details consult the two references given above.
Due to its simplicity, the supernodal elimination tree is more commonly used in
descriptions of the multifrontal algorithm. Liu's survey article [31], for example, uses
the supernodal elimination tree to describe the block version of the algorithm.
27
REFERENCES
[1] C. BEERI, R. FAGlN, D. MAIER, AND M. YANNAKAKIS, On the desirability of aeyclie database
systems, J. Assoc. Comput. Mach., 30 (1983), pp. 479-513.
[2] P. A. ·BERNSTEIN AND N. GOODMAN, Power of natural semijoins, SIAM J. Comput., 10
(1981), pp. 751-771.
[3] J. BLAIR, R. ENGLAND, AND M. THOMASON, Cliques and their separators in triangulated
graphs, Tech. Rep. CS-78-88, Department of Computer Science, The University of Ten-
nessee, Knoxville, Tennessee, 1988.
[4] J. BLAIR AND B. PEYTON, On finding minimum-diameter elique trees, Tech. Rep. ORNLjTM-
11850, Oak Ridge National Laboratory, Oak Ridge, TN, 1991.
[5] P. BUNEMAN, A eharacterization of rigid cireuit graphs, Discrete Math., 9 (1974), pp. 205-212.
[6] G. A. DIRAC, On rigid eircuit graphs, Abh. Math. Sem. Univ. Hamburg, 25 (1961), pp. 71-76.
[7] 1. DUFF AND J. REID, The multifrontal solution of in definit e sparse symmetrie linear equations,
ACM Tra..: Math. Software, 9 (1983), pp. 302-325.
[8] 1. S. DUFF AND J. K. REID, A note on the work involved in no-fill sparse matrix faetorization,
IMA J. Numer. Ana!., 3 (1983), pp. 37-40.
[9] P. EDELMAN AND R. JAMISON, The theory of eonvex geometries, Geometriae Dedicata, 19
(1985), pp. 247-270.
[10] D. FULKERSON AND O. GROSS, Ineidenee matriees and interval graphs, Pacifk J. Math., 15
(1965), pp. 835-855.
[11] K. GALLIVAN, M. HEATH, E. NG, J. ORTEGA, B. PEYTON, R. PLEMMONS, C. ROMINE,
A. SAMEH, AND R. VOIGT, Parallei Algorithmsfor Matrix Computations, SIAM, Philadel-
phia, 1990.
[12] F. GAVRIL, The intersection graphs of subtrees in trees are exaetly the ehordal graphs, J.
Combin. Theory Ser. B, 16 (1974), pp. 47-56.
[13] - - , Generating the maximum spanning trees of a weighted graph, J. Algorithms, 8 (1987),
pp. 592-597.
[14] A. GEORGE, Nested dissection of a regular finite element mesh, SIAM J. Numer. Ana!., 10
(1973), pp. 345-363.
[15] A. GEORGE AND J .-H. LIU, An automatie nested disseetion algorithm for irregular finite
element probiems, SIAM J. Numer. Ana!., 15 (1978), pp. 1053-1069.
[16] - - , Computer Solution of Large Sparse Positive Definite Systems, Prentice-HaIl Inc., En-
glewood Cliffs, New Jersey, 1981.
[17] - - , The evolution of the minimum degree ordering algorithm, SIAM Review, 31 (1989),
pp. 1-19.
[18] A. GIBBONS, Algorithmie Graph Theory, Cambridge University Press, Cambridge, 1985.
[19] J. GILBERT AND R. SCHREIBER, Highly parallei sparse Cholesky faetorization, SIAM J. Sei.
Stat. Comput., 13 (1992), pp. 1151-1172.
[20] M. GOLUMBIC, Algorithmie Gr:aph Theory and Perfect Graphs, Aeademie Press, New York,
1980.
28
[21] C.-W. Ho AND R. C. T. LEE, Counting eligue trees and computing perfect elimination
schemes in parallel, Inform. Process. Lett., 31 (1989), pp. 61-68.
[22] J. JESS AND H. KEES, A data structure for parallei L/U deeomposition, IEEE Trans. Comput.,
C-31 (1982), pp. 231-239.
[23] E. KIRSCH, Practieal parallel algorithms for ehordal graphs, Master's thesis, Dept. orComputer
Science, The University of Tennessee, 1989.
[24] J. LEWIS, B. PEYTON, AND A. POTHEN, A fast algorithm for reordering sparse matrices for
parallel factorization, SIAM J. Sci. Stat. Comput., 10 (1989), pp. 1156-1173.
[25] J .-H. LJU, Modifieation of the minimum degree algorithm by multiple elimination, ACM Trans.
Math. Software, 11 (1985), pp. 141-153.
[26] - - , A compact row storage scheme for Cholesky faclors using elimination trees, ACM Trans.
Math. Software, 12 (1986), pp. 127-148.
[27] - - , On the storage reguirement in the out-of-core multifrontal method for sparse factoriza-
tion, ACM Trans. Math. Software, 12 (1986), pp. 249-264.
[28] - - , Equivalent sparse matrix reordering by elimination tree ratations, SIAM J. Sci. Stat.
Comput., 9 (1988), pp. 424-444.
[29] - - , Reordering sparse matrices for parallel. elimination, Parallei Computing, 11 (1989),
pp. 73-91.
[30] - - , The raie of elimination trees in sparse factorization, SIAM J. Matrix Ana\. AppI., 11
(1990), pp. 134-172.
[31] - - , The multifrontal method for sparse matrix solution: theory and practice, SIAM Review,
34 (1992), pp. 82-109.
[32] J. W.-H. LJU AND A. MIRZAIAN, A linear reordering algorithm for parallel pivoting of chordal
graphs, SIAM J. Dise. Math., 2 (1989), pp. 100-107.
[33] M. LUNDQUIST, Zero patterns, ehordal graphs and matrix completions, PhD thesis, Dept. of
Mathematical Sciences, Clemson University, 1990.
[34] B. PEYTON, Some applieations of elique trees to the solution of sparse linear systems, PhD
thesis, Dept. of Mathematical Sciences, Clemson University, 1986.
[35] B. PEYTON, A. POTHEN, AND X. YUAN, A eligue tree algorithm for partitioning chordal
graphs for parallel sparse triangular solution. In preparation.
[36] - - , Partitioning a chordal graph into transitive subgraphs for parallel sparse triangular so-
lutian. In preparation.
[37] A. POTH EN AND F. ALVARADO, A fast reordering algorithm for parallei sparse triangular
solution, SIAM J. Sci. Stat. Comput., 13 (1992), pp. 645-653.
[38] A. POTHEN AND C. SUN, A distributed multifrontal algorithm using eligue trees, Tech. Rep.
CS-91-24, Department of Computer Science, The Pennsylvania State University, University
Park, PA, 1991.
[39] R. PRIM, Shortest conneetion networks and some generalizations, Bell System Technical Jour-
nal, (1957), pp. 1389-1401.
[40] D. ROSE, A graph-theoretie study of the numerical solution of sparse positive definite systems
of linear equations, in Graph Theory and Computing, R. C. Read, ed., Academic Press,
1972, pp. 183-217.
[41] D. ROSE, R. TARJAN, AND G. LUEKER, Algorithmie aspeets of vertex elimination on graphs,
SIAM J. Comput., 5 (1976), pp. 266-283.
[42] R. SCHREIBER, A new implementation of sparse Gaussian elimination, ACM Trans. Math.
Software, 8 (1982), pp. 256-276.
[43] D. SHIER, Some aspeets of perfect elimination orderings in chordal graphs, Discr. AppI. Math.,
7 (1984), pp. 325-331.
[44] R. TARJAN, Maximum cardinality search and chordal graphs. Unpublished Leeture Notes
CS 259, 1976.
[45] - - , Data Structures and Network Algorithms, SIAM, Philadelphia, 1983.
[46] R. TARJAN AND M. YANNAKAKIS, Simple linear-time algorithms to test ehordality of graphs,
test acyelicity of hypergraphs, and seleetively reduee aeyelie hypergraphs, SIAM J. Comput.,
13 (1984), pp. 566-579.
29
[47] J. WALTER, Representations of rigid cycle graphs, PhD thesis, Wayne State University, 1972.
[48] M. YANNAKAKIS, Computing the minimum fill-in is NP-complete, SIAM J. Alg. Disc. Meth.,
2 (1981), pp. 77-79.
CUTTING DOWN ON FILL USING NESTED DISSECTION:
PROVABLY GOOD ELIMINATION ORDERINGS'
Abstract. In the last two decades, many heuristics have been developed for finding good
elimination orderings for sparse Cholesky factorization. These heuristics aim to find elimination
orderings with either low fill, low operatian count, or low elimination height. Though many heuristics
seem to perform weil in practice, there has been a marked absence of much theoretieal analysis to
back these heuristics. Indeed, few heuristics are known to provide any guarantee on the quality of
the elimination ordering produced for arbitrary matrices.
In this work, we present the first polynomial-time ordering algorithm that guarantees approxi-
mately optimal fill. Our algorithm is a variant of the well-known nested dissectian algorithm. Our
ordering performs particularly weil when the number of elements in each row (and hence each col-
umn) of the coeflicient matrix is small. Fortunately, many problems in practice, especially those
arising from finite-element methods, have such a property due to the physical constraints of the
problems being modeled.
Our ordering heuristie guarantees not only low fill, but also approximately optimal operatian
count, and approximately optimal elimination height. Elimination orderings with small height and
low fill are of much interest when performing factorization on parallei machines. No previous orderi ng
heuristic guaranteed even small elimina.tion height.
We wiJJ describe our ordering algorithm and prove its performance bounds. We shall also present
some experimental resuIts comparing the quality of the orderings produced by our heuristic to those
produced by two other well-known heuristics.
• Some of the work reported in this paper first appeared in an extended abstract in the Pro-
ceedings of the 31st Annual IEEE Conferenee on the Foundations of Computer Science, 1990 [33].
t Digital Equipment Corp., Massively ParalleI Systems Group, 146 Main Street, Maynard, MA
01574.
t Brown University, Providenee, Rl 02912. Research supported by NSF grant CCR-9012357 and
an NSF PYI award, together with PYI matching funds from Thinking Machines Corporation and
Xerox Corporation. Additional support provided by ONR and DARPA contract NOOOI4-83-K-0146
and ARPA Order No. 6320. Amendment l.
32
lill-in. Different orders of eliminating the variables may yield very different fill-in.
It is thus of prime importanee to be able to choose an elimination ordering of the
variables that results in small fill-in. The choice of an elimination ordering that
results in small fill-in often conflicts with the requirement of an ordering that ensures
numerical stability of the solution process. Fortunately, many of the systems of
equations that arise in practice are positive definite, in which numerical stability is
not a problem (17). In solving such lineal' equations, we are free to choose an ordering
of variables entirely based on our desire to preserve the sparsity of the matrix during
the elimination process.
A matrix A is called symmetric if Aij equals Aji for every Z,J. Positive defi-
nite matrices that are also symmetric frequently arise in structural analysis, signal
processing, economics, VLSI simulation, solution of lineal' prograrns, and solution of
partial differential equations, to name a few. In this work, we study the problem of
finding a good elimination orderi ng for such matrices.
Henceforth, when we refer to a lineal' system of equations Ax = b, we assume that
A is~ symmetric positive definite matrix.
Minimizing till. We shall define the fiil for an ordering as the sum of the number
of non-zero elements in the matrix, and the fill-in introduced by the ordering. The
fill for an ordering measures the amount of storage required, and also has bearing on
the total time required for the elimination process. It is thus of interest to find an
elimination ordering that minimizes fill. For the purposes of analyzing fill, we shall
count each pair of symmetric elements only once.
Findiug such an ordering is NP-complete [58) and hence is unlikely to have a
polynomial-time solution. Nevertheless, this problem is of fundamental importance
and a large set of orderi ng heuristics have been developed [3, 55, 14, 20, 49, 16, 15, 43,
18, 17, 9, 10). However, none of them are known to give any performance guarantee
on the size of the fill for arbitrary symmetric matrices. In this work, we present
the first polynomial-time algorithm that guarantees approximately optimal fill. Our
algorithm performs particularly weIl when the number of elements in each row (and
hence each column) of the matrix is small. Fortunately, many problems in practice,
especially those arising from finite-element methods, have such a property due to the
physical constraints of the problems being modeled. As stated earlier, all our results
are for the dass of symmetric positive definite systems of equations.
THEOREM 1.1. There is a polynomial-time algorithm that finds an elimination
ordering yielding approximately optimal fiil. The fiIl for the ordering is within a
factor of O( v'd log4 n) of the optimum, where n is the number of variables and d is
the maximum number of non-zero entries in any row or column of the coefficient
matrix.
Our algorithm is a variant of the well-known nested dissection algorithm (14). It
treat s the input matrix as the adjacency matrix of a graph. The algorithm is based on
finding a recursive decomposition of the graph associated with the coefficient matrix.
The use of graphs in the study of elimination orderi ng is not new [51, 54). There is an
obvious way of associating a graph with a given symmetric matrix; the variables of
the matrix associate with the nodes of the graph, and there is an edge between nodes
33
i and j iff the element (i,j) of the matrix is non-zero. The values of the non-zero
element s in the eoefIicient matrix are not relevant for the ordering problem in the
case of symmetrie positive definite systems.
Parter [51 J first showed how to interpret the elimination proeess as a graph-
theoretieal proeess. Rose [55J eharacterized the dass of graphs for whieh there is
an elimination order with no fill-in, and showed how to find such an ordering. George
[14J first proposed a nested dissection approach, designed specifieally for grid graphs.
The algorithm was shown to be optimal (within eonstant faetors) in terms of opera-
tion eount for the ease of a regular finite-element mesh [29J. George's approach was
first generalized to arbitrary graphs by George and Liu [16J. However they used a
simple heuristie for graph separators and henee eould not prove any bounds on its
performance. Lipton, Rose, and Tarjan [41J proved a performance bound for their
version of nested dissection algorithm for graphs with small (O( vin) in size) sepa-
rators. Gilbert and Tarjan [26J later showed that ~y using small separators, George
and Liu's nested dissection algorithm also gives dose to optimal fill for planar graphs,
but does not generalize to the dass of graphs with O( vin) size separators. A good
overview of these algorithms ean be found in the books of George and Liu [17J and
Duff and Reid [9J.
Node separators are fundamental to the nested dissection algorithm. A node
separator consists of aset of nodes whose removal breaks the graph up into pieees.
For our purposes, every subset of the nodes is a separator. A separator is f-balanced,
for some f < 1, if no pieee on its removal has more than fn nodes, where n is the
number of nodes in the graph.
The fill restlItinI!; from applying the approaeh of Lipton, Rose, and Tarjan depends
upon the size of the separators for the dass of graphs being dealt with. For example,
their approaeh yields O( n log n) fill for any planar graph based on the fact that planar
graphs have ~-balaneed separators of size O( vin).
There are three main differenees between our work and the work of Lipton, Rose
and Tarjan. One, we do not assume the existenee of any speeial separator strueture
in the graphs. Two, our analysis is more striet; we are able to analyze the quality
of our result with respeet to the minimum fill aehievable over all orderings. Three,
our variation of nested disseetion is similar to that of George and Liu, and somewhat
simpler than that of Lipton, Rose and Tarjan.
Gilbert [22J showed that for a matrix with at most d non-zero elements in each row
(and eolumn), there exists a nested dissection algorithm whose fill is within O(dlog n)
of minimum. His nested disseetion algorithm, however, is inherently non-eonstruetive;
the choice of separators in his algorithm depends erucially on the optimally filled
matrix.
Our nested disseetion algorithm also uses balaneed node separators. No polynomial
time algorithms are known for finding a minimum-size balaneed node separator in a
graph. However, we show that ehoosing near-optimal balaneed node separators is
sufIicient to achieve near-optimal fill. All our proofs are independent of the method
by whieh the near-optimal separators are found. To our knowledge, Leighton and Rao
[36, 37J have provided the first and the only polynomial-time algorithm to find an
approximately balaneed separator in an arbitrary graph. We henee use their method
34
Note that the performance guarantee for fill from the above theorem is never worse
than O(mt log3.5 n) since the size of the minimum fill F* must be at least as much as
the number of edges m in the graph.
THEOREM 1.4. The elimination ordering produced by our algorithm has height
with in a factor of O(log 2 n) of optimal.
Note that our algorithm itself is not parallei, but is to be used to generate an
orderi ng with small height. Such an approach is suitable for problems where the
sparsity structure of the matrix is fixed, and the linear system has to be solved for
many different coefficient values and/or right hand sides. A good elimination ordering
can hence be found sequentially as a preprocessing step. Our work is thus different
35
TABLE 1
Performance guarantees for the elimination ordering produced by our algorithm. In the above taMe,
n, m and d respectively denote the number of nodes, edges, and the maximum degree of the graph
associated with the coefficient matrix.
from other work addressing the issue of generating the ordering itself in parallel
[43,52,39,27,25,5).
Along with minimizing the height, it is desirable to keep both the fill and the
operation count small, since they determine the to.tal space and the total work done
by the algorithm. Our algorithm is the first one known that approximately minimizes
all three quantities simultaneously: fill, operation count, and height. By putting
Theorems'1.1, 1.3 and 1.4 together, we get the following result.
THEOREM 1.5. The elimination ordering produced by our algorithm simultane-
ously minimizes height to within a O(log2 n) factor, fill to within O(.jd log4 n) factor,
and the operation count to within a O( d log6 n) factor of the respective optimum quan-
Uties. In the guarantee, d denotes the maximum number of non-zero elements in any
row or column of the n x n coefficient matrix.
Bodlaender et al. (2) have independently presented essentially the same algorithm
as ours for findine: an elimination ordering of approximately minimum height. How-
ever, they do not analyze the fill and operation count for their ordering.
The performance guarantees for the elimination ordering obtained by our algo-
rithm are given in Table 1.
Gilbert (22) has conjectured that there is an orderi ng that simultaneously min-
imizes height and approximately minimizes fill (to within a constant factorl. Our
analysis here represents progress towards proving this conjecture. Our algorithm is a
polynomial-time algorithm since we utilize near-optimal node separators in our nested
dissection algorithm. By utilizing a minimum node separator (non-polynomial) algo-
rithm, we can prove the following:
THEOREM 1.6. There exists a nested dissection ordering that simultaneously
minimizes height to within a O(log n) factor, fill to within a O(.jd log2 n) factor, and
operation count to within a O( d log4 n) factor of the respective optimum quantities.
In the guarantee, d denotes the maximum number of non-zero elements in any row
or column of the n x n coefficient matrix.
chordal graph containing the graph associated with the input matrix. A chordal
graph is a graph in which every cyele of length at least four has a chord, i.e. there
is an edge between two non-consecutive no des in the cyele. Chordal graphs are also
sometimes referred to as triangulated graphs.
We exploit the characterization of the elimination process given by Rose. We
prove that our nested dissection orderi ng yields a chordal graph of small size with
respeet to the optimal.
4. Our algorithm
Graph separators. Before we give the orderi ng algorithm, we need some baek-
ground on an essential ingredient of our algorithm, namely balaneed no de separators.
Reeall that aset of nodes X in a graph G = (V, E) is ealled an J-balanced node
separator for some fraction J < 1, if no connected eomponent of G - X is of size more
than the fraction J of IVI. No polynomial-time algorithms are known for finding an J-
balaneed no de separator of minimum size for a non-trivial eonstant J. However, using
the technique of Leighton and Rao [36], one ean find an approximately minimum-sized
balaneed node separator. This was also shown by Makedon and Tragoudas [50].
LEMMA 4.1 ([36, 50]). There exists a polynomial-time algorithm to find a ~
balanced node separator in a graph oJ size within an 0 (log n) Jactor oJ the optimal
~ -balanced node separator.
5. Separator tree. In this seetion, we will establish a lower bound on the size
of the optimum ehordal extension of a graph in terms of the separator sizes found
by the nested dissection algorithm. In our nested dissection ordering, we employ an
approximation algorithm for finding balaneed node separators. However, for ease of
exposition below, we shall as sume that we have a separator algorithm that finds the
best balaneed node separator in a graph. We shall later forego this assumption.
Consider the following tree, ealled the separator tree, representing the nested dis-
seetion process on a graphj the separator vertiees form the root of the tree, and the
trees of eaeh of the pieees are built reeursively. To distinguish from the nodes of the
tree, we shall refer to the no des of the graph as ve1,tices. A tree node henee stands
for a separator that may eonsist of several vertiees. Our algorithm thus defines an
elimination ordering of the vertiees of the original graph that is eonsistent with a
postorder traversal of the nodes of the separator tree. However, the algorithm orders
the vertiees within a tree no de arbitrarily.
38
We sh all derive a lower bound on the size of the optimal chordal extension G* in
terms of the sizes of the separators at any level of the separator tree. By a level of the
tree we mean all the nodes in the tree that are at the same distance from the root.
Let Xl> ••• ,X p be the tree nodes at some level of the separator tree. Since VX1 ' ••• , VXp
are disjoint, it follows that the graphs G;" ... ,G;p induced by them in G* are also
disjoint. Thus we have
p
(1) 2:: lG:, I S; IG*I
i=l
THEOREM 5.1 ([24]). Evel'y chordal graph has a ~-balanced clique separator, and
hence has a ~-balanced node separator of size at most m,
where E is the number
of edges in the chordal graph.
By Fact 5.1, each of the graphs G;, is chordal. Hence by Theorem 5.1, we can
write
(2)
LEMMA 5.2. The size of the optimai chordal extension is at least one-half the
largest sum of the squares of the sizes of the separatars at any level of the nested
dissection separator tree.
One of the main results of this work is to show that the nested dissection algorithm
in fad yields a chordal graph whose size is elose to the lower bound given above (see
Section 6.3 for proof).
39
THEOREM 5.3. For any level 1 of the nested dissection separator tree, let S/ be
the sum of the squares of the sizes of the separators at this level. Then the size of the
optimal chordal extension ofG is at least! max/ S"~ and at most O( Vdlog4 n)*max, S,.
In employing an approximation algorithm for finding balaneed node separators
that has a factor of f performance guarantee, we prove that the size of the chordal
graph thus obtained is no more than O(P) times the size of that obtained by using
the optimal balaneed node separators. We employ a separator algorithm with an
O(logn)-factor performance guarantee, and obtain the following resuIt (see Section
6.3 for proof).
THEOREM 5.4. There is a polynomial-time algorithm that generates a nearly
optimai chordal extension of an input graph. The size of the chordal graph is
O(min(IG"1 Vdlog4 n, IG"I~ Jmlog3.sn )), where IG"I is the size of the optimai chordal
extension of an input graph of n nades, medges, and maximum degree d.
6.1. A lower bound. We shall first establish a lower bound on the number of
edges in the optimally filled graph G*. In Lemma 5.2, we showed a lower bound
for IG*I in terms of the sizes of the separators at any level of the separator tree.
However, we had assumed that we had an optimal separator algorithm. We now
relax that restriction, and derive a similar resuIt using the nested dissection tree
buiIt with the O(logn) separator approximation algorithm of Leighton and Rao (see
Lemma4.1).
We first state the following simple observation.
PROPOSITION 6.1. Let x!, ..• ,xp be the separatars at some level of the separator
tree. The 'I{ertex sets T"", ... ,T",p of the subtrees rooted at these separatars are disjoint.
LEMMA 6.2. Let Xl' •.• ' X p be the separatars at any level of the nested dissectian
separator tree. The size of the optimai chordal extension is n e:r~;}~;l2).
Proof. Let G'i be the subgraph induced in G* by the vertices belonging to the
subtree rooted at Xi. By Theorem 5.1, ai has a !-balanced separator of size at most
J2IGil. Let this separator be Xi. Then we have
lXii = O(logn)lXil
40
/' //;:/-;t/~~
."." ------'X _ -::. _ \
O
Nclde contalnlng v
I
, /lI' \ ~
/
1/ 0
I
, ' 01, °
lx t i
: I II.~\ \/Ift /\
I ~lO ° °
•
I
I
I
\ 0'0
. , II
II I~
I . 10 °
1\°
\ \/X X~
" .,0 ° °
I /
° °
'-'" I
Let us define the level of a node v in the tree as the distance of v from the root,
and denote it by level~v J. -eyaleveli in the tree, we refer to all the nodes at leveli.
By the level of a vertex we shall refer to the level in the separator tree of the no de it
belongs to. The depth of a t.ee refers to the maximum level of any node in the tree.
We claim that the depth of the tree is small.
LEMMA 6.5. The depth of the separafo. f.ee is at most O(log n).
Proof On removing a balaneed separator from a graph with n vertiees, each of
the pieces has at most ~n vertices. Hence the graph size decreases exponentially with
the increase in recursion depth of the nested dissection algorithm. The depth of the
separator tree is then at most log~ n. D
We shall now count the number of edges to a vertex v from any of the vertices
numbered smaller than v. For that, we define the notion of an associated tree for each
vertex. The associated t.ee for a vertex v belonging to a separator X is constructed
as follows. Let Vb"" Vk be the neighbors of v such that level( Vi) ~ levele v), for
1 :s i :s k. Let Xi be the separator containing Vi. The associated tree for v is the
smallest subtree root ed at X containing each of the separators Xl,"" X k (see Figure
1).
Lemma 6.4 implies that for every edge (w,v) E G~ where a(v) > a(w), w must
belong to the associated tree of v. Thus the total number of edges to v from vertices
In Liu's terminology [47], the associated tree for a vertex is exactly the part of the separator
tree that contains its "row subtree" .
42
numbered lower than v in the orderi ng is at most the number of vertices belonging
to all the separators in the associated tree of v. We shall refer to this number for v
as the cost of v. Thus the total number of edges in our chordal extension is at most
the sum of the costs of the vertices.
THEOREM 6.6. The total number of edges in the chordal extension obtained by
our nested dissection ordering is at most is O(,jJ log4 n) times optimai, where d is
the maximum degree of the graph.
Proof. Let us estimate the sum of the costs of all vertiees at a given levelil in the
tree. Let this level consist of separators Xl,"" Xp • For i = 1, ... ,p, consider the
highest-cost vertex of Xi, and let Ai be the associated subtree for this vertex. For
each leveli:::: Il, let W/(A i ) be the number of vertiees in Ai at leveli. Then the sum
of the costs of vertiees at level h is no more than the sum, over all levels I greater
than 11, of the value
p
(6) :L lXii· W/(A i ).
i=l
Let Ai have qi separators X i,l, . .. ,Xi,qi at level I. Since each vertex has a maximum
degree of d, it follows that the associated tree of a vertex has at most d leaves. This
implies that each level of the tree has at most d nodes, and hence qi is at most d.
Substituting into (6), we get
P qi P qi
P qi
(7) :L:LIXi ,jI2
;=1 j=l
where the first inequality follows from the Cauchy-Schwartz inequality, and the second
from the fact that q; ::; d. By Lemma 6.2 it follows that
and similarly
P qi
:L :L IXi,j 12 = O( jiQ.!log n)
i=l j=l
Our elimination orderi ng hence yields a chordal graph which has only a polylog
factor more edges than the optimal if the maximum degree of the graph is at most
polylog in the number of nodes. This al so proves that the fill for such graphs is
also provably small. Moreover, many problems in practice, for example finite element
probIems, have small degree and thus for these problem s our nested dissection ordering
is guaranteed to produce near-optimal fill.
43
6.4. An upper bound: Large degree graphs. While the performance bound
is polylog for small degree graphs, we cannot claim the same for the unbounded degree
graphs. We can, however, claim a non-trivial performance bound which is no worse
than a faetor of mt log" n times the optimal, where m is the number of edges in the
graph. We omit the proof for brevity. The details can be found elsewhere [1].
THEOREM 6.7. For an unbounded degree gmph G with n vertiees and medges,
the total number of edges in G~ is 0 (lG*lt vmlog3.sn).
7.2. A lower bound. Consider the case when a chordal graph has a clique of
size p. Then for any ordering of variables in the clique, the node numbered i within
the clique has an associated clique of size p - i, for every i from 1 to p. Thus the
total number of multiplications required to eliminate all the variables in this clique
is L:f=l (p - i)2, which is n (p3). By Lemma 5.1, since every chordal graph has a
~-balanced clique separator, the following lemma easily follows.
LEMMA 7.1. For any ehordal gmph G*, if P is the size of its clique sepamtor, then
n (p3) is a lower bound on the number of multiplieations required for any elimination
ordering.
Let M* be the least multiplication count for any elimination ordering of G. We
shall extend Lemma 7.1 a step further to relate M* to the sizes of the separators at
any level of the separator tree.
LEMMA 7.2. Let a given level in the sepamtor tree obtained by our algorithm have
p sepamtors XI, . .. , X p • Then n e::r~~~;l3) is a lower bound on M*.
Proof. Let T õ be the subtree rooted at X; and G'; be the subgraph induced by the
vertices of Tõ in G*. Since Gi is chordal by Fact 5.1, it has a clique separator. Since
our separator approximation has aguarantee of O(log n), the optimal clique separator
must have size (&n). By Lemma 7.1, ai must then require O!() multiplica-
n n
tions. Since the subgraphs Gi, .•. ,G; are disjoint, it follows that any ordering in G*
44
leveli ...
z
I level J 2
leval13
FIG.2. The only vertices z which can contribute to the edge (u,v) must beiaug both to the associated
tree of v and to the subtree rooted at u.
.
mus t reqmre n (L:P,IXd3)
'"los' n mu I'
tlP I'Icat'IOns. 0
7.3. An upper bound. We shall now derive an upper bound on the number
of multiplications required. Let M be the number of multiplications required for the
eEmination ordering defined by the algorithm. M is given by the sum over all nodes v
of the number of edges in v's associated clique. Thus we can write M as L:v L:eECv 1,
whieh is the same as L:e L:v:Cv 3e 1. The contribution of an edge to this sum is the
number of vertiees containing the edge in their associated cliques. We shall refer to
this quantity as the contribution of the edge. M is hence the sum of the contributions
of the edges in a:.
We shall use this characterization along with Lemma 7.2 to relate
M to M*.
Proof The contribution of an edge (u, v) is 1 for each vertex z such that C z
contains the edge (u,v). Without loss of generality, let us assume that a(v) > a(u).
Since (u, v) E a:,
by Lemma 6.4, u must belong to the associated tree of v. Since u,
v and z belong to the clique C" C z must contain the edges (z,v) and (z,u). Since
a(z) < aev), the presence of the edge (z,v) in C z implies that z must also belong
to the associated tree of v (see Figure 2). Similady, the fact that (u,v) is in C z
implies that z must belong to the subtree rooted at u. Thus the only vertices that
can contribute to the edge (u, v) are those which belong to the associated tree of v
and also belong to the subtree rooted at u. Note that the latter implies that the level
of such a vertex is at least as high as that of u.
So let us consider three level s in the separator tree II, i2 , and i3 such that i3 ;:::
lz ;:::
il' Our aim is to count for each edge (u, v) between a vertex v in levelil and a
vertex u in level i2 , the total number of vertices in the level 13 that contain (u, v) in
45
their associated clique. Let this quantity be called M'. M' can be written as
(8) M'= L 1
vElevelil tlElevel b.(u.v)EG~ zElevells,Cz3{u,v)
1
uElevelI2,(u,v)EG~ zElevelI 3 ,cz3(u,v)
for a vertex v. Let XI, ... , X q be the separators at level il, and let Vi denote the
vertex V in Xi for which !'vtv is maximum. Then we can rewrite (8) as
q
(9) M' = LLMv
i=l VEXi
q
(10) ::; LLMv ,
i=l vEXj
q
(11) LIXilMv ,
i=l
Let us now estimate the value of !'vtv,. Let Av, denote the associated tree of Vi. Let
the separators in Av, at level i 2 be XiI, ... , X iq,. Each of the edges of Vi to level i 2
must have a vertex in Av, as its endpoint. Consider all the ed ges between Vi and the
vertiees of the separator Xij. There are a maximum of IXij I such edges. By the above
discussion, any vertex that has any of these edges in its associated clique must belong
to the subtree of Av, rooted at Xi. All such vertices at leveli3 must then belong to
one of the separators in the subtree of Av, rooted at Xij. Let the separators in Av, at
level i3 be X ijl , ... ,Xijq'l' Then the maximum number of vertiees whose associated
cliques can contain an eäge between Vi and a vertex in Xij is given by L:k;1 IXijkl,
and there can be at most IXijl such edges. Summing over all the separators in Av, at
level 12 , we get
qi qij
Note that each of the terms on the right hand side of (15) is a sum over the (disjoint)
separators at a single level, and hence we can apply Lemma 7.2. We get
As mentioned before, the total number of multiplications is the sum of M' over all
the possible choiees of 11, 12, and 13 . There bei ng o (log3 n) such possible choices, the
theorem follows. D
The theorem above shows that the performance guarantee of our nested dissection
algorithm is a polylog factor if the degree of the graph is small. As mentioned earlier,
low degree graphs account for many of the matrices arising in practice.
Finding an ordering that minimizes height itself is NP-hard [53]. Hence we have
to be content with finding an ordering that approximately minimizes height. It turns
out that our nested dissection elimination ordering also approximately minimizes
height, and thus we obtain an algorithm that simultaneously gives low fill, number
of multiplications, and height. Contrary to our performance bounds for the fill and
multiplication count, the guarantee for the height is independent of the degree of the
input graph, and is always a O(log2 n) factor of the optimal. We prove this result in
this section.
Bodlaender et al. [2] have independently proposed an ordering scheme similar
to ours that achieyes approximately minimum height. The problem of finding an
ordering with small height has been studied by many researchers in the past and an
excellent survey can be found in the artide by Heath, Ng, and Peyton (in [10]).
However, we do not address that issue here. Our implementation of the algorithm
at present is sequential. We use the technique of Leighton and Rao [36] for finding
small balaneed separators in a graph, and no efficient parallei implementations are
known for it. Some work has been done [32] on parallelizing the technique, but the
resulting method is still not competitive. We suspect that the algorithm of Leighton
and Rao cannot be parallelized efficiently. However, we hope that other techniques
for finding small graph separators will be developed, which will be more amenable to
parallel implementations. The issue of generating the elimination ordering itself in
parallei has been studied by other researchers [10]. However, none of the previously
proposed algorithms have yielded any performance guarantees.
8.1. A lower bound. From the discussion on the height of an elimination or-
dering, it follows that the height of any elimination orderi ng for a clique of size m is
m. That gives us the following simple lemma.
LEMMA 8.1. For any chordal graph G*, if m is the size of its clique separator,
then the he,ight of any elimination ordel-ing must be n (m).
We can build on the above lemma to get the following result.
LEMMA 8.2. Let the Zargest separator in the separator tree obtained by our algo-
rithm for a graph G be X. Then any elimination ordering for G must have height
n(~).
Proof. Let Vx be the set of vertices in the subtree of the separator X and G*
be the chordal graph with minimum height over all elimination orders. By Theorem
5.1, and the performance guarantee of our separator algorithm (see Theorem 4.1),
the graph induced by \Ix 111 G* has a clique separator of size n (~). This clique
size is a lower bound on the height of any elimination ordering by Lemma 8.1, and
hence the lemma follows.
8.2. An upper bound. We shaH now show that the height generated by our
nested dissection orderi ng is not too much more than the optimal height.
Consider the separator tree. Let X be the largest separator in the tree. Consider
all the separators at each level. One variable from each of the separators can be
eliminated simultaneously as there are no direet edges between the variables of dif-
ferent separators. Hence the number of parallei elimination steps for eliminating all
the variables at a level is no more than the size of the largest separator at the level.
This size is no more than lXI by assumption. Since the number of levels is O(log n),
the height of the orderi ng is at most O(IXllogn). By Lemma 8.2, the value of lXI
is at most O(log n) times the minimum height of any elimination ordering. It then
follows that the height of our orderi ng is at most O(log2 n) times the minimum height
over all orderings. We have thus proved our claim of this performance guarantee in
Theorem 1.4.
The second code is the nested disseetion heuristic that is implemented in SPARSPAK
[19].
The minimum-degree heuristic is by far the most commonly used and acknowl-
edged as the most effeetive heuristic known for finding good elimination orderings. It
has a rich history. It originated from the work of Markowitz in 1957, has undergone
many enhancements over the last fifteen years, and has been incorporated in many
publicly available codes like MA2S, YALESMP, and SPARSPAK. Much statistics re-
garding the performance of this heuristic and all the enhancements are also available
in literature. George and Liu [IS] present an excellent survey of the developments
and enhancements in the minimum-degree heuristic. They suggest that a minimum-
degree heuristic with certain enhancements [43J outperforms other variations of this
heuristic. We abtained the latest versian of the code implementing this heuristic from
Joseph Liu in July 1991, and that is what we shall refer to as the minimum-degree
code for the purposes of the comparison. vVe al so wanted to compare our nested dis-
section ordering against an already existing one. The SPARSPAK nested disseetion
was an ideal choice because of its popularity.
We compared the fill, the total number of multiplications, and the height of our
ordering with those abtained by the other two codes for a variety of matrices. These
matrices were abtained from the Harwell-Boeing test set of sparse matrices [7, 6].
They are symmetric positive definite matrices that are derived from real applications
in the industry. They have also been extensively used as a test suite by many re-
searchers [S, 45, 40, 46, 47]. Many of the matrices that we used came from struetural
engineering and finite-element analysis probiems.
We report our results below. We give the names of the matrices from the Harwell-
Boeing colleetion, and the actual values of the three quantities of interest for the
orderings. For the other two codes, we also compute the percentage difference in the
values of the three quantities as compared to the values for our ordering. The number
of non-zero elements in the original matrix is given in the table for reference.
Our fill is usually within ± 11% of the minimum-degree ordering. The height
of our orderi ng is generally bettel' than that of the minimum-degree ordering. The
latter however, has better performance in terms of the number of multiplications.
Compared to the SPARSPAK nested dissection ordering, our ordering seems to fare
weIl in all the three criteria.
Though our nested disseetion algoritlun seem s to provide competitive results, its
practical use is limited dtie to the computationally intensive algorithm for finding the
49
TABLE 2
Comparison of fill: fill is the total number of elements in the matrix that were either non-zero or
became non-zero during the course of elimination
approximate separators. Our algorithm may run for hours while the minimum degree
heuristic algorithm or the SPARSPAK nested dissection algorithm might terminate
in minutes or even seconds.
10. Conclusions and open issues. Our study suggests some new directions
for further research and many open issues. We list them here.
• Improving the performance bounds for the orderi ng probiems: The perfor-
mance guarantees for the fill and the operation counts for our nested dissec-
tion ordering depend upon the maximum degree of the graph associated with
the coefficient matrix. It is a challenging problem to find a polynomial-time
ordering algorithm whose performance guarantees are independent of the de-
gree of the input graph. A simpler problem might be to obtain an ordering
algorithm whose performance guarantees are proportional to the average de-
gree of the input graph. Such aresult will be interesting even for the cases
where the graph has excluded minors.
• Experiments with variants of our nested dissection algorithm: While our
nested dissection algorithm seems to perform weil in practice, we have not
yet experimented with variants of our algorithm. We think that further
experience with this algorithm might suggest practical enhancements to the
elimination orderings produced by the algorithm. We point out again that
the minimum-degree code against which we compare our heuristic has been
tuned and adjusted over many years.
• Finding in parallei an elimination orderi ng of small height: Our nested dissec-
tion ordering is a good orderi ng for solving sparse linear systems in paralleI.
However, our algorithm for finding the eliminationordering itselfis inherently
sequential at present. That is because no parallei approximation algorithms
are yet known for finding balanced separators in a graph. It is of interest to
find a parallei algorithm that produces an ordering that has provably small
height.
50
TABLE 3
Comparison of multipIieation eount
TABLE 4
Comparison of height
200
150 150
100 100
50 ...... 50
O ~----~------~~
o .-.ä!:W-'
o 100 200 o 100
nz = 1777 nz = 5502. nops =104839. h =48
150
100
50
REFERENCES
[1] A. Agrawal, "Network Design and Network Cut Dualities: Approximation Algorithms and
Applieations," Ph.D. thesis, Teehnieal Report CS-91-60, Brown University (1991).
[2] H. 1. Bodlaender, J. R. Gilbert, H. Hafsteinsson and T. Kloks, "Approximating treewidth,
pathwidth, and minimum eHmination tree height," Teehnieal Report CSL-90-01, Xerox
Corporation, Palo Alto Research Center (1990).
[3] E. Cuthill, and J. MeKee, "Reducing the bandwidth of sparse symmetrie matriees," Proceedings
of the 24th National Conference of the ACM (1969), pp. 157-172.
[4] I. S. Duff, A. M. Erisman, and J. K. Reid, "On George's nested disseetion method," SIAM
Journal on Numerical Analysis, vol. 18 (1976), pp. 686-695.
[5] I. Duff, N. Gould, M. Leserenier, and J. K. Reid, "The multifrontal method in a parallei
environment," in Advances in Numerical Computation, M. Cox and S. Hammarling, eds.,
Oxford University Press (1990).
[6] I. Duff, R. Grimes, and J. G. Lewis, "Users' guide for the Harwell-Boeing sparse matrix eol-
leetion," Manuseript (1988).
[7] I. Duff, R. Grimes, anu;;. G. Lewis, "Sparse matrix test problems," ACM Transactions on
Mathematical Software, vol. 15 (1989), pp. 1-14.
[8] I. Duff, and J. K. Reid, "The multifrontal solution of indefinite sparse symmetrie linear equa-
tions," ACM Transactions on Mathematical Software, vol. 9 (19S3), pp. 302-325.
[9] I. Duff, and J. K. Reid, Direct Methods for Sparse Matrices, Oxford University Press (19S6).
[10] K. A. Gallivan et al. Parallei Algorithms for Matrix Computations, SIAM (1990).
[11] M. R. Garey and D. S. Johnson, Computers and Intractability: Aguide to the theorg of NP-
completeness, W. H. Freeman, San Francisco (1979).
[12] George, J. A., "Computer implementation of a finite element method," Tech. Report STAN-
CS-20S, Stanford University (1971).
[13] George, J. A., "Block elimination of finite element system of equations," in Sparse Matriees
and Their Applications, D. J. Rose and R. A. WiIloughby, eds., Plenum Press (1972).
[14] George, J. A., "Nested nissection of a regular finite element mesh," SIAM Journal on Numer-
ical Analysis 10 (1973), pp. 345-367.
[15] George, J. A., "An automatie one-way disseetion algorithm for irregular finite-element prob-
lems," SIAM Journal on Numerical Analysis, vol. 17 (19S0), pp. 740-751.
[16] George, J. A., and J. W. Liu, "An automatie nested dissection algorithm for irregular finite-
element problems," SIAM Journal on Numerical Analysis, vol. 15 (197S), pp. 1053-1069.
[17] George, J. A., and J. W. Liu, Computer Solution of Large Sparse Positive Definite Systems,
Prentiee-Hall Inc. (1981).
[18] George, J. A., and J. W. Liu, "The evolution of the minimum degree ordering algorithm,"
SIAM Review, vol. 81 (1989), pp. 1-19.
[19] George, J. A., J. W. Liu, and E. G. Ng, "User's guide for SPARSPAK: Waterloo sparse linear
equations paekage," Tech. Rep. CS78-30 (revised), Dept. of Computer Science, Univ. of
Waterloo, Waterloo, Ontario, Canada (1980).
[20] N. E. Gibbs, W. G. Poole Jr., and P. K. Stockmeyer, "An algorithm for redueing the bandwidth
and profile of a sparse matrix," SIAM Journal on Numerical Analysis, vol. 18 (1976), pp.
236-250.
54
[21] J. R. Gilbert, "Some Nested Dissection Order is Nearly Optimal," Information Proeessing
Letters 26 (1987/88), pp. 325-328.
[22] J. R. Gilbert, personal communication (1989).
[23] J. R. Gilbert and H. Hafsteinsson, "Approximating treewidth, minimum front size, and mini-
mum elimination tree height," manuscript, 1989.
[24] J. R. Gilbert, D. J. Rose and A. Edenbrandt, "A separator theorem for chordal graphs," SIAM
J. Alg. Dise. Meth. 5 (1984), pp. 306-313.
[25] J. R. Gilbert, and R. Schreiber, "Hightly parallel sparse Cholesky factorization," Tech. Report
CSL-90-7, Xerox Palo Alto Research Center, 1990.
[26] J. R. Gilbert, and R. E. Tarjan, "The analysis of a nested dissection algorithm," Numerisehe
Mathematik, vol. 50 (1987), pp. 377-404.
[27] J. R. Gilbert, and E. Zmijewski, "A parallei graph partitioning algorithm for a message-passing
multiprocessor," International Journal of Paralid Programming, vol. 16 (1987), pp. 427-
449.
[28] M. C. Golumbic, Algorithmie Graph Theorg and Perfect Graphs, Academic Press, New York
(1980).
[29] A. J. Hoffman, M. S. Martin, and D. J. Rose, "Complexity bounds for regular finite difference
and finite element grids," SIAM Journal on Numerical Analysis, vol. 10 (1973), pp. 364-
369.
[30] J. Jess, and H. Kees, aA data structure for parallei L/U decomposition," IEEE Transactions
on Computers, vol. 31 (1982), pp. 231-239.
[31]· U. Kjrerulff, ''Triangulation of graphs - AIgorithms giving small total state space," R 90-
09, Institute for Eleetronie Systems, Departmellt of Mathematies and Computer Scienee,
University of Aalborg (1990).
[32] P. N. Klein, "A parallei randomized approximation scheme for shortest paths," Technical
Report CS-91-56, Brown University (1991).
[33] P. N. Klein, A. Agrawal, R. Ravi and S. Rao, "Approximation through multicommodity How,"
Proceedings of the 31st Anllual IEEE COllferellce Oll Foundations of Computer Scienee,
(1990), pp. 726-737.
[34] P. N. Klein, and S. Kang, "Approximating concurrent How with uniform demands and capac-
ities: an implementation," Technieal Report CS-91-58, Brown University (1991).
[35] P. Klein, C. Stein and E. Tardos, "Leighton-Rao might be practical: faster approximation
alg"';~:,m: t". concnrrent How with uniform capacities," Proceedings of the 22nd ACM
Symposium on Theorg of Computing (1990), pp. 310-321.
[36] F. T. Leighton and S. Rao, "An approximate max-How min-cut theorem for uniform multicom-
modity How problems with application to approximation algorithms," Proceedings of the
29th Annual IEEE Conferenee on Foundations of Computer Science (1988), pp. 422-431.
[37] F. T. Leighton, F. Makedon and S. Tragoudas, personal communication, 1990
[38] C. Leiserson, and J. Lewis, "Orderings for parallei sparse symmetric factorization," in Parallei
Processing for Scientific Computillg, G. Rodrigue, ed., Philadelphia, PA, 1987, SIAM, pp.
27-32.
[39] M. Leuze, "Independent set orderings for parallei matrix factorization by Gaussian elimina-
tion," Parallei Computing, vol. 10 (1989), pp. 177-191.
[40] J. LewiS, B. Peyton, and A. Pothen, "A fast algorithm for reordering sparse matrices for parallel
faetorization," SIAM Journal on Scielltific and Statistical Computing, vol. 10 (1989), pp.
1156-1173.
[41] R. J. Lipton, D. J. Rose and R. E. Tarjan, "Generalized nested disseetion," SIAM Journal on
Numerieal Analysis 16 (1979), pp. 346-358.
[42] R. J. Lipton and R. E. Tarjan, "Applications of a planar separator theorem, SIAM Journal on
Computing 9 (1980), pp. 615-627.
[43] J. W. Liu, "Modifieation of the minimum degree algorithm by multiple elimination," ACM
Transactions on Mathematieal Software, vol. 12 (1985), pp. 141-153.
[44] J. W. Liu, "Reordering sparse matrices for parallei elimination," Parallei Computing, vol. 11
(1989), pp. 73-91.
[45] J. W. Liu, "The minimum degree ordering with constraints," SIAM Journal on Seientifie and
Statistieal Computing, vol. 10 (1989), pp. 1136-1145.
[46] J. W. Liu, "A graph partitioning algorithm by node separators," ACM Transactions on Math-
ematieal Software,. vol. 15 (1989), pp. 198-219.
[47] J. W. Liu, "The role of elimination trees in sparse factorization," SIAM Journal on Matrix
Analysis alld Applic/dions, vol. 11 (1990), pp. 134-172.
55
[48] J. W. Liu, and A. Mirzaian, "A linear reordering algorithm for paralleI pivoting of ehordal
graphs," S/AM Journal on Diserete Mathematies, vol. 2 (1989), pp. 100-107.
[49] J. W. Liu, and A. H. Sherman, "Comparative analysis of the Cuthill-McKee and the reverse
Cuthill-MeKee orderi ng algorithms for sparse matriees," SIAM Journal on Numerieal Anal-
ysis, vol. 13 (1976), pp. 198-213.
[50] F. Makedon, and S. Tragoudas, "Approximating the minimum net expansion: near optimal
solutions to eireuit partitioning problems," Manuseript (1991).
[51J S. Parter, "The UBe oflinear graphs in Gaussian elimination," SIAM Review, vol. 3 (1961), pp.
364-369.
[52] F. Peters, "ParalleI pivoting algorithms for sparse symmetrie matriees," Parallei Computing,
vol. 1 (1984), pp. 99-110.
[53] A. Pothen, "The complexity of optimal elimination trees," Tech. Report CS-88-1B, Departm.ent
of Computer Science, The Pennsylvania State University, University Park, PA, 1988.
[54] D. J. Rose, "Triangulated graphs and the elimination process," Journal of Math. Anal. Appi.
32 (1970), p. 597-609.
[55] D. J. Rose, "A graph-theoretic study of the numerieal solution of sparse positive definite
systems of linear equations," in Graph Theory and Computing, R. C. Read, ed., Academic
Press (1972), pp. 183-217.
[56] D. J. Rose, R. E. Tarjan and G. S. Lueker, "Algorithmic aspects of vertex elimination on
graphs," SIAM J. Camp. 5 (1976), pp. 266-283.
[57] R. Schr~iber, "A new implementation ofsparse Gaussian elimination," ACM Trans. on Math-
ematiea/ Software 8:3 (1982), pp. 256-276.
[58] M. Yannakakis, "Computing the minimum fill-in is NP-eomplete," SIAM J. A/gebraie and
Discrete Methods 2 (1981), pp. 77-79.
AUTOMATIC MESH PARTITIONING
Abstract This paper describes an efficient approach to partitioning unstructured meshes that
occur naturally in the finite element and finite difl'erence methods. This approach makes use of the
underlying geometric structure of a given mesh and finds a provably good partition in random O(n)
time. It applies to meshes in both two and three dimensions. The new method has applications in
efficient sequential and paralleI algorithms for large-scale problems in scientific computingo This is
an overview paper written with emphasis on the algorithmic aspects of the approach. Many detailed
proofs can be found in companion papers.
Keywords: Center points, domain decomposition, finite element and finite difl'erence meshes, ge-
ometric sampling, mesh partitioning, nested dissection, radon points, overJap graphs, separators,
stereographic projections.
which use good partitioning to either decrease the number of iterations used or the
time used by direct methods.
Several numerical techniques have been developed using the partitioning method
to solve problems on a paralleI system. Examples indude domain decomposition and
nested dissection. Domain decomposition divides the nodes among processors of a
parallel computer. An iterative method is formulated that allows each processor to
operate independently. See Bramble, Pasciak and Schatz [11], ehan and Resasco
[13], and Bj~rstad and Widlund [9]. Nested dissection is a divide-and-conquer node
ordering for sparse Gaussian elimination, proposed by George [34] and generalized
by George and Liu [36] and Lipton, Rose and Tarjan [49]. Nested dissection was
originally a sequential algorithm, pivoting on a single element at a time, but it is
an attractive parallel ordering as well because it produces blocks of pivots that can
be eliminated independently in paralleI. ParalleI nested dissection was suggested by
Birkhoff and George [8] and has been implemented in several settings [12, 21, 35, 84];
its complexity was analyzed by Liu [52] (fo~ the regular square grid) and Pan and
Reif [63] (in the general case).
Vaidya has produced results which indieate that the quality of good precondition-
ers may also be linked to the existence of good partitions [78].
Therefore, one of the key problems in solving large-scale computational problems
on a paralleI machine is the question of how to partition the underlying meshes in
order to reduce the total communication cost and to achieve load balanee.
If a mesh has a sufficiently regular structure, then it is easy to decide in advanee
how to distribute it among the processors of a paralleI machine. However, meshes
of many ä.}:;:!:!:;>tions are irregular and unstructured, making the partition problem
much more difficult. In general, there are meshes in three dimensions which have no
small partition [59]. These examples are not the type that would naturally arise in the
finite element methods, but they are meshes. One important goal is to understand
which meshes do and which do not have small partitions.
Various heuristies have been developed and implemented [65, 68, 82]. However,
none of the prior mesh partitioning algorithms is both efficient in practice and prov-
ably good, especially for meshes from three dimensional probiems. Leighton and Rao
[46] have designed a partitioning algorithm based on multieommodity How probiems,
which finds a separator that is optimal within logarithmie factors. But their algo-
rithm runs in superlinear time and it remains to be seen if it could be used in practice
for large-scale probiems.
1.1. A new method. In a series of papers, the authors (Vavasis [81]; Miller
and Thurston [59]; Miller and Vavasis [60]; Miller and Teng [55]; Miller, Teng, and
Vavasis [56]) have developed an efficient and provably good mesh partitioning method.
This overview paper describes this new approach. It is written with emphasis on the
algorithmie aspects of the approach. Many detailed proofs can be found in com pani on
papers [57, 58].
This method applies to meshes in both two and three dimensions. It is based on
the following important observation: graphs from large-scale problems in scientific
computing are often defined geometrically. They are meshes of element s in a fixed
59
dimension (typically two and three dimensions), that are weil shaped in some sense,
such as having elements of bounded aspeet ratio or having element s with angles
that are not too small. In other words, they are graphs embedded in two or three
dimensions that come with natural geometric coordinates and with structures.
Our approach makes use of the underlying geometric strueture of a given mesh
and finds a provably good partition efficiently. The main ingredient of this approach
is a novel geometrieal charaeterization of graphs embedded in a fixed dimension that
have a small separator, whieh is a relatively small subset of vertices whose removal
divides the rest of the graph into two pieces of approximately equal size. By taking
advantage of the underlying geometric strueture, we also develop an efficient algorithm
for finding such a small separator.
In contrast, all previous separator results (see Section 1.2) are combinatorial in
nature. They not only charaeterize the small separator property combinatorially, but
also find a small separator based only on the combinatorial strueture of the given
graph. When applied to unstructured geometric meshes, they simply discard the
valuable geometric information. The result has been that they are either too costly
to use or they do not find a separator as good as it should be. Worst of all, none of
the earlier separator results is useful for graphs in three dimensions.
Separator results for families of graphs dosed under the subgraph operation im-
mediately lead to divide-and-conquer recursive algorithms for many applications. In
general, the efficiency of such algorithms depends on 8 being bounded away from 1
and f(n) being a slowly-growing function.
Perhaps the most dassieal application of small separator results is nested dissec-
tion, a widely used technique for solving a large dass of sparse linear systems. This
approach was pioneered by George [34], who designed the first O(n1.5)-time nested dis-
section algorithm for linear systems on regular grids using the faet that the vn vn
x
grid has a vn-separator. His result was extended to planar linear systems by Lip-
60
ton, Rose, and Tarjan [49]. Gilbert and Tarjan [40] examined several variants of the
nested dissection algorithms. It has been demonstrated, in theory and in practice,
that nested dissection can be implemented efficiently in parallei.
In the analysis of sparse matrix algorithms, a priori upper bounds on operation
counts are rare in the literature (aside from the trivial dense-matrix upper bounds).
The major exception is nested dissection. The a priori bounds attained by nested
dissection, which in many cases are asymptotically the best possible, always depend
on the associated bounds of the underlying graph-separator algorithm. This means
that a careful analysis of separator sizes is an important aspeet of nested dissection.
Small separator results have found fmitful applications in VLSI design (Leiser-
son [47]; Leighton [45]; Valiant [79]) and efficient message routing (Fredrickson and
Janardan [28]). They have also been used in proving several complexity-theoretic
results (Paterson [62]; Lipton and Tarjan [51]), and have been used to design efficient
graph algorithms such as parallel construction of breadth-first-search trees (Pan and
Reif [63]), testing graph isomorphism (Gazit [33]), and approximating NP-complete
problems (Lipton and Tarjan [51]).
1.3. Outline of the paper. Section 2 defines a new dass of geometric graphs,
the overlap graphs, and describes our main separator theorem. This dass has a simple
definition and contains many important dasses of graphs as special cases. Section 3
studies meshes from the finite element and finite difference methods. We show that
overlap graphs indude "well shaped" meshes. We also show that planar graphs are
a special case of overlap graphs in two dimensions. Section 4 presents a partitioning
algorithm for overlap graphs. The algorithm first uses the geometric information of
the input õ.ü.p!::. tn find a "continuous" separator, then uses the combinatorial struc-
ture to compute a "discrete" separator from its continuous counterpart. The central
step of the algorithm is to find a center point of a point set in a fixed dimensions,
where a center point is a point such that every hyperplane passing through it about
evenly divides the point seto We show that center points always exist and can be
corriputed in polynomial time using linear programming. We also show that the step
of computing a "discrete" separator from a continuous one can be performed in linear
time. Section 5 introduces geometric sampling, a technique that reduces the prob-
lem size and simultaneously guarantees a provably good approximation of the larger
problem. tJsing geometric sampling, we can compute an approximate center point in
random constant time and find a "good" separator of an overlap graph in random
linear time. We further give a practical heuristic for approximating a center point.
We then extend the partitioning algorithm for unstructured meshes. Section 6 gives
the proof outline of the main separator theorem. We demonstrate how to use geomet-
ric arguments to prove separator properties for graphs embedded in fixed dimension.
Section 7 summarizes the paper and gives open questions.
2.1. Neighborhood systems. DEFINITION 2.1. Let P = {PI, ... ,Pn} be points
in IRd. A k-ply neighborhood system for P is aset, {B I, ... , B n }, of closed balls
61
such that (1) Bi is centered at Pi and (2) no point P E rn.d is strictly interior to more
than k balls from B.
A 3-ply neighborhood system in two dimensions is illustrated in Figure 1.
The following notation will be used throughout this paper. For each positive real
a, if B is a ball of radius r in rn.d , then a . B denotes the ball with the same center
as B but radius aro
We now state an important property of neighborhood systems [58].
LEMMA 2.2 (BALL INTERSECTION). Suppose {BI, ... , B n } is a k-ply neighbor-
hood system in rn.d • For each d-dimensional ball B with radius r, for all constant
(3 : 0 < (3 ::; 1,
2.2. Overlap graphs. DEFINITION 2.3. Let a ;::: 1 and let {BI, ... , B n } be
a k-ply neighborhood system for P = {Pt, ... , Pn}. The (a, k )-overlap graph for
the k-ply neighborhood system {Bt , ... ,Bn } is the undirected graph with vertices V =
{I, ... , n} and edges
For simplicity, we call a (1, k)-overlap graph a k-intersection graph. In the case
that a = 1 and k = 1, and no two balls in the neighborhood system have a common
point in their interior, we have the family of graphs known as sphere-packingsj this
interesting dass of graphs will be discussed in the next section.
Oa·
( kd1 . n ~
d + q(a, k, d) )
-separator
that (d+ 1)j(d + 2)-splits. Furthermore, such a separator that (d+ 1 + €)j(d+ 2)-splits
can be computed in random linear time sequentially and in random constant time,
using n processors, for any ljn~/2d < € < 1.
62
riO . 3. US map
3. Finite element and finite difference meshes. One important aspeet that
distinguishes a finite element or finite differenee mesh from a regular graph is that
it has two struetures: the eombinatorial strueture and the geometrie strueture. In
general, it ean be represented by a pair (G,xyz) where G deseribes the eombinatorial
strueture of the mesh and xyz gives the geometrie information.
63
3.1. Meshes from the finite element method. The finite element method
is a colleetion of numerical techniques for approximating a continuous problem by a
finite structure [69]. To approximate a continuous funetion, the finite element method
subdivides the domain (a subset of ffid) into a mesh of polyhedral elements (Figures
2 and 3), and then approximates the continuous funetion by a piecewise polynomial
on the elements.
A common choice for an element in the finite element method is a d-dimensiona!
simplex, which is the convex hull of (d + 1) affinely independent points in ffid, e.g., a
triangle in two dimensions and a tetrahedron in three dimensions. A d-dimensional
simplicial complex is defined to be a collection of d-dimensional simplices that meet
only at shared faces [6, 7, 59]. So a 2-dimensiona! simplicial complex is a colleetion
of triangles that interseet only at shared edges and vertices.
For most applications, a mesh is given as a list of its elements, where each element
is given by the information describing the hierarchical strueture of the elements, its
lower dimensional structures such as its faces, edges, and vertices. Moreover, each
vertex has' geometric coordinates in two or three dimensions.
Associated with each simplicial complex is a natural graph, its 1-skeleton. For ex-
ample, the l-skeleton of a 2-dimensional simplicial complex is a planar graph. Con-
versely, every planar graph can be embedded in the plane such that each edge is
mapped to a straight line segment (Fary [25]; Tutte [74, 75]; Thomassen [71]; Frays-
seix, Pach, and Pollack [27]).
In the finite element method, a linear system is defined over amesh, with variables
representing physical quantities at the nodes. Let finite element graph refer to the
nonzero structure oi i,;lv ,:;"efficient matrix of such a linear system. In the case of
linear finite elements based on a triangulation, such as in Figures 2 and 3, the no des
of the finite element graph are exactly the nodes of the mesh, and hence the finite
element graph is the same as the 1-skeleton of the simplicial complex. In the case of
higher-order elements, the finite element graph usually contains the l-skeleton as a
proper subset. It can be obtained from the finite element mesh as follows: Identify
certain points (vertices, points on edges, points in faces, and points in element s) as
"nodes." Add edges between every pair of nodes that share an element.
To properly approximate a continuous function, in addition to the conditions
that a mesh must conform to the boundaries of the region and be fine enough, each
individual element of the mesh must be weil shaped. A common shape criterion for
element s is the condition that the angles of each element are not too small, or the
aspect ratio of each element is bounded [6,29].
Several definitions of the aspect ratio have been used in literature. We list some
of them.
1. The ratio of the longest dimension to the shortest dimension of the simplex
S, denoted by A 1 (S). For a triangle in ffi2, it is the ratio of the longest side
divided by the altitude from the longest side.
2. The ratio of the radius of the smallest containing sphere to the radius of the
inscribed sphere of S, denoted by A 2 (S).
3. The ratio of the radius of the circumscribing sphere to the radius of the
64
Therefore, if one of the above parameters is bounded by a constant, then all of them
are bounded.
3.2. Graphs from the finite difference method. The finite difference method
is another useful technique for solving computational problems in scientific computing.
It also uses a finite and discrete structure, a finite diffeTence mesh, to approximate a
continuous problem.
Finite difference meshes are often produced by inserting a uniform grid of rn? or
IR3 into the domain via a boundary-matching conformai mapping. In general, the
derivative of the conformai transformation must be slowly varying with respeet to the
mesh size in order to produce good results. See, for example [72]. This means that
the mesh will probably satisfy a density condition [5, 60] .
Let G be an undirected graph and let 'Ir be an embedding of its nodes in IRd. We
say 'Ir is an embedding of density a if the following inequality holds for all vertices
v in G. Let u be the elosest node to v . Let w be the farthest no de from v that is
65
3.3. Overlap graphs and weIl shaped meshes. One of the most valuable
aspects of the dass of overlap graphs is that it enables us to give a unified geomet-
ric characterization of graphs with the small separator property. The set of overlap
graphs in IRd contains all finite subgraphs of infinite grids, planar graphs and sphere
packing graphs. Moreover, overlap graphs indude graphs associated with finite ele-
ment and finite difference methods, as special cases. The parameter a, in a strong
sense, measures the degree to which the mesh is well-shaped.
We now show that for each well-shaped mesh, there is an overlap graph with a pair
a and k that contains the graph defined by (G,xyz) as a subgraph. We say a graph
Gl is a spanning subgraph of another graph G2 if Gl can be obtained from G2 by
deleting edges. A graph G is (a, k)-embeddable in IRd if it is a spanning subgraph of
an (a, k)-overlap graph in IRd. Notice that the small separator property is preserved
under spanning subgraphs.
LEMMA 3.1. IJ G is an a-density graph in IRd, then G is (2a, l)-embeddable.
Proof: Let 7l" be an embedding of G with density a in IRd. Without loss of generality,
assume that G has vertex set V = {1,2, ... ,n}. Let P = {7l"(1),7l"(2), ... ,7l"(n)}. For
each p EP, let c(p) denote the point of P - {p} dosest to p. Let r = {Bt, . .. , B n },
66
So 11"( v) E (2a) . B u , and therefore (7r( u), 11"( v)) is an edge of G', completing the proof.
D
o ( n !.!!::.U + ii )
d -separator.
( n !.!!::.U)
Oa· d -separator.
3.4. Overlap graphs indude all planar graphs. The proof that pIanar graphs
are a special case of overlap graphs relies on the following theorem of Andreev and
Thurston [3, 4, 73] characterizing all planar graphs in a novel geometric fashion.
67
°
the standard stereographic projection mapping. This mapping can be described as
follows. Assume IRd is embedded in IRd+! as the Xd+! = coordinate plane, and
assume Ud is also embedded in IRd+! centered at the origin. Given a point p in
IRd, construct the line L in IRd+! passing through p and through the north pole
of Ud (that is, the point (0, .. ,0,1)). Line L must pass through one other point
q of Udj we define ST(p) to be q. For aset P = {Pt, ... ,Pn} in IRd, we denote
{ST(Pl), ST(1l2) , ... , ST(PnH by ST(P). Recall that the center point of a point set is
the one such that every hyperplane passing through it about evenly divides the point
seto We will de!:......'" rf'nter point formally in Section 4.2.
AIgorithm 1 defines some point sets that are not explicitly computed. We intro-
duce them below only for the purpose of explaining the algorithm.
• Let Ql = 1I'1(Q) in Step 3 abovej
• Let Pt = ST-l(Ql), the pre-image ofQt in IRdU{oo}. The pre-image of the
north pole is defined to be a point at infinityj
• Let P2 = 1I'2(Pl ) in Step 4j
68
• Let Q2 = ST(P2). Note that the origin (0,0, ... ,0) is a center point of Q2'
See further comments below.
°
For each < S < 1, a point c E IRd is a S-center point of P if every hyperplane
containing c S-splits P. Each d/(d + 1)-center point is called a center point of P,
and the set of all center points is denoted by Center(P). The balaneed separation
property of a center point makes it very useful for designing efficient divide and
conquer algorithms [16, 30, 55, 83].
Given aset of points P e IRd, the question of whether P has a center point is
always affirmative. This follows from Helly's Theorem [18].
THEOREM 4.1 (HELLY). Suppase J( is a family of at least d + 1 convex seis in
IRd, and J( is finite or each member of J( is compad. Then if each d + 1 members of
J( have a common point, there is a point common to all members of J(.
LEMMA 4.2 (CENTER POINTS). For each sei P ~ IRd, Center(P) =J: 0.
Proof: 1 We prove the lemma by induction on d. When d = 1, the lemma is elearly
true. We now assume that the lemma holds for all d' < d. If all points of P lie in a
(d - 1)-dh~pnsional affine space, then we can reduce the dimension by one and apply
the induction hypotheses to prove that a better center point exists.
So without loss of generality, assume that P does not lie in a (d - 1)-dimensional
affine space. Notice that P induees an equivalence relation on the set of elosed
halfspaces in IRd: those halfspaces which contains the same subset of points from
P are equivalent. Each equivalence elass can be identified with a halfspace whose
supporting hyperplane passes through d affinely independent points from P.
Let H be the set of all elosed half-spaces with supporting hyperplane passing
through d affinely independent points of P that contain more than LdIPI/(d + l)J
points of P. We want to show that
Center(P) = n H =J: 0.
HE'H
We first show that nHE'H H =J: 0. Clearly, each element from His convex and H
is finite. By Helly's theorem, it is suffident to show that for each H 1 , ••• ,Hd+I E H,
nf:f H; =J: 0.
Note that
;=1
d+1
;=1
;2 P -
d+1
U (IRd -
;=1
H;).
1 We present this proof to indicate that there is an O(n d ) time algorithm for computing a center
point. Similar proofs can be found in many previous works, e.g., [18].
69
Note also
d+l d+l 1
I U((lRd - Hi ) n P)I ::; 2: 1((lRd - Hi ) n P)I < (d + 1)l-d-IPIJ < IPI·
;=1 ;=1 +1
Hence, P - Uf~l(lRd - H;) # 0.
We now show that each point e in nHE1{ H is a center point of P. Suppose e is
not a center point of P. Then there is a hyperplane h passing through e defining a
halfspace H such that the interior of H contains at least rdlPI/(d + 1)1 points of P.
Thus, there is a elosed halfspace H' contained in the interior of H that has at least
rdlPI/(d + 1)1 points of P, contradicting the assumption that e E H'. Therefore,
every point in nHE1{ H is a center point of P. Similarly, we can show that every
center point of P is in nHE1{ H. 0
Immediately following from the above proof is ,an O(n d ) time algorithm for com-
puting a center point of aset P. This algorithm uses linear programming. It forms
a collection of O(n d ) linear inequalities by considering the set of hyperplanes passing
through d affinely independent points of P, and finding the common intersection of
the halfspaces that contain at least dn 1(d + 1) points from P. The intersection of
the O(n d ) halfspaces can be found in O(n d ) time using Megiddo's linear program-
ming algorithm [22, 53]. We will refer this algorithm as the LP algorithm. Of course,
this algorithm is too slow for applications in practice. An effieient algorithm will be
presented in Section 6:
If e in AIgorithm 1 is a D-center point of Q, then the origin 0 is also aD-center
point of Q2 [58]. First of all, the point el is a D-center point of Q1. Now intuitively, a
dilation of lRd moves a center point on the diameter between the south and the north
poles along this diameter either up or down depending on the dilation factor. We
will prove in our companion paper [58] that the dilation of by factor J(1 -
1')/(1 + 1')
indeed makes 0 a D-center point of Q2. SO, any hyperplane passing through 0 0-
splits Q2, and hence GC D-splits Q2. Because all transformations used in the above
partitioning algorithm preserve the splitting ratio of spheres, S also D-splits P.
4.3. Separating spheres. We now explain how to choose a random great eirele
in AIgorithm 1.
More speeifically, for aset of points P = {Pb ... , Pn} in lRd and a constant
o< D < 1, we say that S D-splits P if both lint(S) n PI ::; on and lext(S) n PI
::; on.
70
1. B i n S i= 0j
2. a· Bi n S i= 0 and ri ::; r.
The number of overlap neighbors of S is called the overlap number of S. The
set of overlap neighbors of a sphere can be computed in O(n) time directly from the
neighborhood system.
LEMMA 4.4. The set of all overlap neighbors of S is a vertex cover of Gs.
Proof: From the discussion above, D is a vertex cover of Gs. We want to establish
that each ball from D is an overlap neighbor of S to prove the lemma.
We partition the set D into two subsets D1 and D2 , with
D1 {Bi E D : B i n S i= 0}
D2 D - D1 •
Clearly, each ball from D1 is an overlap neighbor. Now we need to show that for
all B i E D2 , ri ::; r.
If B i E D 2 , then Bi n S = 0. There are two possible cases:
• Case 1: If Pi E int(S), then it simply follows from B i n S = 0, that ri ::; rj
• Case 2: If Pi E ext(S), then from Bi E D2 , it follows that a· Bi n S i= 0,
and there is a ball B j in the neighborhood system such that (1) Pj E int(S)j
(2) B j n S = 0j (3) ri ::; rjj and (4) a· B i n B j i= 0. Because qi,j is not in B j
(otherwise B i would not be in D) there is no intersection between B j and S,
and hence condition (2) holds. Hence rj ::; r, and ri ::; rj ::; r.
Thus, in each case, we have ri ::; r, i.e., Bi is an overlap neighbor, completing the
proof of the lemma. D
So, our second method is to remove the set of all overlap neighbors. The method
is efficient when the overlap graph G is not given. Section 6 shows that the expected
number of overlap neighbors generated by the above algorithm is small.
Notice that the definition of overlap neighbors implicitly removes the assumption
that no point of P lies exactly on S. If the center of B i appears exactly on S, then
B i n S i= 0. Hence B i is an overlap neighbor and is placed in the vertex separator.
5. Making the method practical. The run time of the algorithm above cru-
ciaIly depends on the time needed to compute a center point in (d + 1}-space. All
other steps of the aIgorithm can be performed in O( n) time, and in constant paraIlel
time using O(n) processors.
Unfortunately, no linear-time aIgorithm is known for computing center points. As
shown in Section 4.2, there is a method that requires solving aset of 0(n d ) linear
inequalities. The only improved resuit, due to Cole, Sharir, and Yap [16], is that a
center point in two dimensions can be computed in O( n log5 n) time, and in three
dimensions in O(n 2 log 7 n) time. No subquadratic algorithm is known that aIways
returns even an approximate center point.
In this section, we show that an approximate center point can be found efficiently
72
using geometric sampling [15, 42, 66], which is an important algorithmic technique
for designing efficient geometric algorithms.
5.1. Geometric sampling for efficiency. To illustrate the idea, we first show
how to use random sampling to compute an approximate center point in one dimen-
sion. In this case, the input is aset of 2n integers P = {Pl,'" ,P2n}' If Pi < Pi for
all i < j then centerep) = [Pn,Pn+l]'
Now suppose we randomly seleet an element from P, say p. The probability that
P E {Pn,Pn+d is I/n, while the probability that prn/21 ::; P ::; pr3n/21 is 0.5. So, with
probability 0.5, a randomly selected element from P is an ( = 3/4 center point.
We can improve e using larger samples! Suppose 1random element s S = {rJ, ... , rt}
are seleeted and their median r is the output. Letting I(r) be the rank of r in P,
it follows from a simple analysis that E[l(r)] = n, and V[I(r)] = (2n + 1)(2n -1-
21)/(8k + 6). By Chebyshev's inequality,
. n2
P(lI(r) - nl) ;::: t) ::; 2lt 2
Thus, with probability at least 0.5, II(r )-nl ::; n/0, i.e., r is a 1/20 center point
of IPI. A 1/2 + 1/20 center point of IPI can be computed in 0(1) time. A similar
sampling idea was used by Floyd and Rivest [26] in their fast seleetion algorithm.
The algorithm can be generalized to higher dimensions. In d dimensions, the
randomized 5-center point algorithm has the following form.
We now introduce a notation which will be very useful in quantifying the quality of
the es computed by Algorithm 2. Recall that tPh(P) is the ratio in which hyperplane
h splits a point set P.
The following lemma shows the importance of the e-good sample in approximating
center points. Its proof is straightforward.
LEMMA 5.2. For each P e JRd, if S <;;; P is an e-good sample, then each 5-center
point of S is a (6 + e)-center point of P.
Now the question becomes: how often does aset of 1 randomly chosen points form
73
an f-good sample? This is not a trivial question, but was in fact answered by Vapnik
and Chervonenkis [80] (see [70] for a detailed proof).
THEOREM 5.3 (VAPNIK AND CHERVONENKIS). There is a constant Cd depending
only on d such that for each 0 < f :::; 1 and 1 :::: 2/ f2, if S
is aset of 1 randomly chosen
points from P, then
THEOREM 5.4. For all P E IRd, Algorithm 2 computes a (.Ad + f)-center point of
P in
According to our experiments, about 800 points work very well for meshes in two
dimensions, and 1100 points work very well for meshes in three dimensions.
THEOREM 5.5. Algorithm 3 computes S in random constant time, and a vertex
separator of an overlap graph in random O(n) time. Using p processors, the time can
be reduced to n/p.
AIgorithm 3 demonstrates the usefuIness of geornetry, sampling, and randomiza-
tion in mesh partitioning. The random sampling in the above algorithm reduces the
problem size and simultaneously guarantees a provably good approximation of the
larger problem. It is the underlying geometric structure that ensures the quality of
the partition.
74
5.3. A fast heuristic for center points. Although the sampling algorithm
(AIgorithm 2) for center point (in fixed dimensions) is efficient from theoretieal view-
point, it uses linear programming to solve the center point problem on a smaller
sample point seto The use of linear programmi ng becomes a serious concern in prac-
tical implementation. For example, the experimental results show that the sampling
algorithm need to choose a sample of about five hundred to eight hundred points in
two dimensions. The sampling algorithm thus needs to solve ( 5~0 ) ~ 20 million
linear inequalities! Worse, the state-of-art linear programmi ng algorithms (for fixed
dimensions) have a large constant. The sample size would be larger for higher di-
mensions. The seemingly efficient sampling algorithm is too expensive for practical
applications.
To overcome this difficulty, we have developed a heuristie for approximating center
points [55]. The heuristic uses randomization and runs in linear time in the number of
sample points. Most importantly, it does not use linear programming. Our algorithm
is based on the notion of aradon point. Let P be aset of points in IRd. A point
q E .lRd is a mdon point [18] if Pean be partitioned into 2 disjoint subsets P1 and
P2 such that q is a common point of the convex hull of Pj and the convex hull of P2•
Such a partition is called aradon partition.
FIG. 6. The radon point of four points in IR? When no point is in the convex hull of the other
three (the left figure), then the radon point is the unique cross of two linear segments. Otherwise
(the nght figure), the point that is in the convex hull of the other three is aradon point.
FIG. 7. The radon point of five points in IR3 . Two cases are similar to those in two dimensions.
The following theorem shows that if IPI :2: d + 2, then aradon point always exists.
Moreover, it can be computed efficiently.
THEOREM 5.6 (RADON [18]). Let P be aset oJ points in lRd • IJ IPI :2: d + 2,
then there is a partition (P1 , P2 ) oJ P such that the convex hull oJ P 1 has a point in
common with the convex hull oJ P2.
where Pi = (p], ... ,pf) are the usual coordinates of in IRd. Since n ~ d+2, the system
has a nontrivial solution (ab ... ' an). Let U be the set of all i for which ai ~ 0,
and V the set for which ai ~ 0, and C = LiEU ai > O. Then (U, V) is a partition
of P, and LEv ai = -c and LiW(a;fc)pi = LiEV(a;fc)pi. Let q = LiEU(a;fc)pi =
LiEV(a;fc)pi. The point q is simultaneously written as a convex combination of
points in U and a convex combination of points in V. Hence, q is in the convex hull
of U and the convex hull of V, completing the proof. 0
To compute aradon point of P, we need only to compute aradon point for the
first d + 2 points. It follows from the proof above that aradon point can be computed
in O(c?) time.
We now describe our heuristic for approximating center points.
6. Proof outIine for the main separator theorem. We now outline the
proof of the Main Theorem 2.4, to show how to use geometric arguments to prove
separator properties for graphs embedded in fixed dimension. The detailed proofs
are presented in the companion paper [58]. In Section 6.1, we present a continuous
separator theorem, based on which we give a geometric method for proving a small
separator theorem in Section 6.2. We then apply this method to prove Theorem 2.4
in the remainder of this seetion.
77
Area(f, S) =[ (f(v))d-l(dv)d-l
lVEs
Let 'Ir denote a map from IRd to Ud, which is formed by a stereographic projection,
followed by a rotation, followed by an inverse stereographic projeetion, followed by
adilation, followed by a stereographic projection. Such a map is called an H-map.
Our partitioning algorithms compute an H-map, choose a random great cirele, and
use the inverse of the H-map to transform the great eirele to a sphere in IRd.
For each great eirele GC of Ud, let SGC be the sphere defined bY'lr-1 (GC). Let
Cost(GC) = Area(f,SGc). Let Avg(f) be the average cost of all great eireles of Ud.
We wiII use the following lemma, whose proof can be found in the companion paper
[58].
LEMMA 6.1. Suppose J is a eost Junction in IRd. Then
Avg(f) =0 ((Total-Volume(f)) dd' ) •
The splitting ratio of the separator in the theorem above is (d+ 1)/(d+2) instead
of d/(d + 1). This is because points are mapped from IRd to the unit sphere in IRd+! ,
and the center of the unit sphere is a (d + 1)-dimensional center point of the image
rather than a d-dimensional center point.
In order to deduce a vertex separator from its continuous counterpart, the funetion
J must be JaithJul in the sense that the cost of a continuous separator models the
size of a vertex separator of the underlying graph. In other words, the continuous
function J faithfully encodes some combinatorial properties related to separators of
the underlying graph.
We will follow the basic steps above to prove Theorem 2.4. To this end, for each a-
overlap graph of a k-ply neighborhood system r = {B I , . .• , B n } in IRd, we construct
a real-valued function J based on rand prove that
then we show that from each separating sphere S we can deduce a vertex separator
of size linearly bounded by Area(f, S).
6.3. Local cost functions. Just as each overlap graph is defined from its neigh-
borhood system, Bt, . .. ,Bn , the cost function J itself is defined from the local eost
Junetions, It, ... , Jn, with Ji based on B i •
Let P = {Pl"" ,Pn} be the set of centers of {Bt, ... , B n}, and suppose that the
radius of B i is rio We define Ji as
6.4. Putting local cost functions together. Now we need to .put the local
cost functions together into a global cost funetion. Perhaps the simplest way is to
take the sum, J = Ei k But this is not the best choice. To see this, just check the
extreme case where the neighborhood is a collection of n identical balls and a = 1. In
this case k = n. The total volume of the sum is ndl/d, while we need a cost function
of total volume 0 (nd/(d-l») to establish Theorem 2.4.
To achieve a tight bound, we make use of the "slight" difference between the
various p-norms when applied to high-dimensional vectors. This is a technique that
appears to be newand is inte~esting in its own right. Recently, Mitchell and Vavasis
80
[61] used a cost function similar to ours to analyze their three dimensional mesh
generation algorithm.
Suppose ab ... , an are reaIs. For each positive integer p, the Lp norm of ab ... , am
denoted Lp( ab . .. , an), is
LEMMA 6.3. Lei al, ... , an be real. IJp :5 q, then Lp(ab ... ' an) ~ Lq(ab .. ·' an).
Proof: See Hardy, Littlewood and Põlya [41] (pages 26 and 144). o
We define the global eost Junction of the overlap graph to be the Ld-l norm of
Jb···, Jn, i.e.,
Notice that the L d norm of Ji is not a good choice, because its total volume is nVd
for all neighborhood systems. The following lemma, proved in [58], bounds the total
volume of the function J.
LEMMA 6.4. Let r = {Bl> ... ' B n } be a k-ply neighborhood system in IRd. IJ
Jl, ... , Jn are the loeal eost Junctions oJ r and J is its global eost Junetion, then
d I
Total-Volume(J) = O(a-.=rk-.=rn).
1. B i n S =1= 0;
2. a· B i n S =1= 0 and ri :5 r.
The number of overlap neighbors of S is called the overlap number of S, denoted
19r(S). The following lemma bounds the overlap number 19r(S) of a k-ply neighbor-
hood system in IRd.
LEMMA 6.6. Suppose r = {Bt, ... , B n } is a k-ply neighborhood system in IRd.
Let Jb ... , Jn be loeal eost Junctions defined Jor the a-overlap graph oJ r and lei
J = Ld-l(/t, ... ,In). For eaeh (d - 1)-sphere S,
19r(S) = O(adk) + O(Area(J, S)).
The eonstant in the big-O notation depends only on d.
Theorem 2.4 follows from Lemma 6.4, Theorem 6.2, and Lemma 6.6.
81
Recently, Eppstein, Miller and Teng [24] have showed that a small separator for the
intersection graph of a k-ply neighborhood system can be in fact found in deterministic
linear time.
REFERENCES
[1) A, Agrawal and P. Klein. Cutting down on fill using nested dissection: Provably good elimi-
natian orderings. In Sparse Matrix Computations: Graph Theory Issues and Algorithms,
f1l.JA Volumes in Mathematics and its Applications, (this book), A. George, J. Gilbert and
J. Liu, Springer-Verlag, New York. 1992.
[2) N. Alan, P. Seymour, and R. Thomas. A separator theorem for non-planar graphs. In Proceed-
ings of the 22th Annual A CM Symposium on Theory of Computing, Maryland, May 1990.
ACM.
[3) E. M. Andreev. On convex polyhedra in Lobacevskii space. Math. USR Sbornik, 10(3):413-440,
1970.
[4) E. M. Andreev. On convex polyhedra offinite volumein Lobacevskii space. Math. USR Sbornik,
12(2):270-259,1970.
[5) M. J. Berger and S. Bokhari. A partitioning strategy for nonuniform problems on multipro-
cessors. IEEE Trans. Comp., C-36:570-5S0, 1987.
[6) M. Bern, D. Eppstein and J. R. Gilbert. Provably good mesh generation. In 31st Annual
Symposium on Foundations of Computer Science, IEEE, 231-241, 1990, (to appear JCSS).
[7) M. Bern and D. Eppstein. Mesh generation and optimal triangulation. In Computing in
Euclidean Geometry, F. K. Hwang and D.-Z. Du editors, World Scientific, 1992.
[S) G. Birkhoff and A. George. Elimination by nested dissection. Complexity of Sequential and
Parallel Numerical Algorithms, J. F. Traub, Academic Press, 1973.
[9) P. E. Bjprstad and O. B. Widlund. It.erative methods for the solution of elliptic problems on
regions partitioned into substructures. SIAM J. Numer. AnaI., 23:1097-1120, 1986.
82
[10] G. E. BlelJoch. Vector Models for Data-Parallel Computing. MIT-Press, Cambridge MA, 1990.
[11] J. H. Bramble, J. E. Pasciak, and A. H. Schatz. An iterative method for elliptic problems on
regions partitioned into substructures, Math. Comp. 46:361-9, 1986.
[12] D. Calahan. ParalJel solution of sparse simultaneous linear equations. in Proceedings of the
11th Annual Allerton Conference on Circuits and Systems Theory, 729-735, 1973.
[13] T. F. Chan and D. C. Resasco. A framework for the analysis and construction of domain
decomposition preconditioners. UCLA-CAM-87-09,1987.
[14] L. P. Chew. Guaranteed quality triangular meshes, Department of Computer Science, CornelJ
University TR 89-893, 1989.
[15] K. Clarkson. Fast algorithm for the all-nearest-neighbors problem. In the 24th Annual Sym-
posium on Foundations of Computer Science, 226-232, 1983.
[16] R. Cole, M. Sharir and C. K. Yap. On k-hulls and related probIems. S/AM J. Computing, 61,
1987.
[17] J. H. Conway, and N. J. A. Sloane. Sphere Packings, Lattiees and Groups. Springer-Yerlag,
1988.
[18] L. Danzer, J. Fonlupt and Y. Klee. HelJy's theorem and its relatives. Proceedings of Symposia
in Pure Mathematics, American Mathematical Society, 7: 101-180, 1963.
[19] H. N. Djidjev. On the problem of partitioning planar graphs. S/AM J. Alg. Disc. Math.,
3(2):229-240, June 1982.
[20] ·A. L. Dulmage and N. S. Mendelsohn. Coverings of bipartite graphs. Canadian J. Math. 10,
pp 517-534, 1958.
[21] I. S. Duff. Parallel implementation of multifrontal schemes. Parallei Computing, 3, 193-204,
1986.
[22] M. E. Dyer. On a multidimensional search procedure and its application to the Eudidean
one-centre problem. S/AM Journal on Computing 13, pp 31-45, 1984.
[23] D. Eppstein, G. L. Miller, C. Sturtivant and S.-H. Teng. Approximating center points with and
without linear programming. Manuscript, Massachusetts Institute of Technology, 1992.
[24] D. Eppstein, G. L. Miller and S.-H. Teng. A deterministic linear time algorithm for geometric
separators and its applications. Manuscript, Xerox Palo Alto Research Center, 1991.
[25] I. Fary. On straight line representing of planar graphs. Acta. Sci. Math. 24: 229-233, 1948.
[26] R. W. Floyd and R. L. Rivest. Expected time bounds for selection. CACM 18(3): 165-173,
March, 1975.
[27] H. de Fraysseix, J. Pach, and R. Pollack. Small sets supporting Fary embeddings of planar
graphs. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing,
426-433, 1988.
[28] G. N. Fredrlckson and R. Janardan. Separator-based strategies for efficient message routing.
In 27st Annual Symposium on Foundation of Computation Scienee, /EEE, 428-237, 1986.
[29] I. Fried. Condition offinite element matrices generated from nonuniform meshes. A/AA J. 10,
pp 219-221, 1972.
[30] A. M. Frieze, G. L. Miller and S.-H. Teng. Separator based divide and conquer in compu-
tational geometry. Proceedings of the 1992 ACM Symposium on Parallei Algorithms and
Architectures, 1992.
[31] H. Gazit. An improved algorithm for separating a planar graph. Manuscript, Department of
Computer Science, University of Southern California, 1986.
[32] H. Gazit and G. L. Miller. A parallel algorithm for finding a separator in planar graphs. In 28st
Annual Symposium on Foundation of Computation Scienee, /EEE, 238-248, Los Angeles,
October 1987.
[33] H. Gazit. A deterministic paralleI algorithm for planar graph isomorphism. In 32nd Annual
Symposium on Foundations of Computer Scienee, /EEE, to appear, 1991.
[34] J. A. George. Nested dissection of a regular finite element mesh. S/AM J. Numerieal Analysis,
10: 345-363, 1973.
[35] A. George, M. T. Heath, J. Liu, E. Ng. Sparse Cholesky factorization on a local-memory
multiprocessor. S/AM J. on Seientifie and Statistieal Computing, 9,327-340,1988.
[36] J. A. George and J. W. H. Liu. An automatic nested dissection algorithm for irregular finite
element probIems. SIAM J. on Numerieal Analysis, 15, 1053-1069, 1978.
83
[37] J. A. George and J. W. H. Liu. Computer Solution of Large Sparse Positive Definite Systems.
Prentice-Hall, 1981.
[38] J. R. Gilbert, J. P. Hutchinson, and R. E. Tarjan. A separator theorem for graphs ofbounded
genus. J. Algoritkms, 5 pp391-407, 1984.
[39] J.R. Gilbert, G.L. Miller, and S.-H. Teng. Geometric mesh partitioning: Implementation and
experiments. Technical Report, Xerox Palo AIto Research Center, to appear, 1992.
[40] J. R. Gilbert and R. E. Tarjan. The analysis of a nested dissection algorithm. Numerische
Mathematik, 50(4):377-404, 1987.
[41] G. Hardy, J. E. Littlewood and G. P6lya. Inequalities. Second edition, Cambridge University
Press, 1952.
[42] D. Haussler and E. Welzl. (-net and simplex range queries. Discrete f3 Computational Geometry,
2: 127-151, 1987.
[43] J. P. Hutchinson and G. 1. Miller. On deleting vertices to make a graph of positive genus
planar. In Discrete Algorithms and Complexity Theory - Proceedings of the Japan-US Joint
Seminar, Kyoto, Japan, pages 81-98, Boston, 1986. Academic Press.
[44] C. Jordan. Sur les assemblages de lignes. Journal Reine Angew. Math, 70:185-190, 1869.
[45] F. T. Leighton. Complexity Issues in VLSI. Foundations of Computing. MIT Press, Cambridge,
MA,1983.
[46] F. T. Leighton and S. Rao. An approximate max-flow min-cut theorem for uniform muIti-
commodity flow problems with applications to approximation algorithms. In 29th Annual
Symposium on Foundations of Computer Seienee, pp 422-431, 1988.
[47] C. E. Leiserson. Area Effieient VLSI Compulation. Foundations of Computing. MIT Press,
Cambridge, MA, 1983.
[48] C. E. Leiserson and J. G. Lewis. Orderings for parallei sparse symmetric factorization. in 3rd
SIAM Conference on Parallei Processing for Scientific Computing, 1987.
[49] R. J. Lipton, D. J. Rose, and R. E. Tarjan. GeneraIized nested dissection. SIAM J. on
Numerical Analysis, 16:346-358, 1979.
[50] R. J. Lipton and R. E. Tarjan. A separator theorem for planar graphs. SIAM J. of Appi.
Math., 36:177-189, Apri11979.
[51] R. J. Lipton and R. E. Tarjan. Applications of planar separator theorem. SIAM J. Comput,
9(3): 615-627, August 1981.
[52] J. W. H. Liu. The solution ofmesh equations on a parallei computer. in 2nd Langley Conference
on Scientific Computing, 1974.
[53] N. Megiddo. Linear programming in linear time when the dimension is fixed. SIAM Journal
on Computing 12, pp 759-776, 1983.
[54] G.1. Miller. Finding small simple cycle separators for 2-connected planar graphs. Journal of
Computer and System Sciences, 32(3):265-279, June 1986.
[55] G. L. Miller and S.-H. Teng. Centerpoints and point divisions. Manuscript, School of Computer
Science, Carnegie Mellon University, 1990.
[56] G. L. Miller, S.-H. Teng, and S. A. Vavasis. A unified geometric approach to graph separators.
In 32nd Annual Symposium on Foundations of Computer Scienee, IEEE, pp538-547, 1991.
[57] G. L. Miller, S.-H. Teng, W. Thurston and S. A. Vavasis. Separators for sphere-packings and
nearest neighborhood graphs. in progress 1992.
[58] G. L. Miller, S.-H. Teng, W. Thurston and S. A. Vavasis. Finite element meshes and geometric
separators. in progress 1992.
[59] G. L Miller and W. Thurston. Separators in two and three dimensions. In Proceedings of the
22th Annual ACM Symposium on Theory of Computing, pages 300-309, Maryland, May
1990. ACM.
[60] G. L. Miller and S. A. Vavasis. Density graphs and separators. In Second Annual ACM-
SIAM Symposium on Discrete Algorithms, pages 331-336, San Francisco, January 1991.
ACM-SIAM.
[61] S. A. Mitchell and S. A. Vavasis. Quality mesh generation in three dimensions. Proc. ACM
Symposium on Computational Geornetry, pp 212-221, 1992.
84
[62] M. S. Paterson. Tape bounds for time-bounded Turing machines. J. Comp. Sgst. Sci., 6:116-
124,1972.
[63] V. Pan and J. Reif. Efficient parallel solution of linear systems. In Proceedings of the 17th
Annual ACM Symposium on Theorg of Computing, pages 143-152, Providenee, Rl, May
1985. ACM.
[64] A. Pothen and C.-J. Fan. Computing the block triangular form of a sparse matrix. ACM
Transactions on Mathematical Software 16 (4), pp 303-324, 1990.
[65] A. Pothen, H. D. Simon, K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs.
SIAM J. Matrix Anal. Appi. 11 (3), pp 430-452, July, 1990.
[66] J. H. Reif and S. Sen. Polling: A new randomized sampling technique for computational
geometry. In Proceedings of the 21st annual ACM Symposium on Theorg of Compudng.
394-404,1989.
[67] E. Schwabe, G. Blelloch, A. Feldmann, O. Ghattas, J. Gilbert, G. Miller, D. O'Hallaran, J.
Schewchuk and S.-H. Teng. A separator-based framework for automated partitioning and
mapping ofparallel algorithms in scientific computingo In First Annual Dartmouth Summer
Institute on Issues and Obstacles in the Practical Implementation of Parallei Algorithms
and the use of Parallei Machines, 1992.
[68] H. D. Simon. Partitioning of unstructured problems for paralleI processingo Computing Systems
in Engineering 2:(2/3), ppI35-148.
[69] G. Strang and G. J. Fix. An Analysis of the Finite Element Method, Prentice-Hall, 1973.
[70] S.-H. Teng. Points, Spheres, and Separators: A Unified Geometric Approach to Graph Parti-
tioning. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh,
1991. CMU-C5-91-184.
[71] C. Thomassen. Planarity and duality of finite and infinite graphs. Journal of Combinatorial
Theorg, Series B, 29: 244-271, 1980.
[72] J. F. Thompson, Z. U. A. Warsi and C. W. Mastin. Numerieal Grid Generation: Foundations
and Applications. New York, North Holland, 1985.
[73] W. P. Thurston. The geometrg and topology of 3-manifolds. Princeton University Notes, 1988.
[74] W. T. Tutte. Convex representations of graphs. Proc. London Math. Soe. 10(3): 304-320,
1960.
[75] W. T. Tutte. How to draw a graph. Proc. London Math. Soe. 13(3): 743-768, 1963.
[76] J. D. Ullman. Computational Aspeefs of VLSI. Computer Science Press, Rockville MD, 1984.
[77] P. Ungar. A theorem on planar graphs. Journal London Math Soe. 26: 256-262, 1951.
[78] P. M. Vaidya. Constructing provably good cheap preconditioners for certain symmetric positive
definite matrices. IMA Workshop on Sparse Matrix Computation: Graph Theorg Issues
and Algorithms, Minneapolis, Minnesota, October 1991.
[79] L. G. Valiant. Universality consideration in VLSI circuits. IEEE Transaction on Computers,
30(2): 135-140, February, 1981.
[80] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of
events to their probabiJities. Theorg Probab. Appi., 16: 264-280, 1971.
[81] S. A. Vavasis. Automatic domain partitioning in three dimensions. SIAM J. Sci. Stat. Comp.,
12 (1991) 950-970.
[82] R. D. Williams. Performance of dynamic load balancing algorithms for unstructured mesh
calculations. Technical Report, California Institute of Technology, 1990.
[83] F.-F. Yao. A 3-space partition and its application. In Proceedings of the 15th Annual ACM
Symposium on Theorg of Computing, ACM, 258-263, 1983.
[84] E. E. Zmijewski. Sparse Cho/esky Factorization on a Multiprocessor. PhD thesis, Department
of Computer Science, Comell University, 1987.
STRUCTURAL REPRESENTATIONS OF SCHUR COMPLEMENTS
IN SPARSE MATRICES
Abstraet This paper considers effective implicit representations for the nonzero structure of
a Schur complement in a sparse matrix. Each is based on a characterization of the structure in
terms of paths in the graph of the matrix and/or its triangular factors. Three path-preserving
transformations - quotient graphs, edge pruning, and monotone transitive reduction - are used
to further reduce the size/cost.
where the leading (k -1) x (k -1) prineipal submatrix AKK is nonsingular and has
an LU faetorization. Assume that k - 1 steps of Gaussian elimination have been
performed on A, eliminating the rows/columns associated with A KK • In other words,
the matrix A has been factored as
In this paper, we study effeetive ways to represent the nonzero strueture of a Schur
complement in a sparse matrix. Such struetural representations are helpful in two
importantcontexts. First, they can be used as data struetures when designing effi-
cient symbolic factorization schemes. Second, they can be used to provide struetural
information on the uneliminated portion of the matrix when determining good sparse
matrix reorderings. Indeed, even in numerical faetorization they can be helpful in the
seleetion of pivots (without having to compute the numeric values of the entire Schur
complement) .
It is weIl known that the sparse factorization process can be modeled using a
sequence of graph structures, called elimination gmphs. Such sequences were de-
scribed by Parter [9] for the symmetric case, and by Haskins and Rose [5] for the
unsymmetric case. But the graph associated with the submatrix R[([( il! simply the
• Department of Computer Science and Research Center for Scientific Computation, Yale 'Uni-
versity, New Haven, Connecticut 06520. The research of this author was supported in part by U. S.
Army Research Office contract DAAL03-91-G-0032.
t Department of Computer Science, York University, North York, Ontario, Canada M3J lP3.
The research of this author was supported in part by Natural Sciences and Engineering Research
Council of Canada grant A5509, and in part by the Institute for Mathematics and .its Applications
with funds provided by the National Science Foundation.
86
elimination graph after all of the nodes corresponding to AKK have been eliminated.
Interpreted in these terrns, the purpose of this paper is to study effective ways to
represent (directed and undireeted) eHmination graphs.
In §2, we introduce the matrix and graph notation used throughout the paper. In
partieular, we define notation for edges and paths in the directed graph associated
with a matrix.
In §3, we give a number of characterizations of the Schur complement structure
in terms of paths in the graph of the matrix and/or its triangular factors. These
equivalent conditions are the basis for the strueturaI representations described in
later sections. We also provide an interesting interpretation of these conditions in
terms of the leayes of certain fill trees.
In §4, we describe a dass of struetural Schur representations. In §5, we describe
three path-preserving transformations - quotient graphs, edge pruning, and rnono-
tone transitive reductions - that can be used to further reduce the size/ cost of these
representations.
In §6, we describe severaI struetural Schur representations using paths in the
original graph. We review the use of quotients of strongly conneeted components by
PagaIlo and MauIino [8], and give new representations based on edge prunings of the
original graph and the quotient graph. In §7, we describe similar representations using
paths in the filled graph. In §8, we present several representations using paths in the
graphs of the lower and upper triangular faetors. We review transitive reduetions,
symmetrie reduetions, and path-symmetrie reduetions, and introduce the quotients
associated with the latter two.
2. Notation
Note that the Sehur complement RKK is formed by an update to the originaI
block submatrix AKK' Let Sj?K = -AKKAKkAKK be this Schur complement update.
Thus, to compute the strueture of RKK = AKK + SKK, it is suffieient to obtain the
strueture of SKK, which is just the Sehur complement of A KK in
Aa = (A KK AKK) .
AI?K 0
To allow for uniform indexing into RI?K and SKK, we introduce the matrices
0), S=(O0
R=(Oo RI?K 0).
SKK
87
.2.•
1
• •
•
• 3 •
A=
• 4
.5.
•
•••• 6 •
••
• 7 •
•• • 8
9 •
•• 10
Henceforth, we shall also refer to R as the Schur complement and to S as the Schur
complement update.
If A ha~ an LU factorization, then we let F denote the filled matrix L + U. The
corresponding blocks of F are defined accordingly; for example, we have FKK =
LKK + UKK, and FKo = (FKK UKj(). We also introduce
Fo = (L KK
Lj(K 0
0) + (UKK 0
UJ(]?)
0
= (FJ(J( UKOj()
Lj(K = (FJ(J( FKi()
Fj(K 0 .
1 1 .
.. 2 2 2 .. . . .
. 3 3 3 ..
.. 4 4 4 .. .
.. 5 5 5 ..
.. . .
.. 0 0
A=
.. 6 6 6 .. 0
.
0
7 7 0 .. 7
.. .. 0 0 .. 0 8 8 0 0 8
9 9 .. 9
.. 0 10 .. 0 0 1 0 10
This result can be used to prove the equivalence of aset of path characterizations
of the structure of G(S).
THEOREM 3.2. Let rand c be the row and column subscripts respectively with
r;:::: k and c ;:::: k. The following conditions are equivalent:
S
(1) r --+ c (S-edge)
Ao Ao
(2) r ====;.. c or r --+ ~~c (Ao-path)
Fo Fo
(3) r ====;.. c or r --+ ~~c (Fo-path)
(4) r~~c (L-edge, U -edge)
(5) r~ ~c (L-path, U -path)
Ao
(6) r --+ ~c (Ao-edge, U -path)
Ao
(7) r~ --+ c (L-path, Ao-edge)
The nonzeroes S7,1O, 88,9, and 810,9 in the Schur complement update correspond
respectively to the following paths in G(A o):
These paths are not unique; for example, 8 --> 1 --> 3 --> 4 --> 9 is another path from
8 to 9, as is 10 --> 5 --> 4 --> 9 from 10 to 9. Only 88,9 and 810,9 are fills in the Schur
complement since the entry a7,1O is already nonzero.
89
.2.• • • • 2
• 3 • 3
Fo =
• 4
.5. • •
0
5=
4
5
•••
• 0
• 6 0 6
• 0
• 7 7 0 0
•• 0 0 It 0 8 000
9 9
It 0 10 000
/
e S6
SS9~
U69
/
e 84
8S9~
U49
/ \
ess US6
/ \
e6S US9
/ \ e S3 U34
I
a49
I
aSS
I
aS6
I e/ \
a6S S4 U49 e
/\ I
S2 U23 a34
I
aS4
I
a49
I
aS2
I
a23
Consider a nonzero ere in the lower triangular factor L. Since it is nonzero, either
a rcis nonzero or there exists ak < min{r,c} such that erk and Ukc are both nonzero.
In the latter case, take erk to be the left son of are and Ukc to be the right son.
Repeating this recursively, we get a binary tree whose root is erc and whose leaves
are entries in A. We eal! this tree a fiil tree for the nonzero entry rc • Similar trees e
can be defined for nonzero entries in the upper triangular factor U and the Schur
complement update S. Figure 4 gives two different fil! trees for the nonzero 8S9.
The leaves in the tree correspond to nonzero entries in A. Indeed, if the leaf set
is listed from left to right, then we obtain an Ao-path as guaranteed by Theorem 3.2.
For example, for the nonzero 889, the two fil! trees lead to two different Ao-paths:
Note that the set of leaves in any subtree of a fill tree corresponds to a path
in the filled matrix F. Judicious choiees of such subtrees lead to the various path
90
/389~ /389~
t;;,
! \56
-86 U69 U69
L-edge U-edge
I
a85 Ao-edge U - path
L-path Ao-edge
FIG _ 5_ Fill subtrees for 889 to il/ustrate the paih canditions
Ä"U U
f~ f
U U
f~
f u
e
A u
a
Ao-edge U-path L-path Aa-edge
a
L-path U-path
FIG _ 6_ Generic forms of fill subtrees associated with pa th eonditions_
Generic forms of the fill subtrees associated with the path conditions in Theo-
'em 3.2 are shown in Figure 6.
;hat preserves the structure of the Schur complement update Sj that is, one for which
;he nonzero structure of the Schur complement of X KK in X o is the same as that of
4 KK in A o. We shall refer to X o as a structural Schur representation of Aa 01' of S.
91
Condition (2) in Theorem 3.2 implies that the strueture of S can be represented
implicitly using paths in G(Ao)j that is, we can choose X O to be A o. No additional
storage is required, and there is no cost associated with the construetion of this
representation.
Another obvious choice is the filled matrix Fo. There may be more nonzeroes in Fo
than in Aa due to fill-in, but the locations of the nonzeroes in its Schur complement
update will be identical to those of Aa.
Yet another possibility is the use of condition (4) in Theorem 3.2. Each nonzero in
S corresponds to a path of length exaetly two, with one edge from L/(K and another
from UK/(' Therefore,
This path-preserving transformation will reduce the number of nodes, yet each
path r ~ e will correspond to a (possibly shorter) path in the quotient graph.
It should be emphasized that we do not impose any maximal property on the
partition. Indeed, the trivial partition consisting of single nodes satisfies the cyele
condition. The amount of node reduetion depends on how good the partition is. This
will become elear from the examples in the following sections.
F------(7
Let Q(A) be the matrix obtained from A by coalescing each strongly connected
component of G(A[([(). The matrix Q(A a) is similady defined. The following result
folIows directly from Theorem 5.1.
THEOREM 6.l. [8) The matrix Q(Aa) is a structural Schur representation of Aa.
ill • • ill • •
.1]] • • • I]] • •
Q(A) = • 7 • Q(Ao) = • 7
• • 89 • • • 8
9
• • 10 • 10
In Figure 8, we display the corresponding Q(A) and Q(Aa} for the matrix in
Figure 1. We use ill
to indicate the component {1,3,4} and I]] the component
{2, 5, 6}.
Pagallo and MauIino [8] observe that the strueture of Q(A) can be represented
in-place using the nonzero strueture of A. This is useful if we want a struetural Schur
representation that requires no more storage than required for the original matrix.
ifi~~j
otherwise
Thus G(A-) is the smallest subgraph of G(A) that preserves the filled graph of G(A).
We call G(A-) the skeletan graph of G(A).
THEOREM 6.2. The skeleton matrix A õ is a strudural Schur representation of
Aa.
In Figure 9, we display Aõ for the matrix in Figure 1. The symboI «." is used to
represent a nonzero that has been pruned. For example, the nonzero a6,4 is removed
due to the path 6 ~ 3 ~ 4 in G(F) or the path 6 ~ 2 ~ 1 ~ 3 ~ 4 in
G(A). Every nonzero in Sean be generated by a path in G(Aõ). For example, for
A- A- A- A-
8 ~ 9, we have the path 8 -<:.. 2 -<:.. 5 -<:.. 4 -<:.. 9.
6.3. Path-preserving edge reductions. Since the skeleton graph G(A õ ) pre-
serves the filled graph of G(A o), it must also preserve the set of paths in G(A o). But
we are only interested in the strueture of G(S), so it is not neeessary for the repre-
sentation to preserve all of the filled graph of G(Ao). This suggests that we expIore
other possible edge prunings.
95
1 •
• 2 • • •
• 3•
Aõ =
• 4
.5. •
• •
• 6
• 7
•• 8
9
• 10
1• • • 1• •• 1• •• 1 . •• 1 •
• 2 • • .2 .2 .2 .2
•• 3· • • 3.4 • 3.4 • 3 .3
• ••5 •
. ..
• •• 4 .4 .4
• ••• 5. • • .•
.
5 .• 5 .5
•Original
••• 6 •• 6
•Skeleton • 6 6 .6
Minimal Minimum Transitive
Matrix Matrix Equivalent Equivalent Reduetion
FJG. 10. An example to show minimaljminimum equivalent digmph and tmnsitive redBetian.
Three types of reduced graphs were diseussed in §5.2: minimum equivalent di-
graphs, minimal equivalent digraphs, and transitive reduetions. In general, they are
all eandidates for a struetural Sehur representation and are all different (see Fig-
ure 10).
It is interesting to note that Moyles and Thompson [7] compute a minimum equiv-
alent digraph by first computing the transitive reduetion of the quotient digraph based
on strongly conneeted components, and then finding a minimum equivalent digraph
96
III
.~
• 7
••
• • ·
III
~
• 7
•
•• III
•
.~
7
•
• •
• • 8 • • 89 • 8
9
9
• 10 • 10 • 10
Quotient Skeleton M inimumEquivalent /
Q(Ao) Matrix TransitiveReduction
1
2 • •
3 III ••
4 • • ~ • 0
•
5 • 0 0 • 7
6 • 0
0
• • 8
• 7 9
• .00 • 0 8 • 10
9
• 0 10
for each strongly conneeted component. For our purpose here, we do not need the
time-consuming second stage.
1 •
• 2 2 • • • •
• 3 3 •
.. •
• 4
.. 5
4
5 •
••
• 6 + 6 • 0 0
7 7
0 8 8
9 9
0 10 10
Another difference is in the minimum equivalent digraphs of A o and Fo. For the
example of Figure 10, it is easy to see that the transitive reduction of Ao is a minimum
equivalent digraph of Fo, and is slightly smaller than any minimum equivalent digraph
of Aa.
The transitive reduction L~K of L.J{ can be defined formallyas follows. For j < k,
Thus, G(L~J{) is the smallest subgraph that preserves the set of paths in G(L,J{).
The transitive reduction UK. can be defined in a similar way.
In Figure 13, we show the transitive reductions of the lower and upper triangular
factors of the matrix in Figure 1. Each nonzero pruned is indicated by a ".". For
example, f SI is pruned because of the path 8 ~ 6 ~ 2 ~ 1 (or the path 8 ~
L.K 5 ---*
6 ---* L.K 4 ---* L. K 1) ; an d U59 b ecallse 0 f t Ile pat h 5 ---*
L.K 3 ---* UK_ 9 .
UK_ 6 ---*
98
p.=U'.. ={ 0 if s~j~s,forsomej<s<min{i,k}
J' J' Uji otherwise
The edges pruned to form G(FJ) are a subset of the edges removed to form G(L~K)
and G(U'k.) [2]. Therefore, we have the following result.
THEOREM 8.2. The symmetrieally-redueed matrix F~ is a stroetuml Sehur repre-
sentation of A o.
Symmetrie reduetion prunes edges from the two faetor matrices based on their
symmetric nonzeroes, but it ean aIso be used to reduee nodes by path-preserving
quotients. We first define an equivalenee relation. For i < j < k, i and j are
said to be symmetrieally-related if and only if i ~ il ~ ... ~ it ~ j and
j ~ it ~ ... ~ il ~ i for some nodes il < ... < it. In other words, there is
a symmetric (undireeted) path from i to j in G(FKK) through nodes in increasing
order. .
It is easy to verify that this relation is reflexive, symmetrie, and transitive, and
henee is an equivaIenee relation. We refer to the quotient defined by this relation
as the symmetrie quotient digraph/matrix, and Iet Q'(Ao) and Q'(Fo) denote the
symmetric quotient matrices of Ao and Fo respeetively. By definition, the partitions
in the symmetric quotient matrix are finer than those in the strong quotient matrix.
We have therefore the following result.
THEOREM 8.3. The symmetrie quotient matriees Q'(A o) and Q'(Fo) are stroetural
Sehur representations of Aa.
= m
{1,3,4}, m
For the example of Fi~e 1, the symmetrie quotient has three partitions:
= {2}, and W = {5,6}. Figure 14 shows the symmetrieally-reduced
matrix FJ and its symmetrie quotient matriees.
99
.2.•
1
• • • [1]. • • • [1] • • • •
• 3 •
4 • • m
.[&]
• • m
.[&]
• II
•
II
II 5 II • • • 0 0
• • II 6 • 0 0 • 7 II 7
FIG. 14. F~, Q'(A o), and Q'(Fo) o/the matrix in Figure 1.
d if s~j~s,forsomej<s<min{i,k}
JJi = l:j = { n ..
'-'J otherwise
'f f . {' k}
r~ = u'!. = { 0 1
LKJ{' UKJ{
S ==;> J ==;> S , or some J. < S < mm z,
J' J' Uji otherwise
The edges pruned in G(F~') are again a subset of the edges removed to form G(L~K)
and G(UK.) [2]. Therefore, we have the following resuIt.
THEOREM 8.4. The path-symmetrically-reduced matrix F~' is a stroctuml Schur
representation oJ Aa.
We can define a quotient based on the idea of path-symmetric reduetion. For i <
j < k, i and j are path-symmetrically-related if and only if s ~ i ~ s ~ j ~ s
for some node s < k. It can be shown that this is an equivaIence reIation, and we use
Q"(Aa) to denote the resulting quotient matrix of Aa.
THEOREM 8.5. Q"(A a) is the same as the matrix Q(A a) corresponding to the
strong quotient digmph.
1 •
.2. •
• 3• GJ •• GJ ••
.. .5.
• 4 •• • [ill • • .[ill • 0 •
• 7 • 7
• • 6 • 0 0
• • 8 • • 8
• 7 9 9
0 0 8 • 10 • 10
9
0 10
F."0 Q"(Ao) Q"(Fo)
FJG. 15. Fo', Q"(A o), and Q"(Fo} for the matrix in Figure 1.
For example, in a sparse matrix reordering context, some form of quotients (or its
reduetion) on the originaI matrix Ao will be appropriate, especially since it can be
stored in place within the strueture of Ao• (Note that the quotient used ean be the
st rong quotient digraph [8] or the symmetric quotient digraph in §8.2). On the other
hand, in anumerieal faetorization eode, where we ean make use of the computed
strueture of Fo from the numerieal phase, quotients and reduetions on Fo should be
used.
REFERENCES
[1] A. V. Aho, M. R. Garey, and J. D. UlIman. The transitive reduction of a directed graph. S/AM
J. Comput., 1:131-137, 1972.
[2] S. C. Eisenstat and J. W. H. Liu. Exploiting structural symmetry in sparse unsymmetric
symbollc factorization. S/AM J. Matrix Anal. Appi., 13:202-211, 1992.
[3] J. R. Gilbert and J. W. H. Liu. Elimination struetures for unsymmetric sparse LU factors.
Technkal Report CS-90-11, Department of Computer Science, York University, North York,
Ontario, Canada, 1990.
[4] G. H. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, Balti-
mare, MD, 1983.
[5] L. Haskins and D. J. Rose. Toward a characterization of perfect elimination digraphs. S/AM
J. Comput., 2:217-224, 1973.
[6] J. W. H. Liu. A compaet row storage scheme for Cholesky faetors using elimination trees.
ACM Thms. Math. Software., 12:127-148, 1986.
[7] D. M. Moyles and G. L. Thompson. Finding a minimum equivalent graph of a digraph. Journal
of the ACM, 16:455-460, 1969.
[8] G. PagalIo and C. Maulino. A bipartite quotient graph model for unsymmetric matrices. In
V. Pereyra and A. Reinoza, editors, Numerical Methods, volume 1005 of Lecture Notes
in Mathematics, pages 227-239. Springer-Verlag, 1983. Proceedings of the International
Workshop held at Caracas, June 14-18, 1982.
[9) S. V. Parter. The use of linear graphs in Gauss elimination. S/AM Review, 3:119-130, 1961.
[10) D. J. Rose. A graph-theoretic study of the numerical salutian of sparse positive definite systems
of linear equations. In R. Read, editor, Graph Theorg and Computing, pages 183-217.
Academic Press, New York, 1972.
[11) D. J. Rose and R. E. Tarjan. Algorithmic aspects of vertex elimination of directed graphs.
S/AM J. Appi. Math., 34:176-197,1978.
IRREDUCIBILITY AND PRIMITIVITY OF PERRON
COMPLEMENTS: APPLICATION OF THE COMPRESSED
DIRECTED GRAPH*
Abstract. The Perron complement is a smaller matrix derived in a natural way from a square
nonnegative matrix. By compressing the directed graph of a nonnegative matrix in a certain way,
we analyze fully the connected components, irreducibility and primitivity of its Perron complement
with respeet to a given subset of indices.
1. The compressed directed graph. Let N = {1,2, ... ,n}j we denote the
complement with respeet to N of a subset a S;; N by aC. For it, ... ik E N (repeats
allowed) and a directed graph D on vertex set N, we denote a path consisting of
edges (il, i z ), (i z, i 3 ) •••• , (ik-t, ik) in D by p = p(it, i z, .•• , ik)j a path p(i,j) in D
is simply an edge of D. A path whose initial and terrninal vertices are the same is
called a circuitj we denote the circuit consisting of edges (il' i z), .•• , (ik-t, ik), (i k, il)
by e = c(it, i z, ••. , i k ). Thus c(it, i 2 , ••• , i k ) is the same as p(it, ... , ik, il)' The Zength
of a path or circuit in D is measured in terms of the number of edges and denoted
»
by f(·). Thus, f(p(it,iz, ... ,i k = k - 1, and f(c(it,i z, ... i k » = k. If we wish to
emphasize the graph in which the path occurs, we use fDO. We shall also need a
special measure of Zength for a circuit in D relative to a subset (J S;; N. We use fJJ(e)
to denote the number of appearances of a vertex in (J in the circuit c. Again, fD,JJ(c)
emphasizes the graph. We note that repetitions are allowed and eounted, and that a
directed graph D may have loops, edges of the form (i, i), i E N, so that a path or
circuit may have a vertex repeated consecutively an arbitrary number of times. The
greatest common divisor of an arbitrary, possibly infinite, list L of positive integers
is denoted gcd( L ).
For a directed graph D on vertices N and a subset a S;; N, we define the compressed
• This work supported in part by National Science Foundation grant DMS 92-00899 and by
Office of Naval Research contract N00014-90-J-1739.
t Department of Mathematics, The College of William and Mary, Williamsburg, Virginia 23187-
8795.
102
directed graph relative to a, denoted CD[a], as follows. The vertex set of CD[a] is a,
and, for i,j E a, there is a directed edge from i to j in CD[a] if and only if there is a
(directed) path in D from i to j, all of whose intermediate vertices, if any, are in aC.
For example, if
then
..••.••....... ~ .............•••
CD[{2, 3}] = 0· 0.
l.~) .
The edge (2,3) already oeeurs in D and the dotted edges result from paths via {1,4}
in D : (3,2) from the path p(3, 1,2) and (3, 3) from the cireuit c(3, 1). Of eourse,
several different paths of the allowed type could result in the same edge in CD[a].
It is important to note that a eompressed direeted graph CD[a] inherits exactly
the eonneetivity among its vertiees that oeeurs in the graph D.
We note that the "v-elimination graph" Gv of [RT] is the special case of our eom-
pressed direeted graph CD[a] in whieh a is the eomplement of a single vertex v. Fur-
thermore, our CD[a] may be obtained as (((G v,)",)" ')Vk in which aC = {Vl!'" ,Vk};
the order does not matter. The purpose in [RT] is to study "fill" in a symbolic version
(no cancellation assumption) of Gaussian elimination, and nonzero diagonal entdes
are assumed (and self edges in their direeted graphs are suppressed). Our purpose is
to study irreducibility and primitivity in a context in whieh cancellation cannot oe-
eur. No assumption need be made about diagonal entries, and we need keep track of
self-edges. Because of the formal similarity between Perron and Schur complements,
it is not surprising that the compressed direeted graph arises naturally in both con-
texts, and we suspeet that the eompressed direeted graph will have further use in the
study of elimination/complements.
(i) PA({3) is weIl defined by the above formula if and only if p(A[{3C]) < peA);
(ii) in this event PA ({3) is 1{3I-by-I{31 and nonnegative;
and
(iii) for any t > 0 and any {3 for which p(A[{3C}) < peA), ~ PtA ({3) = PA({3).
The directed graph of an n-by-n matrix A = (aij) is, as usual, the graph on vertex
set N with a direeted edge from i to j if and only if aij =1= 0 (ineluding loops in case
j = i). We denote it by D(A). For convenienee, we often identify notions associated
with the matrix A and the graph D(A), for example indices that identify rows or
columns of A and vertices in the graph D(A).
Important to all our analysis of irreducibility and primitivity of PA ({3) is the faet
that the direeted graph of PA ({3) is exactly the eompression of the direeted graph of
D(A) relative to the vertex set {3 S;; N.
D(PA(f3)) = CD(A)[f3].
Proo! Let i, j E 13. We must show that the i, j entry of PA (13) is positive if and
only if there is a path from i to j in D(A), all of whose intermediate vertiees (if any)
lie in f3c. As p(A) > 0 by hypothesis, we assume, without loss of generality that
p(A) = 1. As p(A[f3C]) < 1, then (I - A[f3c])-l = 1+ A[f3C] + A[f3c]2 + .... Thus,
Au A 12 ••• A.lk 1
[ o A 22 :
: 0 '. :
o ... 0 A kk
Our main result in this section is that the irreducible components of a Perron
complement PA (f3) are naturally related to those of A. This is actually just a fact
about strongly connected components of a compressed directed graph.
Proo! It is clear that the nonempty sets among f3n al, ... ,f3n ak form a partition
of 13, which indexes the rows and columns of PA (13). It suffiees to show that
105
(1) if 1,8 nOil;::: 2, then there is a path in D(PA (,8)) connecting any two vertices in
,8 n Oi and
(2) if p E ,8 n Oj1 and q E ,8 n oj"h =I h, then there is not both a path from p to q
and from q to p in D(PA(,8)).
But, since D(PA(,8)) = CD(A)[,8], by lemma 2, and since any two vertices in ,8
are connected in CD(A)[,8] if and only if they are in D(A) by lemma 1, requirement
(1) is met, as Oi is a connected component of A. On the other hand, again since
D(PA(,8)) = CD(A)[,8] and connectivity of ,8 vertiees in CD(A)[,8] is equivalent
to connectivity in D(A), requirement (2) is met because Oj, and Oi> are different
connected components. D
Two corollaries of interest follow immediately from Theorem 3. It was a main
result of [M] that PA(,8) is (either 1-by-l or) irreducible whenever the nonnegative
matrix A is. Of eourse, PA(,8) may be irreducible when A is not, whieh was not
addressed in [M].
SA,f3(i,j) = LIIAP,
.,,:il~i;ilc=i e
12"",'k-lE{3
the sum of all path products from A whose initial vertices are i, whose terminal
vertices are j and all of whose intermediate vertices (if any) are from (3c.
Analogous with the fact that edges in D(PA((3)) correspond to special paths in
D(A), we may give a path product formula for PA ((3) when A is normalized so that
p(A) = 1. Because of the homogeneity of PA ((3) (observation iii in Section 2), the
latter is no restriction.
THEOREM 8. Let (3 ~ N. If A is an n-by-n nonnegative matrix such that
p(A[(3C]) < p(A) = 1, then for i,j E (3, the i;j entry of PA ((3) is SA,f3(i,j).
Since the i,j entry of A[(3] is IIAP(i,j), summing results in the entry formuIa of
Theorem 8. D
REFERENCES
[HJ) R. Horn and C.R. Johnson, Matrix Analysis, Cambridge University Press, New York, 1985.
[M) C. Meyer, Uncoupling the Perron Eigenvector Problem, Linear Algebra and its Applications
114/115 (1989), 69-94.
[RT) D. Rose and R. Tarjan, Algorithmic Aspects ofVertex Elimination on Directed Graphs, SIA!\f
J. AppI. Math. 34 (1978), 176-197.
PREDICTING STRUCTURE IN
NONSYMMETRIC SPARSE MATRIX FACTORIZATIONS
Abstract. Many eomputations on sparse matriees have a phase that predicts the nonzero struc-
tu re of the output, followed by a phase that actually performs the numerical eomputation. We study
structure prediction for c.omputations that involve nonsymmetric row and column permutations and
nonsymmetric or non-square matrices. Our tools are bipartite graphs, matchings, and alternat.ing
paths.
Our main new result concerns LU factorization with partial pivoting. We show that if a square
matrix A has the strong Hall property (i.e., is fully indecomposable) then an upper bound due to
George and Ng on the nonzero structure of L + U is as tight as possible. To show this, we prove a
crucial result about alternating paths in strong Hall graphs. The alternating-paths theorem seems
to be of independent interest: it can also be used to prove related results about structure prediction
for QR factorization that are due to eoleman, Edenbrandt, Gilbert, Hare, Johnson, Olesky, Pothen,
and van den Driessche.
This paper discusses structure predietion for orthogonal faetorization and for
Gaussian elimination with partial pivoting. These algorithms permute the rows and
columns of an input matrix nonsymmetrically: starting with a linear system (or least-
squares system) of the form Ax = b, they instead solve a system (pr ApC) (( PC) T x) =
(prb). Here pr and pc are permutation matrices; pr reorders the rows of A (the
equations), often for numerieal stability or for efficiency, and pc reorders the columns
of A (the variabies), often for sparsity. We are most interested in the case where pc
has already been chosen on grounds of sparsity.
* Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, California 94304-1314
(gilbert@parc.xerox.com). This work was supported in part by the Christian Michelsen Institute,
Bergen, Norway, and by the Institute for Mathematics and Its Applications with funds provided by
the National Science Foundation. Copyright © 1992 by Xerox Corporation. All rights reserved.
t Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge,
Tennessee 37831-6367 (ngeg@ornl.gov). This author's work was supported by the Applied Mathe-
matieal Sciences Research Program of the Office of Energy Research, U .S. Department of Energy,
under contract DE-AC05-840R21400 and by the Institute for Mathematics and Its Applieations
with funds provided by the National Science Foundation.
108
Our main tools are bipartite graphs, matchings, and alternating paths. A match-
ing corresponds to a choiee of nonzero diagonal elements. Paths in graphs are impor-
Lant in many sparse matrix settings; the notion of alternating paths links matchings,
connectivity, and irreducibility. In this paper we highlight a partieular sort of ir-
reducibility called the strong Hall property: this generalizes the notion of strong
conneetivity (or irreducibility under symmetric permutations) to nonsymmetric per-
mutations and nonsquare matrices. It turns out that accurate strueture predietion is
easier for strong Hall matriees than for general matrices. Fortunately, a non-st rong-
Halllinear system is often most efficiently solved by decomposing it into a sequence
of strong Hall systems.
The next seetion gives definitions and background results, beginning with a def-
inition of exactly what we mean by strueture predietion. Seetion 3 discusses QR
factorization. Most of this seetion reviews earlier work, placing it in a framework
that can be used to study LU factorization as weIl. Seetion 3 also contains a new
tight symbolic result on columnwise orthogonal faetorization. Seetion 4 applies the
framework from Section 3 to LU factorization. It contains the main resuIts of the
paper, which are tight upper and lower bounds on where fill can occur during LU
factorization with partial pivoting. Both Seetions 3 and 4 conelude with remarks and
open probIems; Section 5 makes some final remarks.
reasons that have nothing to do with numerical values. For example, consider an
algorithm that solves a nonsymmetric linear system Ax = b by forming the normal
equations ATAx = ATb and factoring the matrix ATA. If Ahas the stmcture
J
x x
x
x
then the symbolic approach wilI predict (correctly) that ATA is full, and then (incor-
rectly) that the factor of this full matrix is a full triangular matrix.
Even though the no-cancellation assumption may not be strictly correct, there
are situations in which symbolic structure prediction is the most useful kind. For
example, an algorithm may produce intermediate jill, or element s that are nonzero
at some point in the computation but zero in the final resulto (Using the normal
equations on the triangular matrix above is an example.) A symbolic prediction can
be used to, identify all possible intermediate filllocations, and thus to set up a static
data structure in which to carry out the entire algorithm. Also, even if an element
can be proved to be zero in exact arithmetic, it may not be computed as zero in
floating-point arithmetic; we may wish to use symbolic stmcture prediction to avoid
having to decide when such an element should really be considered to be zero.
Exact structure prediction, on the other hand, predicts the nonzero stmcture of
f(A) from that of A without regard to the algorithm that computes f(A). For each
input structure S, it yields the set of output positions that are nonzero for some choice
of input A having structure S. Thus the output of an exact structure prediction is
An exact structure prediction for the normal equations algorithm on the triangular
input above is that the output has the same structure as the input.
If T is the exactly predicted stmcture of J on input stmcture S, then for each
nonzero position (i,j) of T there is some A (depending on i, j, and S) for which
[f(A)};j is nonzero. (We use [J(A)]ij to denote the (i,j) element of f(A).) This is
what we call a one-at-a-time resuit: it promises that every position in the predicted
structure can be made nonzero, but not necessarily all for the same input A. A
stronger result is an all-at-once result, saying that there is some single A depending
only on S for which f(A) has the structure T. Some functions J admit all-at-once
exact structure predictions and some do not. For example, we wilI see that if f(A)
is the upper triangular factor in QR factorization of astrong Hall matrix, then there
is an all-at-once exact prediction; but if f(A) is the upper triangular factor in LU
factorization with partial pivoting of astrong Hall matrix, then the tightest possible
exact prediction is only one-at-a-time.
I'
0'
X X
X
3'
X X X
X X 0'
X X
X X 5'
0'
1 1) = (1 0 0)o (1 1 1)
2 1 1 1 0 1 0
1 2 1 0 1 0 0 1
is that it is full, though in fact its (2,3) element is zero (for the particular choice of
numerical values).
A symbolic upper bound on structure is an exact upper bound, but not vice versa.
In each of Sections 3 and 4, we prove that an exact lower bound is equal to a symbolic
upper boundj it follows that the bound is tight both symbolically and exactly.
1 When the graph G is bipartite or undireeted, P[x;: Xj) = «X;,Xi+l}, ... ,(Xj_bXj}) if i::; j,
=
and P[x; :Xj) «X;,X;_l}, ... ,(Xj+bXj}) ifi ~ j.
(:
x
X
X
X
X
X X
X J 'rxl'
FIG. 2. A nonsymmetric matrix A and its directed graph G(A).
112
x
x x '~'
x
x
x
x
x
x
x
;)
FIG. 3. A symmetric matrix A and its undirected graph G(A).
TABLE 1
Graphs associated with the matrix A.
x x
x
x x x
x x
x x
x x
FIG. 4. A matrix A, its column intersection graph Gn(A), and its lilled column intersection graph
G~(A).
113
Havors: an r-alternating path is one that follows matching edges from columns to
rows and non-matching edges from rows to columns; a e-alternating paih is one that
follows matching edges from rows to columns. The reverse of an r-alternating path
or walk is a c-aIternating path or walk. Suppose the last vertex of one alternating
walk is the first vertex of another. If the aIternating walks are of the same Havor,
their concatenation is an alternating walk of that Havor; if the walks are of opposite
Havors, their concatenation is not an alternating walk.
Suppose that P is an alternating path (of either Havor) from an unmatched vertex v
to a different vertex w. If the last vertex w on P is unmatched, or the last edge on P
belongs to M, then the set of edges M = M EEl P = (M UP) - (M n P) is another
matching; we say that M is obtained from M by alternating along path P. If w is
matched in M, then v is matched and w is unmatched in M, and IMI = IMI. If w
is unmatched in M, then both v and w are matched in M, and IMI = IMI + 1. In
the latter case we also call P an augmenting palh (with respeet to M). A dassical
resuIt of matching theory is that a maximum-sizematching can be construeted by
greedily finding augmenting paths and alternating along them.
LEMMA 2.2. Suppose Ahas a nonzero diagonai. The directed graph G(A) has
a path from vertex r to vertex e if and only if the bipartite graph H(A) has a path
from row rl to column e that is r-alte1'nating with respeet to the matching oJ diagonaI
edges (il, i).
Proo! Immediate. D
2.4. Hall and strong Hall bipartite graphs. A bipartite graph with m rows
and n columns has the Hall property if every set of k column vertices is adjacent to
at least k row vertices, for all 0 ::; k ::; n. Clearly a Hall graph must have m 2: n.
If a graph is not Hall, it cannot have a column-complete matching, because aset of
columns that is adjacent only to a smaller set of rows cannot all be matched. The
converse is a dassical faet about bipartite matchingo
COROLLARY 2.4. IJ a matrix Ahas Jull column rank, then H(A) is Hall. Con-
versely, iJ H is Hall then almost all matriees A with H = H(A) have Jull eolumn
rank.
Proo! If H(A) is not Hall, then it has aset of columns with nonzeros in a smaller
number of rows; those columns must be linearly dependent. For the converse, let M
be a column-complete matching on Hand let R be the set of rows that are matched
by M. Consider any matrix A with H(A) = H. The submatrix of A consisting of
rows R and all columns is square. Its determinant is a polynomial in the nonzero
values of A. We claim that this polynomial is not identically zero: if the entries
corresponding to edges of M have the value one and all other entries are zero, the
114
submatrix is a permuted identity matrix and the determinant is ±1. The set of zeros
of a k- variable polynomial has measure zero in Rk, unIess the polynomial is identieally
zero. Thus the set of ways to fill in the values of A to make this submatrix singular
has measure zero. If the submatrix is nonsingular, then all the columns of A are
linearly independent and A has full column rank. 0
A bipartite graph with m rows and n columns has the strong Hall property if every
set of k column vertiees is adjacent to at least k + 1 row vertices, for all 1 :::; k < n. 2
It is easy to see that the strong Hall property implies the Hall property.
If the Hall property is a linear independence condition, the strong Hall property is
an irreducibility condition: any matrix that is not strong Hall can be permuted to a
block upper triangular form called the Dulmage-Mendelsohn deeomposition [3,24,29],
in which each diagonal block is strong Hall. 3 Linear equation systems and least-
squares problems whose matrices are not strong Hall can be solved by performing first
a Dulmage-Mendelsohn decomposition, and then a block backsubstitution that solyes
a system with each strong Hall diagonal block. Strong Hall matriees are therefore of
part.icular interest in sparse Gaussian elimination and least squares probIems.
Brualdi and Shader [4] and Coleman, Edenbrandt, and Gilbert [5] discuss prop-
erties of st rong Hall matrices. In the following result, an independent set is aset
of vertiees no two of which are adjacent; an independent set in a bipartite graph
corresponds to the rows and columns of a zero submatrix.
THEOREM 2.5 (BRUALDI AND SHADER [4]). A bipartite graph having m rows
and n :::; m eolumns is Hall iJ and only iJ it has no independent set oJ more than m
vertiees, and strong Hall iJ and only iJ it has no independent set oJ at least m vertiees
that includes at least one vertex Jrom eaeh part. 0
A square strong Hall matrix is often ealled Jully indeeomposable, meaning that
there is no way to permute its rows and columns into a block triangular form with
more than one block [3]. This gives the following (standard) resulto
THEOREM 2.6. Let H = H(A) be a square strong Hall graph. Then Jor all
row and eolumn permutations pr and pe, the directed graph G(pr Ape) is strongly
eonneeted. 0
We conelude this subsection by proving a theorem (Theorem 2.9) about strong
Hall matrices that is useful in several structure prediction results. The theorem first
appeared in a technical report by Gilbert [15]; other proofs have been given by Hare,
Johnson, Olesky, and van den Driessche [21] and Brualdi and Shader [4]. First we
need two technieal lemmas.
LEMMA 2.7. Let H be astrong Hall graph and let (r/,e) be an edge oJ H. Then
there is a eolumn-eomplete matehing that includes (r', e), and uniess (r', e) is the only
edge oJ H there is a eolumn-eomplete matehing that excludes (r', e).
2 This definition is from Coleman et al. [5]. Another definition that is sometimes used replaces
the bounds on k by 1 ~ k < mj the only dilference is that an m by n matrix with m > n and m - n
zero rows that is strong Hall by our definition is not strong Hall by the other definition. All the
resuits in Section 3 and Seotion 4 hold no matter which definition is used.
3 This assumes m ~ n. More generally, for any m and n, an m x n matrix can be permuted
to a block upper triangular form in which each diagonal block is strong Hall or has a strong Hall
transposeo
115
Proo! First, let H be H without vertiees r' and e and their ineident edges. We
show that H is Hall. Every nonempty set C of eolumns of H is a nonempty proper
subset of eolumns of H, and henee is adjaeent to at least ICI + 1 rows of H. This
indudes at least ICI rows of H. Therefore H is Hall and has a eolumn-eomplete
matehing. That matehing pius edge (r', e) is a eolumn-eomplete matehing on H.
Now assume that H has mare than one edge, and let fi be H without the single
edge (r', e). We show that fi is Hall. Any nonempty proper subset C of eolumns is
adjaeent to at least ICI + 1 rows in H, henee to at least ICI rows in fi. The same
argument works if C is the set of all eolumns and H has at least ICI + 1 nonzero rows.
If C is the set of all eolumns and H has exaetly ICI nonzero rows R, we argue as
follows: If r' were adjaeent only to e in H, then C - e would be adjaeent in H only to
the IC - el rows R - r ' ,4 eontradicting the fact that His strong Hall. Thus C must
be adjaeent in fi to all ICI rows.
Whether or not H is square, then, we eondude that fi is Hall. Thus fi has a
column-<;omplete matehing, whieh is a eolumn-complete matehing on H that exdudes
(r',e). D
LEMMA 2.8. If H is strong Hall and has more nonzero rows than columns, and
M is any column-complete matching on H, then from every row or column vertex w
of H there is a c-alternating path to some unmatched row vertex r' {which depends
on w and MJ.
Proo! This is a standard result on Dulmage-Mendelsohn deeompositioni we in-
du de a proof here only to be self-eontained. If w is an unmatehed row there is nothing
to proveo Otherwise, let C be the set of eolumns reaehable by e-alternating paths
from w. Then C is nonempty. Let R be the set of row vertiees adjaeent to vertiees
of C. Sinee H is strong Hall and has mare nonzero rows than eolumns, IRI is larger
than ICI. Thus there is sam e vertex r' in R that is not matehed to a vertex in C.
Suppase r' is adjaeent to e E C. The e-alternating path from w to e ean be extended
by edge (e, r') to r'. Now if r' were matched, it would be matehed to a vertex v
not in Ci but then there would be a e-alternating path from w to v, eontrary to the
definition of C. Therefore r' is the desired unmatehed row vertex. D
Finally we prove the main result about alternating paths in strong Hall graphs.
. (.I-.-=~.-·.. .:
.T----<............ .
.-. j.. • •
r
. 'Õi l ! I
v
:: i i : I
I I
FlG. 5. Gase 1 of Theorem 2.9. The dashed edges are the matching M. P is the horizontal path
from v to w. The light dotted line shows pa th 'ii from u to r. Path P[v : u]'ii[u : x] is c-alternating
with respeet to M.
Suppose that H has more nonzero rows than columns. If v = w there is nothing to
proveo Otherwise, by hypothesis there is at le'ast one path from v to W. By Lemma 2.7
there is a column-complete matching that omits the first edge on that path. (Note
that this edge is not the only edge of H since H has more nonzeros than columns.)
If P is a path from v to w and M is a column-complete matching that omits the
first edge on P, let u (dependent on P and M) be the last vertex on P such that
P[v : u] is alternating. Then P[v : u] is c-alternating. Among all such paths and
column-complete matchings, choose P and M such that the length of P[u : w] is
minimum.
Because P[v : u] is c-alternating and begins and ends with non-matching edges,
u is a row vertex and hence t is a column vertex. Let s be the vertex matched to t
in M, which may or may not be on P.
Lemma 2.8 implies that there is an unmatched row vertex rand a c-alternating
path n from u to r (possibly u = r). Now t is on path n
if and only if s is. There
are two cases.
Case 1. Both t and s are on n. In this case P[v: u]n[u: t] is a c-alternating walk
from v to t. Therefore there is a c-alternating path n from v to t. Let x be the last
vertex on P that is also on n (so x is on P[t: tv]). Then 15 = n[v: x]P[x: tv] is a path
from v to W. But this is a contradiction: J5 is c-alternating from v at least to x, and
15[x: w] is shorter than P[u: w]. This contradiets the choice of P. Figure 5 illustrates
this case.
Let x be the first vertex on P that is also on n (so x is on P[ v: u]), and let y be the
last vertex on P that is also on n (so y is on P[t :w]). Then J5 = P[v: x]n[x :y]P[y:w]
is a path from v to w. The path P[v : x] is c-alternating with respeet to both M
and M, because M and M agree on P[v : x]. Depending on whether x precedes or
117
.J---. · · ·
,--=---_..._ .._,
iQ; :Ji ~
\ ,,' I .,
• ~=~It---4I-~~---4I-~~~~.
~ ·I
FIG. 6. Case 2 of Theorem 2.9. The most eomplieated version is shown. The dashed edges are the
matching M. l' is the horizontal path from v to w. The light dotted line shows path R from s to r.
Path 1'[v : z]'R[z : y] is c-alternating with respeet to M Ell 'R. A simpler version, not shown, is if'R
does not interseet l' after u. Then z = u, y = t, and 1'[v : u](u, t} is e-alternating with respeet to
MEIl'R.
The deficiency of v corresponds to the fill that occurs in A when the (v, v) element
is used as a pivot in Gaussian elimination. Therefore we can define a sequence of
elimination graphs Go, Gb ... , G n , where Go = G(A) and Gi is obtained from Gi-l
by adding the deficiency of vertex i (in Gi-t) and deleting vertex i and its incident
edges. Then Gi is the graph of the (n - i) x (n - i) Schur complement that remains
after eliminating the first i vertiees of A. This is in the symbolic sense--that is, it
ignores possible numeric cancellation. We define the filled graph of A, which we write
G+(A), as the n-vertex graph containing all the edges of all the Gi's. Thus we have
the following resulto
THEOREM 2.10. Suppose the square matrix A can be factored as A = LU without
row or column interchanges. Then G( L + U) ~ G+ (A) with equality unless there is
cancellation in the factorization. In other words, the filled graph contains edges for
all the nonzems of L and U. D,
118
Proof. Suppose (r,e) is an edge of G+(A). Then there is a fill path P from r to e
whose intermediate vertices are less than s = min(r, e).
L~t K be the submatrix of A mentioned in Lemma 2.12, consisting of rows 1
through s-I and r, and columns 1 through s-I and e. For convenience, call the
last row and column in K number r and e respectively instead of number s. Then
path P corresponds to a path in H(K) from row vertex r' to column vertex e, which
is r-alternating with respeet to the matehing M of edges (i', i).
Now M is one edge short of being a perfect matehing on K, because column e
and row r' are not matched. However P is an augmenting path with respect to M,
and therefore M Ee P is a perfect matehing on K. Since K has a perfect matching,
it is Hall; thus its determinant is nonzero by hypothesis, and [L + U]rc is nonzero by
Lemma 2.12. D
The hypothesis that Ahas nonzero diagonal in ,Lemma 2.13 is cruciaI. Brayton,
Gustavson, and Willoughby [21 gave the following counterexample in the case when
this hypothesis is not included. Let
J
x x
x x
Then the (4,3) entry in G+(A) is nonzero, but [L]4,3 = 0 regardless of the nonzero
values of A.
LEMMA 2.14. Suppose bipartite graph H has a perfect matching M. Let A be a
matrix with H(A) = H, such that [A]rc > n for (r', e) E M and 0 < [A]rc < 1 for
(r', e) f. M. If A is factored by Gaussian elimination with partial pivoting, then the
edges of M will be the pivots.
Proof. When the rows of matrix Aare permuted so that the edges of M are the
diagonal elements, the values chosen make the permuted matrix strongly diagonally
dominant. D
strueture of R? The answer to the first question depends on the algorithm we use to
eompute the factorizationj the answer to the second does not.
In Seetion 3.1 below, we review work of George, Liu, and Ng on intermediate fill
during eolumn QR faetorization. We then give a new tight symbolic resuIt on eolumn
QR factorization. In Seetion 3.2, we survey several authors' work on predieting the
strueture of Rj in Seetion 3.3, we re-prove aresult of Coleman, Edenbrandt, and
Gilbert in a framework that relates it to the new results on LU faetorization in
Seetion 4. Finally, in Seetion 3.4, we brieily survey some related work.
except for the (k, i) element, it replaces both their nonzero structures with the union
of their nonzero struetures. Thus the strueture of row k of Ai is the union of the
structures of those rows j of A i - l for which i :::; j :::; k and [Ai-l]ji i= o. Moreover, at
the end of step i, the structure of row i of Ai is the union of the structures of those
rows j of Ai- l for which i :::; j :::; m and [Ai-l]ji i= o.
Now consider (the row-oriented version of) Householder refleetions. The House-
holder reflection that annihilates the subdiagonal nonzeros of column i of Ai-I replaces
all the rows containing those nonzeros with linear combinations of their old values.
Symbolically, every row with a nonzero in column i of Ai-l has the same strueture
in Ai, namely the union of their original struetures in Ai-I.
In terms of struetures, the fundamental difference between Givens rotations and
Householder reflections is the number of rows participating in one reduetion operation.
In one Householder reduetion, all rows that have a nonzero in column i of Ai- l
participate in a reduction step, whereas in a Givens Teduetion, only a subset of those
rows are involved.
We now describe a bipartite graph model that George, Liu, and Ng [12] developed
to analyze the reduetion process using Givens rotations. Their model associates a
bipartite graph H i with the matrix Ai. We number the m - i row vertices of H i from
i+1 to m, and the n-i column vertices from i+l to n. The changes in the structure of
Ai due to the reduetion process are described in terms of transformations on the graph
H i • Because of the similarity between Givens reduetions and Householder reflections,
this model can be extended to cover both cases. We summarize these results below;
proofs can be found in the paper [12]. All these results are symbolic; theyassume
that zeros are introduced only by explicit annihilation, not by cancellation.
The following results contain a parameter p, which we introduce to cover both
of the column algorithms. We define p == r for Givens rotations, and p == m for
Householder reflections.
We begin by formalizing the symbolic effeet of annihilating one column, that is,
the relationship between Hi-l and H i . The four statements in the lemma below are
easily seen to be equivalent.
LEMMA 3.1.
• For r > i, AdjHi(r') == ReachCol Hi _, (r', {i, i', (i + 1)" ... ,p'}).
• For r > i, e E AdjHi(r') iJ and only iJ there exists a path oJ Zength 1 or 3
Jrom r' to e through {i,i',(i + 1),,··· ,p'} in H i- l .
• For r > i and e > i, c E AdjHi(r') iJ and only iJ either e E AdjHo(r') or Jor
some k:::; i, there is a path (r',k,s',c) in Hk-l with k:::; s:::; p.
We wish to charaeterize fill in terms of the structure of the original matrix. George,
122
1'~1
2'
2
3'
3
4'
Liu and Ng [12] provided upper and lower bounds on the structure of H õ, but neither
bound is tight. Their upper bound is as follows.
THEOREM 3.2. For r > i, AdjH.(r') ~ ReachCoIHo(r', {I"", i, 1'"" ,p'}). 0
Note that Theorem 3.2 provides only a .necessarv eondition for a fill element to
oeeur during the annihilation proeess. Figure 7 (from [12]) is an example showing
that Theorem 3.2 is not tight. There is a path (4',2, I', 3) in the graph H o, but it
is easy to verify that no zero element in A beeomes nonzero in reducing A to upper
triangular form by Givens rotations or Householder refleetions.
The George, Liu, and Ng lower bound is as follows.
THEOREM 3.3. Suppose that Ho contains a path (r',ct,r~,c2,r~, .. ·,ct,r~,c)
whose intermediate vertices are all in {I,,,,, i, 1',,,,, p'}. If Ch ~ r~ for k ~ t and
Ck+! ~ rk for k < t, then e E AdjH.(r'). D
I'
1
2'
2
3'
3
4'
4
5'
By this definition, all edges in H o are also fill paths. The main new resuIt of this
seetion is the following, whieh generalizes the last statement of Lemma 3.1. It gives a
neeessary and suffieient eondition for a zero element of A to become nonzero at some
stage of the annihilation process, in the symbolie sense. The proof of the resuIt is an
easy induetion, and is omitted.
THEOREM 3.4. For r',e > l, e E AdjH;(r') iJ and only iJ there is a fill paih
joining r' and e in Ho . D
Consider the path (4 ' ,2, 1',3) in H o in Figure 7. Sinee it does not satisfy eondi-
tion (2), the (4,3) element of A will remain zero throughout the eomputation, whieh
is indeed the ease for either Givens or Householder. AIso eonsider the example in
Figure 8. Although the path (5',2,1',1,4',3) does not satisfy the eondition in The-
orem 3.3, it does satisfy eondition (2) above. Henee, the (5,3) element of A will
beeome nonzero at some point during the eomputation, assuming exaet numerieal
eaneellation does not oeeur.
Unfortunately, unIike the ease of sparse Gaussian elimination without pivoting,
there does not appear to be a simple and non- reeursive way to express the fill property.
Finally, we define a graph whose strueture eaptures all of the Hi for the ease
of Householder refleetions. The (bipartite) TOW merge graph of a matrix A whose
diagonal is nonzero, whieh we write HX (A), is the union of H j (by the Householder
interpretation) for 1 ::; i ::; n. Thus HX(A) has m row vertiees and n eolumn vertiees,
and is eonstrueted by the following process. Begin with the bipartite graph H(A),
whieh indudes all edges of the form (i', i) beeause Ahas nonzero diagonal. For
eaeh k from 1 to n, add an edge from eaeh row r' :::: k adjaeent to eolumn kto eaeh
eolumn e :::: k adjaeent to any such row. (In other words, take those rows at or below
row k with nonzeros in eolumn k, and merge the parts of their nonzero struetures at
or to the right of eolumn k.)
We also define a direeted version of the row merge graph. The bipartite row
merge graph HX(A) is a bipartite graph with m rows, n ::; m eolumns, and a eolumn-
eomplete matehing of edges (i', i). The (directed) TOW merge graph, whieh we write
GX(A), is the n-vertex direeted graph whose adjaeeney matrix has the strueture of
the first n rows of HX(A).
Theorems 3.2, 3.3, and 3.4 ean be translated into statements about HX(A). We
will need one of these later.
124
l'
l'
1 2 3 4 5
I' 1 1 2 5
3' 3., 2'
.'
3'
4'
1
1C 1)
2
3
1
1 2
1
s- s ·0
5' 1
6'
"/
FIG. 9. Example for Theorem 3.8. Graph H is shown in Figure 1. Its column intersection graph
and filled column intersection graph are shown in Figure 4. This figure shows the construction that
makes entry [Rh5 nonzero. At leJt, graph Il is the subgraph of H induced by column vertices 1
through r = =
3 and e 5, and all the row vertices. The dashed edges are a column-complete matching
=
M with respeet to which there is a c-alternating path Q (5,5',2, 1', 1,3',3) from e to r. At center,
A is chosen to have ones in positians M and Q and zeros elsewhere. At right, K is the submatrix
of ATA consisting of rows and columns 1 through r - 1 =
2, as weil as row r = 3 and column
c = 5. Matrix K is a permutation of a triangular matrix with nonzero diagonal and hence cannot
be singular.
Chaase rand c with r < c ::; n. Take an arbitrary m x n matrix A with factor-
ization Q R, such that the first r columns of Aare linearly independent. Now let K
be the submatrix of ATA consisting of columns 1 through r - 1 and e, and rows 1
through r. Lemma 2.12 applies to ATA (because ATA is positive definite), and says
that K is singular if and only if [Rlre, the entry in the (r,e) positian of R, is zero.
Thus [R]re is zero if and only if a eertain polynomial pre in the nonzero entries of A
(namely the determinant of K) is zero.
We now show that if A is a matrix with H(A) = Hand (r, e) is an edge of GMH),
126
then the polynomial Prc is not identieally zero. (Note that prc has a variable for eaeh
edge of H.) Let H be the subgraph of H indueed by all the row vertiees and the
eolumn vertiees 1, 2, ... , r, and e. Lemma 2.11 says that there is a path P from e to r
in the undireeted graph Gn(H) whose intermediate vertiees are all smaller than r.
Thus P is also a path in Gn(H). By Lemma 2.1, there is a path in H from eolumn
vertex e to eolumn vertex r.
Now H is st rong Hall beeause His. Therefore the alternating-paths theorem
(Theorem 2.9) applies, and says that there is a eolumn-eomplete matehing M for H
and a path Q from e to r that is e-alternating with respeet to M.
Choose the values of those nonzeros of A eorresponding to edges of M u Q to
be 1, and ehoose the values of the other "nonzeros" to be O. Let us examine the r x r
submatrix K of ATA defined above. (For simplicity, we will eall the last eolumn of K
number e rather than number r; the last row of K is number r.) We daim that the
bipartite graph H(K) has exaetly one perfect matehing (or, equivalently, that K ean
be permuted to a triangular matrix with nonzero diagonal). To prove this, we match
rows of K greedily to eolumns of K. Take a column j of K. If j is a vertex that
is ~ot on path Q, then the only nonzero in eolumn j of K is [Kli;, and we match
eolurnn j to row j'. If j is on Q, i' is the vertex following j on Q, and k is the vertex
following i' on Q, then [K]ki is nonzero and we match column j to row k'. (The last
vertex on Q is oolumn r, which is not a column of K.) This is a perfect matching
on H(K). Its uniqueness follows by induetion on the length of Q, the induetion step
being the fact that column e of K has onlyone nonzero (because row ef is not a row
of K).
This proves the daim that H(K) has exactly one perfect matchingo Thus the
determinant of K is just the produet of the nonzero values eorresponding to element s
of that matehing, and is itself nonzero. This shows that the polynomial Prc is nonzero
for at least one point, that is, for at least one ehoiee of values for A.
Now the set of zeros of a k-variable polynomial has measure zero in Rk, unIess
the polynomial is identically zero. Thus not only do values for the nonzero entries
of A exist that make prc and hence [R]rc nonzero, but almost all choiees of values (in
the measure-theoretic sense) work. Therefore, almost all choiees of values for A make
every [R]rc nonzero simultaneously. Furthermore, almost all of those choiees indude
no zero values; that is, for almost all such choices, H(A) = H as desired. Finally, we
observe that we ean ehoose A to have full rank n: for some n x n submatrix of A there
is a choice of values that gives nonzero determinant (namely, ones for the elements of
a oolumn-complete matehing of Hand zeros elsewhere), and hence almost all choiees
of values make that submatrix nonsingular. D
COROLLARY 3.9. IJ H is strong Hall and has nonzero diagonal, then the upper
triangular paris oJ GX(H) and GMH) are equal.
Prooj. By Theorem 3.6 and its eorollary we have G(R) ~ GX(H) ~ GMH) for
any A = QR with H(A) = H. If we choose A as in Theorem 3.8, the first and third
graphs are equal, and hence the second and third are also equal. D
COROLLARY 3.10. IJ H is strong Hall and has nonzero diagonal, then there is
a matrix A with Jull column rank and with H(A) = H, sueh that the orthogonal
Jaetorization A = QR s~tisfies G(R) = GX(H). D
127
model of Gaussian elimination with row and eolumn interehanges, and we prove some
results on the structure of the matrix during elimination. These results are symbolie;
that is, theyassume that zeros are introdueed only by explicit elimination, not by
eaneellation. In Section 4.2 we give upper bounds on the structure of the factors L
and U obtained by Gaussian elimination with row interehanges. In Section 4.3, we
give an exact lower bound on L and U. This result is tight-that is, best possible-
and is the main new result of this paper. We eonclude the seetion with remarks and
op en probIems.
We define L as the n x n matrix whose i-th eolumn is the i-th eolumn of Li, so
that L - I = Li(Li - I). Note a subtle point about L: we ean also think of Gaussian
elimination as eomputing a factorization pr Apc = LOU, but this LO is not the same
as L. The two matriees are both unit lower triangular, and they contain the same
nonzero values, but in different positions; LO has its rows in the order deseribed by
the entire row pivoting permutation, while L has the rows of its i-th eolumn in the
order deseribed by only the first i interehanges. The matrix L is essentially a data
strueture for storing LO; either can be used in solving systems of equations. The
strueture prediction results in Seetions 4.2 and 4.3 below will be about L, not LO.
Note also that our notation is slightly different than in the previous sectian: now
Ai is always n x n, not (n - i) x (n - i).
ineident on them, then add the edges in the defieieney of (r',e). The edges in
the defieieney of (r',e) eorrespond to the zero elements of Ao that beeome nonzero
when lAol, e is eliminated. (Note that the labeHing of the vertices of H 1 refers to
the labeHi~g in the original matrix Ao.) Thus, given a sequenee of pivot element s
(r~, el), (r~, e2)'· .. , (r~_l' c,,-l) (some of whieh may be fill edges), we ean follow the
reeipe above to eonstruct a sequenee of bipartite graphs Ho, HI , · · · , Hn , where H;
deseribes the structure of the (n - i) x (n - i) Sehur eomplement remaining after
step i.
It is possible to prove bipartite versions of several of the results from Seetion 2.5.
We will use the following lemma in the exaet lower bound proof later in this section.
LEMMA 4.1. Let A be a square matrix, and let M be a perfect matching on H(A).
Let H o, ... , H n be the sequenee of bipartite elimination graphs deseribed above, when
elimination is earried out by pivoting on the edges of M. If (r', e) is a non-matehing
edge of H;, then there is a path from r' to e in H(A) that is r-alternating with respeet
to M, and whose intermediate vertiees are all endpoints of edges of M eliminated at
or before step i.
Prooj. Reeall Theorem 2.5, whieh says that an m x n bipartite graph is Hall if and
only if it has no independent set of more than m vertiees, and strong Hall if and only
if it has no independent set of exaetly m vertiees that includes at least one vertex
from eaeh part.
Let Rl and Gl be the row and column vertices in a largest independent set in H I ·
It is not possible that both r' E AdjHo(GI ) and e E AdjHo(RJ), for that would imply
an edge between Rl and Gl in HI . Therefore either Rl U Gl U {r'} or Rl U Gl U {e}
is an independent set in H o. If H o is Hall, that set has size at most m, and hence
Rl U Gl has size at most m - 1, so HI is also Hall. The strong Hall ease follows
the same argument, considering only independent sets that include both rows and
eolumns. D
4.2. Upper bounds on L and U with partial pivoting. For the remainder
of this seetion, we restrict our attention to the ease in whieh only row interehanges
are performed during Gaussian elimination, so the eolumn ordering is fixed initially.
130
This subseetion proyes symbolie upper bounds on the struetures of L and U, making
no assumptions on the row pivoting strategy. For the ease where A is strong Hall
and rows are ordered by partial pivoting, the next subseetion proyes matehing exact
lower bounds. Therefore the symbolie upper bound is in faet a tight exaet bound in
this ease. As we will see, the tight exaet bound is a one-at-a-time result; there is no
tight all-at-once bound on L and U in general.
In the rest of this seetion we require A to have a nonzero diagonal. The rows of
any nonsingular square matrix ean be permuted to put nonzeros on the diagonal (by
Theorem 2.3 and Corollary 2.4). In faet, only the bounds on L below depend on a
nonzero diagonal; the bounds on U hold for arbitrary nonsingular A.
Sinee the row interehangespr depend on the numerieal values, it is in general im-
possible to determine where fill will oeeur in L and U from the strueture of A. George
and Ng [13) suggested a way to get an upper bound on possible fillloeations. At step
i of Gaussian elimination with row interehanges, eall the rows that have nonzeros
in eolumn i below the diagonal candidate pivot rows. George and Ng observed that
fill e,an only oeeur in eandidate pivot rows, and only in eolumns that are nonzero in
some eandidate pivot row. Thus the strueture that results from the elimination step
is bounded by replacing eaeh eandidate pivot row by the union of all the eandidate
pivot rows (to the right of eolumn i). We need the faet that the diagonal of A is
nonzero to argue that this models the effeet of row interehanges eorreetly: row i is
itself a eandidate pivot row at step i, and therefore interehanging row i with another
eandidate pivot row does not affeet the strueture of the bound.
This proeedure for bounding the struetures of L and U is precisely the eonstruetion
of the row merge graph from Seetion 3. Therefore we have the following theorem.
(Note that GX(A) = HX(A) sinee A is square.)
THEOREM 4.3 (GEORGE AND NG [13)). Let A be a nonsingular square ma-
trix with nonzero diagonal. Suppose A is factored by Gaussian elimination with row
interchanges as
G(L + U) ~ GX(A),
that is, the structures of L and U are subsets of the lower and upper triangles of the
row merge graph of A. D
G(L + U) ~ G;!;(A),
that is, the structures of L and U are subsets of the lower and upper triangles of the
(symmetric) Jilled column intersection graph of A. D
George, Liu, and Ng [10, 13) gave an algorithm for Gaussian elimination with
parti al pivoting that uses GX(A) to build a data strueture to hold the faetors of Aas
elimination progresses. The strueture may be overgenerous in the sense that it stores
131
1 2 3 4 5 6 1 2 3 4 5 6
I' x X I' x x x X
2' x X 2' x x x x X
3' x x X 3' x x x x x X
4' x x X 4' x x x X
5' x X 5' x x x x X
6' x x X 6' x x x
FIG. 10. Example for Theorem ..(.5. On the left is a matrix A. On the nght is the bound GX(A) on
the struetures of L and U. In the eas e r < e, Figure 11 shows how to make [U]35 nonzero. In the
ease r > c, Figure 12 shows how to make [LJs4 nonzero.
some zeros, but it has the advantage that it is static; the structure does not change
as pivoting choiees are made. George, Liu, and Ng's numerieal experiments indicated
that (with a judicious choice of a column reordering for sparsity) the total storage
and execution time required to compute the LU decomposition using the statie data
structure were quite competitive with other approaches.
l'
2' 2 _d 1 2 3 4 5 6
I' x 9
3' 2' x X
3' 9 x x
.' • ·C
4'
5' x
9 x
X
X
t'.r' .. S'
6' 9 x x
"
'Y.
1 2 3 4 5
T
x x
I' x 9
:)
4' 9 X
6' 9
5' x
*
FlG. 12. Example Jor Gase 2 oJ Theorem 4.5, showing the construction that makes [L]s4 nonzero
in the stmeture Jrom Figure 10. At top leJt, the graph Il is the subgraph oJ H indueed by eolumn
= =
vertiees 1 through e 4, and all the row vertiees. Then d 2 is the first eolumn vertex on sam e path
Jrom r' to c. The dashed edges are a eolumn-eomplete matching M with respeet to which there is a
c-alternating path Q = (4,4',3,3', 1, 1',2) Jrom e to d. At top right, A is chosen to have large valu es
in positions M and small values elsewhere. At bottom leJt, Ac is the submatrix oJ PA with eolumns
1 through e and r and the rows in the eorresponding positions aJter -1 pivot steps. The element [L]s4
is in positian * oJ the Jactor oJ Ac. The fiJth and last row oJ Ac is 5', the fiJth row oJ A, beeause 5'
= =
was not involved in a pivoting swap during the first 4 steps; thereJore s' r' 5' and the argument
about an alternating path Jrom r' to s' is not needed in this example. At bottom right, the direcled
graph G(A c) has a path (5,2,1,4); thereJore (5,4) fil/s in. In the originai A, the first pivot step fil/s
position (1',4), and the second pivot step fil/s positian (5',4).
134
alternating-paths theorem (Theorem 2.9) applies, and says that there is a column-
complete matching M for H and a path Q from e to r that is e-alternating with
respeet to M.
Choose the values of those nonzeros of A corresponding to edges of M to be
larger than n, and the values of the other nonzeros of A to be between 0 and 1.
Further, choose the values so as to make every square submatrix of A that is Hall,
induding Aitself, nonsingular. (Such a choice is possible by an argument like that
in Theorem 3.8: the determinant of a Hall submatrix is a polynomial in its nonzero
values, not identically zero because the Hall property implies a perfect matchingo
Therefore the set of values that make any Hall submatrix singular has measure zero,
and can be avoided.)
Now we prove that this choice of values makes [U].c nonzero. In the first r steps
of elimination of A, the pivot elements are nonzeros corresponding to edges of M.
Let P be the permutation matrix that describes the first r row interchanges (that is,
P = PrPr- 1 ••• P1 in Theorem 4.3). Let A r be the (r+ 1) x (r+ 1) prineipal submatrix
of PA that indudes the first r columns and column e, and the corresponding rows.
Thus the columns of A r are those numbered 1 through r and e in Hj the first r rows
of A r are those matched to columns 1 through r of H by M j and it does not matter
which row of H the last row of A r is. We will consider the rows and columns of the
bipartite graph H(A r ) to have the same numbers that they did in Hj thus the column
vertex numbers are 1 through r and e, and the row numbers may be anything. In the
directed graph G(A r ), we will also number the vertices 1 through r and e, but bear
in mind that the row of A r corresponding to a vertex v was not necessarily row v'
in H.
Now the first r diagonal element s of A r are nonzero, and dominant. Let L r and Ur
be the triangular factors of Ar without pivoting, A r = LrUr. Then the element [U]rc
mentioned in the statement of the theorem is in fact [Ur]rc, the element in the last
column and next-to-Iast row of Ur' We proceed to show that [Ur]rc i- O.
All square Hall submatrices of A r are nonsingularj thus, by Lemma 2.13, G+(A r )
is exactly the structure of [L r + Url. Therefore [U]rc is nonzero if and only if G(A r)
contains a directed path from vertex r to vertex e, through vertices numbered less
than r.
Reeall the path Q, which is a path in H from e to r that is c-alternating with
respect to M. The matching M consists of exactly the edges on the diagonal of A r
(except for the one in the last column, which cannot be an edge of Q because Q is
c-alternating). Therefore Q corresponds to a directed path from r to e in G(A r ).
Every vertex of G(Ar ) except r and e is numbered less than r, so this is the desired
directed fill path and the proof of this case is complete.
Note that the proof never explicitly identified the row of H that ended up in
position (r, e) of Uj it is the row matched to column r by M, and is the second last
vertex on the path Q.
Case r > e (structure of L). Figure 12 illustrates this case. The proof for this
case is much like that for U, but it needs to do some extra work to identify the row
of H that ends up in position (r,e) of L, because that row has not yet been matched
(pivoted on) when L rc is computed.
135
Again by Corollary 3.5, there is a path l' in H from row vertex r' to eolumn
vertex e whose intermediate eolumn vertiees are all at most e. Let d be the first
eolumn vertex on l' (this is the vertex after r' on 1'j possibly d = e).
Let H be the subgraph of H indueed by all the row vertices and the eolumn
vertices 1 through e. (This has one less eolumn than in the prooffor U.) Then 1'[d:el
is a path (possibly of length 0) in H from eolumn vertex d to eolumn vertex e in H.
Again, therefore, there is a eolumn-eomplete matehing M for H and a path Q from
e to d that is e-alternating with respeet to M.
Again we ehoose A so that edges of M have values larger than n, other edges have
values between 0 and 1, and every square Hall submatrix of A is nonsingular.
The first e steps of elimination of A pivot on nonzeros eorresponding to edges of M.
Let P be the permutation matrix that deseribes the first e row interehanges (that is,
P = PcPc- 1 ••• PI in Theorem 4.3). Let Ac be the (e+ 1) x (e+ 1) prineipal submatrix
of PA that indudes the first e eolumns and eolumn r, and the rows in eorresponding
positions of PA. Thus the eolumns of Ac are those numbered 1 through e and r in
Hj the fir~t e rows of Ac are those matehed to eolumns 1 through e of H by M. The
last row of Ac is some row number s' in H that is not matehed by M. (Row s' may
or may not be matehed to eolumn r in the final faetorization of A.)
Again, we give the rows and eolumns of the bipartite graph H(A c ) the same
numbers they had in Hj the eolumn vertex numbers are 1 through e and r, and the
row numbers may be anything (but the last row is s'). In the direeted graph G(A c ),
we will also number the vertices 1 through e and rj again, bear in mind that the row
of Ac oorresponding to a vertex v was not neeessarily row v' in H, and in partieular
the row eorresponding to vertex r of G(A c ) is row s' of H.
Now the first e diagonal elements of Ac are nonzero, and dominant. Let L c and
Uc be the triangular faetors of Ac without pivoting, Ac = LcUc. The element [Ll re
mentioned in the statement of the theorem is in faet [Lel re , the element in the last
row and next-to-Iast eolumn of Le.
As before, we show that [Lel re =1= 0 by exhibiting a direeted path from vertex r
to vertex e of G(A e ), based on a e-alternating path in H. However there is not
neeessarilyan edge between eolumn vertex rand row vertex s' in Hj thus we must
find a e-alternating path that ends at s', not r. The details of how to do that will
eomplete the proof.
We now traee the pivoting proeess to diseover where row s' eame from. H row r'
of H was not used as one of the first e pivots, then it has not moved and s' = r'.
H row r' was used as a pivot, suppose it was in eolumn el ~ e, and that the row
interehanged with r' at step el was row r~. (Reeall that all row and oolumn numbers
are vertex numbers of H.) Again, either r~ = s' or else r~ was later used as a pivot
in some eolumn e2 > el, when it was interehanged with some row r~. Continuing
induetively, we eventually arrive at a row rk which is equal to s', whieh was not used
as a pivot in the first e steps.
136
The sequence of nonzeros we followed while tracing the pivoting process was
Each (c;, rl) is an edge of one of the bipartite elimination graphs H 0, H I! ... , H e
corresponding to the first e steps of symbolic Gaussian elimination of H. Therefore,
by Lemma 4.1, there is a c-alternating path in H from c; to ri for eaeh i. Furthermore
eaeh (ri_t,C;) is an edge of M, and is thus a one-edge c-alternating path from rLt
to c;. Concatenating these paths yields a c-alternating walk W (which may repeat
vertices or edges) from r' to s' in H.
Now if edge (d, r') is not an edge of M, then Q followed by (d, r') followed by W
is a c-alternating walk from column e to row s'. Alternatively, if (d, r') is an edge
of M, then d = eI! and Q followed by W[d: s') is a c-alternating walk from column e
to row s'. Either way, we have a walk in H from e to s' that is c-alternating with
respeet to M. This walk corresponds to a direeted walk from vertex r to vertex e
of G(A e). Thus there is a direeted path from vertex r to vertex e of G(A e). The
inter~ediate vertices on this path are less than both r and e, because r and e are the
last two vertices of G(A e). Therefore (r, e) is an edge of G+(A e). Since all square Hall
submatrices of Ac are nonsingular, therefore, [Le)re is nonzero. Thus [L)re is nonzero
and the proof is complete. D
possible for LU factorization with partial pivoting. To see this, consider a matrix
that is tridiagonal pIus a full first column,
x
x x
x x x
x x
x
The graph H(A) is strong Hall. The row merge graph GX(A) is full. As Theorem 4.5
says, any single position in L or U can be made nonzero by an appropriate choice of
pivots. But the first row of U will have the same strueture as some row of A, so it is
impossible for U to be full.
One application of strueture predietion for partial pivoting is to prediet which
columns of A will update which other columns if the faetorization is done with a
column-by-column algorithm. For example, Gilbert [15] gave a parallei implementa-
tion of LU faetorization with partial pivoting in which tasks (columns of the faetor-
ization) w~re scheduled dynamically to processors, based on a precedence relationship
determined by precomputing the elimination tree [23] of Gn(A). Since [U]ij is nonzero
if and only if column i updates column j during the factorization, a corollary of The-
orem 4.5 is that, for strong Hall A, this is the tightest prediction possible from the
strueture of A aloneo
COROLLARY 4.7 (GILBERT [15]). Let a strong Hall structure for the square
matrix A be given. If k is the parent of j in the elimination tree of Gn(A), then there
exists a choice of nonzero values of A that will make column j update column k during
factorization with parlial pivoting. D
This corollary is a one-at-a-time result. However, if we restrict our attention to the
edges of the elimination tree of Gn(A) instead of all of GX(A), it may be possible to
prove an all-at-once resulto We conjeeture that for every square st rong Hall matrix H,
there exists a single matrix A with H(A) == H such that every edge of the elimination
tree of Gn(A) corresponds to a nonzero in the upper triangular factor U of A with
partial pivoting.
Little if anything is known about the case when H(A) is not strong Hall. Hare
et al. [21] gave a complete exaet result for Q R factorization assuming only the Hall
property; is a similar analysis possible for partial pivoting? In particular, since the
upper triangles of GX(A) and G;!;(A) can differ in the non-strong Hall case, how tight
is the former for partial pivoting? There are non-strong Hall structures for which
GX(A) is tight but G;!;(A) is not; an example is a matrix whose only nonzeros are the
diagonal and the first row.
Gaussian elimination takes over to handie numeric relationshipsj the tight exact (Le.
numeric) lower bounds in this paper say that Dulmage-Mendelsohn decomposition is
doing its job.
Predicting structure in algorithms that combine numerical and structural infor-
mation is an interesting challenge. Murota et al. [25J have studied block triangular
decompositions that take some but not all of the numerical values into account.
We point out once more that Hare, Johnson, Olesky, van den Driessche, and
Pothen [21, 28] have recently obtained tight exact bounds on both Q and R in the
general Hall case, thus extending the work of Coleman, Edenbrandt, and Gilbert that
we reviewed in Section 3. It would be interesting to see whether our bounds on L
and U for partial pivoting, in Section 4, could be similarly extended.
We conelude by mentioning three open problem areas for nonsymmetric structure
prediction.
First, it would be interesting to understand the relationship between the structure
of 1. and the structure of LO, both of which are different ways of storing the lower
triangular factor in Gaussian elimination with partial pivoting. Can the techniques
discussed in this paper be used to obtain bounds on the structure of LO?
Second, it would be useful to achieve a complete structural understanding of the
Bunch-Kaufmann symmetric indefinite factorization [18, Chapter 4.4J. Here a sym-
metric indefinite matrix is factored symmetrically by choosing pivots from the di-
agonai, but each pivot may be either an element or a 2 x 2 submatrix. Thus the
factorization is P ApT = LDLT, where P is a permutation, L is lower triangular, and
D is block diagonal with 1 x 1 and 2 x 2 blocks. This factorization is particularly
useful for solving "augmented systems" of the form
where A is rectangular and K is symmetric and (perhaps) positive definite [1]. Even
the common case K = I is not weil understood.
Third, it would be interesting to understand the structural issues in the incomplete
LU factorizations sometimes used to precondition iterative methods for solving linear
systems [7].
REFERENCES
[1) Ake Björck. A note on scaling in the augmented system methods, 1991. Unpublished
manuscript.
[2) Robert K. Brayton, Fred G. Gustavson, and Ralph A. WiIIoughby. Some results on sparse
matrices. Mathematics of Computation, 24:937-954, 1970.
[3) Richard A. Brualdi and Herbert J. Ryser. Combinatorial Matrix Theory. Cambridge University
Press, 1991.
[4) Richard A. Brualdi and Bryan L. Shader. Strong Hall matrices. IMA Preprint Series #909,
Institute for Mathematics and Its Applications, University of Minnesota, December 1991.
[5) Thomas F. Coleman, Anders Edenbrandt, and John R. Gilbert. Predicting!ill for sparse
orthogonal factorization. Journal of the Association for Computing Machinery, 33:517-
532. 1986.
139
[6] I. S. Duff and J. K. Reid. Some desigu features of a sparse matrix eode. ACM Transactions
on Mathematieal Software, 5:18-35, 1979.
[7] Howard Elman. A stability analysis of ineomplete LU factorization. Mathematies of Compu-
ta/ion, 47:191-218, 1986.
[8] Alan George and Michael T. Heath. Solution of sparse linear least squares problems using
Givens rotations. Linear Algebra and its Applieations, 34:69-83, 1980.
[9] Alan George and Joseph Liu. Householder refteetions versus Givens rotations in sparse orthog-
onal decomposition. Linear Algebra and its Applieations, 88:223-238, 1987.
[10] Alan George, Joseph Liu, and Esmond Ng. A data strueture for sparse QR and LU factoriza-
tions. S/AM Journal on Seientifte and Statistieal Computing, 9:100-121, 1988.
[11] Alan George and Joseph W. H. Liu. Computer Solution of Large Sparse Positive Deftnite
Systems. Prentice-Hall, 1981.
[12] Alan George, Joseph W. H. Liu, and Esmond Ng. Row ordering schemes for sparse Givens
transformations I. Bipartite graph mode!. Linear Algebra and its Applieations, 61:55-81,
1984.
[13] Alan George and Esmond Ng. Symbolie factorization for sparse Gaussian elimination with
partial pivoting. S/AM Journal on Seientifte and Statistieal Computing, 8:877-898, 1987.
[14] John R. Gilbert. Predieting structure in sparse matrix computations. Technical Report 86-750,
Cornell University, 1986. To appear in S/AM Journal on Matrix Analysis and Applications.
[15] John R. Gilbert. An efficient parallei sparse partial pivoting algorithm. Technical Report
88/45052-1, Christian Michelsen Institute, 1988.
[16] John R. Gilbert and Tim Peierls. Sparse partial pivoting in time proportional to arithmetie
operations. S/AM Journal on Seientifte and Statistieal Computing, 9:862-874, 1988.
[17] John Russell Gilbert. Graph Separator Theorems and Sparse Gaussian Elimination. PhD
thesis, Stanford University, 1980.
[18] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University
Press, second edition, 1989.
[19] Martin Charles Golumbic. Algorithmie Graph Theory and Perfect Graphs. Aeademic Press,
1980.
[20] Frank Harary. Graph Theory. Addison-Wesley Publishing Company, 1969.
[21] Donovan R. Hare, Charles R. Johnson, D. D. Olesky, and P. van den Driessche. Sparsity
analysis of the QR factorization, 1991. To appear in S/AM Journal on Matrix Analysis
and Applications.
[22] Michael T. Heath. Numerieal methods for large sparse linear least squares problems. S/AM
Journal on Seientifte and Statistieal Computing, 5:497-513, 1984.
[23] Joseph W. H. Liu. The role of elimination trees in sparse factorization. S/AM Journal on
Matrix Analysis and Applications, 11:134-172,1990.
[24] L. Lovasz and M. D. Plummer. Matehing Theory. North Holland, 1986.
[25] Kazuo Murota, Masao Iri, and Masataka Nakamura. Combinatorial canonical form of layered
mixed matrices and its application to block-triangularization ofsystems oflinear/nonlinear
equations. S/AM Journal on Algebraie and Discrete Methods, 8:123-149, 1987.
[26] Esmond G. Ng and Barry W. Peyton. A tight and explicit representation of Q in sparse QR
factorization. Teehnical Report ORNL/TM-12059, Oak Ridge National Laboratory, 1992.
[27] S. Parter. The use of linear graphs in Gauss elimination. S/AM Review, 3:119-130, 1961.
[28] Alex Pothen. Predieting the strueture of sparse orthogonal faetors. Manuseript, 1991.
[29] Alex Pothen and Chin-Ju Fan. Computing the block triangular form of a sparse matrix. ACM
Iransactions on Mathematieal Software, 16:303-324, 1990.
[30] Donald J. Rose. Triangulated graphs and the elimination process. Journal of Mathematieal
Analysis and Applieations, 32:597-609, 1970.
[31] Donald J. Rose and Robert Endre Tarjan. AIgorithmic aspects of vertex elimination on direeted
graphs. S/AM Journal on Applied Mathematics, 34:176-197, 1978.
[32] Donald J. Rose, Robert Endre Tarjan, and George S. Lueker. AIgorithmic aspects of vertex
elimination on graphs. S/AM Journal on CompuUng, 5:266-283, 1976.
[33] H.R. Schwartz. Tridiagonalization of a symmetric band matrix. Numerisehe Mathematik,
12:231-241, 1968.
HIGHLY PARALLEL SPARSE TRIANGULAR SOLUTION*
Abstract. In this paper we survey a reeent approach for solving sparse triangular systems
of equations on highly parallei computers. This approach employs a partitioned representation
of the inverse of the triangular matrix so that the solution can be computed by matrix-vector
multiplication. The number of factors in the partitioned inverse is proportional to the number of
general communication steps (router steps on a CM-2) required in a highly parallei algorithm. We
describe partitioning algorithins that minimize the number of factors in the partitioned inverse over
all symmetric permutations of the triangular matrix such that the permuted matrix continues to
be triangular. For a Cholesky factor we describe an O(n) time and space algorithm to solve the
partitioning problem above, where n is the order of the matrix. Our computational results on a CM-
2 demonstrate the potential superiority of the partitioned inverse approach over the conventional
substitution algorithm for highly parallei sparse triangular solution. Finally we describe current and
future extensions of these results. .
AMS(MOS) subject classifications: primary 65F50, 65F25, 68R10.
Keywords. chordal graph, directed acyclic graph, elimination tree, graph partitioning, mas-
sively parallei computers, partitioned inverse, sparse triangular systems, transitive dosure.
• A part of this work was done while the authors were visiting the Institute for Mathematics
and its Applications (IMA) at the University of Minnesota. We thank the IMA for its support.
t Electrical and Computer Engineering Department, 1425 Johnson Drive, The University of
Wisconsin, Madison, WI 53706 (alvarado@ece.wisc.edu). This author was supported under NSF
Contracts ECS-8822654 and ECS-8907391.
+ Department of Computer Science, University of Waterloo, Waterloo, Ontario Canada N2L
3G1 (apothen@narnia.uwaterloo.ca, na.pothen@na-net.ornl.gov). This author was supported by
NSF grant CCR-9024954 and by U. S. Department of Energy grant DE-FG02-91ER25095 at the
Pennsylvania State University and by the Canadian Natural Sciences and Engineering Research
Council under grant OGP0008111 at the University of Waterloo.
§ RIACS, MS T045-1, NASA Ames Research Center, Moffett Field, CA 94035
(schreiber@riacs.edu). This author was supported by the NAS Systems Division under Coopera-
tive Agreement NCC 2-387 between NASA and the University Space Research Association (USRA).
142
the description of the algorithms (especially Algorithm RP2 in section 2), to illustrate
the differences between the algorithms by means of examples, and to correct minor
errors.
2.1. Graph rnodels. A formaI statement of the best no-fill partitioning problem
is as follows:
(Pr!) Given a unit lower triangular matrix L = Il?=l Li, find a partition into factors
L = Il:'l Pi , where
1. each Pi = Il~~;.-l Lk, with el = 1 < e2 < ... em < em +1 = n + 1,
2. each Pi inverts in place, and
3. m is minimum over all partitions satisfying the given conditions.
oo IV 0
o o
o o
000 o
o 0 000
o o o
oo o
o o
o 0 ggolgo oo o
oo o oo
o !< 0 0
FIG. 1. A lower triangular matrix, its DA G, and its partitions. The originai ordering and the
partition found by Algorithm P1 are shown on the leJt, and the ordering and partition found by
Algorithms RP1 or RP2 are shown on the right.
It is helpful to consider a graph model of (Pr 1) and the other partitioning probIems.
Let G( L) denote a directed graph with vertices V = {I, ... , n} corresponding to the
columns of L and edges E = {(j,i): i > j and lij =f. O}. The edge (j, i) is directed
from the lower-numbered vertex j to the higher-numbered vertex i. It follows that
G(L) is a directed acyclic graph (DAG). If there is a directed path from a vertex j to
a vertex i in G(L), we will say that j is a predecessor of i, and that i is a successor
of j. In particular, if (j, i) EE, then j is a predecessor of i and i is a successor of j.
Given a subset P of the columns of L, the column subgraph of G( L) induced by P
is the graph whose edge set is the subset of edges in E that are directed from vertices
in P to all vertices in V, and whose vertex set is the subset of vertices which are the
endpoints of such edges. Thus the column subgraph of P is the subgraph induced by
the edges that correspond to nonzeros in the column set P.
144
L = (LI'" L 6 )(L 7 • •• L I2 ),
whieh has only two factors. The matrix on the right in Fig. 1 eorresponds to the
reordered matrix, and its partition into two factors is also shown.
A formaI statement of the best reordered partitioning problem is as follows:
(Pr2) Given a unit lower triangular matrix L = ni=l Li, find an admissible permu-
tation Q and a partition LQ == QLQT = nr,:,l Pi , where
1. each Pj = n~~!;-l Lk, with el = 1 < e2 < ... e m < em+I = n + 1,
2. eaeh Pi is invertible in plaee, and
3. m is minimum over all permutations Q sueh that L Q is lower triangular.
As noted above, the action of the permutation Q on L is to reorder the elementary
matrices whose product is Lj however, these elementary matrices eannot be arbitrarily
reordered, sinee we require the resulting matrix LQ to be lower triangular. From the
145
equation Li = I + mi f:.? it can be verified that the elementary matrices Li and Li+!
can be permuted if and only if lHl,i = o. These precedence constraints on the order
in which the elementary matrices may appear is nicely captured in a graph model of
(Pr2).
A topological ordering of G( L) is an ordering of its vertices in which predeces-
sors are numbered lower than successorSj Le., for every edge (j,i) E E, i > j. By
construction, the original vertex numbering of G(L) is a topological ordering. A per-
mutation Q that leayes L Q lower triangular corresponds to a topological reordering
of the vertices of G(L).
The graph model of (Pr2) is:
(Pr2') Find an ordered partition P1 -< P2 -< ... -< Pm of the vertices of G(L)
numbered in a topological ordering such that
1. for every v E V, if v E Pi then all predecessors of v belong to PI, ... ,
Pi ,
2. the column subgraph of each Pi is transitively dosed, and
3. mis minimum subject to these conditions.
Input: A unit lower triangular matrix L = L 1 L 2 ••• L n and its DAG G(L).
Output: A best no-fill partition of L.
i+- 1j {Li is the lowest-numbered elementary matrix not induded in a factor yet}
k +- 1j {Pk is the factor being computed}
while (i :::; n) do
{Find the largest integer r ~ i such that Li ... L r is invertible in place}
r +- Zj
while r < n and in G(L) every successor of the vertex r is a successor
of all predecessors v of r such that i :::; v < r do r +- r + 1j od
Pk+-{i, ... ,r}j k+-k+1j i+-r+1j
od
Best no-HU partitions. Algorithm Pl, shown in Fig. 2, was proposed by Al-
varado, Yu and Betancourt [3]. This algorithm greedily tries to indude as many
elementary matrices in the current factor as possible, while maintaining the two
properties that a factor should invert in place, and that the 'left-to-right' precedence
constraint in problem (Pr1) should be obeyed. The condition that in the graph G(L)
every successor of a vertex r is also a successor of every predecessor of r ensures that
indusion of the vertex r in the current factor Pk will continue to make G(Pk) transi-
tively dosed, and thus Pk will be invertible in place. Alvarado, Yu, and Betancourt
146
did not eonsider the issue of optimality, but later it was proved by Alvarado and
Sehreiber [2] that Algorithm PI solyes problem (Prl).
Best reordered partitions. Now we deseribe Algorithm RPI that solyes the
reordered partitioning problem (Pr2). A vertex v in the DAG G(L) is a saUTee if
there are no edges directed into v: i.e., there are no edges (u, v). The level of a vertex
v is the length of a longest direeted path into v. It follows that if v is a souree, then
level (v) = Oj furthermore, if v is not a source, then level (v) is the length of a longest
path from a souree to v. The level values of all the vertiees of G(L) ean be computed
in O( e) time. We define the set hadj (v) to be the set of all vertices adjaeent to v and
numbered higher than v.
Algorithm RPI, shown in Fig. 3, renumbers the elementary matriees during the
eourse of its execution since it eomputes an appropriate symmetric permutation Q to
minimize the number of factors. Conditions la and Ib in the algorithm ensure that
the first eondition of problem (Pr2) is satisfiedj similarly condition 2 ensures that the
eolumn subgraphs of the factors are transitively elosed.
Alvarado and Schreiber [2] proved that AIgorithm RPI finds a best reordered
partition. The time complexity of the algorithm is dominated by the eheeking of
eondition 2: in the worst-case, this eost is LVEV dIe v )do (v), where dIe v) is the indegree
and do(v) is the outdegree of V. Since dI(v) :::; n-I, and LVEV do(v) == e, the time
147
complexity of the algorithm is O(ne). If we assume that the indegrees and outdegrees
are bounded by d, then the complexity is O(d 2n). The space complexity is O(e).
At the expense of additional space, in most cases we can reduce the running time
required by AIgorithm RP1 by incorporating two enhancements.
The first improvement is that a vertex need not be tested for indusion into a factor
Pk until all of its predecessors have been numbered. To accomplish this, in count( v)
we count the number of unnumbered predecessors of each vertex Vj initially, this is its
indegree. When this count becomes zero, we indude v in aset e of vertices eligible
to be tested for indusion in a factor Pk.
If an eligible vertex v satisfies condition 2', then it is deleted from e and induded
in the factor Pk. Otherwise, it is induded in e+, the set of vertices eligible to be
tested for indusion in the next factor Pk+!. Further, since newly eligible vertices are
adjacent to currently eligible vertices, we need to maintain only the sets e and e+ in
the algorithm. Thus we can dispense with the processing of vertices by level values.
The second improvement is to reduce the cost of checking condition 2 in AIgorithm
RPl. If u and v are both nurribered vertices which have been induded in the current
148
oo
V
o
o o
o o
000 o
000 000
o o 0
oo 0
o oo
og~l:0~1:oo
000 0000 0
00000
~I:o
o oo 000
FIG. 5. ACholesky Jactor L, its DAG, and its partitions. The original ordering and the partition
Jound by Algorithm P 1 are shown on the leJt, and the ordering and partition Jound by Algorithm
RPtree are shown on the right.
i. (The edge is directed from j to i.) The vertex i is the parent of j, and j is a
ehild of i. If (j, i) is an edge in the elimination tree, the lowest-numbered vertex in
hadj(j) is i. The elimination tree of the graph G(L) in Fig. 5 is shown in Fig. 6. (The
vertex nUllibering corresponds to the originaI orderi ng shown on the left in Fig. 5.) A
comprehensive survey of the role of elimination trees in sparse Cholesky factorization
has been provided by Liu [13].
Our partitioning algorithm will require as input the elimination tree with vertices
numbered in a topological ordering. It also requires the subdiagonal nonzero counts of
each column v of L, stored in an array hd(v) (the higher degree of v). The algorithm
uses a variable member to partition the verticesj member( v) = eimplies that v belongs
to the set Pi .
UnIike AIgorithms RPI and RP2 which compute the factors PI, ... , Pm in that
sequence, AIgorithm RPtree examines the vertices of the elimination tree in increasing
order of their numbers. If a vertex v is a leaf of the tree, then it is induded in the
first member (the vertices in Pl ). Otherwise, it divides the children of v into two
sets: Gl is the subset of the children u such that the column subgraph of G(L)
induced by u and v is transitively dosed, and G2 denotes the subset of the remaining
children. Let mi denote the maximum member value of a child in Gl and m2 denote
the maximum member value of a child in G2 • Set mi = 0 if Gi = 0. If Gl is empty,
or if mi :::; m2, then we will show that v cannot be induded in the same member as
any of its childrim, and hence v begins a new member (m2 + 1). Otherwise, mi > m2,
and v can be induded together with some child u E Gl such that member(u) = mi'
We now describe the details of an implementation. The vertices of the elimination
tree are numbered in a topological ordering from 1 to n. The descendant relationships
in the elimination tree are represented by two arrays of length n, ehild and sibling.
The array ehild (v) represents the first child of v, and sibling (v) represents the right
sibling of v, where the children of each vertex are ordered arbitrarily. If ehild( v) = 0,
then v has no child and is a leaf of the elimination treej if sibling(v) = 0, then v has
no right sibling. AIgorithm RPtree is shown in Fig. 7.
The reader can verify that P1 = {I, 3, 4, 7, 8, 9}, P2 = {2, 5, IO}, and P3 =
{6, 11, 12} for the graph in Fig. 5. The time and space complexities of the algo-
rithm are easily shown to be O(n). We turn to a discussion of the correctness of the
algorithm.
150
Input: The elimination tree of a DAG e( L) and the higher degrees of the vertices.
Output: A mapping of the vertices such that member(v) = f implies that v EP/..
for v := 1 to n ~
if ehild( v) = 0 then {v is a leaf}
member(v):= 1;
else {v is not a leaf}
u := ehiidev); mI := 0; m2 := 0;
while u # 0 do
if hd(u) = 1 + hd (v) then
mI := max{ml, member(u)};
else {hd(u) < 1 + hd(v)}
m2 := max{m2' member(u)};
fi
u := sibling( u);
od
if mI :s; m2 then {v begins a new factor}
member(v) := m2 + 1;
else {mI> m2, vean be induded in a factor which indudes a ehiId}
member(v) := mI;
fi
fi
rof
Condition 1 of problem (Pr2) requires that if a vertex v belongs to Pe, then all
predecessors of v must belong to PI , ... , Pe. The elimination tree T, being the transi-
tive reduction of the DAG e( L), preserves path structure: i.e., there exists a directed
path from v to w in e(L) if and only if there is a (possibly some other) directed path
from v to w in the elimination tree T. Hence the predecessors of a vertex in e( L) re-
main its predecessors in the eliminatian tree. Further, since we assign member values
in a topological ordering of the vertices in the eliminatian tree, to satisfy Condition
1 we need consider only the children of a vertex v among its predecessors. Now since
AIgorithmRPtree assigns member values such that member( v) is greater than or
equal to member(u) for any child u, the condition is satisfied.
Condition 2 requires that each factor Pe be transitively elased. An important
property of the elimination tree [13] is that if v is the parent of a vertex u in the
elimination tree, then hadj (u) ~ {v} U hadj (v). Hence hd (u) :s; 1 + hd (v). On
the other hand, if u and vean be induded in the same transitively elosed column
subgraph, then hadj (u) :2 {v} U hadj (v). It then follows that u and vean be possibly
ineluded in the same column subgraph only if hadj( u) = {v }Uhadj( v), or equivalently,
hd(u) = 1 +hd(v). Furthermore, if v has a child u not satisfying the degree condition,
then v but not u is adjacent to some higher numbered vertex x, and hence veannot
belong to the same member as U. Thus we partition the children of v into two
subsets: Gl consists of children u such that u and vean be induded in the same
column subgraph; G2 indudes the rest of its ehiIdren. It follows that if mi is the
151
maximum member value among vertices in Gi , then the inclusion of v into a column
subgraph containing a child preserves transitivity only if mi > m2.
It can be established by induction that AIgorithm RPtree solves (Pr2) by parti-
tioning G(L) into the minimum number of factors over all topological orderings [16].
TABLE 1
Comparison of exeeution times on aSun SPARCstation IPC for three seeondary reordering sehemes
with the MMD primary ordering. The parameters r(A) and r(L) have been sealed by a thousand for
convenience.
the height of the elimination tree is 3k + 8(1), while the number of faetors (in (Prl)
and (Pr2)) is 210g 2 k + 8(1). The results in Table 1 show that the number of faetors
for these irregular problems is only weakly dependent on the order of A, compatible
with logarithmic growth.
The RPtree algorithm has O(n) time complexity while RP1 and RP2 are both
O(nr(L)) algorithms (recall that r(L) == e). This is confirmed by the experiments:
on the average problem in this test set, RPtree is more than a hundred times faster
than RP1 or RP2, and the advantage increases with increasing problem size. From a
practical perspeetive, the time needed by the RPtree algorithm is quite small when
compared to the cost of computing the initial MMD ordering. An equally important
advantage of the RPtree algorithm is that it requires only O(n) additional space,
whereas both RP1 and RP2 require O(r(L)) additional space. However, AIgorithms
RP1 and RP2 can be us ed to partition triangular factors arising from approximate or
incomplete Cholesky factorizations as wel! as unsymmetric and symmetric indefinite
faetorizations.
We have also experimented with a variant of the Minimum-Length-Minimum-
Degree (MLMD) orderi ng [5] as the primary ordering, but we do not report detailed
results here. The MLMD orderi ng incurs a great deal more fil! in L than the MMD al-
gorithm, and its current, fairly straightforward implementation is quite slow compared
to the MMD algorithm. We believe an implementation comparable in sophistication
to the MMD algorithm should not be significantly slower than MMD, and may also
reduce fil!. In spite of the greater fil!, the MLMD orderi ng is more effeetive in almost
all cases than MMD in reducing the number of factors in the partition of both L and
LQ. In some cases, the initial number of faetors obtained when MLMD is used as the
primary orderi ng is lower than the final number of levels obtained with MMD after
the secondary reordering (Q in problem (Pr2)). However, because of the increased fill,
choosing between MMD and MLMD as the primary ordering is not straightforward.
TABLE 2
CM-2 times (seconds) for full matrix substitution and partitioned solution.
TABLE 3
CM-2 times (seeonds) for sparse triangu/ar substitution and partitioned so/ution.
wise' execution model, which treats each Weitek chip of the CM-2 as a processor. For
this.experiment we used 256 Weitek processors on the Connectian Machine at NASA
Ames. Results are given in Table 2. Clearly, the partitioned method is superior, by
a factor roughly equal to the ratio of the number of levels in G( L) (which is n in this
case) to the number of factors in the partition of L (one).
In Table 3 we give the size of these factors, the number of levels (this is proportional
to the time required for ourparallei substitution algorithm), and the number of
factors, which is in practice proportional to the time required by the partitioned
inverse approach.
These results confirm that the time required to solve a triangular system by parti-
tioning of the inverse is quite well predicted by the number of factors in the partition.
It also shows that the number of levels in G( L) is a good predictor of the time required
for solution by substitution methods. We see that when L has a fairly rich structure
the partitioned inverse approach is mu ch better than the substitution method, but
when L is very sparse there is little gained. The use of an MLMD primary ordering
improves both substitution and partitioned methods. However, with the introduc-
tion of the additional fill in the exact factor L 3 (compared with L 2 ), the number of
levels in G(L) increases sharply (as does the time for substitution) while the number
of factors in the best reordered partition drops dramatically. The difference in the
solution time, even for this problem of modest size, is about a factor of twenty. Thus
we conelude that the method can be quite useful in highly paralle! machines when
155
the matrix L has a rich enough structure, as happens when it is an exact triangular
factor.
loö
o •
. 0
1 0
o ".
...
0 ••
000. •
o. "
000 •• 0"
000 "..
o •• o 0 0
0 ••
000. 000 :I:IU 0" ..
o oo
o
0000.
oo ..
oo
o
000
• oo
..00000
o.•
FIG. 8. Partitioning over the elass of perfect elimination orderings of the fil/ed graph G(F). The
original ordering is shown on the left, and an ordering and a parti/ion that solve problem (Pr3) are
shown on the right. The figure il/ustrates that the upper and lower triangles of the matrix on the left
are not preserved under the permutation, but the st;ucture of F is preserved.
Problem (Pr3) turns out to be much harder than (Pr2), but can be solved by devel-
oping several results concerning transitive perfect elimination orderings: i.e., perfect
elimination orderings of subgraphs of chordal graphs which make them transitively
dosed subgraphs as weIl. AIgorithms to solve (Pr3) wilI be report ed in [14, 15].
The first issue concerns the uniqueness of the minimum-cardinality partition. AI-
gorithms that we have designed for computing the partitions in all of the above
problems belong to the dass of greedy algorithms. Further, for each i, they indude
as many elementary matrices as possible into a factor Pi subject to the condition
that the factors P1 , ••• , Pi - 1 have been obtained similarly. It is easily seen, however,
that minimum cardinality partitions need not be unique: for instance, for problem
(Pr2-Cholesky), it is possible to design an algorithm that processes the vertices of the
elimination tree from n down to 1 to compute the partition in the order Pm , ... , P1 •
Such a partition would put, for each i, as many elementary matrices as possible into
Pi subject to the condition that factors Pm , ••. , Pi+I have been obtained similarly.
In the context of highly parallei triangular solution, it may be preferable to have
each factor contain roughly the same number of nonzeros, since the matrix-vector
multiplication involves as many virtual processors as the number of nonzeros.
There is some flexibility in assigning elementary matrices to factors, and this may
be exploited to reduce the disparity between the number of nonzeros in the different
157
REFERENCES
CLEVE ASHCRAFTt
Abstract. There are two classic column-based Cholesky factorization methods, the Jan-out
method that communicates factor columns among processors, and the Jan-in method that commu-
nicates aggregate update columns among processors. In this paper we show that these two very
different methods are members of a "fan-both" algorithm family.
=
To each member of this algorithm family is associated a pair ofintegers, (ql, q2), where qlq2 p,
the number of processors. The fan-out method is a (p, 1) method, while the fan-in method is a (1, p)
method. Methods with 1 < ql, q2 < P have characteristics of both fan-out and fan-in, and thus give
the family of methods its name.
The fan-out and fan-in methods have upper bounds on message counts of (p-l)1V1 and message
volume of(p -1)IELI, where IVI is the size of the matrix and IELI is the number ofnonzero entries
in the Cho~esky factor. In general these bounds are (ql + q2 - 2)1V1 and (ql + q2 - 2)IELI, and a
(..,jP,..,jP) method has bounds 2(..,jP - 1)1V1 and 2(..,jP - 1)IELI.
• This work has been partially supported by the University of Minnesota IMA Workshop on
Graph Theory and Sparse Matrices
t Boeing Computer Services, P. O. Box 24346, MS 7L-21, Seattle, Washington 98124,
cleve@espresso.boeing.com
160
There are two key ideas inherent in the fan-both family of algorithms.
1. Both faetor entries and aggregate update entries can be exchanged among
processors.
2. The sets of processors that exchange each type of information can be re-
stricted, and this can significantly reduce the communication.
These two ideas can be used in submatrix faetorizations, where the fundamental
data strueture is asubmatrix, and matrix-matrix computations form the fundamen-
tal computationaI tasks. An algorithm that sends entries of L, entries of L T and
aggregate update entries Ei [k,ili,i can employ three dimensions of communication,
with a resulting O(~IELI) communication volume. Seetion 6 outlines this extension
and presents some concluding remarks.
2. Notation. In this seetion we introduce the notation, the data struetures and
fundamental computational tasks that are performed during a column based sparse
Cholesky factorization. The matrix A is n x n and symmetric. The rows and columns
of Aare i\Umbered 0,1,···, n - 1. The nonzero strueture of A is represented by a
graph GA = (V, EA), where ak,i =J 0 if and only if (k,j) E EA. The nonzero structure
of the Cholesky factor L is represented in a similar manner by the graph GL = (V, EL),
where [k,i =J 0 if and only if (k,j) E EL. Throughout the paper we attempt to follow
this convention: if i, j and/or k appear as indices, then i ~ j ~ k.
(1)
[I:~~, ][
[n-t,i
li,i 1= [ a;~':'i ~ I:~;" ][
an-t,i
] - [
[n-t,i
li,i 1'
or more compaetly, f.,;li,j = a.,j - E{:ci [.,;li,i. If A is sparse, then most of the ai,i and
li ,i entries will be zero. The sets of indices for nonzero entries in row j and column j
are given as follows:
Ri = {i I Ii,i =J O} and
Equation (1) can now be written
(2) I')i,i = a',i - L [')i,i
iERj\{j}
In this equation the '*' in [',i is short for ei> the row indices of the nonzero entries of
column j in L. The '*' in a',i are the indices (k,j) E EA, where k ~ j. The '*' in l.,i
is short for ei n ei, the row indices of column i that will update column j.
2.2. The fundamental data structures. The right hand side of equation (2)
is composed of a number of columns, one for each index in Ri. The vector associated
with i in the row strueture of j we define below.
ti .
',J
= { a',i ~f ~ = ~
-I .[ .. tfZ-'-J
*.13,l T
162
The '*' in the i',i is short for Ci nC;. The subseript j means this veetor will contribute
to l.,j' The superseript i means that the update is from column i. The superscript
index can be extended to sets. For example,
is the sum of two component columns. The '*' subscript in i!~,i} is the set Ci n (Ch UC;).
In general, for aset ac we have:
i~,i = Lt:.i
iECl'
(3)
The best way to interpret this equation is to view the temporary vector i~j as a
vectar that will accumulate the original column a',i and updates from preceding
factor columns.
for i E n j \ {j}
u p datei:4 := i~j - [.,;!j,i
scale [.,jlj,j = i.,}
There are three types of computations performed in equation (3). The first type is
to load the original entries in column j into the accumulation veetoL The second type
is to perform the update from column i to column j. The third type is to compute the
factor column by scaling the accumulated column. The update task is the cmod(j ,i)
task weil known in the literature, while the scale task is the cdi v (j) task [13].
Subseetion 3.2 describes two ways by which an l',i factor column can be computed.
At step j, the entries a',j will be loaded, and the column will be sealed. However,
there is considerable freedom in specifying which update operations are performed.
At step j, a backward-looking method executes updates to column j from preceding
163
Subseetion 3.3 analyzes the communication that will be performed during a dis-
tributed faetorization. We measure two simple properties, the number of messages
and the number of floating point numbers eommunieated. These two eharaeteristies
are precisely determined by the eomputation map.
In this seetion we do not impose any speeial form at on the eomputation map.
Data struetures, algorithms and communication analysis are deseribed for a general
eomputation map. In Seetion 4 we will diseuss and analyze a family of algorithms
defined by a speeifie type of eomputation map.
3.1. The computation map. The responsibility for eomputing and aeeumulat-
ing the t~,j temporary veetors will be distributed among p proeessors in some fashion.
The manner in whieh the eomputations are distributed is defined by the computation
map from the edge set of L to the proeessors ids:
For i < j and (j, i) E EL, proeessor m(j, i) is responsible for eomputing t~,j = -[.,;li,i.
Proeessor m(j,j) performs the load and seale tasks for eolumn j.
The strueture of row j in L ean be partitioned into a number of disjoint sets,
based on which proeessor owns the t~,j temporary veetor, as follows:
For eolumn j, the right hand side of equation (3) ean be expressed as the following
sum.
To simplify the notation, we write t~,i == t:1. Note the differenee between t~,j and
t~,i' The first has a eolumn id as a superseript, the second a proeessor id.
R
The total accumulated vector t.,j we write as t:,j' This eolumn is the sum of at
most p partial nonzero aeeumulated veetors.
p-l
- t*
t **,j = Rj . - ' " t q . - tO
, j ' - L...J * , j ' - *,j + t *,j
l
+ . . . + t *,j
p- l
q=O
The partial veetor t~,j will be nonzero only if the index set nl is nonempty.
It will be useful to speeify sets of proeessors that have interaetions with eaeh
eolumn j, either performing updates to j from preeeding eolumns, or using i',i to
update sueeeeding columns. We define subsets of {G, 1,···,p - I} as foIlows:
164
An edge (j, i) E EL is identified with each update and scale computation task. At
st ep j, the processors in aj perform updates in forward Cholesky, while the processors
in (lj perform updates in backward Cholesky. The assembly tasks are present in
the distributed Cholesky algorithms, but are not found in the serial general sparse
method. The additions in the assembly tasks can be considered overhead, but through
165
careful coding the extra additions can be avoided. The major source of overhead in
the distributed factorizations is the communication of factor and/or update columns
among the processors. The number of messages and their volume is analyzed in the
following subsection.
if q = m(j,j) then
/ / processor owns the load and scaling tasks
/ / load partial accumulations from self
load t:,j := t:,j
/ / receive and assemble partial accumulations from other processors
for r E {3j \ {q}
reeeive t:,j from processor r
assemble t:,j := t:,j + t:,j
/ / scale the column
seale l.)j,j = t:,j
/ / send the factor column to all processors that need it
forrECtj\{q}
send l.,j to processor r
/ / use the factor column in owned update tasks
for k E eJ\ {j}
update t;,k := t;,k - [.,;lk,j
else
if q E (3j then
/ / send partial aceumulation to the owning processor
send t~,j to processor m(j,j)
if q E Ctj the n
/ / receive the factor column and use in update tasks
reeeive I.,j from processor m(j,j)
for k E eJ
update t;,k := t;,k - I.,;lk,j
If one knows when a t.,k column is to receive its first update or assembly, the operation can be
performed without the additions, as update t!,k := -I.,jlk,j or assemble t:,k := t:,k'
166
if q = mU,j) the n
/ / proeessor owns the load and sealing tasks
/ / eompute owned updates to eolumn j
for i E nl \ {j}
update t:,i := t:,i - I.,ili,;
/ / reeeive and assemble partial aeeumulations from other proeessors
for r E f3i \ {q}
reeeive t:,j from proeessor r
assemble t:,i := t:,i + t:,i
/ / seale the eolumn
seale I.,ili,j = t:,i
/ / send the factor eolumn to all proeessors that need it
for r E ai \ {q}
send I.,i to proeessor r
else
if q E f3i then
/ / eompute owned updates to eolumn j
for i E nl
update t:,i := t:,i - I.,;li,;
/ / send partial aeeumulation to the owning proeessor
send t;',j to proeessor mU,j)
if q E ai then
/ / reeeive the faetor eolumn
reeeive I.,i from proeessor mU,j)
Rather laose upper bounds can be abtained by noting that lajl :::: p and l,8jl :::: p, and
IVI is n, the number of vertices, and IELI is the number of edges in the graph of L.
n-I
traffic :::: I;(p + p - 2) = 2(p - l)IVI
j=O
n-l n-I
volume :::: I;(p + p - 2)ICj l = 2(p -1) I; ICjl = 2(p - l)IEL I
j=O j=O
For each j, we have lajl :::: IRjl and l,8jl :::: ICjl. Forlarge p there will be many ealumns
where IRjl < P and/or ICil < p, and the above bounds will not be achievable. A mare
realistic upper bound on communication is given below.
n-I
traffic :::: I; (min(p, IRjl) + min(p, ICjl) - 2)
n-I
volume :::: I;(min(p, IRjl) + min(p, ICi I) - 2)ICj l
j=O
In a similar way we do not specify the manner in which update columns are sent
among processors. For example, processars 7' and s may both have nonzero t:,j and t!,j
partial accumulated columns that need to be accumulated on processor q = m(j,j).
We assume for simplicity that each processol' 7' and s sends its partial column to q
that performs the assembly task. It could be more efficient that processor 7' sends
t:,j to processor s where it is accumulated into t!,j' which is then sent to processor
q. In general, the parti al accumulated columns could be accumulated in a spanning
tree with processor q at the root. This could prove mare efficient than our simple
model, for the total number of hops may be fewer. In any case, there will be at least
l,8il-1 sends of a partial accumulated column for j and (l,8il-1)ICjl entries in those
messages.
where p = q1q2. For example, six proeessms ean be arranged in four different ways
where the proeessors are numbered eolumn major in the mesh.
110
111
pa 113] 112
[110 111 112 113 P4 115 1' [ P1 P4 ,
113
112 PS
114
115
When p is prime, there are only two possible eonfigurations, a 1 x p mesh and ap x 1
mesh. When p = 2d , as for hypereubes, there are d + 1 possible configurations.
The eolumn map takes its name for the following property. If (jb i) and (h, i) are
both in EL, then the owning proeessors are:
The updates to j1 and h are both from eolumn i. The proeessms owning the two
update tasks are both found in eolumn c(i) of the proeessor mesh. In general, the set
ai, proeessors that will exeeute update tasks from eolumn i to sueeeeding eolumns,
is found in eolumn c(i) of the proeessm mesh.
In a similar way r(j) is the row map. If eolumn j reeeives updates from eolumns
il and i 2 , the proeessms owning these update tasks are:
Both proeessors are found in row r(j) of the proeessor mesh. For any eolumn j, the
set (3j, proeessors that will exeeute updates to eolumn j, are found in row r(j) of the
proeessor mesh.
It is clear how to bound the sizes of the aj and (3j sets. Eaeh set aj is found in
a eolumn of the proeessm mesh, and so lajl ::; q1' Eaeh set (3j is found in a eolumn
of the proeessor mesh, and so l(3jl ::; q2. We ean easily obtain loose upper bounds on
the communication statisties.
n-I
(4) traffie = L: (Iajl + l(3jl- 2) ::; (ql + q2 - 2)1V1
j=O
n-I
(5) volume = L: (Iajl + l(3jl- 2) ICjl ::; (q1 + q2 - 2)IELI
j=O
It is simple to minimize these upper bounds with respeet to the ql and q2 mesh dimen-
sions, subjeet to the constraint p = q1 Q2. The ,jP x ,jP proeessm mesh eonfiguration
minimizes these bounds:
that are one half the bounds for a general computation map.
In many cases these bounds may overestimate the message traffic and volume,
even when the map is chosen to balance the computations over the processors. By
noting that lajl S ICjl and I/Jjl S IRjl, we obtain the following equations:
n-I
(6) traffic sL: (min(qI, ICjl) + min(qz, IRjl) - 2)
j=O
n-I
(7) volume S L: (min( qI, ICj I) + min( qz, IR I) -
j 2) ICj I
j=O
There, are two special members of this algorithm family that have appeared in
the literature. The fan-out method [6] is the first column-based distributed Cholesky
method to appear, and is forward-Iooking with ap x 1 processor configuration. The
fan-in method [2] is backward-Iooking with a 1 x p processor configuration. Because
of the historieal importance and the simplicity of the algorithms, we present these
methods in more detail in the following subsections.
4.1. The fan-out factorization algorithm. The fan-out method was not de-
signed from the perspective of mapping computations to processors. Instead, the
columns of the matrix were mapped to processors, and a very specific rule was used
to specify the processor to perform the update from one column to another column.
The processor that "owns" column j will perform all updates from preceding
columns i E Rj. This means that all factor columns l.,i that update column j must
be resident on the processor owning column j, at least temporarily. Once column j
has been computed, it must be sent to all processors that require it to update following
columns k in Cj \ {j}. It is the "fanning-out" of a factor column to the processors
that is the origin of the fan-out name. The processor mesh configuration is p x 1,
and therefore cU) = 0 for all columns j. The column processor set aj is a subset of
{O, 1, ,p - I} as usual, but the row processor Süt /Jj is the singleton {r(j)}. One
oo.
Since the row processor set /Jj is {r(j)}, each partial accumulated column t~,j is
identically zero for q # r(j). Therefore no partial accumulation columns are sent, and
the storage for t:,j can be just the storage for I.,j on processor r(j). The factorization
can be performed using one temporary vector to hold incoming factor columns from
other processors. Once a factor column is received, it is immediately used to update
all suceeeding owned eolumns.
Equations (6) and (7) reduee to the following:
n-I n-I
traffie S L: (min(p, ICj \) - 1) and volume S L: (min(p, ICjl) - 1) ICjl
j=O j=O
170
if q = m(j,j) then
/ / proeessor owns the eolumn
load t:,j := t;,j
/ / seale the eolumn
scale l')j,j = t:,j
/ / send the factor eolumn to all proeessors that need it
forrEaj\{q}
send I.,j to r
/ / use the factor eolumn in update tasks
for k E Cn {j}
update i;,k := t;,k - i.,jh,j
else if q E aj then
/ / reeeive the factor eolumn and use in update tasks
receive i.,j from proeessor m(j, j)
for k E CJ
update t;,k := t;,k - I.,jlk,j
4.2. The fan-in factorization algorithm. Like the fan-out method, the fan-in
method was not designed from the perspective of mapping eomputations to proeessors.
The eolumns of the matrix were mapped to proeessors, and a second speeifie rule was
used to speeify the proeessor to perform the update from one eolumn to another
eolumn.
The proeessor that "owns" eolumn j will perform all updates to sueeeeding eolumns
k E Cj . This means that a temporary data structure t;,k will exist for eaeh proeessor
q that owns a eolumn i.,j that will update column k. To eomplete the aeeumulation
t: i;
of k' the partial aeeumulation eolumns k must "fan-in" from the other proeessors
and 'be assembled. The proeessor mesh eo~figuration is 1 x p, and therefore r(j) = 0
for all eolumns j. The row proeessor set /3j is a subset of {O, 1" .. ,p - I}, and the
eolumn proeessor set aj is simply {c(j)}.
The fan-in factorization algorithm is also simplified when eompared to the general
formulation. Figure 4.2 presents a variant of the fan-in algorithm where onlyone
temporary vector is needed to accumulate the partial column i;,j' where CJ # 0. This
eolumn will either be sent to the owner of column j or will be factored, but it does
not need to be computed before st ep j of the factorization. This is beeause all i.,i
factor eolumns are always resident on the proeessor that needs to use them.
Equations (6) and (7) reduce to the following:
n-I n-I
traffic S L (min(p, In j I) - 1) and volume S L (min(p, Inj I) - 1) ICj I
j=O j=O
5. Empirical studies. The fan-out and fan-in methods have been implemented
in several studies [2], [3], [6], [8], [9]. To our knowledge, fan-both methods have yet
to be implemented on any distributed architecture. In this section we examine the
fan-both family of algorithms on four test matrices, two from the Harwell-Boeing
matrix collection [4]. Weare interested in the following questions:
171
if q = m(j,j) then
/ / processor owns the column
/ / load the original entries
load t:,i := a.,i
/ / compute updates from previously owned columns
for i E nl \ {j}
update t:,j := t:,j - 1.,ilj,i
/ / receive and assemble partial columns from other processors
for r E {3j \ {q}
receive t:,i from processor r
assemble t:,i := t:,i + t:,i
/ / scale the oolumn
scale 1.,jli ,i = t:,i
else if q E (3j then
/ / compute updates from previously bwned columns
/ / and send the partial column to processor m(j,j)
for i E nl
update t:,i := t:,i - 1.,ili,i
send t:,i to processor m(j,j)
TABLE 1
Dimensions and Statistics of Four Test Matrices
The matrix BCSPWRIO models the eleetric grid of the eastern United States.
Power matrices are usually very sparse, and the Cholesky factor has relatively few fill
entries.
Matrix BCSSTK16 is a finite element model of a Corp of Engineers Dam. The
finite element s used in this model appear to be linear hexahedral element s with three
degrees of freedom at each grid point. The original element structure for this problem
is no longer available, but we have recovered an element strueture from the adjaeeney
strueture of the matrix. The generated element mesh is eompaet, having a small
diameter of elements in any direetion.
With the exeeption of the grid probIems, eaeh matrix was ordered by the multiple
minimum degree algorithm [12] and post-proeessed using the Jess and Kees algorithm
[11]. One hundred runs of the orderi ng was performed, and the best ordering with
respeet to faetor operations was chosen. The final ordering is a post-order traversal
of the resulting elimination tree.
Diagrams of the elimination trees for these matriees are given in Figures 5, 6, 7
and 8. The separate nodes are visible for GRD6363ND and BCSPWR10 as small
filled cirdes. In eaeh figure, the nodes are equally spaeed with respeet to height, and
their horizontal displaeement is determined as follows. The nodes are ordered in a
post-order traversal of the elimination tree that minimizes the stack storage for the
multifrontal method. The leaf nodes are equally spaced iIl: the horizontal direction,
and the position of an interior node is the average of the positions of its ehiIdren.
GRD6363ND
GRD151515ND
BCSPWR10
BCSSTK16
TL = max {Tj}.
j a root
An upper bound on speedup for the factorization is simply the number of factor
operations divided by TL.
For each of the four test matrices, the upper bounds on speedup in Table 1 are
quite modest with respect to the amount of computation. We will consider the cases
where the number of processors used to compute the factorization is the dosest square
number to the maximum speedup. The average n1).mber of degrees of freedom, entries
and factor operations per processor is given in Table 2.
TABLE 2
Storage and Work per Processor
The power network matrix is definitely an outlier, where the amount of parallelism
for the computation is large. In some sense this can be inferred from the elimination
tree, which has short height and is very bushy. The other three matrices show a much
larger number of entries and operations per processor. As the matrices grow larger
in terms of storage and factor operations, the available parallelism does increase, but
the storage and work per processor increases. As the matrix size increases, machine
architectures must have more memory per processor and faster CPU's to maintain
relative performance.
5.2. The balaneed mesh eomputation map. Some thought must be given
to designing a suitable computation map for each matrix and (ql' q2) processor config-
uration. After some experimentation we chose the following procedure to define the
row and column maps. It is a greedy procedure that attempts to balance computation
among rows and columns of the processor mesh.
To each node j is associated two weights in terms of floating point öperations: a
forward weight
The forward weight h indudes the scale task for column j and all updates from j to
succeeding columns. The backward weight bj indudes all updates to j from preeeding
columns and the seale task for eolumn j. The forward eomputations for node j are
performed by the processors in eolumn e(j) of the processor mesh, while the baekward
eomputations are performed by the proeessors in row r(j) of the proeessor mesh.
The basie idea behind the balaneed mesh map is to assign no de j to the row
of processors with the least aecumulated baekward weights and to the column of
proeessors with the least aecumulated forward weights.
Note two speeial cases. The (p, 1) fan-out method has q2 = 1, so all processors lie
in one column of the mesh, and the forward weights h do not contribute to the map.
Likewise, the (l,p) fan-in method does not use the backward weights bj to define the
map.
One crucial aspeet of the map algorithm needs to be defined, namely the order
in which the nodes are processed to distribute the forward and backward weights.
We shortly will evaluate the fan-in, fan-both and fan-out methods with respeet to
load balance, working storage and communication. We do this with respeet to one
particular way of processing the nodes to obtain the computation map, that of the
post-order traversal of the elimination tree. We choose this in hopes of minimizing
the working storage of the fan-both method. Other sequences of no des to define the
map could be considered.
1. A breadth first traversal of the elimination tree could be useful, if the desire
that each processor start eomputation as soon as possible. On the other
hand, if a post-order traversal of the tree is used to generate the map, it
is stiil likely that the leaf nodes will be distributed fairly evenly over the
processors.
2. Ordering the nodes by descending order of the forward weights h should
balance operations for the fan-in method.
3. Ordering the nodes by deseending order of the baekward weights bj should
balanee operations for the fan-out method.
4. Ordering the nodes by deseending order of the sums of the forward and
backward weights (b j +h) should balanee operations for the fan-both method.
179
Why do we not consider nested maps, e.g., a subtree-subcube map [8]? There are
two reasons.
1. Most of the elimination trees we have generated for the symmetric Harwell-
Boeing test matrices (using multiple minimum degree and post-processed
by the Jess and Kees algorithm) resemble that of BCSSTK16 in Figure 8.
There are two main trunks with many small subtrees hanging off the trunks.
For this matrix, 90% of the factor operations are performed in updates from
nodes in the trunks to nodes in the trunks. A subtree-subcube map would
allow at most a factor of two decrease in the communication in each of the
two large subtrees, and not decrease at all the communication involving the
large top separator.
2. For large numbers of processars, elose to the maximum speedup for a matrix,
a subtree-subcube map will show little to moderate improvement. This can
be seen by examining equations (6) and (7), replacing ql and q2 by functions
ql(j) :::; ql and q2(j) :::; q2. l\1ost of the nodes where ICil and Inil are large
occur near the top of the elimination tree, where ql (j) ~ ql and q2(j) ~ q2'
Those nodes where ql (j) ~ ql and q2(j) ~ q2 lie at the lower leveis of the
elimination tree, and have smaH ICil and Inil values.
3 ~----------------------4 ~ ~------------------------~
~ ~ ~----------------------------~
J
........
i 3 ~------------------------~~--~
__
~~~----------------~.~....._.:_.. ...-,~~
~
~..--
.. ·--
i;I---:;:=::~=:r·::.·:7···:··=::~.. ·,..
~~ ~~
o
~------~~~.~
J.... ..~.._.,----------------~
~~ ~~~----------------------~
0. __i ____________________________
o
~
~
! ..... ! ~----------------------------~
~ L---------~2~M~P'-~-.-.-.o-~------------J
~~----------------------------~
. .<:.~/
i.
t ......:::....'~.: .../
i~~----------~~~z=----------~
i ..· ··/7
i . . . .·
~ ~~---------------------------,
..'
40 procenor.
5.4. Working storage. One advantage of the elassie fan-in and fan-out meth-
ods is that they ean be implemented using onlyone temporary storage vector for the
message buffer, whose size is that of the largest eolumn in the factor matrix. They
are effieient with regards to working storage, an important property for arehitectures
with limited amounts of storage on eaeh proeessor.
The fan-both method has a serious drawbaek in the amount of working storage
that is required. The worst seenario is to factor a symmetric dense matrix, where a
(-IP, -IP) method using a balaneed mesh map requires as mueh a n 2 /2-IP storage for
a proeessor, as opposed to n 2 /2p for a fan-in or fan-out method.
The working storage for a fan-both method is largely a function of the shape of
the elimination tree and the order in whieh the ealumns are processed. The load
balanee study in the Section 5.3 and the communication study in the Section 5.5 are
independent of the partieular order that the eolumns are processed. Working storage
for a fan-both method is not. In our experiment's we process the eolumns in the
post-order traversal of the elimination tree, as pres ent ed in Figures 5, 6, 7 and 8.
Figure 10 presents the working storage profiles for the four matriees and three
methods. We plot the profiles sealed by the average number of factor entries per
proeessor, and we do not eount the one message buffer vector. Sinee fan-in and fan-
out use no extra working storage, their profiles are elustered near one on the y-axis.
There are two curves for the (-IP, -IP) fan-both method. The upper eurve is for
a backward-Iooking method, where factor eolumns are held until all their updates
are eompleted. The lower eurve is for a forward-Iooking method, where an update
eolumn is created when the first update to it is performed, and is released when the
last has been computed.
(Note that the elassie fan-in method has a (l,p) eomputation map implemented
as a baekward-Iooking method, and the the elassie fan-out method is the dual, a (p, 1)
eomputation map with a forward-Iooking method. Their eonverses, a (1, p) forward
method and a (p, 1) baekward method, require mueh more working storage than either
a forward or baekward fan-both method, and are not praetieal.)
We see from these plots that the working storage for the (-IP, -IP) baekward fan-
both method is always more than the earresponding forward method. The working
storage for the forward method is fairly small, never more than six times the average
number of faetor entries per proeessor. (Note, here the working storage ineludes the
factor entries in eolumns owned by the processor, i.e., the overhead storage is the
profile minus one.) If storage per proeessor is tight, the fan-both method is at a
disadvantage.
5.5. Communication studies. The points we make are relevant when commu-
nication is the largest souree of overhead for a distributed faetorization. The fan-both
method has great potential to reduce the communication if the loose bounds of equa-
tions (4) and (5) are to be believed. However, when the number of processors is large,
we expect the loose bounds to overestimate the communication, particularly for the
fan-in and fan-out methods, and the tight bounds from equations (6) and (7) to be
more accurate.
182
GRD6363ND. Woo1dng Sto~. Profli. GAOt5t5t5NO. Wo,tdng St...OO Pro/Ile
~
..
I· r-------------~~~~~
I .. , I:
ls ~
f
.i .
~
õe
~. ~------------------------------~
i
I
r
.~
~
~----------------------------,
..
iN~,-~,~----~~~------~
i ............... :;:.:::;
~.
0
li
(:................. .
.
~N
.~r-----------------------------~
E
i 0r-____________________________--i
r:-r-------------------------------1
~
~
~ .'
..... :.:.=-
"""""=
•• .~.:~_ t l--- --------l
i~"" . . -- ~
õ7
1· ~-----------------4
~
~ i
161 proeölore
Table 3 presents some statistics for the four test matriees. As in the preeeding
load balanee and working storage studies, the number of proeessors chosen for eaeh
matrix is the square number elosest to the maximum speedup. The balaneed mesh
eomputation map is used to assign eomputations to proeessors.
TABLE 3
Communication Statistics, Tight Bounds, Loose Bounds and Observed
TABLE 4
Average Size of Communicated Columns
The statisties in Table 3 are the sum over all proeessors of the messages and
matrix entries, sealed by the matrix size and the number of faetor matrix entries,
respeetively. For eaeh matrix and method, the loose bounds of equations (4) and
(5), the tight bounds of equations (6) and (7), and the observed communication are
presented.
184
Note that the loose bounds always overestimate the tight bounds and observed
communication, many times by quite a lot, particularly for the fan-in and fan-out
methods. The tight bounds do a much better job of predicting the expected commu-
nication.
We can try to understand where the majority of communication is performed for
each of the three methods. If a level of a node is measured by the maximum distance
from it to a leaf in the elimination tree, the nodes at high levels have larger column
sizes than nodes at the lower levels, with the exception of nodes near the root, where
the column size decreases to one at the root.
In Table 4 we present the average size of a factor column for each matrix, along
with the average size of a communicated column for each of the three methods. Note
that in all cases the average column that is communicated is larger than the average
column of the matrix. The small columns at the lower levels of the elimination tree
are communicated many fewer times than colUJ;nns of larger sizes at the higher levels.
1. The fan-in method has the largest average column size, for the nodes at
the highest levels of the tree most likely will receive the largest numbers of
aggregate update columns, and these nodes have the largest column sizes in
the matrix.
2. The fan-out method has the second largest average column size of the three
methods. This can be explained by realizing that the nodes within the first i
levels of the root cannot send more than i-I messages, for i :::; p. Therefore,
it is columns on the "shoulders" of the elimination tree, with moderate to
large size, that are sent the most times in the fan-out method.
3. The fan-both method shows a more regular communication pattern for the
columns in different levels of the elimination tree. Each node can receive
as manyas q2 - 1 aggregate update columns and send off at most qt - 1
factor columns. For qt = q2 = VP, it is likely that columns in many levels
of the tree, from the root down to near the leaves, can receive elose the the
VP - 1 update columns and send off elose to the VP - 1 factor columns. This
explains both the elose correlations between the loose bounds, tight bounds
and observed communication for this method, as weIl as why the average
communicated column size is fairly elose to the average column size of the
factor matrix.
Figures 11, 12, 13 and 14 present the profiles of the number of messages sent
and received and the number of matrix entries sent and received by the processors.
For the two grid matrices and the dam matrix, the fan-both method always has less
communication than the fan-in and fan-out methods, and the profiles are more Hat,
meaning that the communication is more evenly distributed among the processors.
The average slopes of the profiIes is largest for the fan-out method and smallest
for the fan-both method. The slope of the fan-in method lies somewhere between.
There are sizable spikes at the right and/or left end s of the profiles. This probably
means that there is a reasonably small number of no des that either send or receive
the full number of columns possible, and that the processors owning these columns
have their statistics increased accordingly.
The power matrix has different profiIes. For-once the fan-in method generates the
fewest messages, though the fan-both method has the least volume.
185
~ r-------------------~ ~ r---------------------4
... ..
...-......... .....
!
• ~r-------------~~--~
. . .. . ... J , ~ I----------------------~-~~
......
~ ~---------==~~
~ , 0- •••••••• ••
. . .....
L- - - - - _ . _ - - - -j
~~--------------~
r ~ ~---------------------~
.l........ /"
j~ .' I ~f--------~'"
.......:'..... ~§ ............................................/
~I~ . - .~
§r--------------------.~. §v------------------~
, . ..... . _. ....... - . ... . ... .. -
1r-__ ~G~R~D~I~
51~5~15~N~D~~~n~d~v~O~
lu~
m~.~p~
ro~tl~
~_____, ir-___G~R~D~'~5~'5~'~5N~D~.~R~K~.~IV~.~v=OI~um~.~pro~"~=___-.
I f - - - -- - - - - - : l If----------I
~ ~---------==-_.~
.
~ ~------------------~ ~ r-------------------~
~ ~--------------------~ ~ r-----------------------~
49 proc: ....or.
A
i:>:::::: _:: _
8
~~................. .... ~ ~........ .
-~
i ..
..j . ...
hf.-·,~-'.--'- - -- - - - 1
--------
i '
õ i
. ~~--------------~ : ~~-------------------4
~ I··-------------------~ ~I------------------~
~~--------------~
....... ~
.........
........
~ ......... . I .......
i
~
.
~~
0
~~----------------~ ~
- _. a _ __ -- - - - --
..... -- -- -- - . -----
_~
118 proceUOfl
.........
II------.-
............
. .. ·~ 11--- - - - - - - - 1
. _. . ·_. ·
-.=
. . .·
the number of matrix entries communicated during the factorization is bounded above
by 3~IELI, a lower complexity than the column based methods in this paper. This
algorithm is presented in [1] for the dense LU factorization.
The fan-both method has great potential to outperform the elassie fan-in and fan-
out methods, particularly on architectures where communication cost is relatively
high or where large numbers of processors are used. Its one drawback is non-trivial,
the amount of working storage required to hold external factor and aggregate update
columns. We do have prototype codes running on an i860 hypercube, and the fan-both
method does compute the factorization faster than the fan-in and fan-out methods.
We refrain from presenting incomplete results at this time. Valid comparison must
in elude the multifrontal method, as weil as computation maps designed specifically
for the fan-in and fan-out methods.
190
REFERENCES
ROBERT SCHREIBER
Abstract. We shalJ say that a scalable algorithm achieyes efficiency that is bounded away from
zero as the number of processors and the problem size increase in such a way that the size of the
data structures increases linearly with the number of processors. In this paper we show that the
column-oriented approach to sparse Cholesky for distributed-memory machines is not scalable. By
considering message volume, no de contention, and bisection width, one may obtain lower bounds
on the time required for communication in a distributed algorithm. Applying this technique to
distributed, column-oriented, dense Cholesky leads to the condusion that N (the order of the matrix)
must scale with P (the number ofprocessors) so that storage grows like p2. So the algorithm is not
scalable. Identical condusions have previously been obtained by consideration of communication and
computation latency on the critical path in the algorithm; these results complement and reinforee
that condusion.
For the sparse case, both theory and some new experimental measurements, reported here, make
the same point: for column-oriented distributed methods, the number of gridpoints (which is O(N))
must grow as p2 in order to maintain parallei efficiency bounded above zero. Our sparse matrix
results employ the "fan-in" distributed scheme, implemented on machines with either a grid or a
fat-tree interconnect using a subtree-to-submachine mapping of the columns.
The alternative of distributing the rows and columns of the matrix to the rows and columns of
a grid of processors is shown to be scalable for the dense case. Its scalability for the sparse case
has been established previously [10]. To date, however, none of these methods has achieved high
efficiency on a highly parallei machine.
Finally, open problems and other approaches that may be more fruitful are discussed.
Two lines of att.ack have been t.aken up to now. The MIMD, message-passing-
machine community has tended t.o concentrate on met.hods that are column ori-
ented [2, 3, 4, 9, 14, 18, 19, 30]. In these methods, columns of the matrix A and
its Cholesky faetor L are assigned to processors in some way - column j is held by
processor map(j) and map is determined as part of the method. Furthermore, the
methods organize the computation as a colleetion of column-oriented tasks: sparse
column scaling and sparse DAXPY. This dass of methods has also been proposed
and used for the dense problem on message-passing machines [1]. When scalability is
• Research Institute for Advanced Computer Science, MS T045-1 NASA Ames Research Center,
Moffett Field, CA 94035. This author's work was supported by the NAS Systems Division via
Cooperative Agreement NCC 2-387 between NASA and the University Space Research Association
(USRA).
192
not a primary issue, these methods may be entire!y appropriate. They are like!y to
be very useful on moderately parallei, shared memory machines.
A second approach is to map the data in two dimensions. This approach is favored
by Gilbert and Schreiber [10], Kratzer [15], and VenugopaI and Naik [29]. Recently,
Dongarra, Van de Geijn, and Walker [7] have shown the value of this approach for
the dense problem on MIMD message passing machines; the author has also used it
successfully for the dense problem on the Maspar MP-l, a massively paralleI SIMD
machine.
In this paper we investigate the scalability of these elasses of methods for dis-
tributed sparse Cholesky factorization. By asealable aIgorithm for this problem, we
mean one that maintains efficiency bounded away from zero as the number P of pro-
cessors grows and the size of the data structures grows roughly linearly in P. We
concentrate on the mode! problem arising from the 5-point, finite difference steneil
on an Ng x Ng grid. We will show that the column-oriented methods cannot work
weil when the number of gridpoints (N = N;) grows like O(P) or even O(PlogP).
We show that communication will make any column-oriented, distributed algorithm
useiess, no matter what the mapping of eolumns to proeessors. This is true because
column-oriented distribution is very bad for dense problems of order N when N is
not large compared with P.
It is reasonable to ask why one should be concerned with machines having thou-
sands of processors. Figure 1 should illustrate the reasons for believing that supercom-
puter architecture is now making an inevitable and probably permanent transition
from the modestly paralle! to the high ly paralleI (2.57 - 4,096 processors) or massivery
paralleI (4,097 - 65,536 processors). The following estimation of supercomputer ar-
chitecture during the coming decade helps motivate the work presented here.
Uniprocessor Performance
103
YProcessors
1rt
Desktop Processors
10'
8-
~
10'
10'
1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
Year
efforts for dense problems incIude those of Li and Coleman [17] for dense triangu-
lar systems, and Saad and Schultz [28] j Ostrouchov, et al. [22], and George, Liu,
and Ng [8] have made some analyses for the sparse, column-mapped algorithms.
An interesting analysis of the effeet of a memory hierarchy on sparse Cholesky has
been provided by Rothberg and Gupta [26]. These investigators, working with a
nonuniform-access shared-memory system, have recently come to concIusions similar
to ours [27].
In Section 2 we introduce distributed implementations of Cholesky factorizationj
Section 3 develops some lower bounds on communication timej in Section 4 we com-
pute these bounds for the dense case and use them to illustrate the problem with
column mappingj Section 5 extends this work through an experiment for the sparse
casej in Section 6 we consider the problems that are still unresolved.
cholesky(A, N)
for k = 1 to N do
ediv(k);
for j = +
k 1 to N do
emod(j, k);
od
od
194
Procedure cdiv( k) computes the square root of the diagonal element Au and scales
the kth column of A by 1/.,fAi;k to produce the ktk factor column L.k; procedure
emodU, k) subtracts L jk times the kth column from the ph column.
The execution order of this program is not the onlyone possible. The true depen-
denees require only that emodU, k) must follow ediv(k) and cdiv(k) must follow all
the emod(k,f) for f < k and LId i- O. A second form of Cholesky is this:
eholesky(A, N)
for k = 1 to N do
for f = I to k - I do
emod(k,f)j
od
cdiv(k) j
od
The first form is sometimes called "submatrix" Cholesky and sometimes called
a "right-looking" method. The second form goes by the names "column" or "left-
looking".
In the sparse case, sparsity is exploited within the vector ediv and emod operations.
Furthermore, most emod operations are omitted altogether because the multiplying
scalar L jk is zero.
The column-oriented distributed methods map columns to processorsj column k
is stored at processor map[k]. The operation ediv(k) is performed at map[k]. The
operation emodU, k) may be performed at map[j], in which case the column of L. k
must be sent out from map[k] after the cdiv(k) is performed. This approach is known
as a "fan-out" implementation. Alternatively, the emodU, k) may be performed at
map[k], as follows. Consider the set of updates to column j. There is one update
(emodU, k)) for each k < j such that Ljk i- o. Processor 11' can compute the scaled
column LjkL. k for each such k for which map[k] = 11'. It then adds these scaled
columns together to form an "aggregate update" vector u[j, 11'] and sends this vector
to processor map[j]. All communication is in the form of these aggregate updates.
Whenever one arrives at a processor, it is subtracted from the updated column. This
method is known as the "fan-in" distributed algorithm.
The node code for a fan-in method is shown in Figure 2. The data structure at
processor 11' is the integer N-vector map, the columns of A and L mapped to 11', The
set mycols = {k I map[k] = 11'}, and the sets row[j,1I'] = {k I map[k] = 11' and L jk i-
O},I ~ j ~ N.
As befits an MIMD code, the schedule of computation is not that presented in
either of the sequential methods above. Instead computations occur at times that
depend on the sequence of arriving data, and are not determined in advance.
It is clear from this code that running time has an O(N) term, because of the
195
::mter, sequential for loop; this alone can be shown to imply nonscalability. Actual,
efficient implementations, however, avoid this sequentialloop.
Let V ~ W be the set of all processors and L be the set of all communication
links.
We assume identicallinks. Let fJ be the inverse bandwidth (slowness) of a link in
~econds per word. (We ignore start-up costs in this model.)
We assume that processors are identical. Let </J be the inverse computation rate
)f a processor in seconds per floating-point operation. Let fJo be the rate at which a
processor can send or receive data, in seconds per word. We expect that fJo and fJ
will be roughly the same.
A distributed-memory computation consists of aset of processes that exchange
information by sending and receiving messages. Let M be the set of all messages
:ommunicated. For m E M, Iml denotes the number of words in m. Each message
rn has a source processor sTc(m) and a destination processor dest(m), both elements
)f V.
For m E M, let d(m) denote the length of the shortest machine path from the
lOurce of the message m to its destination. We assume that each message takes
:t certain path of links from its source to its destination processor. Let p(m) =
:i1 ,i2 , ... ,id(m») be the path taken by message m. For any link i EL, let the set of
messages whose paths utilize i, {m E M I i E p(m)}, be denoted M(i).
The following are obviously lower bounds on the completion time of the compu-
;ation. The first three bounds are computable from the set of message M, each of
which is characterized by its size and its endpoints. The last depends on knowledge
)f the paths p(M) taken by the messages.
1. (Average flux)
and
The bound is
flux(Vo, Vl) . i3
sep(Vo, Vi) .
max
vEV
L: Imli3o;
dest(m) =v
max
vEV
L: Imli3o.
src(m) = v
4. (Edge contention)
max L: Imli3·
lEL mEM(l)
Of course, the actual communication time may be greater than any of the bounds.
In particular, the communication resourees (the wires in the machine) need to be
scheduled. This can be done dynamically or, when the set of messages is known in
advanee, statically. With detailed knowledge of the schedule of use of the wires, better
bounds can be obtained. For the purposes of analysis of algorithms and assignment
of tasks to processors, however, we have found this more realistic approach to be
unnecessarily cumbersome. We prefer to use the four bounds above, which depend
only on the integrated (i.e. time-independent) information M and, in the case of the
edge-contention bound, the paths p( M).
TABLE 1
Average interproeessor distanees.
Grid Torus
2D (2/3).../P (1/2).../P
3D p I/ 3 (3/4)PI /3
the path cdiv(l), cmod(2, 1), cdiv(2), cmod(3, 2), . ... By making column operations
an atomic unit of computation, we have lengthened the critical path from O(N) to
O(N2) operations. Therefore, at most O(N) processors can be used efficiently.
Next, consider communication costs on two-dimensional grid or toroidal machines.
Suppose that P is a perfect square and that the machine is a .../P x .../p grid. (This
assumption is not necessary for our conclusions, but it simplifies things.)
Consider a mapping of the computation in which the operation cmod(j, k) is per-
formed by processor map(j) (a fan-out method). After performing the operation
cdiv(k), processor map(k) must send column k to all processors {map(j) I j > kl.
Two possibilities present themselves. These sends may be done separately and
sequentially by processor map(k), with separate messages each taking its own path
to the several destinations; or they may be sent through a spanning tree of the proces-
sor graph, whose root is processor map( k) and whose nodes include the destination
processors.
In order to compute average flux, we need to know the average path length tra-
versed by the messages. Let us assume that N is greater than P. (Otherwise we
clearly have idle processors.) We first assume that the average message distanee is
just the average distance between two randomly chosen processors in the machine.
For a mesh in 2D this is (2/3).../P; for a 2D torus it is (1/2).../P. In 3D the square
roots become cube roots and the constants change to 1 for grids and (3/4) for tori
(Table 1). Even if we are clever about assigning data to processors, and we place
the early, large columns in the middle of the grid, we can at best reduee the average
distanees by a modest constant factor. So we will stiek to the estimate based on
ranclom positions for souree and destination.
Let us fix our attention on 2D grids. If separate messages are sent, the total flux
is (1/3)N2 p3/2. There are a total of ILI = 2P links; the total machine bandwidth is
roughly 2P/f3 and the flux-per-link bound is (1/6)N 2.../Pf3 seconds.
With spanning tree multieast, the "average distanee" computation changes. Most
of the sends will use a tree of total length P reaching all the proceSSQrs. Every matrix
element will therefore travel over P links, so the total information flux is (1/2)N 2P
and the average flux bound is (1/4)N2f3 seconds.
With multieast, only O( N 2 / P) words leave any processor. If N ~ P, processors
see almost the whole (1/2)N2 words of the matrix as arriving factor columns. The
bandwidth per processor is 130, so the arrivals bound is (1/2)N2f3o seconds. If N ~ P
the bound drops to half that, (1/4)N 2f3o seconds.
199
TABLE 2
Communication Costs for Colllmn-Mapped Fllll Cholesky.
Arrivals N'f3
TO
Consider a bisection of the machine through its vertical midline. Since most sends
must arrive at all processors, we may approximate the flux across the line by assuming
that every factor column crosses. With individual messages, it crosses (1/2}P times,
for a total flux of (1/4}N2 P words. With spl'nning tree multieast, the shape of the
tree plays a role. The number of crossings is at least one. This observation leads to a
weak bound. Instead, we will use a more realistie estimate that is not in fact a bound.
A realistic assumption is that on average the tree intersects the cut in (1/2}VP edges,
since the tree uses half of all edges and there are ,;p of them in the cut. Thus the
flux is (1/4}N2,;p words. The resulting lower bounds are (1/4}N2,;pf3 seconds with
separate messages and (1/4}N2 seeonds with tree multieast.
We summarize these bounds for 2D grids in Table 2.
From the critical path, average work per processor, and the bisection width
bounds, we have that the completion time is roughly max(~, 3~2q,,~) with tree
multieast and max( ~;t , 3~2q" N2f'i3} with separate messages. Contours of efficiency
(in the case P = 1,024) are shown in Figures 3 and 4.
We can irnmediately conelude that without spanning tree multieast, this is a
nonsealable distributed algorithm. We suffer a loss of efficiency as P is increased,
with speedup limited to O( -IN). Even with spanning tree multieast, we may not
take P > !!J and still achieve high efficiency. For example, with f3 = lOe/> and
P = 1,000, we require N > 12,000 (72,000 matrix element s per processor) in order
to achieve 50% efficiency. This is excessive for dense problems and will prove to be
excessive in the sparse case, too.
4.2. Mapping blocks. Dongarra, Van de Geijn, and Walker have already shown
that on the Intel Touchstone Delta machine (P = 528), mapping blocksis better than
mapping columns. In such a mapping, we view the maehine as an Pr X Pe grid and we
map elements Aij and Lij to processor (mapr(i) , mapc(j)). We assume a cyelie map-
pings here: mapr(i) == i mod Pr and similarly for mapc. In a right-looking method,
two portions of column k are ,needed to update the block Arows,eols: Lrows,k and Leols,k
200
90
80
70
60
~ SO
<xl.
40
30
20
10 0.1I'i
0.1
2 3 4 S 6 7 9 10
N/P
FIG. 3. Iso-efficiency /ines for dense Cholesky with column cyclic mapping; separate messages.
90 .1
80
70
FIG. 4. Iso-efficiency lines for dense Cholesky with column cyclic mapping, P = i,024; tree multi-
east.
201
TASLE 3
Communication Costs for Torus-Mapped Full Cholesky.
Arrivals ~(l+~)
4 P. P C
(rows and cols are integer vectors here). Again, we may send the data in the form of
individual messages from the Pr processors holding the data to those processors that
need it, or we may use multieast.
The analysis of the preceding section may now be done for this mapping. Now
the compute time must be at least N 2 t/J max (2~. ' 3";, ); the longest path in the task
graph has N 2 /2Pr multiplies and multiply-adds. For the multieast approach, the
spanning trees are linear conneetions of the processors in machine rows and columns.
With this information about the paths p( m) taken by messages, we may compute
the use of the most heavily loaded edge. This bound dominates the average flux and
biseetion width. Results are summarized in Table 3. With Pr and PC both O( VP),
the communication time drops like 0(P-l/ 2 ). With this mapping and with efficient
multieast, the algorithm is scalable even when I' > t/J. Note that P = 0(N2) so
that storage per processor is 0(1). (In fact, this scalable algorithm for distributed
Cholesky is due to ü'Leary and Stewart in 1985 [21].)
Contours of efficiency for P = 1,024 and Pr = Pc = 32 are shown in Figures 5
and 6.
90
80
70
60
.... SO
=.
40
30
20
ID
2 4 5 6 7 8 9 ID
N/P
FIG. 5. Iso-efficiency lines for dense Cholesky with 2D cyc1ic mapping; separate messages.
2 4 5 6 7 9 10
N/P
FIG. 6. Iso-efficiency lines for dense Cholesky with 2D cyc1ic mapping; lree multieast.
203
AvgOps
Max Arrivals
10'
10'
1ifLI--------~~--~~~1----------~~~~.
W W W
Grid Size, Ng
AvgOps
=
Processor Grid Size, Pr Pc
do not, and efficiency is very poor when P is not much smaller than Ng.
Figures 9 and 10 show two measures of efficiency over a range of values of Ng and
P, with the ratio fixed at one half and at two. The first is a measure of load imbalance,
the number of operations done by the most heavily loaded processor (MaxOps) scaled
by the average load (AvgOps). The other is a measure of relative communication
overhead, the average computation 1000 per processor (A vgOps) scaled by the average
number of words transferred per communication link (AvgFlux).
The results for the dense case lead us to suspect that efficiency will be roughly
constant if the ratio Ng / P is fixed.
The load-imbalance curve is practically Hat, showing that 1000 imbalance is not
a concern with this scaling. We cannot conelude from this data that 1000 imbalance
would ruin efficiency with P = O(N;). (We also cannot conelude that it would not.)
But the communication-overhead curve is still dropping as P and Ng increaseo
This confirms the main result of this work: one must scale the number of gridpoints
at least as the square of the number of processors in order to have efficiency bounded
above zero as P is increased. (For this implementation, using subtree to subgrid
mapping to a grid, even that may not be fast enough). Thus, the method is not
scalable by our earlier definition.
Recently, Thinking Machines Corporation has introduced a highly parallel ma-
chine with a "fat-tree" interconnect scheme. A fat tree is a binary tree of nodes.
Leaves are processors and internaI nodes are switches. The link bandwidth increases
geometrically with increasing distanee from the leaves. These could potentially work
significantly better than meshes, since average interprocessor distanee is now O(log P)
205
28
26
24
22
MaxOps I AvgOps
1.8
1.6
1.40~-----:20':---...,40':---...,60:-----=80:----1~OO:::------,I~20:::---7.140
Ng =(l/2)P
4
MsxOps I AvgOps
2
~-----'4=O-~S~O-~60':---=70:---~80:---9~O----'~I~OO:--~I~IO:--~12~O-~130
Ng=2*P
6.5
Ng=Np
!:l 5.5
li:
t S
~
~ 4.5
~ 4
~
< 3.5 Ng -Np/2
2S Ng- Np/4
2
6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8
Fat Tree Height
FIG. 11. Scaled communication and laad ba/ance for fat trees, with Ng cx: P.
The computation DAG is quite large. Methods that work with an uncom-
pressed representation of this DAG suffer from excessive storage costs. (This
idea is quite like the veryold one of generating straight-line code for sparse
Cholesky in which the size of the program is proportional to the number of
flops, and hence is larger than the matrix and its factor.)
Of course, Cholesky DAGs have underlying regularity that allows for com-
pressed representations. One such representation is the strueture of L. Oth-
ers, smaller stiIl, have been derived from the supernodal structure of L and
are usually only as large as a constant multiple of the size of A.
All approaches to the problem to date have employed an assignment of com-
putation to processors that is derived from the structure of L rather than
from the computation DAG. None has succeeded. It is not known, however,
if this failure is due to a poor choice of assignment, or alternatively if any
assignment based only on the structure of L must in some way fail, or indeed
whether there is any assignment for sparse ,cholesky computation DAGs that
wiIl succeed. These issues require some investigation.
• In these proceedings, Ashcraft proposes a new elass of column-oriented meth-
ods in which the assignment of work to processors differs from the assignment
used in the algorithms we have investigated. His approach may make for a
substantial reduetion in the average flux and biseetion width requirements
of the method, and so it should be investigated further. We note, however,
that it wiIl not reduce the length of the critical path, since it is based on the
same task graph as all column-oriented methods.
• It appears that the scalable implementation of iterative methods is much
easier than it is for sparse Cholesky. Indeed, even naive distributed imple-
mentation of attractive iterative methods is quite efficient. For example,
with a regular grid, simple mappings of gridpoints to processors allows fast
calculation of matrix-vector products. Total flux is kept to a small fraetion
of the operation count by mapping compaet subgrids to processors, so that
most edges of the grid conneet gridpoints that reside on the same processor.
Reeent work of Hammond [12], Pomm~rell, Annaratone, and Fichtner [24],
and Pothen, Simon, and Wang [25] makes it elear that this can be done, at
some noticeable but supportable preprocessing cost, even for irregular grids.
When Krylov subspace methods are used, dot products may be annoying;
but all that we require to make them tolerable is, at worst, that the number
of gridpoints grow like P log P, not P2. Useful, fully parallei preconditioners
have also been developed. Finally, domain decomposition methods (which
can be viewed as the elass of preconditioned Krylov subspace methods de-
signed to take advantage of spatial locality) are even more suitable in the
distributed-memory environment. A good example of the power of parallel
domain decomposition methods has recently been provided by Bj!<lrstad and
Skogen [6], who found that P = 16,384 was no impediment to the efficient
solution of finite difference equations with Ng equal to only 640.
• We conelude by admitting that it is not yet elear whether sparse direet solvers
can be made competitive at all for highly (P > 256) and massively (P >
4,096) paralleI machines.
208
REFERENCES
Abstract. Massively parallel SIMD eomputers, in prineiple, should be good platforms for
performing direet factorization of large, sparse matriees. However, the high arithmetic speed of
these machines ean easily be overeome by overhead in intra- and inter-proeessor data motion.
Furthermore, load balaneing is diffieult for an "unstructured" sparsity pattern that eannot be
dissected eonveniently into equal-size domains. Nevertheless, some progress has been made reeently
in LU and QR factorization of unstruetured sparse matriees, using some familiar eoneepts from
veetor-supercomputer implementations (elimination trees, supernodes, etc.) and some new ideas
for distributing the eomputations across many proeessors. This paper describes programs based on
the standard data-parallel eomputing model, as weil as those using a SIM D maehine to implement
a dataflow paradigm
1. INTRODUCTION
Computations involving large, sparse matrices have a great deal of inherent con-
currency. Therefore, one would expeet fine-grained, "massively parallel" computers
to provide good performance in these applications. In fact, several workers have
reported excellent throughput figures for iterative algorithms, such as conjugate
gradients and relaxation, running on these machines [1]. However, in some applica-
tions, the struetural and numerical properties of the problem make direet solution
(by factorization, forward and backward substitution) more appropriate. Develop-
ment of massively paralleI approaches for sparse matrix faetorization has been slow
because, even though the faetorization problem may contain enough concurrency
to occupy a large number of processors, the machine's high arithmetic throughput
can be overwhelmed easily by overhead in data motion.
Early work on sparse matrix faetorization emphasized reduetion in the storage
and operation counts, since memory was at a premium and software-controlled,
floating-point arithmetic was much slower than other operations. Reeent work, e.g.
[2], has been driven by the performance charaeteristics of modem computer archi-
teetures, which often indude high-speed floating-point hardware. Very high per-
formance has been obtained from pipelined supercomputers [3], while cost-effective
processing has been demonstrated on workstations [4]. Sparse-matrix codes have
been run successfully on MIMD (multiple instruetion/multiple data stream) multi-
processors, including shared-memory [5] and distributed-memory architectures [6].
Research on SIMD (single instruetion/multiple data stream) implementation of
sparse matrix faetorization is less mature, and the current status of this work is
the subject of this paper.
In some applieations, the sparse matrix has a speeial strueture that allows it
to be mapped effieiently to an array of proeessors. For example, when nested
disseetion is used on a uniform grid problem, the faetorization deeomposes into
equal-sized subproblems [7]. However, there are many "unstruetured" problems
where such simplifieations are not appropriate, and the original matrix must be
treated with general sparse-matrix teehniques. This paper is eoneemed with LU
and QR faetorizations of these unstruetured matriees on SIMD eomputers. For LU
faetorization, we assume that the nonzero strueture of A is symmetrie, although the
numerieal values may be nonsymmetrie. For most applieations, wherever aij =I- 0
and aji = 0, we ean treat aji as a nonzero with very little additional eost. This
amounts to using the (symmetrie) strueture of A + AT in place of that of A, and
greatly simplifies both the symbolie preproeessing and the numerieal factorization.
2. SIM D ARCHITECTURES
Several parallel eomputers have been based on the SIMD eomputing model,
inclp.ding ILLIAC-IV, MPP, DAP, Conneetion Machine CM-2 and MasPar MP-l.
The last two designs will be summarized briefly here, sinee they are the foeus of
most of the eurrent research on numerieal applieations. In any SIMD architecture,
the proeessors all reeeive the same instruction from a sequeneer, and each proeessor
may be disabled by a eonditional test based on loeal data.
The MasPar MP-1 has a reetangular array of proeessors, each of which ean
eommunieate directly with its eight nearest neighbors. Grid-based eommunieations
operations inelude nearest-neighbor transfer and broadeast of a specified element in
eaeh row (column) aeross the entire row (eolumn). A router provides general point-
to-point eommunieations, but they are mueh slower than grid based transfers. Each
proeessor has aloeal memory that ean be addressed indireetly, to support parallel
array access with a proeessor-dependent index.
The Connection Machine CM-2 eonsists of a hypereube eommunieations net-
work where eaeh node (ealled a "sprint node") eontains 32 bit-serial processars, a
floating-point proeessor, and a memory unit. The hypereube ean be eonfigured via
software and firmware to emulate a,grid with any reasonable number of dimensions.
As with the MasPar, the proeessors have indireet-addressing faeilities and a router
for arbitrary eommunieations pattems.
3. ALGORITHMS
Fori=lton {
For k = 1 to i - I {
L(i,k) = A(i,k)/U(k,k)
Forj=iton
A(i,j) = A(i,j) - L(i, k)*U(k,j)
}
Forj=iton
U(i,j) = A(i,j)
}
Forj=lton {
For k = 1 to j - I {
Aj:n,j = Aj:n,j - Lj:n,kUkj
}
L;+1:n,j = Aj+l:n,j/ajj
Vj,j:n = Aj,j:n
}
214
Forj=lton {
/* compute row j of U * /
U j,j:n= A j,j:n
/* eompute column j of L * /
ljj = 1
Lj+l:n,j = Aj+l:n,j/Ujj
/* outer product to update Schur complement */
Aj+l:n,j+l:n = Aj+l:n,j+l:n - Lj+l:n,jUj,j+l:n
}
factor(j) {
/* factor the submatrix for subtree rooted at node j* /
for each child j' of j
{factorU'); }
gather front(j) == [aij : lij i= 0 A lkj i= OJ
eliminate x j from front(j)
save U j ' and L. j
seatter front(j) back to memory
}
215
4. DATA-PARALLEL METHODS
Level( k) is the level of node k in the tree, defined as the distanee of this node
from the root. All eontributions to the final value of Uij or lij are computed in
proeessor (c/J(i),c/J(j)), so the router is not needed. Grid-based eommunieations
(aeeessed via the FORTRAN-90 "spread" primitive) are used to spread pivot rows
and columns across the proeessor array during elimination. Many massively parallel
eomputers are designed to give higher bandwidth for such grid-based transfers than
for arbitrary communication patterns.
To illustrate this mapping, we show in Figure 4-1 a matrix strueture that typ-
ieally arises from disseetion orderings. l The sparsity strueture, the elimination
supernode tree and the mapping are included in Figures 4-1a, b and e, respeetively.
Figure 4-1e applies when the proeessor grid dimension P is at least as large as the
tree height H. If P < H, as is usually the ease, then the mapping causes the front
matriees to be folded onto the grid.
1 Dissection is mentioned here only for illustrative purposes; any reordering algorithm (such as
minimum degree) can be used with this parallei factorization method.
218
For a typical sparse matrix, this mapping is not one-to-one; many nonzeros will
be mapped to each processor. For example, if level(j) = levelei') + kP for some
integer k, then ajj and ai'j' are mapped into the same processor. The lower- and
upper-triangular parts of the original matrix A, as well as the final L and U factors,
are stored in a three-dimensional array called MATRIX, which can be viewed as
a one-dimensional array within each processor. Space is allocated in MATRIX by
a simple bin sorting procedure during the preprocessing phase. To determine the
amount of storage used by L* ,j in processors {(I, c,b(j)) , .. . , (P, c,b(j))} , we set up P
empty bins and then assign each nonzero lij to bin c,b(i) . ("Assigning" a nonzero
simply means incrementing the bin's counter etPei) by 1.) L. ,j will occupy mj words
of storage in each of the aforementioned processors, where
Before performing the partial factorization of a front matrix front(j), the ele-
ments of front(j) are gathered from MATRIX into a four-dimensional array FRONT
of size (mj, mj, P, P), where mj was computed by the above mentioned storage al-
location process. Using compiler directives, the last two dimensions of FRONT are
dedared to be "physical," meaning that they correspond to the two dimensions of
219
the P x P processor grid (or, in the case of the CM-2 or the CM-5, the embedding
of this grid in the true physical network). The mst two dimensions of FRONT
are "virtual" and are mapped along the local memory space within each processor.
Thris, each processor contains an mj X mj submatrix of front(j).
Using the outer-product algorithm of Figure 3-3, the number of iterations (i.e.,
the number of parallei multiply-add operations) required to eliminate node j is
mJ. The value of mj depends on the number of nonzeros in front(j) as well as
the distribution of the mapping coordinates (~( i), ~(j» for these nonzeros in the
space [1, Pj x [1, Pj. If this distribution is very non-uniform, meaning that some
processors have many more elements of front(j) than do others, then mj will be
larger than necessary and many element s of FRONT will be unused; although each
paralleI arithmetic instruction performs p2 scalar operations, only a fraction of
these operations are actually used. To measure this phenomenon, we define the
processor efficiency 1] by
where the serial work (in flops) to factor A is Ws = 2 L n~, and nj is the number
j
of nonzeros in L.j. For an n X n dense matrix such that P divides n evenly, 1] = l.
Note that 1] measures a structural quantity, namely the uniformity with which
nonzeros have been mapped onto the processor grid by the function~. 1] does
not measure the relative amounts of time spent on computation, communication,
gather/scatter, and other overheads. Certainly these overhead costs are important,
and they depend on the details of software implementation and compiler perfor-
mance.
In Figure 4-2 1] is plotted against·P (the processor grid dimension) for matrices
from finite element grids of various sizes. The highest efficiency is obtained when
a large matrix is stored on a small processor grid. Table 1 shows the efficiency for
several matrices from the Harwell-Boeing collection, demonstrating that efficiencies
above 30% are obtained for many practical problems on arrays of 4096 processors.
While most CM:-2 and MP-1 systems contain at least 4096 processors, some new
architectures such as the CM-5 and iWarp will pack more processing throughput into
a smaller number of processors. This approach will provide even better efficiency
for these new systems.
220
100
~ 80
~
>. Fjnite element
ü
e ~
Q) 60
-- 150x150
Ti
:;:
-+-- 1 OOxl OO
W 40
-<>-- 50x50
(;
(/)
(/)
Q) 20
ü
2
a..
O+-~.-~.-~.-~.-~.-~.-~
o 10 20 30 40 50 60 70
P (for P by P processor grid)
Figure 4-2. Processor efficiency 11 vs. P for sparse matriees from finite element grids, with 9-
point operator and minimum-degree ordering.
Although the tree-based mapping eliminates the need for the router, indirect
addressing (indexing) is required before and after each supernode elimination in
order to implement the gather and scatter operations in Figure 3-4. (The element s
of a given front are not stored at the same memory address in every processor.)
The CM-2 or MP-1 hardware performs indexing at speeds comparable with that
of floating-point computation, but we must also be concerned with the complexity
of computing the indices of the values being accessed. In principle, one could
precompute and store all of the indices, but for large matrices the indices would
use far more storage than the numerieal values, and this would severely limit the
applicability of the program. The method us ed in this work involves storing one
integer (derived from the elimination tree) for each nonzero, and performing grid-
based communications and an integer addition in order to compute the indices.
Details are provided in [14].
221
galherlscatter
52.68%
communications
arit~metic
Figure 4-3. Relative time usage during LU factorization of BCSSTK30 matrix on MP-\ with
4K prilcessors.
5. MESSAGE-PASSING METHODS
sprint node (each duster of 32 processors - see Section 2) works on a different tree
node. Grid-based communieations are used to carry messages between sprint nodesj
each message indudes fl.oating-point data and control tokens.
QR factorization by Givens rotations can be mapped efl.ieiently to a ring of
P "stages" (a state will be defined later) with nearest-neighbor conneetivity. All
rotations that eliminate nonzeroes in column j are mapped to state ifJ(j), where
[e
s
-s]
e
[A~*],
R, *
then retums the modified value of Ri* to loeal memory and delivers the new value
of Ai* to the next st age. The tag aceompanying the latter value is found from a
lookup table in the stage, and differs from the one carried by A i * as it entered this
stage.
223
0, p, 2P, ...
1, P+ 1, 2P+ 1, ...
Figure 5-1. Mapping of tree nodes for sparse QR factorization to stages of ring.
In order for this approach to give reasonable processor efficiency, the number of
stages P must be less than the tree height H, so that the mapping "wraps around"
from st age P to st age 1. If we let each sprint node be one stage, then a CM-2 with
16,384 processors has P = 512 stages; many applications give rise to sparse matrices
with tree heights exceeding this number. Even so, one might expect the load balanee
to be poor because in a typical tree, each level in the "canopy" (farthest from the
root) has more nodes than each level near the root. However, the amount of work
to be done at a given level is not simply proportional to the number of nodes at this
level. Each canopy node is associated with only a few rows from A, and each of
these rows is very sparseo Closer to the root, many A rows arrive to be processed at
each node, and the rows are denser because they have suffered im-in from previous
rotations. These effects combine to yield processing efficiency (the average fraction
of processors that are doing useful work) around 50% for most matrices from regular
and irregular finite element grids, as long as the tree height is greater than P.
The CM-2 is used here as a ring-connected pipeline of stages; each stage (each
sprint node) functions as a vector processor, with a vector length of 32. A row with
more than 32 nonzeroes is broken up into segments, where each segment contains
up to 32 nonzeroes and a control tag. Aside from the aforementioned "node index,"
this tag also indudes bit fields to mark the first and last segments of each row. If
a row A i * is several segments long, then it can be pipelined through several stages,
with each stage applying a different rotation to a different segment of the row at
any given moment.
As a row from A passes through a stage, the rotation applied to it may con-
224
tribute some mI, lengthening the row because only the nonzeroes are transmitted
between stages. Even though the stages operate synchronously, they accept and de-
liver data at differing rates because they are operating on different portions of the
matrix, with different amounts of :!ill-in occurring. Software-managed queues are
placed between the stages to absorb fluctuations in data rates, thereby increasing
processor utilization.
Figure 5-2 shows a typical prome of processor utilization versus time step for
this QR faetorization method. The fluctuations in Figure 5-2 are due to the com-
plex interaetions among the P coupled queuing proeesses. Average utilization was
approximately 50% for this run; the tree height for this example was 1050. Note
that "utilization" in this context refers to the fraction of virtual processors (out of
16384) that are busy during each iteration. Therefore, this measure is similar to
the quantity 11 that was de:!ined for sparse LU faetorization in Section 4.2; both are
struetural measures of load balanee that do not address the question of overheads
(for communications, gather / scatter, etc.). For this CM -2 C /Paris code, those pro-
cessors that are not idle (due to load imbalance) spend about 25% of their time on
floating-point arithmetic, and the remainder on overhead. Even some of this 25%
is wasted because most of the processors are disabled during the computation of
the Givens rotation parameters; this is an unfortunate consequence of the SIMD
computation mode!. For this reason, more extensive performance measurements
were not conducted with this code. Perhaps better performance could be obtained
by implementing this approach on a MIMD machine, where some processors can
compute rotation parameters while others are performing other tasks.
There 'are several choices for the type of columns to be used in communica-
tion operations. One possibility that was ruled out is to pass partially-computed
columns of L among processorSj each time li, column arrives at li, processor, it is
updated by one or more columns of L that are stored in that processor. This
scheme is inappropriate because there is no way to guarantee that, when li, column
arrives at li, processor to be updated by another column, the updating column is
completed. Thus, the two possibilities that we have considered are using completed
columns of the Cholesky factor (the "fan-out" method) or using partial updates (the
"fan-in" method) [16]. On MIMD machines it has been demonstrated that using
partial updates leads to li, reduced communication volume [17] and to more efficient
algorithms (see [18] for li, comparison on the Ncube/2). However, communication
volume by itself is not as important li, factor for SIMD machines, as the number
of idle processors must be considered also. For the present we have chosen to use
completed columns mostly because the control information needed to route these
is simpler than for partial updates. In the future we plan to further consider the
possibility of using partial updates.
Thus, the major action during each step of Cholesky factorization is the sub-
traction of li, multiple of li, segment of li, completed column of L from that of an
uncompleted column. When an uncompleted column has had its last modiflcation
performed on it, it is post-processed and then sent out to other processors that need
it to modify columns. More than one copy of the completed column may exist in
the processor grid at one time, allowing it to modify more than one column at li,
time.
226
The QR method of section 5.1 does not exploit the independence of sibling
nodes, even though this is an easily exploited source of large-grained parallelism.
This is a consequence of the level-to-processor mapping necessitated by the use of
the one-dimensional ring. To allow us to exploit this parallelism, we have expanded
the ring to a two-dimensional torus. One reason for this change is that in Cholesky
factorization, work can begin only at the leaves of the tree, and the leaves are often
not weIl distributed across the height of the tree. This would lead to poor load
balanee, especially for problems with short trees. By changing to a two dimensional
grid, we have essentially reduced the tree height that is required for good load
balanee, by allowing much greater flexibility in how we map the tree to the processor
grid.
The mapping of matrix columns to processors affects both communication over-
head and load balanee. Communication overhead depends on the total number of
processors that must eventually receive a column, and the number of processors
that a column must pass through without being used. An ideal scheme would re-
sult ·in many modifying columns being sent at once, but each going to only a few
processors, so that communication is reduced and yet each processor still has useful
work to do. We have chosen to use a mapping which increases the load balanee at
the possible expense of increased communication, based on the fact that the CM
has a high communication bandwidth and will most likely to be able to handle
the extra communication. All grid communications take place in the south or west
grid directions. The mapping assigns column j to the north or east neighbor of the
sprint node handling column parent(j). The sprint node chosen is the one which has
currently been assigned fewer columns, helping to maintain the total load balanee
A CM-2 implementation of this approach is currently being tested, and results will
be reported in a future publication.
6. CONCLUSIONS
Massively-parallel SIMD computers offer very high arithmetic speeds for pro-
grams that can exploit the hardware efficiently. This exploitation requires careful
attention to several issues, including granularity, load balancing, and overhead.
Each of these issues also arises in MIMD implementations, but the criteria are
different. Massively-parallel SIMD architectures generally call for a relatively fine
granularity of decomposition, such as one processor for each nonzero (rather than
one per column, subtree or submatrix) during any given step.
Load balancing for a SIMD program means equalizing not only the amount of
work for all processors, but also the type of work. For example, some processors
cannot be inverting pivot element s (division) while others are updating Schur com-
plements (multiplication and addition). Even if the programming language allows
constructs such as if-then-else or loops with processor-dependent iteration counts,
these are implemented by conditionally disabling some processors, leading to effi-
ciency loss. This issue is especially important when dealing with "unstructured"
sparse matrices. One way to deal with this problem is to break the data into equal-
sized blocks, or segments, whose size is related to the processor array dimensions.
227
REFERENCES
[1] O. McBRYAN, The Conneetion Machine: PDE Solution on 65,536 Processors, Thinking
Machines Corp. Technical Report CS86-1, 1986.
[2] A. DAVE AND I. DUFF, Sparse Matrix CaIculations on the Cray-2, Parallei Comput., 5 (1987),
pp. 55-64.
[3] C. YANG, A Vector/parallel Implementation of the Multifrontal Method for Sparse Sym-
metric Positive Definite Linear Systems on the Cray Y/MP, Cray Research Inc. Technica!
Report 1990.
[4] E. ROTHERBERG AND A. GUPTA, Tecbniques for Improving the Performance ofSparse Matrix
Factorization on Multiproeessor Workstations, Stanford Univ. Report CSL-TR-90-430, 1990.
[5] A. GEORGE, M. HEATH AND J. LIU, ParaIlel Cholesky Faetorization on a Shared-Memory
Multiprocessor, Lin. Alg. AppI., 77 (1986), pp. 165-187.
[6] R. LUCAS, W. BLANK AND J. TIEMAN, A ParalleI Solution Method for Large Sparse Systems
of Equations, IEEE Trans. Computer Aided Design, CAD-6 (1987), pp. 981-991.
[7J P. WORLEV AND R. SCHREIBER, Nested Dissection on a Mesh-Connected Proeessor Array, in
New Computing Environment: ParalleI Veetor and Systolie, ed. by A. Wouk, SIAM, 1986.
[8] J. LIU, The Role ofElimination 'frees in Sparse Factorization, SIAM J. Matrix Ana!. AppI.,
11 (1990), pp. 134-172.
[9] R. SCHREIBER, A New Implementation of Sparse Gaussian Elimination, ACM Trans. Math.
Software, 8 (1982) pp. 256-276.
[10] A. GEORGE AND M. HEATH, Solution ofSparse Linear Least Squares Problems Using Givens
Rotations, Lin. Alg. AppI., 34 (1980), pp. 69-83.
[11] J. LIU, On General Row Merging Sehemes for Sparse Givens Transformations, SIAM J. Sci.
Stat. Comp., 7 (1986), pp. 1190-1211.
[12] A. GEORGE AND J. LIU, Householder Reflections versus Givens Rotations in Sparse Orthog-
onal Decomposition, Lin. Alg. App!., 88 (1987), pp. 223-238.
228
[13] J. GILBERT AND R. SCHREIBER, Highiy Parallei Sparse Choiesky Factorization, SIAM J.
Scientific and Statistical Computing, 13 (1992) pp. 1151-1172.
[14] S. KRATZER, Sparse LU Factorization on Massiveiy Parallei SIMD Computers, Technical
Report SRC-TR-92-072, Supercomputing Research Center, ApriI. 1992.
[15] S. KRATZER, Massiveiy Paralle! Sparse Matrix Computations, Technica! Report
SRC-TR-90-008, Supercomputing Research Center, February, 1990.
[16] M. HEATH, E. NG AND B. PEYTON, Paralle! A!gorithms for Sparse Linear Systems, SIAM
Review, 33 (1991), pp. 420-460.
[17] C. ASHCRAFT, S. EISENSTAT, J. LJU, AND A. SHERMAN, A Comparison of Three Co!-
umn-based Distributed Sparse Factorization Schemes, Technica! Report, Dept. of Computer
Science, York Univ., 1990.
[18] A. CLEARY, A Camparisan of AJgorithms for Choiesky Factorization on a Massiveiy Parallei
MIMD Computer, Proc. 5th SIAM Conf. on ParalleI Processing, March, 1991.
THE EFFICIENT PARALLEL ITERATIVE SOLUTION OF LARGE
SPARSE LINEAR SYSTEMS·
Abstract. The development of efficient, general-purpose software for the iterative solution of
sparse linear systems on parallel MIMD computers depends on reeent results from a wide variety of
research areas. Parallel graph heuristies, convergence analysis, and basie linear algebra implemen-
tatian issues must all be considered.
In this paper, we discuss how we have incorporated these results into a general-purpose iter-
ative solver. We present two recently developed asynchronous graph coloring heuristies. Several
graph reduction heuristics are described that are used in our implementation to imprave individ-
ual processor performance. The effeet of these variaus graph reduction schemes on the salutian of
sparse triangular systems is categorized. Finally, we report on the performance of this solver on
two large-scale applications: a piezoelectric crystal finite-el€:ment modeling problem, and a nonlin-
ear optimization problem to determine the minimum energy configuration of a three-dimensional
superconduetor made!.
Key words: graph coloring heuristics, iterative methods, parallei algorithms, preconditioned con-
jugate gradients, sparse matriees
AMS(MOS) subject classifications: 65F10, 65F50, 65Y05, 68RIO
• This paper is based on a talk presented by the second author at the IMA Workshop on Sparse
Matrix Computations: Graph Theory Issues and AIgorithms, Oetober 14-18, 1991. This work was
supported by the Applied Mathematieal Sciences subprogram of the Office of Energy Research,
U.S. Department of Energy, under Contract W-31-109-Eng-38.
t Mathematics and Computer Science Oivision, Argonne National Laboratory, 9700 South Cass
Ave., Arganne, IlIinois 60439
1 For example, consider a three-dimensional problem discretized on an O(k x k x k) grid and
ordered by nested dissection. We assume that we must solve a dense system of size O(P), the size
of the largest separator. This task requires O(k 6 ) work and O(k 4 ) space. By contrast, for an iterative
scheme we assume that the number of iterations required is at worst O(k) (Le., proportional to the
relative refinement of the mesh). The work per iteration is proportional to the size of the linear
system, or O(k3 ). Thus, the total work required by the iterative method would be O(k 4 ) and the
space required O(k 3 ).
230
In this paper we present an approach to solving such systems that satisfies the
requirements above. Central to our method is a reordering of the matrix based on a
coloring of the symmetric graph corresponding to the nonzero structure of the matrix,
or a related graph. To determine this ordering, we use a recently developed parallel
heuristic. However, if many colors are used, a straightforward parallel implementa-
tion, as is described in [10], suffers poor processor performance on a high-performance
processor such as the Intel i860. In this paper we present several possible graph reduc-
tions that can be employed to greatly improve the performance of an implementation
on high-performance RISC processors.
Consider an implementation of any of the standard general-purpose iterative meth-
ods [7, 16]: consistently ordered SOR, SSOR accelerated by conjugate gradients (CG),
or CG preconditioned with an incomplete matrix factorization. It is evident that the
major obstacle to a scalable implementation [6] is the inversion of sparse triangular
systems with a structure based on the struc~ure of the linear system. For example,
the parallelism inherent in computing and applying an incomplete Cholesky precondi-
tioner is limited by the solution of the triangular systems generated by the incomplete
Cholesky factors [21]. It was noted by Schreiber and Tang [20] that if the nonzero
structure of the triangular factors is identical to that of the original matrix, the mini-
mum number of major parallel steps possible in the solution of the triangular system
is given by the chromatic number of the symmetric adjacency graph representing
those nonzeros. Thus, given the nonzero structure of a matrix A, one can generate
greater parallelism bycomputing a permutation matrix, P, based on a coloring of the
symmetric graph G(A). The incomplete Cholesky factor L of the permuted matrix
P ApT is computed, instead of the factor based on the original matrix A.
In this permutation, vertices of the same color are grouped and ordered con-
secutively. As a consequence, during the triangular system solves, the unknowns
corresponding to vertices of the same color can be solved for in parallel, after the
updates from previous color groups have been performed. The result of Schreiber
and Tang states that the minimum number of inherently sequential computational
steps required to solve either of the triangular systems, Ly = b or iT x = y, is given
by the minimum possible number of colors, or chromatic number, of the graph.
We note that this bound on the number of communication steps assumes that
only veetol' operations are performed during the triangular systems solves. This
assumption is equivalent to restricting oneself to a fine-grained parallel computational
model, where we assign each unknown to a different processor. When many unknowns
are assigned to a single processor, it is possible to reduce the number of communication
steps by solving non-diagonal submatrices of L on individual processors at each step.
In this case, the minimum number of communication steps is given by a coloring of a
quotient graph obtained from a partitioning of unknowns to processors.
The remainder of the paper is organized as follows. In §3 we present several
possible graph reductions, including the clique partitions that allow for the use of
higher-level Basic Lineal' AIgebra Subprograms (BLAS) in the software. We consider
2 That is, we are interested in a solver where, for fixed problem size per processor, the performance
per processor is essentially independent of the number of processors used.
231
a general frarnework that can incorporate these ideas into efficient triangular system
solvers in §4. Finally, in §5 we present experimental results obtained for our software
implementation on the Intel DELTA for problems arising in two different applications
and· in §6 we discuss our concIusions.
In the Monte Carlo algorithm described by Luby [15], this initial independent set
is augmented to obtain a maximal independent seto The approach is the following.
After the initial independent set is found, the set of vertices adjacent to a vertex in
I, the neighbor set N(I), is determined. The union of these two sets is deleted from
V, the subgraph induced by this smaller set is constructed, and the Monte Carlo step
is used to choose an augmenting independent seto This process is repeated until the
candidate vertex set is empty and a maximal independent set (MIS) is obtained. The
complete Monte Carlo algorithm suggested by Luby for generating an MIS is shown
in Fig. 1. In this figure we denote by G(V' ) the subgraph of G induced by the vertex
set V'. Luby shows that an upper bound for the expeeted time to compute an MIS
by this algorithm on a CRCW P-RAM is EO(log(n)). The algorithm can be adapted
to a graph coloring heuristic by using it to determine a sequence of distinet maximal
independent sets and by coloring each MIS a different color. Thus, this approach will
solve the (~+ 1) vertex coloring problem, where ~ is the maximum degree of G, in
expected time EO((~ + 1) log(n)).
I fo-- 0;
V' fo-- V;
While G(V ' ) =1= 0 do
Choose an independent set t in G(V');
lfo--Iut;
V' fo-- V' \ (I' u N(t));
enddo
FIG. 1. Luby's Monte Carlo algorithm for determining a maximal independent set
Choose p(v)j
n-wait = Oj
send-queue = 0j
For each W E adj(v) do
Send p( v) to processor responsible for Wj
Receive p(w)j
if (p( w) > p( v)) then n-wait = n-wait + 1j
else send-queue ~ send-queue U {w}j
enddo
n-recv = Oj
While (n-recv < n-wait) do
Receive u(w)j
n-recv = n-recv + 1j
enddo
u(v) = smallest available color consistent with the
previously colored neighbors of Vj
For each W E send-queue do
Send u( v) to processor responsible for Wj
enddo
An upper bound for the expected running time of a synchronous version of this
algorithm of EO(log( n) / log log( n)) can be obtained for graphs of bounded degree [13].
The central idea for the proof of this bound is the observation that the running time
of the heuristic is proportional to the maximum length monotonic path in G. A
monotonic path of length t is defined to be a path of t vertices {vt, V2, ... ,Vt} in G
such that P(Vl) > P(V2) > ... > p(Vt).
We now show that the Luby's MIS algorithm can be modified to obtain the same
bound. Consider the following modification to the asynchronous coloring heuristic
given in Fig. 2. Let the function ,(v) equal one if v is in the independent set I, two if
v is in N(I), and let it be undefined otherwise. In Fig. 3 we present an asynchronous
algorithm to determine a MIS.
The following lemma proyes the correctness of the asynchronous algorithm.
LEMMA 2.1. At the termination of the algorithm given in Fig. 3, the function
,(v), v E V defines a maximal independent seto
Proof: At the completion of the algorithm in Fig. 3, 1'(v) is defined for each v E V.
Thus, each vertex v E V satisfied one of the following based on the definition of 1':
1. v E I, or
2. v E N(I).
It is dear that the set I is independent, and each member of N(I) must be adjacent
to a member of I. Thus, the above two conditions imply that the independent set I
is maximal. D
234
Choose p(v)j
n-wait = Dj
send-queue = 0j
For each W E adj(v) do
Send p( v) to processor responsible for Wj
Receive p(w)j
if (p( w) > p( v» then n-wait = n-wait + 1j
else send-queue +- send-queue U {w} j
enddo
n-recv = Dj
While (n-recv < n-wait) do
Receive "Y(w)j
n-recv = n-recv + 1j
enddo
if (all the previously assigned neighbors W of v
have "Y(w) = 2), then "Y(v) = lj
else "Y(v) = 2j
endif
For each W E send-queue do
Send "Y( v) to processor responsible for Wj
enddo
Based on Theorem 3.3 and Corollary 3.5 given in [13], we have the following
corollary.
COROLLARY 2.2. For graphs of bounded degree A, the expected running time is
EO(log( n) /log log( n» for the maximal independent set algorithm given in Fig. 3.
Proof: As for the bound for the asynchronous paralleI coloring heuristie, the expected
running time for the asynchronous maximal independent set algorithm is proportional
to the expeeted length of the longest monotonic path. By Theorem 3.3 and Corol-
lary 3.5 in [13] this length is bounded by EO(log(n)/loglog(n». D
Finally, we note that this maximal independent set algorithm can be used in place
of Luby's MSI algorithm to generate a sequence of maximal independent sets, each
of which can be colored a different color. The running time of this coloring heuristic
would again be bounded by EO(log( n) / log log( n» because the maximum number of
colors used is bounded by A + 1, and we have assumed the maximum degree A of
the graph is bounded.
It is often observed that the sparse systems arising in many applieations have a
great deal of special local strueture, even if the systems are described as "unstruc-
tured." We have attempted to illustrate some of this local structure, and how it can
be identified, in the following sequence of figures.
In Fig. 4 we depict a subsection of a graph that would arise from a two-dimensional,
linear, multicomponent finite-element model with three degrees of freedom per node
point. We illustrate the three degrees of freedom by the three dots at each node
point; the linear element s imply that the twelve degrees of freedom sharing the four
node points of each face are completely connected. In the figure we show edges
only between the nodes, these edges represent the complete interconneetion of all the
vertices on each element or face.
FIG. 4. A subgraph generated by a two-dimensional, linear finite element model with three degrees
of freedom per node point. The geometric partition shown by the dotted lines yields an assignment
of the vertices in the enclosed subregion to one processor.
The dashed lines in the figure represent a geometrie partitioning of the grid; we
assume that the vertices in the central region are all assigned to one processor. We
make several observations about the local strueture of this subgraph. First, we note
that the adjacency strueture of the vertices at the same geometric node (Le., the
nonzero strueture of the associated variabIes) are identical, and we call such vertiees
identieal vertiees. It was noted by Schreiber and Tang [20] that a coloring of the graph
corresponding to the geometric no des results in a system with small dense bloeks, of
order the number of degrees of freedom per node, along the diagonal. We not e that
this observation can also be used to decrease the storage required for indireet indexing
of the matrix rows since the struetures are identical.
We also consider another graph reduetion based on the local clique strueture of
the graph. In Fig. 5 the dotted lines showone possible way the vertiees assigned to
the shown partition and its neighbors can be partitioned into cliques. Denote such a
partition by Q. If we associate a super vertex with each clique, the quotient graph
236
@I @ @I I@ @ @
_~_ ...................................... ~.~::::::: ..... ,
................ i ................................ ;, ..............................:
.•••..........
"":] 1::::11:::::::11:::
. . . . ·. . . . . . . . . . r.
. . . . (f)...! l"~ ·~ 1 I"~ ·~ I Ci). . .
I
• • • • •••• • •••• • •• •:~ .~.~.~.~.~ .~.~ .;.~.~ .~.~.~.;.;: •• ;~ .~.~.~.;.;.;.;.;.; .;.~ ';';';';:.i- ;~~. ~:::::::: ~ ••••• ,
G / Q can be construeted based on the rule that there exists an edge between two
super vertices v and w if and only if there exists an edge between two vertices of their
respeetive partitions in G. The quotient graph construeted by the clique partition
shown in Fig. 5 is shown in Fig. 6.
FIG. 6. The quotient graph given the clique parlition shown in Fig. 5
Of course the quotient graph rednction is not limited to the choice of a maximal
clique partition; any local partition of the subgraph assigned to a processor can be
used to generate the reduced graph. We use a clique decomposition because the
submatrix associated with the clique is dense, thus allowing for the use of higher
level dense linear algebra operations (BLAS) in an implementation. The aspect of
the graph reduetion is discussed in more detail in §4. Finally, we not e that the
237
For i = 1, ... ,x do
1. Local Solve (requires no interprocessor communication):
Li,iYi = bi
2. Update (communication without interdependencies):
bJ; fo- bJ; - LJ;,K;YK;
enddo
FIG. 7. A general /ramework for the parallel forward elimination of the lower triangular system
Ly=b
Consider the lower triangular matrix L decomposed into the following block struc-
ture.
[~,' 0 0
Xl·
La,} L 2 ,a 0
(4.1)
Pointwise colorings - Given a coloring of the graph G(A) for the incomplete
factorization matrix A, we order unknowns corresponding to same colored
vertiees consecutively. An implementation based on this approach and com-
putational results are given in [10].
Partitioned inverse - One can determine a product decomposition of L; for ex-
ample,
"
(4.2) L = IIL;,
;=1
(4.3) L _ [ D1 ,1
-
0]
L 2 ,1 D 2 ,2 '
where D 1,1 and D 2 ,2 are diagonal. Schreiber makes the following observation:
(4.4) L- 1 _ [ D1.~ 0 ]
- -D2,~L2,1D1.~ D2,~ ,
where the structures of L and L- 1 are identical. Thus, one can group pairs
of colors together and form the inverse of the combined diagonal block by a
simple rescaling of the off-diagonal part.
Nodewise colorings - Identify adjacent vertices with identical structure. As de-
scribed in §3, such vertiees often arise in finite element models for indepen-
dent degrees of freedom defined at the same geometric node. Let the set
I identify identieal nodes. A matrix ordering based on a coloring of Gil,
where identically colored nodes are ordered consecutively, yields a system
where L;,; is block diagonal, with dense blocks the size of the number of
identical nodes at each node point. Given a geometric partition of the nodes,
these dense blocks are local to a processor. In addition, the observation
made by Schreiber and illustrated in Equation 4.4 can be used to decrease
the number of major communication step by a factor of two for a nodewise
coloring. The inverse formula given in Equation 4.4 with D 1 ,1 and D 2 ,2 block
diagonal will still preserve the nonzero structure of L, because the nonzero
structure of the columns in each dense block are identical.
Quotient graph colorings derived from aloeal clique partition - This ap-
proach is used in our implementation. The local cliques correspond to local
dense diagonal blocks in L;,;. The inverses of these blocks are computed.
Thus the local solve, step 1 in Fig. 7, can be implemented using Level-2
239
BLAS. Usually the number of colors required to color the quotient graph will
be smaller than the number of colors required for the original graph. How-
ever, if fewer colors are used, reeent theoretical results [11] indicate that the
convergence of the iterative algorithm could suffer. This aspect is discussed
more fully in §5.
Quotient graph colorings derived from general local systems - Any local
structure can chosen for the diagonal systems Li,i ' However, if general sparse
systems are used, the processor performance is not necessarily improved over
a pointwise coloring. In addition, load balancing becomes more difficult as
larger partitions are chosen.
5.1. The piezoelectric crystal modeling problem. The first set of sparse
systems that we consider arise from a second-order finite element of a piezoelectric
crystal strip oscillator. These crystals are thin strips of quartz that vibrate at a fixed
frequency when an electric forcing field is applied to the crystal. A diagram of a strip
oscillator affixed to an aluminum substrate with epoxy is shown in Fig. 8.
Aluminum
Second-order, 27-node finite element s are used to model the crystal. Higher-
order element s are required to accurately model high frequency vibrational modes of
240
the crystal. There are four degrees of freedom at each geometric node point: three
mechanieal displacements and an electric field potential. The solution phase has two
steps. First, the deformation of the crystal caused by thermal displacement is found.
For example, if the crystal was mounted on aluminum at 25°C, it will deform when
the temperature is raised to 35°C. This requires solving a nonlinear static thermal
stress problem. Second, to find the vibrational modes of interest for the deformed
crystal, we solve a linear vibration problem - a generalized eigenproblem.
To solve the nonlinear static thermal stress problem, a series of linear systems
of the form K u = f must be solved, where K represents the stiffness matrix, u
represents the displacements, and f represents the forees due to thermal loads and
displacement constraints. The major task here, of course, is the solution of very large,
sparse systems of equations.
To solve the linear vibration problem, we must solve a generalized eigenproblem
of the form K x = w 2 Mx, where K represents the stiffness matrix, M represents
the mass matrix, x is a vibrational modeshape, and w is a vibrational mode. We
use a shifted, inverted variant of the Lanczos algorithm to solve this eigenproblem
[17]. This method has been shown to be very efficient for the parallel solution of the
vibration problem [9]. Again, the major computational task is the solution of large
sparse systems of linear equations.
The three-dimensional finite element grid needed to model the crystals is much
more refined in the length and width directions than it is in the thickness direction.
We can take advantage of this fact and partition the grid among the processors in
only the length and width directions. This approach reduces communication and
maps nicely onto the DELTA architecture. Each processor is assigned a rectangular
solid corresponding to a portion of the three-dimensional grid. Each processor is
responsible for evaluating the finite elements in its partition and for maintaining all
relevant geometric and solution data for its partition.
TABLE 1
Average megaflop rat es per processor for the triangular system solution as a fundion of the number
of processors used. The problem size per processor is kept approximately constant. Shown are the
number of processors nsed (p), the problem sizes (n), and the number of nonzeros in the lower
triangular systems (nnz). AIso shown are the size of the redneed systems once identical nodes
(ni-nod.) and local cliqnes (neliqu.) are identified.
In Tables 1 and 2 we present results obtained on the Intel DELTA for solving
linear systems generated for the piezoelectric crystal modeling problem. The average
megaflop rates given in Table 1 demonstrate the scalable performance of the solver;
for fixed problem size per processor, the performance per processor is essentially
independent of the number of processors used.
In Table 2 we show the times required for the symbolic manipulations and for the
241
TABLE 2
Times (in seconds) to /ind the identical nodes (ti-nodo), local c/iques (telique), and colorings (teolor)
for the piezoelectric crystal problem. The time to reduce the graph (i. e., compule the quotienl gmph)
is included in the times ti-nodo and teliquo. Also given is the time (in seconds) for one back solve,
tBS, and one forward solve, tFS. The asynchronous parallei coloring heuristic given in Fig. e was
used to compule the coloring for the reduced graph. AIso given are the number of colors, x, used by
the pamllel coloring heuristic.
solution of the triangular systems. Note that the symbolic manipulation is only done
once - to initialize the conjugate gradient iteratio~. In fact, since the structure of
the sparse system is constant, these symbolic data structures remain the same for the
linear system to be solved at each nonlinear iteration. The implementation of the
matrix multiplication is done in essentially the same manner as the forward and back
solves. Thus, the time for one conjugate gradient iteration is roughly 2(tFS + tBS).
i,From the results given in Table 2, the total time required to determine the identi-
cal nodes, local diques, coloring, and set up the required data structures corresponds
to roughly 10 to 12 conjugate gradient iterations. Note that these times indude
all the required symbolic work; we have induded the time to compute the quotient
graphs with these times. Since the number of conjugate gradient iterations required
for these problems is typically several hundred, and considering the time required to
integrate and assembly the stiffness matrix, the time required for the symbolic work
is relatively inexpensive.
The two level-2 BLAS routines that are involved in the triangular system solves
are DGEMV and DTRMV, the matrix-vector multiplications routines for general and
triangular matrices respectively. By comparing the results presented in Table 2 with
the performance of these routines on one processor we can get some idea of the relative
efficiency of our implementation. We have used the assemblier implementation of the
BLAS routines developed by Kuck & Associates (14). For matrix sizes of 20 and 50
they achieve performances of 2.39 and 5.35 megaflops for the DTRMV routine on a
single i860 processor. Likewise, for the DGEMV routine they achieve performances
of 6.73 and 15.90 megaflops for matrix sizes of 20 and 50. Since the average dique
size for the problems presented in Table 1 is approximately 40, the measured per
processor performance for the paralle! implementation appears to be quite good.
x
o " - ------;1--
-- - -----,.,--- 1 "
3 "" 4 ""
---------~----------~------
6 "", 7 "" 8
Shown in the figure are aIternating layers of supereondueting and insulating ma-
terial. The independent variables are two veetor fields, one defined in the superoon-
dueting sheets, and the other in the insulating layer. The two fields are eoupled in the
free ~nergy formulation. When the model is diseretized a finer degree of resolution
is generaIly given to the insulating layers. For the problems of interest, the number
of grid points necessary to represent the model in the direetion perpendicular to the
layers (the X-axis in Fig. 9) is smaIler than the number of points required in the two
direetions paraIlel to the layers (the Y-axis and Z-axis in Fig. 9). We make use of
this property and partition the grid in the Y and Z direetions. For example, in Fig. 9
the Y-Z domain is shown partitioned among 9 proeessors.
We denote the diseretization in the X, Y, and Z direetions by NX, NY, and NZ,
respeetively. As the diseretization within an insulating layer, NK, varies, the size of
the 10eaI cliques ehanges, and therefore so does the individual proeessor performance.
In Table 3 we note the effeet of varying the layer diseretization on the i860 proeessor
performance during the solution of the linear systems. For these numbers we have
used 128 proeessors and fixed the loeal problem size to be roughly equivalent. The
second oolumn shows the average size of the identical nodes found in the graph by
the solveri the third eolumn shows the average clique size found. The finaI oolumn
shows the average eomputational rate per proeessor during the solution of the linear
systems.
TABLE 3
The effeet of varying the layer discretization on the processor performance in solving the linear
systems
In Table 4 we present results for the linear solver on three problems with differing
geometrie eonfigurations on 512 processors.
243
TABLE 4
Computational results obtained for three difJerent problem configurations on 512 processors
In the solution of both of these systems, the diagonal of the matrix was scaled
to be one. If the incomplete factorization fails (a negative diagonal element created
during the factorization), a small multiple of the identity is added to diagonal, and the
factorization is restarted. This process is repeated until a successful factorization is
obtained [16]. The average number of conjugate gradient iterations required to solve
one nonlinear iteration of the thermal equilibrium problem for the crystal model to
a relative accuracy of 10-7 is approximately 700. The average number of conjugate
gradient iterations required per nonlinear iteration for the superconductivity problem
is approximately 250. The linear systems arising in the superconductivity problem
are solved to a relative accuracy of 5.0 X 10-4 • However, it should be noted that
these are speciallinear systems: they are highly singular (more than one-fifth of the
eigenvalues are zero, because of physical symmetries). However, they are consistent
near aloeal minimizer because a projection of the right hand side (the gradient of
the free energy function) onto the null space of the matrix is zero near the minimizer.
of required colors and the size of the quotient graph. The implementation allows
the user to specify the maximum clique size and the maximum number of cliques
per color, in case load-balancing or convergence problems arise. In the experimental
results section we demonstrate the improvement in processor performance for larger
clique sizes for the superconductivity problem. In addition, the concentration of the
basic computation in the BLAS allows for an efficient, portable implementation.
Finally, we note that reeent theoretieal results have shown that for amodel prob-
lem, the convergence rate improves as the number of colors is increased [11]. This
possibility was investigated for the piezoelectric crystal problem, and a definite, but
moderate, decrease in the convergence rate was found in going from a pointwise col-
oring (~ 108 colors) to a clique coloring (~ 10 colors). However, the increase in
efficiency of the implementation for the clique coloring more than offset the conver-
gence differences.
Overall, we feeI that this approach represents an effective approach for efficiently
solving large, sparse linear systems on massively paralleI machines. We have demon-
strated that our implementation is able to solve general sparse systems from two
different applications, achieving both good processor performance and convergence
properties.
REFERENCES
[1] F. L. ALVARADO, A. POTHEN, AND R. SCHREIBER, Highly parallei sparse triangular solution,
Tech. Rep. CS-92-09, The Pennsylvania State University, May 1992.
[2] F. L. ALVARADO AND R. SCHREIBER, Optimal parallei solution of sparse triangular systems,
SIAM Journal on Scientifie and Statistieal Computing, (to appear).
[3] D. BRELAz, New methods to color the vertices of a graph, Comm. ACM, 22 (1979), pp. 251-256.
[4] T. F. eOLEMAN AND J. J. MORE, Estimation of sparse Jacobian matrices and graph coloring
probiems, SIAM Journal on Numerieal Analysis, 20 (1983), pp. 187-209.
[5] M. R. GAREY AND D. S. JOHNSON, Computers and [ntraetabi/ity, W. H. Freeman, New York,
1979.
[6] J. L. GUSTAFSON, G. R. MONTRY, AND R. E. BENNER, Development of parallei methods
for a J02,/-processor hypercube, SIAM Journal on Seientifie and Statistieal Computing, 9
(1988), pp. 609-638.
[7] L. A. HAGEMAN AND D. M. YOUNG, Applied [terative Methods, Aeademic Press, New York,
1981.
[8] D. S. JOHNSON, Worst ease behavior of graph coloring algorithms, in Proceedings 5th South-
eastern Conferenee on Combinatorics, Graph Theory, and Computing, Utilitas Mathemat-
iea Publishing, Winnipeg, 1974, pp. 513-527.
[9] M. T. JONES AND M. L. PATRICK, The Lanezos algorithm for the generalized symmetrie
eigenprob/em on shared-memory architeetures, Preprint MCS-P182-1090, Mathematics and
Computer Science Division, Argonne National Laboratory, Argonne, III., 1990.
[10] M. T. JONES AND P. E. PLASSMANN, SealaMe iterative solution of sparse linear systems,
Preprint MCS-P277-1191, Mathematies and Computer Scienee Division, Argonne National
Laboratory, Argonne, III., 1991.
[11] - - , The effeet of many-color orderings on the eonvergenee of iterative methods, in Proeeed-
245
ings of the Copper Mountain Conference on Iterative Methods, SIAM LA-S lG , 1992.
[12] - - , Solution of large, sparse systems of linear equations in massively parallei applieations,
Preprint MCS-P313-0692, Mathematics and Computer Science Division, Argonne National
Laboratory, Argonne, III., 1992.
[13] - - , A parallei graph coloring heuristie, SIAM Journal on Scientific and Statistical Comput-
ing, 14 (1993).
[14] KUCK & ASSOCIATES, CLASSPACK Basie Math Library User's Guide (Release 1.1), Kuck &
Associates, Inc., Champaign, IL, 1990.
[15] M. LUBY, A simple parallei algorithm for the maximal independent set problem, SIAM Journal
on Computing, 4 (1986), pp. 1036-1053.
[16] T. A. MANTEUFFEL, An incomplete factorization technique for positive definite linear systems,
Mathematics of Computation, 34 (1980), pp. 473-497.
[17] B. NOUR-OMID, B. N. PARLETT, T. ERICSSON, AND P. S. JENSEN, How to implement the
speetral transformation, Mathematics of Computation, 48 (1987), pp. 663-673.
[18] A. POTHEN, H. SIMON, AND K.-P. LIOU, Partitioning sparse malrices with eigenvectors of
graphs, SIAM Journal on Matrix Analysis, 11 (1990), pp. 430-452.
[19] R. SCHREIBER. Private communication, 1991.
[20] R. SCHREIBER AND W.-P. TANG, Vectorizing the conjugate gradien.t method. Unpublished
manuscript, Department of Computer Science, Stanford University, 1982.
[21] H. A. VAN DER VORST, High performance preconditioning, SIAM Journal on Scientific and
Statistical Computing, 10 (1989), pp. 1174-1185.
[22] S. VAVASIS, Automatic domain partitioning in three dimensions, SIAM Journal on Scientific
and Statistical Computing, 12 (1991), pp. 950-970.