Вы находитесь на странице: 1из 254

The IMA Volumes

in Mathematics
and its Applications
Volume56

Series Editors
Avner Friedman Willard Miller, Jr.
Institute for Mathematics and
its Applications
IMA
The Institute for Mathematics and its Appllcations was established by a grant from the
National Science Foundation to the University of Minnesota in 1982. The IMA seeks to encourage
the development and study of fresh mathematical concepts and questions of concern to the other
sciences by bringing together mathematicians and scientists from diverse fields in an atmosphere
that will stimulate discussion and collaboration.
The IMA Volumes are intended to involve the broader scientific community in this process.
Avner Friedman, Director
Willard Miller, Jr., Associate Director
**********
IMA ANNUAL PROGRAMS
1982-1983 Statistical and Continuum Approa'1hes to Phase Transition
1983-1984 MathematicalModeIs for the Economics of Decentrallzed
Resource Allocation
1984-1985 Continuum Physics and Partial Differential Equations
1985-1986 Stochastic Differential Equations and Their Applications
1986-1987 Scientmc Computation
1987-1988 Applied Combinatorics
1988-1989 Nonlinear Waves
1989-1990 Dynamical Systems and Their Applications
1990-1991 Phase Transitions and Free Boundaries
1991-1992 Applied Linear AIgebra
1992-1993 Control Theory and its Applications
1993-1994 Emerging Applications of Probability
IMA SUMMER PROGRAMS
1987 Robotics
1988 Signal Processing
1989 Robustness, Diagnostics, Computing and Graphics in Statistics
1990 Radar and Sonar
1990 Time Series
1991 Semiconductors
1992 Environmental Studies: Mathematical, Computational, and Statistical Analysis
**********
SPRINGER LECTURE NOTES FROM THE IMA:
The Mathematics and Physics of Disordered Media
Editors: Barry Hughes and Barry Ninham
(Leeture Notes in Math., Volume 1035, 1983)
Orienting Polymers
Editor: J .L. Ericksen
(Leeture Notes in Math., Volume 1063, 1984)
New Perspectives in Thermodynamics
Editor: James Serrin
(Springer-Verlag, 1986)
Models of Economic Dynamics
Editor: Hugo Sonnenschein
(Lecture Notes in Econ., Volume 264, 1986)
Alan George John R. Gilbert
Joseph W.R. Liu
Editors

Graph Theory and


Sparse Matrix Computation

With 102 Illustrations

Springer-Verlag
New York Berlin Heidelberg London Paris
Tokyo Hong Kong Barcelona Budapest
Alan George John R. Gilbert
University of Waterloo Xerox Palo AIto Research Center
NeedIes Hall 3333 Coyote Hill Road
Waterloo, Ontario N2L 3G1 Palo AIto, CA 94304-1314 USA
Canada
Series Editors:
Joseph W.H. Liu Avner Friedman
Department of Computer Scienee Willard Miller, Jr.
York University Institute for Mathematies and its
North York, Ontario M3J 1P3 Applieations
Canada University of Minnesota
Minneapolis, MN 55455 USA
Mathematies Subjeet Classifieations (1991): 05C50, 65F50, 05C05, 05C70, 05C20,
15A23, 15A06, 65F05, 65FlO, 65F20, 65F25, 68RlO

Library of Congress Cataloging-in-Publication Data


Graph theory and sparse matrix computation / Alan George, John R.
Gilbert, Joseph W.H. Liu, editors
'p. cm. - (The IMA volumes in mathematics and its
applications ; v. 56)
Indudes bibliographical references.
ISBN-13:978-1-4613-8371-0 (alk. paper)
I. Graph theory-Congresses. 2. Sparse matrices-Congresses.
I. George, Alan. II. Gilbert, J.R. (John R.), 1953- . III. Liu,
Joseph W.H. IV. Series.
QA166.G7315 1993
511'.5-dc20 93-26146

Printed on acid-free paper.


© 1993 Springer-Verlag New York, Inc.
Softcover reprint of the hardcover 1st edition 1993

All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New
York, NY 10010, USA), exeept for brief exeerpts in eonneetion with reviews or scholarly
analysis. Use in eonneetion with any form of information storage and retrieval, eleetronic
adaptation, computer software, or by similar or dissimilar methodology now known or hereaf-
ter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this publication, even
if the former are not especially identified, is not to be taken as a sign that such names, as
understood by the Trade Marks and Merchandise Marks Aet, may aceordingly be used freely
byanyone.
Permission to photoeopy for internai or personal use, or the internai or personal use of speeific
dients, is granted by Springer-Verlag, Inc., for libraries registered with the Copyright Clearanee
Center (CCC), provided that the base fee of $5.00 per eopy, pius $0.20 per page, is paid directly
to CCC, 21 Congress St., Salem, MA 01970, USA. Speeial requests should be addressed directly
to Springer-Verlag New York, 175 FifthAvenue, NewYork, NY 10010, USA.
ISBN-13:978-1-4613-8371-0/1993 $5.00 + 0.20
Production managed by Hal Henglein; manufaeturing supervised by Jaequi Ashri.
Camera-ready eopy prepared by the IMA.
987654321

ISBN -13 :978-1-4613-8371-0 e-ISBN -13 :978-1-4613-8369-7


DOI: 10.1007/978-1-4613-8369-7
The IMA Volumes
in Mathernatics and its Applications

Current VoIurnes:
Volume 1: Homogenization and Effeetive Moduli of Materials and Media
Editors: Jerry Erieksen, David Kinderlehrer, Robert Kohn, J.-L. Lions

Volume 2: Oseillation Theory, Computation, and Methods of Compensated Compaetness


Editors: Constantine Dafermos, Jerry Erieksen,
David Kinderlehrer, Marshall Slemrod

Volume 3: Metastability and Ineompletely Posed Problems


Editors: Stuart Antman, Jerry Erieksen, David Kinderlehrer, Ingo Muller

Volume 4: Dynarnieal Problems in Continuum Physies


Editors: Jerry Bona, Constantine Dafermos, Jerry Erieksen,
David Kinderlehrer

Volume 5: Theory and Applieations of Liquid Crystals


Editors: Jerry Erieksen and David Kinderlehrer

Volume 6: Amorphous Polymers and Non-Newtonian Fluids


Editors: Constantine Dafermos, Jerry Erieksen, David Kinderlehrer

Volume 7: Random Media


Editor: George Papanieolaou

Volume 8: Pereolation Theory and Ergodie Theory of Infinite Partide Systems


Editor: Harry Kesten

Volume 9: Hydrodynamie Behavior and Interaeting Partide Systems


Editor: George Papanieolaou

Volume 10: Stochastie Differential Systems, Stochastie Control Theory and Applieations
Editors: Wendell Fleming and Pierre-Louis Lions

Volume 11: Numerieal Simulation in Oi! Reeovery


Editor: Mary Fanett Wheeler

Volume 12: Computational Fluid Dynamies and Reacting Gas Flows


Editors: Bjorn Engquist, M. Luskin, Andrew Majda
Volume 13: Numerical AIgorithms for ParalleI Computer Architectures
Editor: Martin H. Schultz

Volume 14: Mathematical Aspects of Scientific Software


Editor: J.R. Rice

Volume 15: Mathematical Frontiers in Computational Chemical Physics


Editor: D. Truhlar

Volume 16: Mathematics in Industrial Problems


by Avner Friedman

Volume 17: Applications of Combinatorics and Graph Theory to the Biological


and Social Sciences
Editor: Fred Roberts

Volume 18: q-Series and Partitions


Editor: Dennis Stanton

Volume 19: Invariant Theory and Tableaux


Editor: Dennis Stanton

Volume 20: Coding Theory and Design Theory Part I: Coding Theory
Editor: Dijen Ray-Chaudhuri

Volume 21: Coding Theory and Design Theory Part II: Design Theory
Editor: Dijen Ray-Chaudhuri

Volume 22: Signal Processing: Part 1- Signal Processing Theory


Editors: L. Auslander, F.A. Grünbaum, J.W. Helton, T. Kailath,
P. Khargonekar and S. Mitter

Volume 23: Signal Processing: Part II - Control Theory and Applications


of Signal Processing
Editors: L. Auslander, F.A. Griinbaum, J.W. Helton, T. Kailath,
P. Khargonekar and S. Mitter

Volume 24: Mathematics in Industrial ProbIems, Part 2


by Avner Friedman

Volume 25: Solitons in Physics, Mathematics, and Nonlinear Optics


Editors: Peter J. Olver and David H. Sattinger
Volurne 26: Two Phase Flows and Waves
Editors: Daniel D. Joseph and David G. Schaeffer

Vohime 21: Nonlinear Evolution Equations that Change Type


Editors: Barbara Lee Keyfitz and Michael Shearer

Volurne 28: Computer Aided Proofs in Analysis


Editors: Kenneth Meyer and Dieter Schmidt

Volurne 29: Multidimensional Hyperbolie Problems and Computations


Editors: Andrew Majda and Jim Glimm

Volurne 30: Mieroloeal Analysis and Nonlinear Waves


Editors: Michael Beals, R. Melrose and J. Raueh

Volurne 31: Mathematies in Industrial Probiems, Part 3


by 'A vner Friedman

Volurne 32: Radar and Sonar, Part I


by Richard Blahut, Willard Miller, Jr. and Calvin Wileox

Volurne 33: Direetions in Rohust Statisties and Diagnosties: Part I


Editors: Werner A. Stahel and Sanford Weisberg

Volurne 34: Direetions in Robust Statisties and Diagnosties: Part II


Editors: Werner A. Stahel and Sanford Weisberg

Volurne 35: Dynamieal Issues in Combustion Theory


Editors: P. Fife, A. Lifian and F.A. Williams

Volurne 36: Computing and Graphies in Statisties


Editors: Andreas Buja and Paul Tukey

Volurne 31: Patterns and Dynamies in Reaetive Media


Editors: Harry Swinney, Gus Aris and Don Aronson

Volurne 38: Mathematies in Industrial ProbIems, Part 4


by Avner Friedman

Volurne 39: Radar and Sonar, Part II


Editors: F. Alberto GrÜnbaum. Marvin Bernfeld and Richard E. Blahut
Volume 40: Nonlineal' Phenomena in Atmospheric and Oceanic Sciences
Editors: George F. Camevale and Raymond T. Pierrehumbert

Volume 41: Chaotic Processes in the Geological Sciences


Editol': David A. Yuen

Volume 42: Partial Differential Equations with Minimai Smoothness and Applications
Editors: B. Dahlberg, E. Fabes, R. Fefferman, D. Jerisoll, C. Kenig, and
J. Pipher

Volume 43: On the Evolution of Phase Boundaries


Editors: Morton E. Gurtin and Geoffrey B. McFadden

Volume 44: Twist Mappings and Their Applications


Editol': Richard McGehee and Kenneth R. Meyer

Volume 45: New Directions in Time Series Analysis, Part I


Editors: David Brillinger, Peter Caines, John Geweke, Emanuel Parzen,
Murray Rosenblat!, and Murad S. Taqqu

Volume 46: New Directions in Time Series Analysis, Part II


Editors: David Brillinger, Peter Caines, John Geweke, Emanuel Parzen,
Murray Rosenblatt, and Murad S. Taqqu

Volume 47: Degenerate Diffusions


Editors: Wei-Ming Ni, L.A. Peletier, and J.-L. Vazquez

Volume 48: Linear AIgebra, Markov Chains and QUf'ueing Models


Editors: Carl D. Meyer and Robert J. Plemmons

Volume 49: Mathematies in Industrial Probiems, Part 5


by Avner Friedman

Volume 50: Combinatorial and Graph-Theoretieal Problems in Linear AIgebra


Editors: Richard A. BruaIdi, Shmuel Friecllaml, and Victor Klee

Volume 51: Statistical Thermodynamics and Differenlial Geometry of


Microstructured Materials
Editors: H. Ted Davis and Johannes C.C. Nitsehe

Volume 52: Shoek Induced Transiliolls and Phase Structures in General Media
Editors: J .E. Dunn, Roger Fosdick, and Marshall Slelm'od
VoIurne 53: VariationaI Problems
Editors: Avner Friedman and Joel Spruck

VoIurne 54: Microstrueture and Phase Transitions


Editors: D. Kinderlehrer, R. James, and M. Luskin

VoIurne 55: Turbulence in Fluid Flows: A Dynamical Systems Approach


Editors: C. Foias, G.R. Sell, and R. Temam

VoIurne 56: Graph Theory and Sparse Matrix Computation


Editors: Alan George, John R. Gilbert, and Joseph W.H. Liu

Forthcorning VoIurnes:

Pbase Transitions and Free Boundaries


Free Boundaries in Viscous Flows

Summer Program Semiconductors


Semicondudors (2 volumes)

Applied Linear AIgebra


Iterative Methods for Sparse and Struetured Problems
Linear AIgebra for Signal Processing
Linear AIgebra for Control Theory

Summer Program Environmental Studies


Environmental Studies

Control Tbeory
Robust Control Theory
Control Design for Advanced Engineering Systems: Complexity, Uncertainty, In-
formation and Organization
Control and Optimal Design of Distributed Parameter Systems
Flow Control
Robotics
Nonsmooth Analysis & Geometric Methods in Deterministic Optimal Control
Systems & Control Theory for Power Systems
Adaptive Control, Filtering and Signal Processing
FOREWORD

This IMA Volume in Mathematics and its Appllcations

GRAPH THEORY AND SPARSE MATRIX COMPUTATION

is based on the proceedings of a workshop that was an integraI part of the 1991-
92 IMA program on "Applied Linear AIgebra." The purpose of the workshop was
to bring together people who work in sparse matrix computation with those who
conduct research in applied graph theory and grl:l,ph algorithms, in order to foster
active cross-fertilization. We are grateful to Richard Brualdi, George Cybenko,
Alan Geo~ge, Gene Golub, Mitchell Luskin, and Paul Van Dooren for planning and
implementing the year-Iong program.
We espeeially thank Alan George, John R. Gilbert, and Joseph W.H. Liu for
organizing this workshop and editing the proceedings.
The finaneial support of the National Science Foundation made the workshop
possible.

Avner Friedman
Willard Miller. Jr.
PREFACE

When reality is modeled by computation, linear algebra is often the con nec-
tiori between the continuous physical world and the finite algorithmic one. Usually,
the more detailed the model, the bigger the matrix, the better the answer. Efficiency
demands that every possible advantage be exploited: sparse structure, advanced com-
puter architectures, efficient algorithms. Therefore sparse matrix computation knits
together threads from linear algebra, parallei computing, data struetures, geometry,
and both numerieal and discrete algorithms.
Graph theory has been ubiquitous in sparse matrix computation ever since Sey-
mour Parter used undireeted graphs to model symmetric Gaussian elimination mo re
than 30 years ago. Three of the reasons are paths, loeality, and data structures. Paths
in the graph of a matrix are important in many contexts: fill paths in Gaussian
elimination, strongly connected components in irreducibility, bipartite matehing, and
alternating paths in linear dependence and structural singularity. Graphs are the
right setting to diseuss the kinds of locality in a sparse matrix that allow a parallei
algorithm to work on different parts of a problem more or less independently. And
the aetive field of graph algorithms is a rich source of data structures and effieient
techniques for manipulating sparse matrices by computer.
The Institute for Mathematics and Its Applications held a workshop on "Sparse
Matrix Computations: Graph Theory Issues and AIgorithms," organized by the ed-
itors of this volume, from Oetober 14 to 18, 1991. The workshop included fourteen
invited and several contributed talks, software demonstrations, an open problem ses-
sion, and a great deal of stimulating discussion between mathematicians, numerical
analysts, and t!-teoretical computer scientists. After the workshop we invited some of
the participants to SUL.,;::! papers for this colleetion. We intend the result to be a
resource for the researeher or advanced student of either graphs or sparse matrices
who wants to explore their conneetions. Therefore, we asked the authors to undertake
the challenging task of making current research accessible to both eommunities.
The order of papers in the volume reHects a rough grouping into three categories.
• First, graph models of symmetric matrices and faetorizations: Blair and
Peyton on chordal graphs and clique trees; Agrawal and Klein on provably
good nested dissection orderings; and Miller, Teng, Thurston, and Vavasis
on separators for geometric graphs.
• Second, graph models of a.lgorithms on nonsymmetric matrices: Eisenstat
and Liu on Schur complements; Johnson and Xenophontos on Perron com-
plements; Gilbert and Ng on QR faetorization and partial pivoting; and
Alvarado, Pothen, and Schreiber on partitioned inverses of triangular matri-
ees.
• Third, parallel sparse matrix algorithms: Asheraft on distributed-memory
sparse Cholesky faetorization; Schreiber on scalability and its limits; Kratzer
and Cleary on massively parallel LU and QR; and Jones and Plassman on a
parallei iterative method.
Of course, the categories overlap and interrelate. Separators (Agrawal, Miller) are
useful in parallel matrix computation, for both direet and iterative methods. So are
partitioned inverses (Alvarado). Nonsymmetrie analyses (Gilbert) gain leverage from
symmetric models, with interseetion graphs as the fulerum. Another view might try
to group the papers by general subject, by matrix algorithm, or by graph-theoretic
model:
Subjects: Reorderings for efficient factorization (Agrawal, Miller, Ashcraft),
nonzero structure prediction (Eisenstat, Johnson, Gilbert), partitioning (Agrawal,
Miller, Alvarado, Jones), parallelism (Miller, Alvarado, Ashcraft, Schreiber, Kratzer,
Jones).
Matrix algorithms: Cholesky factorization (Blair, Agrawal, Miller, Ashcraft,
Schreiber), nonsymmetric factorization (Eisenstat, Gilbert, Kratzer), matrix-vector
multiplication (Miller, Jones), triangular solution (Alvarado), Schur complement
(Eisenstat, Johnson).
Graph modeIs: Chordal graphs (Blair, Agrawal, Alvarado, Schreiber), various
trees (Blair, Agrawal, Alvarado, Schreiber), directed graphs (Eisenstat, Johnson, Gil-
bert, Alvarado), bipartite graphs (Gilbert), other undirected graphs (Agrawal, Miller,
Jones).
The astute reader will recognize this as an adjacency-list representation of a
sparse matrix; its nonzero structure and one" of its graphs are displayed below.
Anyone who has spent time at the IMA knows that Avner Friedman and his staff
nurture an amazing environment of mathematical stimulation and interdisciplinary
excitement. The IMA special year on applied linear algebra was blessed further
by having Richard Brualdi as organizer and intellectual shepherd. We express our
deepest thanks to them, to the workshop participants, and most of all to the authors
of these papers.
Alan George, Waterloo
John R. Gilbert, Palo AIto
Joseph W. H. Liu, York
March 1993

Papers 4

7..,.;:"-~

nz=45 9

FIG. 1. A sparse matrix and its column intersection graph.


CONTENTS

Foreword ................................................................. Xl

Preface ................................................................... Xlll

An introduetion to chordal graphs and clique trees 1


Jean R.S. Blair and Barry Peyton

Cutting down on fill using nested dissection:


Provably good elimination orderings ...................................... 31
Ajit Agrawal, Philip Klein and R. Ravi

Automatic Mesh Partitioning ............................................. 57


Gary L. Miller, Shang-HM Te ng,
William Thurston and Stephen A. Vavasis
StrueturaI representations of Schur complements in sparse matrices 85
Stanley -Co Eisenstat and Joseph W.H. Liu

Irreducibility and primitivity of Perron complements:


Application of the compressed direeted graph ............................. 101
Charles R. Johnson and Christos Xenophontos

Predieting strueture in nonsymmetric sparse matrix faetorizations 107


John R. Gilbert and Esmond G. Ng

Highly paralleI sparse triangular soIution ........ . . . . . . . . . . . . . . . . . . . . . . . . . . 141


Fernando L. Alvaraa,,> Alex Pothen, and Robert Schreiber

The fan-both family of column-based distributed


Cholesky factorization aIgorithms 159
Cleve Ashcraft

Scalability of sparse direet solvers 191


Robert Schreiber
Sparse matrix faetorization on SIMD parallei computers ...... . . . . . . . . . . . . . 211
Steven G. Kratzer and Andrew J. Cleary

The efficient paralleI iterative soIution of Iarge sparse Iinear systems 229
Mark T. Jones and Paul E. Plassmann
AN INTRODUCTION TO CHORDAL GRAPHS AND
CLIQUE TREES'

JEAN R. S. BLAIRt AND BARRY PEYTONi

Clique trees and chordal gr ap hs have carved out a niche for themselves in recent work on sparse
matrix algorithms, due primariJy t.o research questions associated with advanced computer archi-
t.ectures. This paper is a unified and eJementary introduction to the standard characterizatious
of chordal graphs and clique trees. The pace is Jeisurely, as detailed proofs of all results are in-
cluded. We also briefly discuss applications of chordal graphs and clique trees in sparse matrix
compntations.

Key Words. chordal graphs, clique t.rees, acyclic hypergraphs, minimum spanning tree, Prim's
algorithm, maximum cardinality search, sparse linear systems, Cholesky factorization

AMS(MOS) subject classifications. 68R10, OSC50', 65F50, 68Q25

1. Introduction. It is weil known that chordal graphs model the sparsity struc-
ture of the eholesky factor of a sparse positive definite matrix [40J. Of the many ways
to represent a chordal graph, a particularly useful and compact representation is pro-
vided by clique trees [24, 46J. Until recently, explicit use of the properties of chordal
graphs or clique trces in sparse matrix computations was rarely needed. For example,
chordal graphs are mentioned in a single exercise in George and Liu [16J. However,
chordal graphs and clique trees have found a niche in more reeent work in this area,
primarily due to various research questions associated with advanced computer ar-
chitectures. For ;nstance, the multifrontal method [7J, which was developed to obtain
good performance on vectvr supercomputers, can be expressed very succinctly 111
terms of a clique tree representation of the underlying chordal graph [34, 38J.

This paper is intended as an update to the graph theoretieal results presented and
proved in Rose [40], which predated the introduction of clique trees. Our goal is to
provide a unified introduction to chordal graphs and clique trees for those interested
in sparse matrix computations, though we hope it will be of use to those in other
application areas in which these graphs play a major role. We have striven to write
a primer, not a survey article: we present a limited number of weil known results
of fundamental importance, and prove all the results in the paper. The pacing is
intended to be leisurely, and the organization is intended to enable the reader to read
selected topics of interest in detail.

The paper is organized as follows. Section 2 contains the standard weil known

• Work was supported in part by the Applied Mathematical Sciences Research Program, Office
of Energy Research, U.S. Department of Energy under contrad DE-AC05-840R21400 with Mar-
tin Marietta Energy Systems, Incorporated, and in part by the Institute for Mathematics and Its
Applications with funds provided by the National Science Foundation.
t Department of Computer Science, University of Tennessee, Knoxville, TN 37996-1301.
t Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Bldg. 6012,
Oak Ridge, TN 37831-6367.
All technical terms used in this section are defined later in the paper.
2

characterizations of chordal graphs and presents the maximum cardinality search


algorithm for computing a perfect elimination ordering. Seetion 3 presents several
charaeterizations of the clique trees of a chordal graph, including a maximum spanning
tree property that is probably not as widely known as the others are. Section 4 ties
together certain concept s and results from the previous two seetions: it identifies
the minimal vertex separators in a chordal graph with edges in any one of its clique
trees, and it also shows that the maximum cardinality search algorithm is just Prim's
algorithm in disguise. Finally, Section 5 briefly discusses reeent applications of chordal
graphs and clique trees to specific questions arising in sparse matrix computations.

2. Chordal graphs. An undirected graph is chordal (triangulated, rigid circuit)


if every cycle of length greater than three has a chord: namely, an edge connecting
two nonconsecutive vertices on the cycle. After introducing graph notation and ter-
minology in Seetion 2.1, we present two standard characterizations of chordal graphs
in Sections 2.2 and 2.3. The latter of these two seetions shows that chordal graphs
are characterized by possession of a perfect etimination ordering of the vertices. The
maximum cardinality search algorithm is a linear-time procedure for generating a
perfect elimination ordering. Seetion 2.4 describes this algorithm and proyes it cor-
reet. The necessary definitions and references for each of these results are given in
the appropriate subseetion.

2.1. Graph terminology. We assume familiarity with elementary concepts and


definitions from graph theory, such as tree, edge, undireeted graph, connected compo-
nent, etc. Golumbic [20) provides a good review of this material. Here we introduce
some of the graph notation and terminology that will be used throughout the paper.
Other cOli"epts from graph theory will be introduced as needed in later sections of
the paper.
We let G = (V, E) denote an undirected graph with vertex set V and edge set E.
The number of vertiees is denoted by n = IVI and the number of edges by e = lEI.
For any vertex set S ~ V, consider the edge set E(S) ~ E given by

E(S):= {(u,v) E E I u,v E S}.

We let G(S) denote the subgraph of G induced by S, namely the subgraph (S, E(S)).
At times it will be convenient to consider the induced subgraph of G obtained by
removing aset of vertices S ~ V from the graphj hence we define G \ S by

G \ S := G(V - S).

Two vertices u,v E V are said to be adjacent if (u,v) E E. AIso, the edge
(u, v) E E is said to be incident with both vertiees u and v. The set of vertices
adjacent to v in G is denoted by adja( v). Similarly, the set of vertiees adjacent
to S ~ V in G is given by

adja(S):= {v E V I v (j. S and (u,v) E E for some vertex u E S}.

(The subscript G often will be suppressed when the graph is known by context.) An
induced subgraph G(S) is complete if the vertices in S are pairwise adjacent in G. In
this case we also say that S is complete in G.
3

We let [VO, VI, ••• , Vk] denote a simple path of length k from Vo to Vk in G, i.e.,
Vi =I
Vj for i =I j and (Vi, Vi+I) E E for 0 ~ i ~ k - 1. Similarly, [vo, VI. •.• ,Vk, VD]
denotes a simple cycle of length k + 1 in G. Finally, a chord of a path (eyele) is any
edgejoining two noneonseeutive vertices of the path (eyele).
DEFINITION 1. An undireded graph G = (V, E) is chordal (triangulated, rigid
circuit) iJ every cycle oJ length greater than three has a chord.

Clearly, any indueed subgraph of a ehordal graph is also ehordal, a fact that is
useful in several of the proofs that follow.

2.2. Minirnal vertex separators. A subset S e V is a separator of G if


two vertices in the same eonneeted component of G are in two distinet eonneeted
eomponents of G \ S. If a and b are two vertiees separated by S then S is said to
be an ab-separator. The set S is a minimal separator of G if S is a separator and no
proper subset of S separates the graphj likewise S is aminimai ab-separator if S is
an ab-separator and no proper subset of S separates a and b into distinet eonneeted
eomponents. When the pair of vertiees remains unspecified, we refer to S as aminimai
vertex separator. It does not neeessarily follow that aminimai vertex separator is also
aminimai separator of the graph. For instanee, in Figure 1 the set S = {b, e} is a
minimal dc-separatorj nevertheless, S is not a minimal separator of G sinee {e} e S
is also a separator of G. Minimal vertex separators are used to characterize ehordal

G:

FIG. 1. Minimal dc-separator {b, e} is not aminimai separator of G.

graphs in Theorem 2.1, whieh is due to Dirae [6]. The proof is taken from Peyton [34],
which, in turn, elosely follows the proof given by Golumbic [20].
THEOREM 2.1 (DIRAC [6]). A graph G is chordal iJ and only iJ every minimal
vertex separator oJ G is complete in G.

Prao/. Assume every minimal vertex separator of G is eomplete in G, and let


I' = [vo, ... , Vk, vol be any eyele of length greater than three in G (i.e., k ?: 3). If
(VO,V2) E E, then I' has a ehord. If not, then there exists a vova-separator S (e.g.,
S = V - {Vo, V2} ) j furthermore, any such separator must eontain VI and Vi for some i,
3 ~ i ~ k. Choose S to be a minimal vova-separator so that S, by assumption, is
complete in G. It follows that (VI, Vi) is a chord of 1', which proyes the "if" part of
the resulto

Now assume G is ehordal and let S be a minimal ab-separator of G. Let G(A)


and G(B) be the conneeted eomponents of G \ S containing a and b, respeetively.
It suffiees to show that for any two distinet vertiees in S, say x and y, we have
4

(x,y) E E. Since S is minimaI, each vertex v E S is adjacent to some vertex in A


and some vertex in E; otherwise, S - {v} would be an ab-separator contrary to the
minimalityof S. Thus, there exist paths J1 = [x, al, ... , aT) y] and v = ry, bl , ... , bt , xl
where each ai E A and each bi E E (see Figure 2). Further, choose J1 and v so that

G(B)

CS5

sO o

Fro.2. Cycle in proof of Theorem 2.1 that induees ehord (x,y).

they are of the smallest possible length greater than one, and combine them to form
the cycle u = [x, al, ... , aT) y, bl , ... , bt , x]. Since G is chordal and u is a cycle of
length greater than three, u must have a chord. Any chord of u ineident with ai,
1 ::; i ::; r, would either join ai to another vertex in J1 contrary to the minimality of r,
Dr would join ai to a vertex in E, which is impossible because S separates A from E
in G. Const'J.ü.~!'t.lv. no chord of u is incident with a vertex ai, 1 ::; i ::; r, and by
the same argument no chord of the cycle is incident with a vertex bj , 1 ::; j ::; t. It
follows that the only possible chord is (x,y). D

Remark. In reality, r = t = 1, otherwise [x, ar, ... , aT) y, x] or ry, bb ... , bt, x, y] is
a chordless cycle of length greater than three.

2.3. Perfect elimination orderings. We need the following terminology be-


fore we can state and prove the main resuIt in this section. An ardering a of G is
a bijection a : V --> {I, 2, ... , n}. Often it will be convenient to denote an ordering
by using it to index the vertex set, so that a( Vi) = i for 1 ::; i ::; n where i will be
referred to as the label of Vi. Let Vj, V2, ... , Vn be an orderi ng of V. For 1 ::; i ::; n,
we define Li to be the set of vertices with labels greater than i-I:

The manatane adjacency set of Vi, denoted madja(Vi), is given by

Again, the subscript G often will be suppressed where the graph is known by context.
A vertex v is simplicial if adj( v) induees a complete subgraph of G. The orderi ng a
is a perfect eliminatian ardering (PEO) if for 1 ::; i ::; n, the vertex Vi is simpIieial in
5

the graph G(Ci ). As shown below in Lemma 2.2, every nontrivial ehordal graph has
a simplieial vertex (aetually, at least two). Theorem 2.3, which states that ehordal
graphs are eharaeterized by the possession of a PEO, follows easily from Lemma 2.2.
The proofs are again taken from Peyton [34], whieh, in turn, closely follow argument s
found in Golumbie [20].
LEMMA 2.2 (DIRAe [6]). Every chordal graph G has a simplicial vertex. IJG is
not complele, then it has two nonadjacent simplicial vertices.
Proof. The lemma is trivial if G is eomplete. For the ease where G is not eomplete
we proeeed by induetion on the number of vertiees n. Let G be a ehordal graph with
n ;::: 2 vertices, including two nonadjaeent vertiees a and b. If n = 2, both vertiees of
the graph are simplieial sinee both are isolated (i.e., adj(a) = adj(b) = 0). Suppose
n > 2 and assume that the lemma holds for all such graphs with fewer than n
vertiees. Sinee a and b are nonadjaeent, there exists an ab-separator (e.g., the set
V - {a,b}). Suppose S is aminimai ab-separator of G, and let G(A) and G(B) be
the eonneeted eomponents of G \ S containing a and b, respeetively. The indueed
subgraph G(A U S) is a ehordal graph having fewer vertiees than G; henee, by the
induetion hypothesis one of the following must hold: Either G( AUS) is eomplete and
every vertex of A is a simpIicial vertex of G( AUS), or G( AUS) has two nonadjacent
simpIicial vertiees, one of whieh must be in Asinee, by Theorem 2.1, S is complete
in G. Beeause adja(A) ~ AUS, every simpIicial vertex of G(A U S) in A is also a
simpIicial vertex of G. By the same argument, B also eontains a simpIicial vertex
of G, thereby eompleting the proof. 0
THEOREM 2.3 (FULKERSON AND GROSS [10]). A graph G is chordal iJ and only
iJ G has a perJect d::mination ordering.
Proof. Suppose G is ehordal. We proeeed by induetion on the number of vertiees n
to show the existenee of a PEO of G. The ease n = 1 is trivial. Suppose n > 1 and
every ehordal graph with fewer vertiees has a PEO. By Lemma 2.2, G has a simpIieial
vertex, say v. Now G \ {v} is a ehordal graph with fewer vertiees than G; henee, by
induetion it has a PEO, say /3. If a orders the vertex v first, followed by the remaining
vertiees of G in the order determined by /3, then a is a PEO of G.
Conversely, suppose G has a PEO, say a, given by VI> V2, ••• , Vn. We seek a ehord
of an arbitrary eycle J.t in G of length greater than three. Let Vi be the vertex on J.t
whose label i is smaller than that of any other vertex on J.t. Sinee a is a PEO,
madj(vi) is eomplete; whenee J.t has at least one ehord: namely, the edge joining the
two neighboring vertiees of Vi in J.t. 0

2.4. Maximum cardinality search. Rose, Tarjan, and Lueker [41] introdueed
the first Iinear-time algorithm for producing a PEO, known as the lexicographic
breadth first search algorithm. In aset of unpubIished leeture notes, Tarjan [44]
introdueed a simpler algorithm known as the maximum cardinality search (MCS) al-
gorithm. Tarjan and Yannakakis [46]later deseribed MCS algorithms for both ehordal
graphs and aeyclie hypergraphs. The MCS algorithm for ehordal graphs orders the
vertiees in reverse order beginning with an arbitrary vertex V E V for whieh it sets
a(v) = n. At each step the algorithm seleets as the next vertex to label an unlabeled
vertex adjaeent to the largest number of labeled vertiees, with ties broken arbitrarily.
6

A high-level description of the algorithm is given in Figure 3. We refer the reader to


Tarjan and Yannakakis [46] for details on how to implement the algorithm to run in
O(n + e) time.

'cn+! f- 0;
for i n to 1 step -1 do
f-

Choose a vertex v E V - 'ci+l for which


ladj(v) n 'ci+ll is maximum;
a( v) f - i; [v becomes vd
'ci f - 'ci+l U {vd;
end for

FJG. 3. Maximum cardinality search (MCS).

The following lemmaand theorem prove that the MCS algorithm pro du ees a PEO.
The lemma provides a useful characterization of the orderings of a chordal graph that
are not perfect elimination orderings. Edelman, Jamison, and Shier [9, 43] prove
similar resuits while studying the notion of convexity in chordal graphs. Theorem 2.5
is then proved by showing that every ordering that is not a PEO is also not an MCS
ordering. The proof is taken from Peyton [34]. Later in Section 4.2, we will provide
a more intuitive view of how the MCS algorithm works: it can be viewed as a special
implementation of Prim's algorithm applied to the weighted clique intersection graph
of G (defined in Section 3.4).

LEMMA L; •.,;, ••J n ordering a of the vertices in a graph G is not a perfect elimination
ordering if and only if for some vertex v, there exists a chordless path of length greater
than one from v = a-1(i) to some vertex in 'ci+l through vertices in V - 'ci.

Proo! Suppose a is not a PEO. There exists then by Lemma 2.2 a vertex U E V
for which madj( u) is not complete in G; hence, there exist two vertices v, w E madj( u)
joined by no edge in E. Without loss of generality assume that i = a(v) < a(w).
Then [v,u,w] is a chordless path of length two from v = a-1(i) to w E 'ci+! through
u E V - 'ci.

Conversely, suppose there exists a chordless path Il = [uo, Ul, •.. , url of length
r ~ 2 from Uo = a-1(i) to U r E 'ci+l through vertiees Uj E V - 'ci, 1 ::; j ::; r - l.
Let Uk, where 1 ::; k ::; r -1, be the internaI vertex in Il whose label a(uk) is smaller
than that of any other internaI vertex in Il. Then madj( Uk) indudes two nonadjacent
vertices: namely, the two neighboring vertiees of Uk in Il. It follows that a is not a
PEO. D

THEOREM 2.5 (TARJAN [44], TARJAN AND YANNAKAKIS [46]). Every maxi-
mum cardinality search ordering of a chordal graph G is a perfect elimination ordering.

Proo! Let a be any orderi ng of a chordal graph G that is not a PEO. We will
show that the orderi ng a cannot be generated by the MCS aIgorithm.

By Lemma 2.4, for some vertex Uo there exists a chordless path Il = [uo, Ul, .•. , Ur-J,
url of Iength r ~ 2 fromUo = a-1(i) to U r E 'ci+l through vertiees Uj E V - 'ci,
7

1 ::; j ::; r - 1. (See Figure 4.) Choose Uo so that the label i = a(uo) is maximum
among all the vertices of G for which such a chordless path exists.
1'0 show that a is not an MCS ordering it suffices to show that there exists some
vertex w E V - CHI for which ladj(w) n C;+11 exceeds ladj(uo) n CHII. We will
show that the vertex UT-I E J.! is indeed such a vertex. Note that adj( uo) n C HI and
madj(uo) are by definition identical, and thus it suffices to show that
(1)

For the trivial case madj( uo) = 0, the theorem holds since UT-I is adjacent to
UT E CHI . Assume instead that madj(uo) of. 0, and choose a vertex x E madj(uo). To
see that x is also adjacent to UT-Il consider the path I = [x, Uo, ... , UT-Il uTl pictured
in Figure 4. The maximality of i implies that every path of length greater than one

FIG. 4. Illustration for the proof of Theorem 2.5. The dark solid edges exist by hypothesis; existence
of the lighter broke1' edges is argued in the proof and the remark that follows it.

having the following two properties will have a chord: a) the endpoints of the path are
both numbered greater than i, and b) the interior vertices are numbered less than the
minimum of the endpoints. The path I satisfies these two properties and hence has a
chord. Moreover, since J.! = [uo, UIl'" ,url has no chords, every chord of I is incident
with x. Let Uk be the vertex in I adjacent to x which has the largest subscript. If
k of. r then [x, Uk, ••• , Url is a chordless path, again contrary to the maximalityof ij
hence (x, UT) E E.
It follows that 17 = [x, UO, ... , UT-I, Ur , xl is a cycle of length greater than three in G
(recall that r 2 2). Since G is chordal, 17 must have a chord, and, as argued above,
any such chord must be incident with x. Let Ut be the vertex in 17 with the highest
subscript other than r, for which (x, Ut) E E. If t of. r - 1, then [x, Ut, ... , Un xl
is a chordless cycle of length greater than 3, contrary to the chordality of G. In
consequence, (X,UT-I) E E for all x E madj(uo). But UT-I is also adjacent to UT E
CHI - madj(uo), whence (1) holds, completing the proof. D

Remark. In the preceding proof the argument leading to the inclusion of (x, UT-I)
in E can be repeated for every edge (x, U j), 1 ::; j ::; r - 2. In consequence we have
(2) madj(uo) ~ adj(uj) n CHI for 1 ::; j::; r - 2.
Statement (1) implies that if the MCS algorithm "tried" to generate a, then as the
vertex to be labeled with i is diosen, the priority of UT-I would be greater than that
8

of Uo. Similarly, (2) implies that the priority of each vertex Uj (1 :5 j :5 r - 2) would
be at least as great as that of Uo.

3. Characterizations of clique trees. Let G = (V, E) be any graph. A clique


of G is any maximal set of vertices that is complete in G, and thus a clique is properly
contained in no other clique. We will refer to a "submaximal clique" as complete in G,
as we did in the previous section. Henceforth KG = {K 1 ,K2 , ••• ,Km } denotes the
set containing the cliques of G, and m will be the number of cliques.

The reader may verify that the graph in Figure 5 is a chordal graph with four
cliques, each of size three. The graph in Figure 5 will be used throughout this seetion

FIG. 5. Chordal grap" with seven verliees and Jour cliques.

to ilIustrate results and key points. For convenience we shall refer to the vertices of
this graph as VI, V2, . •• , V7; e.g., the vertex labeled "6" will be referred to as V6. Note
that the labeling of the vertices is a PEO of the graph.

For any chordai gl"ph G there exists a subset of the set of trees on KG known as
clique trees. Any one of these clique trees can be used to represent the graph, often in
a very compaet and efficient manner [24, 46], as we shall see in Section 4. This section
contains a unified and elementary presentation of several key properties of clique trees,
each of which has been shown, somewhere in the literature, to characterize the set of
clique trees associated with a chordal graph.

The notion of clique trees was introduced independently by Buneman [5], Gavril [12],
and Walter [47]. The property we use to introduce and define clique trees in Sec-
tion 3.1 is a simple variant of one of the key properties introduced in their work.
We use this variant because, in our experience, it is more readily apprehended by
those who are studying this materiaJ for the first time. Section 3.2 presents the short
argument needed to show that the more reeent variant is equivalent to the original.

Clique trees have found application in relational databases, where they can be
viewed as a subclass of acyclic hypergraphs, which are heavily used in that area.
Open problems in relational database theory motivated the pioneeri ng work of Bern-
stein and Goodman [2], Beeri, Fagin, Maier, and Yannakakis [1], and Tarjan and
Yannakakis [46]. Our two final charaeterizations of clique trees, presented in Sec-
tions 3.3 and 3.4, are based on results from these papers. Seetion 3.5 summarizes
these results, and also ilIustrates these results in negative form using the example in
Figure 5.

Throughout this secti'Ün it will be convenient to assume that G is conneeted. All


9

the resuIts can nevertheless be applied to a disconnected graph by applying them


successively to each connected component; thus no loss of generality is incurred by
the restriction. Note also that Sections 3.2, 3.3, and 3.4 can be read independently
of one another, but any of these three subsections should be read only after reading
Section 3.1. As in the previous section, needed definitions and specific references to
the literature are given in the appropriate subsections.

3.1. Definition using the clique-intersection property. Assume that G is


a connected graph (not necessarily chordal), and consider its set of maximal cliques
KG. In this section we consider the set of trees on KG that satisfy the following
clique-intersection property:

For every pair of distinct cliques K, K' E KG, the set K n K' is
contained in every clique on the path connecting K and K' in the
tree.

As an example of a tree that satisfies the clique-intersection property, consider the


tree shown.in Figure 6, whose vertiees are the cliques of the chordal graph in Figure 5.
The reader may verify that this tree indeed satisfies the clique-intersection property:

If
~-~-~
~-G CV
FIG. 6. A tree on the c/iques of the chordal graph in Figure 5, which satisfies the c/ique-intersedion
property.

for example, the set K 4n K 2 = {V7} is contained in Kl, which is the only clique on the
path from K 4 to K 2 in the tree. The reader may also verify that the only other tree
on {Kl,K2 ,K3'!{4} that satisfies the clique-intersection property is obtained from
the tree in Figure 6 by replacing the edge (K3, K 2) with (I{3, Kr).

We will show in Theorem 3.2 below that G is chordal if and only if there exists
a tree on KG that satisfies the clique-intersection property. For any given chordal
graph G, we shalllet T~t de not e the nonempty set of trees T = (KG, eT) that satisfy
the clique intersection property, and we shall refer to any member of T!/
as a clique
tree of the underlying chordal graph G. In Section 3.2, we prove the original version
of this result, which was introduced independently by Buneman [5], Gavril [12], and
Walter [47].

To prove the main result of this subsection, we require two more definitions and a
simple lemma. A vertex K in a tree T is a lea! if it has precisely one neighbor in T (i.e.,
!adjT(I{)! = 1). We let K G( v) <;;: KG denote the set of cliques containing the vertex v.
10

rhe following simple characterization of simplicial vertices has been useful in various
:l.pplications. This result has been used widely in the literature [8, 19,23,24,46], and
ilas been formally stated and proven in at least two places [23, 24].
LEMMA 3.1. A vertex is simplicial iJ and only iJ it belongs to precisely one clique.
Proo! Suppose a vertex v belongs to two diques K, K' E '/CG. Maximalityof the
cliques implies the existence of two distinet nonadjacent vertices u E K - K' and
u' E K' - K. Since both u and u ' are adjacent to v, it follows that v is not simplieial.

Assume now that the vertex v belongs to one and onlyone dique K E '/CG. Note
~hatv is adjacent to a vertex u i= v if and only if there exists a dique of G to which
both u and v belong. Consequently adj (v) = K - {v}, whence v is simplicial. D
The first part of the following proof dosely resembles the argument given by
Gavril [12] to prove aresult that shall be presented in the next seetion. The second
half was improvised for this paper, and resembles the first half in many of its features.
THEOREM 3.2. A connected graph G is chordal iJ and only iJ there exists a tree
T ='('/CG, eT) Jor which the clique-intersection property halds.
Proo! We proceed by induetion on the number of vertices n to show the "only if"
part. The base step n = 1 is obvious. For the induetion step, let G be a chordal graph
with n ~ 2 vertices and assume the result is true for all chordal graphs having fewer
~han n vertices. By Lemma 2.2, G has a simplicial vertex, say v. Let K be the single
dique of G that contains v (see Lemma 3.1), and consider the induced subgraph
G' = G \ {v}. Since G' is a chordal graph with n-I vertices, by the induetion
hypothesis there exists a tree T ' = ('/CG" eT') that satisfies the dique-interseetion
property.
To complete the proof of the "only if" part, there are two cases to consider. First,
suppose K' = K - {v} remains maximal in G' (i.e., K' E '/CG'). It is trivial to show
that '/CG' = '/CG U {K'} - {K}, and we leave it for the reader to verify this. It follows
that the only difference between the diques of G and G' is the presence in G of the
simplieial vertex v in K and the absence of v from the corresponding dique K' of
G'. In consequence, the intersection of any pair of diques in G is identical to the
interseetion of the corresponding pair in G'. Let T be the tree on '/CG obtained from
T ' by replacing K' with K. Since T ' has the dique-interseetion property, it follows
that T has this property as well, thereby completing the argument for the first case.
Now, suppose S' = K - {v} is not a maximal dique in G' (i.e., S' (j. '/CG'). Since
n ~2 and G is conneeted, v is not an isolated vertex, and we have
S' =K - {v} = adj(v) i= 0.
Since S' is complete in G', there exists a dique P E '/CG' = '/CG - {K} for which
S' e P. (As before, we leave it for the reader to verify that '/CG' = '/CG - {I{}.) Let
T be the tree on '/CG obtained by adding the dique K and the edge (K, P) to T ' . We
now verify that T satisfies the dique-interseetion property. Because T ' satisfies the
dique-interseetion property, the set Kl n K 2 is contained in every dique on the path
from Kl to K 2 in T whenever neither Kl nor K 2 is K. Consider now the set KnK"
where Kli E '/CG - {K} = '/CG'. Since K - {v} e P and v belongs to no dique in
'/CG - {K}, it follows that Kli n K e P. Because T ' satisfies the dique-interseetion
11

property, the set K n Kli = P n Kli is contained in every clique on the path from K
to Kli in T, and T therefore satisfies the clique-intersection property as weIl.
To prove the "if" part, let G = (V, E) be a graph and suppose there exists a tree
T = (ICa,ET) that satisfies the clique-intersection property. Again we proceed by
induction on n to show that G is chordal. The base step n = 1 is obvious. For the
induction step, let G be a graph with n ~ 2 vertices and assume the result is true for
all graphs having fewer than n vertiees.
Let K and P be respectively a leaf of T and its sole neighbor (Le., "parent")
in T. By maximality of the cliques there exists a vertex v E K - P. The vertex v
moreover cannot belong to any clique K' E ICa - {K,P}, for were it otherwise the
clique P, whieh is on the path from K to K' in T, would not contain the set K n K'.
Consequently v belongs to no other clique but K, whence by Lemma 3.1 it is a
simplicial vertex of G.
Consider the reduced graph G' = G \ {v} and let IC = K - {v}. If K' fj.. P,
then the "reduced" tree T ' for G' is obtained simply by replacing K with K' in Tj if
K' e P, then T ' is obtained by removing from T the vertex K and the single edge
(K, P) incident with it in T. As before, in the first case, ICa' = IC a U {I<'} - {I<}j in
the second case, ICal = ICa - {K}. In either case, it is trivial to verify that the tree
T ' satisfies the clique-intersection property. From the induction hypothesis it follows
that G' is chordal. Let {3 be any PEO of G'. A PEO of G can then be obtained by
ordering v first, followed by the remaining vertiees of G in the order determined by (3.
Thus by Theorem 2.3, G is also chordal, giving us the resulto D

3.2. The in\!"~ed-subtree property. In this section we are concerned with


the set of all trees on IC a that satisfy the induced-subtree property:
For every vertex v E V, the set ICa(v) induees a subtree of T.
We shall let T~·t denote the set of all trees on ICa that satisfy the induced-subtree
property.
Consider again the clique tree in Figure 6. Observe that eaeh of the sets ICa( V3) =
{K3 } and ICa(vs) = {I<1,K2 ,K3 } induees a subtree of this tree. The reader may
verify that this tree satisfies the induced-subtree property. It is trivial to prove that
the clique-intersection and induced-subtree properties are indeed equivalent.
THEOREM 3.3. For any connected graph G, we have T~·t = T~/.
Proof. To see that T~/ ~ T~'\ let Tet E T!/ and consider the set of cliques
ICa(v) for some vertex v E V. Choose two cliques K,K ' E ICa(v). Since the set
K n K' lies in every clique on the path joining K and K' in Tct , it follows that the
vertex v E KnK ' also lies in each clique along this path. In eonsequence, the induced
subgraph Tet(ICa(v)) is connected and hence a subtree of G. It follows that Tet E T~'\
whenee T!/ ~ T/ft, as desired.
To see that T/t ~ T/f\ let Ii.t E T~·t. Choose two cliques K, K' E ICa, and
consider the set K n K'. For each vertex v E K n K', the set IC a (v) induees a subtree
of Ii.t (Le., a connected subgraph of Ii.t)j and thus the vertex v lies in each clique
along the path joining K and Kl in Iist. It follows that Iist E T/ft, whence T/t ~ T!/ ,
as desired. D
12

We thus have the following well known result from the literature.

THEOREM 3.4 (BUNEMAN (5), GAVRIL (12), WALTER [47]). A connected graph
G is chordal iJ and only iJ there exists a tree T = (KG, eT) Jor which the induced-
subtree property holds.

Prao! The result follows immediately from Theorems 3.2 and 3.3. D

3.3. The running intersection property. A total orderi ng of the cliques in


KG, say Kl> K 2 , ••• , Km, has the running intersection property (RlP) if for each
clique Kj, 2 :::; j :::; m, there exists a clique K i , 1 :::; i :::; j-I, such that

(3)

For any RlP orderi ng of the cliques, we construct a tree Trip on KG by making each
clique K j adjacent to a "parent" cliql1e K i identified by (3). (Since more than one
clique K i , 1 :::; i :::; j -1, may satisfy (3), the parent may not be uniql1ely determined.)
We let T~ip be the set containing every tree on KG that can be constructed from an
RlP,ordering in this manner. We define a reverse topolagieal ordering of any root ed
tree as an orderi ng that numbers each parent beJore any of its children. Finally, note
that any RlP orderi ng is a reverse topological orderi ng of a rooted tree constructed
from the orderi ng in the manner specified above.
The ordering K l ,!{2, K 3 , K 4 of the cliques shown in Figure 5 is an RlP ordering;
a corresponding RIP-induced parent function is displayed in Figure 7. Note that the
parent function specifies precisely the edges of the clique tree in Figure 6. lndeed, we
can show that for any connected graph G, we have T~ip = T~t.

FIG. 7. Clique tree in Figure 6 is an RlP tree. Arrows point from child to parent.

THEOREM 3.5 (BEERI, FAGIN, MAIER, YANNAKAKIS [1]). For any connected
graph G, we have T~ip = T~t.

Prao! We first show that T// ~ T!;i P • Let T ct E T~t; choose R E KG; and
root Tet at R. Consider any reverse topological ordering R = Kl, K 2 , • •• , Km of
the rooted tree Tct . For any clique Kj, 2 :::; j :::; m, let Kp be its parent clique in
the root ed tree (whence 1 :::; P :::; j-I). Now, for 1 :::; i :::; j - I , the clique K i
13

cannot be a descendant of K j , hence Kp is on the path in Tct connecting Kj and


K j • The clique-intersection property implies that Kj n K j ~ Kp. This implies that
K j n (Kl U K 2 U ... U K j - 1 ) ~ Kp; furthermore, Kp cannot be a subset of K j by
maximality, so the containment is proper. Thus, Tet E T~jp, and we have T~t ~ T~jp.

To see that T~jp ~ T~\ consider a tree T = (Ka, e) rt T~t. We will show that
T rt r;;jp.
Since T rt T~t, there exists then a pair of distinct cliques K, K' E Ka such
that the set K n K' is not contained in at least one clique on the path connecting K
and K' in the tree. Choose two such cliques K, K' E Ka that minimize the length of
the path from K to K' in T. The key observation on which our argument depends
is that the set K n K' belongs to no clique on the path connecting K and K' in the
tree, except K and K'. Let K 1 ,K2 , •.. ,Km be any reverse topological orderi ng of
T for arbitrary root Kl E Ka. It suffices to show that (3) does not hold for some
parent-child pair in T.

Consider the path Il = [K = l<;o'!<il l " " K j , = K/J in T. Let l<;, be the clique
with lowest index among the cliques in Il, and without loss of generality assume that
i o > is. SiIjce under the given reverse topological ordering Kio is a proper descendant
of Kõ, E Il, the clique K il is necessarily the parent of K io in the rooted tree, and hence
i o > il. Our choice of K (= K io ) and JC (= K;,) implies that (a) s 2: 2, and (b)
K io n K j , CZ K õr for each r, 1 ::; r ::; s - 1. In consequence, we have K õo n K i , CZ K i"
whence (3) does not hold for the parent-child pair K il and Kio, which completes the
proof. D

Remark. In the preceding proof, the argument that T~t ~ T~jp verifies that any
reverse topologieal ordering of a clique tree Tct E T~t is an RIP ordering of the cliques.

3.4. The maximum-weight spanning tree property. Associated with each


chordal graph G is a weighted clique intersection graph, W a , defined as foIlows. The
vertex set of Wa is the set of cliques Ka. Two distinct cliques K, K' E Ka are
connected by an edge if and only if their intersection is nonempty; moreover, each
such edge (K, K') is assigned a positive weight given by lK n K'i. We let be-r;st
the set containing every maximum-weight spanning tree (MST) of Wa.
Figure 8 shows Wa for the chordal graph in Figure 5, and highlights the edges of
the clique tree in Figure 6. Observe that the highlighted clique tree is a maximum-
weight spanning tree of Wa, with edge weights that sum to five. Bernstein and
Goodman [2J first showed that for any chordal graph G, we have T amst = T~t. Our
proof of this result is similar to that given by Gavril [13J.

Our argument requires two ideas commonly used in the study of maximum-weight
(minimum-weight) spanning tree algorithms. First, let T = (Ka, eT) be a spanning
tree of Wa. It is weIl known that T is a maximum-weight spanning tree if and
only if for every pair of cliques K, K' E Ka for which (K, JC) rt eT, the weight of
every edge on the path joining K and JC in T is no smaller than lK n K'i (see, for
example, Tarjan [45, pp. 71-72]). Second, given an edge (K,K ' ) in a tree, we define
the fundamental eut sel (see Gibbons [18, p. 58]) associated with the edge as follows.
The removal of (K, K') from the tree partitions the vertiees of Tinto precisely two
sets, say Kl and K 2 • The fundamental cut set associated with (K, K') consists of
every edge with one vertex in Kl and the other in K 2 , including (K, lC) itself.
14

FIG. 8. Weighted clique intersection graph for graph in Figure 5. Bold edges belong to the clique
tree in Figure 6. Also shown are the intersection sets upon which the weights are based.

THEOREM 3.6 (BERNSTEIN AND GOODMAN [2]). For any connected chordal
graph G, TJ;'st = T~t.
Proo! We first show that T~t <;;; T Gmst . Let Tct E TJ'/ and ehoose two cliques [(
and [(' that are not eonneeted by an edge in Tct . Consider the eycle formed by adding
the edge {[(, [('} to Tet . By Theorem 3.2 every edge along this eycle has weight no
smaller than 1[( n [('I, whenee Tct is a maximum-weight spanning tree of WG •
To see that TJ;'st <;;; T~\ ehoose Tmst E TJ;'st. By Theorem 3.2, T~t f= 0. Choose
Tct E T;:/ that has a maximum number of edges in common with Tmst • Assume for
the purpose of eontradietion that there is an edge ([(1, [(2) of Tmst that is not an edge
of Tet • Consloer ~~'" fundamental eut set (in WG) associated with the edge ([(1, [(2)
of Tmst and also the eycle (in Tet ) obtained by adding the edge ([(1, [(2) to Tet • Any
eycle containing one edge from the eut set must eontain another edge from the eut
set as weil. Seleet from the eycle in Tet one of the edges (I{3, [(4) f= ([(1,[(2) that
belongs to the eut seto

Note that the edge (I{3, [(4) is an edge of Tet , but it is not an edge of Tmst • Sinee
Tet is a clique tree, it follows from Theorem 3.2 that [(I n [(2 <;;; [(3 n [(4. However, if
[(I n [(2
were a proper subset of [(3 n [(4 , then replacing ([(I, [(2) in Tmst with ([(3, [(4)
would result in a spanning tree of greater weight, eontrary to the maximalityof Tmst 's
weight. Henee, [(I n [(2 = [(3 n [(4' Consider the tree obtained by replacing (I{3' [(4)
in Tet with the edge ([(1,[(2), The reader ean easily verify that the resulting tree is
a clique tree. The new clique tree moreover has one more edge in common with Tmst
than originally possessed by Tet, giving us the contradietion we seek. Consequently,
Tmst = Tct , and the result holds. D

3.5. Summary. The following eorollary summarizes the results presented in this
seetion.
COROLLARY 3.7. For every connected graph G, we have
15

Furthermore, G is chordal iJ and only iJ this set is nonempty, in which case we have
,..-ct _ ..-rist _ ..-rrip _ ..-rmst
.La -.l.c -.LG -..la .

Based on Corollary 3.7, we heneeforth drop the superseripts from our notation
and shall use T G to denote the set of clique trees of G. Finally, Figure 9 illustrates
Corollary 3.7 in negative form. We now verify that the tree displayed in this figure

~ {5.7KK~
~ - ~
____ (7)

FIG. 9. Not a clique tree of the graph in Figure 5.

indeed satisfies none of the eharacterizations of a clique tree:


[CT] The set Rl ,-, J:. ;q not contained in [(4.
[1ST] IC G ( vs) does not induee a subtree.
[RIP] The reverse topologieal orderi ng [(3, [(2, [(4, [(1 is not an RIP ordering: [(1 n
(I<4 u [(2 U [(3) = [(1, which is, of eourse, eontained in no other clique. It
follows then from the remark after Theorem 5 that the tree is not an RIP
tree.
[MST] The weight of the tree, whieh is four, is submaximal by one.

4. CIique trees, separators, and MCS revisited. This seetion ties together
some of the results and concept s presented separately in Seetions 2 and 3. See-
tion 4.1 presents results that link the ed ges in a clique tree with the minimaI vertex
separators of the underlying ehordal graph. Section 4.2 presents an efficient algo-
rithm for computing a clique tree. This algorithm, whieh is a simple extension of the
MCS algorithm, is shown to be an implementation of Prim's algorithm for finding a
maximum-weight spanning tree of the weighted clique interseetion graph WG • New
definitions and notation will be introdueed as needed, and appropriate referenees to
the literature will be given in eaeh subseetion. As in the previous section, we assume
without loss of generality that G is connected.

4.1. CIique tree edges and minimaI vertex separators. Choose a clique
tree T E T G and let S = [(i n [(j for some edge ([(i, [(j) E eT. Let T i = (ICi, ei) and
Tj = (IC], ej) denote the two subtrees obtained by removing the edge ([(i, [(j) from
16

T, with K; E K. i and Kj E K.j . We also define vertex sets V; e v and vi e v by

V;:= ( U K) - S
KEK.;
and

vi := ( U K) - S.
K E K. j

We first prove two technical lemmas, the second of which shows that the set
S = K i n K j separates V; from vi in G. These two results are then used in the proof
of Theorem 4.3 to show that for any clique tree T E Tct the set S' e v is aminimai
vertex separator if and only if S' = K n K' for some edge (K, K') E eT. The results
in this section have appeared in both Ho and Lee [21] and Lundquist [33]. The proofs
of Lemma 4.2 and Theorem 4.3 are similar to arguments given by Lundquist [33].

LEMMA 4.1. The sets V;, Vi, and S form a partition ofV.
Prao! Let T, S, K i , K j , K. i , K. j , V;, and vi be as defined as in the first paragraph
of the subsection. Clearly, V = V; u vi u S, and S is disjoint from both V; and Vi.
Hence it suffices to show that V; n vi = 0. By way of contradiction assume the there
exists a vertex v E V; n vi. It follows that v belongs tü süme clique K E K. i and also
belongs to some clique K' E K. j • Since T E Ta, the vertex v belongs to every clique
along the path joining K and K' in T, which necessarily includes both K; and K j • In
consequence, v E S = K i n K j , which is impossible since both V; and vi are disjoint
from S, whence the result follows. D

LEMMA 4.2. iJ S = Ki n K j and (Ki, Kj) E eT for some T E Ta, then S is a


vw-separator for every pa ir of vertices v E V; and w E vi.
Prao! Again let T, S, K i , K j , K. i , K. j , V;, and vi be as defined in the first
paragraph of the subsection. To prove the result it suffices to show that there exists
no edge (v, w) E Ea with v E V; and w E Vi. Now, if (v, w) E Ea, then there exists a
clique K E K. a for which v, w E K. If K E K. i then clearly v, w E Su V;. Moreover
since by Lemma 4.1, V;, vi, and S form a partition of V, it follows that neither v nor
w belongs to Vi. Likewise, if K E K. j then v, w E Su Vi, and neither v nor w belongs
to V;. In consequence, no edge in Ea joins two vertiees v E V; and w E Vi, which
concludes the proof. D

THEOREM 4.3. Let T E Ta. The set S e V is aminimai vertex separator of G


if and only if S = K n K' for some edge (K, K') E eT.
Prao! For the "if" part let T E Ta, and let S = K n K', for some edge (K, K') E
eT. Consider two vertiees v E K - S and w E K' - S. By Lemma 4.2, S is a
vw-separator. Moreover, since both v and w are adjacent to every vertex in S, it
follows that S is aminimai vw-separator, as desired.

To prove the "only if" part, choose T E Ta and let S be aminimai vw-separator
of G. Since (v, w) rt E, the sets K. a (v) and K. a (w) induce disjoint subtrees of T.
Choose K E K.a(v) and K' E K.a(w) to minimize the distance in T between K and
K'. Consider the path f! = [K = Ka, Kl, .. . , Kr-J, Kr = K'] in T, where l' ~ 1.
17

Define S; := K; n K;+! for 0 S i S r - 1, and let S := {So, Sl,"" Sr-d. We will


show that S E S, whieh suffiees to prove the result.
First, to see that S; ~ S for at least one set S; E S, suppose (for the purpose of
cont~adiction) that S; ~ S for every S; E S, and choose X; E S; - S for each member
of S. Since x; E K; n K;+! (0 S i S r -1), we have a path [v,xo,xt, ... ,xr-t,w]
joining v and w in G \ S, contrary to our assumption that S is a vw-separator. It
follows that S; ~ S for at least one set S; E S.
Now seleet S; E S for which S; ~ S, and consider the two subtrees obtained by
removing the edge (K;, Ki+1) from T. Let Tv be the subtree containing K o 3 v, and
let T w be the subtree containing Kr 3 w. Since Sj is contained in the vw-separator
S, we clearly have v, w (j Si. Hence, by Lemma 4.2, Si is a vw-separator. Since S is
moreover aminimaI vw-separator, we have S = Si = KinK;+! where (K;, Ki+d E eT,
as required. D
For a clique tree T = (Ka, eT) E Ta, consider the set containing every distinct set
KnK' where (K,K') E eT. It follows immediatelyfrom Theorem 4.3 that this set is
the same for every clique tree T E Ta. In light of Theorem 4.3, we shall refer to the
members of this invariant set as separators. For any clique tree T = (Ka, eT) E Ta
consider the multiset of separators defined by

MT:= {K n K' I (K,K') E eT}.

That this multiset is the same for all clique trees T E Ta is an immediate consequence
of aresult by Ho and Lee [21]; the result was also proven by Lundquist [33]. The
proof is taken directly from Blair and Peyton [4].
THEOREM 4.4 (Ho AND LEE [21], LUNDQUIST [33]). The multiset of separators
is the same for every clique tree T E Ta.
Proof. For the purpose of contradiction, suppose there exist two distinet clique
trees T,T' E Ta for which MT # MT'. From among the clique trees T' E Ta for
whieh MT' # MT, choose T' so that it shares as many edges as possible with T.
(Note that T and T' cannot share the same edge set, for then they also would share
the same multiset of separators.)
Let ( Kl, K 2) be an edge of T that does not belong to T'. As in the proof of
Theorem 3.6, consider the fundamental cut set (in Wa) associated with the edge
(Kt, K 2 ) of T and also the cycle (in T') obtained by adding the edge (/{t, K:z) to T'.
Recall that any cycle containing one edge from the cut set must contain another edge
from the cut set as weIl. Seleet from the cycle in T' one of the edges (K3 , K4) #
(Kl, K 2 ) that belongs to the cut seto Note that the edge (K3 , K 4 ) is an edge of T'
but not an edge of T.
Since T E Tet , it follows by Theorem 3.2 that K 3 n K 4 ~ Kl n K 2 ; similarly, since
T' E Tet. it follows by Theorem 3.2 that Kl n K 2 ~ K 3 n K 4 ; hence K 3 n K 4 =
Kl n K 2 • By Theorem 3.6, the replacement of (K3 , K 4 ) in T' with (Kl, K 2 ) results
in aclique tree, which, moreover, clearly has the same multiset of separators that T'
haso Contrary to our assumption about T', the modified tree shares one more edge
with T, and thus result follows. D
18

4.2. MCS and Prim's algorithm. Prim's algorithm [39J is an efficient method
for computing a maximum-weight (minimum-weight) spanning tree of a weighted
graph. Thus, by Theorem 3.6, Prim's algorithm applied to the weighted clique in-
tersection graph Wa computes a clique tree T E Ta. At any point the algorithm
has constructed a subtree of the eventual maximum-weight spannin~ tree T, and at
each step it adds one more clique and edge to this subtree. Let '/C e '/Ca be the
cliques in the subtree construeted thus far. As the next edge to be added, the algo-
rithm chooses the heaviest edge that joins JC to '/Ca - JC. For a proof that Prim's
algorithm correctly computes a maximum-weight spanning tree, we refer the reader
to Tarjan [45, pp. 73-75] or Gibbons [18, pp. 40-42J. A version of Prim's algorithm
formulated specifically for our problem is given in Figure 10.

eT +- 0;
Choose K E '/Ca;
JC +- {K};
for r +- 2 to m do
Choose cliques K E JC and K' E '/Ca - JC
for which lK n K'i is maximum;
eT +- eT U {(K,K' )};
JC +- JC U {I<'};
end for

FIG. 10. Prim 's algorithm for finding a maximttm-weight spanning tree of the weighted cliqtte inter-
section graph W G.

In this seetion wt:: will show that the MCS algorithm applied to a chordal graph
G can be viewed as an implementation of Prim's algorithm applied to Wa . In Sec-
tion 4.2.1 we show that since the MCS algorithm generates a PEO, it can easily deteet
the cliques in '/Ca during the course of the computation. Section 4.2.2 shows that 1)
the MCS algorithm can be viewed as a block algorithm that "searches" the cliques in
'/Ca one after the other, and 2) the order in which the cliques are searched is precisely
the order in which the cliques are searched by Prim's algorithm in Figure 10. Using
the results in Seetions 4.2.1 and 4.2.2, we also show how to supplement the MCS
algorithm with a few additional statements so that it deteets the eliques and aset
of clique tree edges as it generates a PEO. A detailed statement of this algorithm
appears at the end of Section 4.2.2.
The elose connection between the MCS algorithm and Prim's algorithm was, to
our knowledge, first presented by Blair, England, and Thomason [3J. Several of the
proofs in this seetion are similar to arguments given by Lewis et al. [24J. Though the
techniques discussed in this seetion can be implemented to run quite efficiently, there
are more efficient ways to compute a elique tree when certain data struetures that
arise in sparse matrix computations are available. The reader should consult Lewis
et al. [24] for details on how to compute a clique tree in the course of solving a sparse
positive definite linear system.

4.2.1. Detecting the cliques. In this subseetion we show that the MCS algo-
rithm can easily and efficiently deteet the cliques in '/Ca. To do so we exploit the fact
19

that MCS computes a PEO. We shall use the following result from Fulkerson and
Gross [10].
LEMMA 4.5 (FULKERSON AND GROSS [10]). Let vI, V2, ... , v n be a perfect elim-
ination ordering of G. The set of maximal cliques lea contains precisely the sets
{Vi} U madj(vi) for which there exists no vertex Vi, j < i, such that
(4) {v;} U madj(Vi) e {Vi} U madj(vi)'

Proof. Choose K E lea and let Vi E K be the vertex whose label i assigned
by the PEO is lowest among the labels assigned to a vertex of K. Consider the
vertex set {v;} U madj(vi)' Since K consists of Vi and neighbors of Vi with labels
larger than i, clearly K ~ {v;} U madj(vi)' Because the ordering is a PEO, the set
{Vi} U madj(v;) must be complete in G. Thus by maximality of the clique K we have
K = {Vi} U madj(Vi) , and moreover it follows that .(4) holds for no vertex vj, j < i.
Now, let K = {Vi} U madj(v;) and suppose that (4) holds for no vertex vj, j < i.
Since the ordering is a PEO, clearly K is complete in G. If K is submaximal, then
there exists a vertex Vi E V - K that is adjacent to every vertex of K. But the
existence of such a vertex Vi is impossible: if j > i then Vi E madj(vi), contrary to
Vj E V -Kj ifj < i then (4) holds for Vj, contrary to our assumption. In consequence,
no such vertex Vj exists, and the result follows. 0
Throughout the remainder of the paper we let VI, V2, ... ,Vn be a PEO obtained
by applying the MCS algorithm to a conneeted chordal graph G. We shall call Vi r
the representative vertex of Kr whenever Kr = {Vi r } U madj(vir)j that is, we let
Vi" Vi., ... ,Vim be ~!;.~ rp.presentative vertices of the cliques Kl, K 2, ... ,Km, respec-
tively, where il > i 2 > ... > im. Thus the ordering Kl! K 2 , ••• , Km specifies the order
in which the cliques are searched by the MCS algorithm.
As the MCS algorithm generates a PEO it can easily deteet the representative
vertices and hence can easily colleet the cliques in lea. Condition 2 in the next
lemma provides a test for determining when a vertex in an MCS orderi ng is not
a representative vertex. Lemma 4.7 then provides a simple test for detecting the
representative vertices.

LEMMA 4.6. Let VI, V2, ... ,Vn be a perfect elimination ordering obtained by ap-
plying the maximum cardinality search algorithm to a connected chordal graph G. For
each vertex label i, 1 :5 i :5 n-I, the following are equivalent:

1. {vi+d U madj(vi+d ~ lea.


2. ladj(vi) n Ci+ll = ladj(Vi+l) n Ci+21 + 1.
3. {Vi} U madj(Vi) = {Vi,Vi+d U madj(Vi+l)'
Proof. First we state two inequalities that prove useful here and in later proofs.
Note that the maximum cardinality seleetion criterion ensures that the following
inequality holds true when Vi+! (1 :5 i :5 n -1) is seleeted to be labeled:
(5)
Equation (5) along with the fact that Ci+l = Ci+2 U {Vi+d, gives us
(6)
20

Assume that the first condition in the statement of the lemma holds for Vi+!, and
consider the vertex Vi selected by the MCS algorithm at the next step. When the
algorithm selects Vi there exists (by Lemma 4.5) a vertex u E V - Ci+I that is adjacent
to every vertex in {Vi+d U madj(vi+t}. In light of (6), the existence of such a vertex
u ensures that the vertex Vi chosen by the MCS algorithm (perhaps Vi = u) satisfies
the second condition.
Assume now that the second condition in the statement of the lemma holds for
the two vertices Vi and Vi+!o It immediately follows that
Imadj(v;) I = I{Vi+d U madj(Vi+I)I·
Consequently, to prove that the third condition holds true it suffices to show that
madj(vi) ~ {vi+!l U madj(Vi+I). Now if it were the case that Vi+! ~ adj(Vi)' then
from (5) and the fact that Ci+I = Ci+2 U {vi+d we would have
ladj(v;)nci+!1 Sladj(vi+I)nCi+2l,
contrary to our assumption that condition 2 holds true. It follows then that Vi+! is
adj~nt to Vi in G. Now choose Vk E madj(vi) - {vi+d. Clearly k ~ i + 2j moreover,
since {Vi} U madj(vi) is complete in G, Vk is necessarily adjacent to Vi+I E madj(vi)j
whence Vk E madj(vi+!), giving us condition 3.
Finally, by Lemma 4.5 the first condition follows immediately from the third,
which completes the proof. D
Further extending the result in Lemma 4.6, we obtain the following technique for
detecting the representative vertices of '/Ca while generating the MCS ordering.
LEMMJ.. 4.7. Let VI, V2, ... , v n be a perfect elimination ordering obtained by apply-
ing the maximum carainality search algorithm to a connected chordal graph G. Then
'/Ca contains precisely the following sets: {Vd U madj(vt} and {vi+!l U madj(vi+!),
1 S i S n-I, for which
(7)

Proof. From Lemma 4.5 it follows that {Vd U madj(vI) E '/Ca. Consider the set
{vi+d U madj(vi+t} where 1 S i S n-I. It follows from (6) and the equivalence of
conditions 1 and 2 in Lemma 4.6 that {Vi+d U madj(Vi+I) is a member of '/Ca if and
only if (7) holds. This condudes the proof. D

4.2.2. MCS as a block algorithm. Clearly, the MCS algorithm can deteet
the diques in '/Ca by determining at each step whether or not (7) holds. With the
next lemma we show that the MCS algorithm can be viewed as a block algorithm
that searches the diques of '/Ca one after the other.
LEMMA 4.8. Let Vt, V2, ... , Vn be a perfect elimination ordering obtained by ap-
plying the maximum cardinality search algorithm to a connected chordal graph G,
and let Vi" Vi., ... , Vi m be the representative vertices of the cliques 1<l> 1<2' ... ' 1<m,
respectively, where il > i 2 > ... > im. Then
T

(8) Cir = U 1<.


8=1

for each r, 1 S r S m.
21

Proo! Choose r, 1 ::; r ::; m, and assume Vj rl- Li r , Le., j < ir. Since clearly Vj rl-
{v;,} U madj(vi,) for each s, I ::; s ::; r, it follows by Lemma 4.5 that Vj rl- U~=IK•.
Now assume Vj E Lir and for convenience of notation define i o := n + 1. Choose s,
1 ::; s::; r, for which i. ::; j < is-I' If j = iS! then clearly Vj E K s = {Vj} U madj(vj).
If is < j, then by repeated application of condition 3 of Lemma 4.6, we have

Ks {vd U madj(vi,),
{vi"v;,+d U madj(vi,+1)'

Consequently, Vj E K., and the result follows. D


It follows from Lemma 4.8 that the MCS algorithm labels the vertices contiguously
in blocks as follows:

{ vh , Vii +1 , . . . , Vn = Vio -1 } Kl
{Vi 2 ' Vi 2 +1,· .. , Vi1-l} K 2 -Kl
2
K3 - UK.
a=1

m-l
Km - UKs.
8=1

For convenience we J",~~., the function clique : V --+ {I, ... , m} by clique( Vj) := r
where i o := n + 1 and Vj E {Vir,Vir+I, ... ,Vir_,-d (i.e., ir ::; j < ir-I)' Clearly
clique( V) is the lowest index of a clique that contains Vj that is,

clique(v) = min{r I v E Kr}.

The following lemma is needed to provide a means of detecting the edges of a


clique tree, and it is also critical in the proof of the main result in this subsection.

LEMMA 4.9. Let VI, V2, . .. ,Vn be a perfect elimination ordering obtained by ap-
plying the maximum cardinality search algorithm to a connected chordal graph G, and
let Vi" Vi" .. . , Vi m be the representative vertices of the cliques Kl> K 2 , • •• , Km, respec-
tively, where il > i 2 > ... > im. For any integer r, 1 ::; r ::; m - 1, there exists an
integer p, 1 ::; p ::; r, such that

(9)

Moreover, Equation (9) is satisfied when p = clique(vj) , where Vj is the vertex in


K r+l n Li r with smallest label j.

Proo! Let 1 ::; r ::; m - 1. From Lemma 4.8 it follows that for 1 ::; p ::; r we have
22

To prove the result it suffices to show that Kr+! n Cir ~ Kp. Now consider the set
Kr+! n Cir, and choose Vj E Kr+! n Cir with smallest label j. Clearly K r+1 n Cir is
complete in G and moreover

(10)

Choose p, 1 ~ p ~ r, for which ip ~ j < ip-I. (Note that p = clique(vj).) By the


same argument used in the proof of Lemma 4.8, we have

(11)
Combining (10) and (11), we obtain the resulto D
From Lemmas 4.8 and 4.9 it follows that any MCS clique orderi ng is also an RIP
ordering. Furthermore, Lemma 4.9 shows specifically how to use the clique function
to obtain the edges of a clique tree in an efficient manner. (This technique for deter-
mining a clique tree parent function was introduced by Tarjan and Yannakakis [46]
and also appears in Lewis et al. [24].) It follows that the MCS algorithm can generate
a clique tree by 1) detecting the cliques via representative vertices (Lemma 4.7) and
2) choosing as the parent of K r+1 the clique Kp for which p = clique( Vj) where j
is the smallest label in Kr+! n Cir' The following result shows that any clique tree
generated in this fashion could also be generated by Prim's algorithm applied to Wa .

THEOREM 4.10. Any order in which the cliques are search ed by the maximum
cardinality search algorithm is also an order in which the cliques are search ed by
Prim 's algorithm applied to W a .

Prao! Let Kl, K 2 , • •• , Km be an ordering of '/Ca generated by the MCS algorithm.


Choose r, I ::: r < m - 1. To show that this clique ordering is also a search order for
Prim's algorithm applied to Wa (see Figure 10), it suBlees to show that there exists
p (1 ~ P ~ r) for which

(12)

To prove that (12) holds, choose any s and t for which 1 ~ s ~ r < t ~ m. Consider
the vertex Vj E Kr+! n Cir for which j is minimum, and let p = clique(vj). By
Lemma 4.9, we can write

(13)
Lemma 4.8 and the discussion following that result imply that Vir-I is the vertex
from K r+1 - Cir whose label is maximumo By repeated application of condition 3 of
Lemma 4.6 (as needed) we obtain the following:

Kr+! {Vi r+,} U madj( Vi r+l)'


{Vi r+1 , vir+1+d U madj(Vir+l+!)'

In consequence we have

(14)
23

Now, if

theri for u E K t - Cir f:. 0, we have

contrary to the maximum cardinality search criterion by which the vertices were
labeled. It follows then that

(15)

Finally, Lemma 4.8 implies that

(16)

Combining (13), (14), (15), and (16) shows that (12) holds, giving us the result. D
From the results in this subsection, we obtain an expanded version of the MCS
algorithm, which computes a clique tree in addition to a PEO. The MCS algo-
rithm is shown in Figure 3, and the expanded algorithm is shown in Figure 11. We

prev_card +- Oj
Cn +! +- 0j
s+- Oj
eT +- 0j
for i +- n to 1 step -1 do
r.hoose a vertex v E V - CHI for which
!adj(v) n CHI ! is maximumj
a(v) +- ij [v becomes vd
new_card +-!adj(vi) n CHI!j
if new_card :::; prev_card then [begin new clique]
s +- s + lj
K. +- adj(Vi) n CHI j [= madj(v)]
if new_card f:. 0 then [get edge to parent]
k +- min{j !Vi E K.}j
p +- clique(vk)j
eT +- eT U {K., Kp}j
end if
end if
clique( Vi) +- Sj
K. +- K. U {Vi}j
Ci +- CHI U {Vi}j
prev_card +- new_cardj
end for

FIO. 11. An expanded version of MCS, which implements Prim's algorithm in Figure 10.

emphasize that the primary purpose of this section is to establish the connection be-
tween the MCS algorithm and Prim's aIgorithm (applied to Wa ), and Theorem 4.10
24

demonstrates that the detailed algorithm in Figure 11 can be viewed as a special im-
plementation of Prim's algorithm shown in Figure 10. Some of the details necessary
to represent a chordal graph as a clique tree have been discussed herej for a complete
discussion of this topic the reader should consult the papers [24, 46]. It is worth
noting that a clique tree is often a much more compaet and more computationally
efficient data strueture than the adjacency lists usually used to represent G.

5. Applications. In this seetion we briefly review a few recent applications of


chordal graphs and clique trees in sparse matrix computations.

5.1. Terminology. Let Ax = b be a sparse symmetric positive definite system


of linear equations, whose Cholesky factorization is denoted by A = LLT • Direet
methods for solving such linear systems store and compute only the nonzero entries
of the Cholesky faetor L. This faetorization generally introduces fill (or fill-in) into
the matrixj that is, some of the zero entries in A become nonzero entries in L.
Assume the coefficient matrix A is n x n. We associate a graph GA = (V, EA)
with·the matrix A in the usual way: the vertex set is given by V = {Vl>V2' .•. ;Vn},
with two vertices Vi and Vj joined by an edge in EA if and only if aij :I O. We define
the filled graph G F = (V,EF) in precisely the same way, where F := L + LT. Note
that GF is a chordal supergraph of GA (EA ~ EF) [40], and the order in which the
unknowns are eliminated is a PEO for the corresponding filled graph GF.

5.2. Elimination trees. More commonly used than the clique tree, the elim-
ination tree associated with the ordered graph GA has proven very useful in sparse
matrix computations. The elimination tree TA = (V, ET) for an irreducible graph
GA is a rooteõ LL':~ -lefined by a parent funetion as follows: for eaeh vertex vj,
1 :::; j :::; n-I, the parent of Vj is Vi, where the first off-diagonal nonzero entry in
column j of L occurs in row i > j. If GA is reducible, one ohtains a forest rather than
a tree. A topological ordering of TA is any orderi ng of the vertices that numbers each
parent with alabeI larger than that of any of its ehiIdren. The order in which the
unknowns are eliminated, for example, is a topological ordering of the tree TA, and, in
faet, any topological ordering of the tree is a PEO of GF. Elimination trees evidently
were introduced by Schreiber [42], though they had earlier been used implicitly in a
number of algorithms and applications. Liu [30] has provided a survey of the many
uses of elimination trees in sparse matrix computations.
Liu has also discovered an interesting conneetion between clique trees and elim-
ination trees. To faeilitate our discussion of this conneetion we need to introduee
the following concepts and results. If:F is a finite family of nonempty sets, then the
intersection graph of :F is obtained by representing each set in :F by a vertex and
conneeting two vertices by an edge if and only if the interseetion of the corresponding
sets is nonempty. A subtree graph is an intersection graph where :F is a family of
subtrees of a specific tree. Buneman [5], Gavril [12], and Walter [47] independently
discovered that the set of chordal graphs coincides with the set of subtree graphs in
aresult that further extends Theorem 3.4.
Theorem 3.3 provides an obvious way to represent a chordal graph G := GF as a
subtree graph. Choose any clique tree Tct E 'Ta, and consider the family of subtrees
25

of Tct given by

:F = {KG(v) I V E V}.
Since two vertices are adjacent to one another in G if and only if there exists a
clique K E KG to which both vertices belong, it follows that for each pair of vertices
u,v E V, we have (u,v) E E if and only if the subtree induced by KG(u) intersects
the subtree induced by K G ( v). In consequence, G is a subtree graph for the family
of subtrees :F in any clique tree Tct E T G •
Liu has shown how elimination trees provide another way to view chordal graphs
as subtree graphs. Let the row vel'tex set, denoted Struct(L j •• ), be defined by

Struct(L i •• ) := {Vj I eij i= D}.


Liu [26] has shown that each row vertex set Struct(L i •• ) induees a subtree of TA
rooted at Vj. In consequence, GF is a subtree graph for the family of subtrees induced
by the row vertex sets of L. For a full discussion of this result, consult Liu [30].

5.3. Equivalent orderings. The fill added to GA contains preciseIy the edges
needed to make the order in which the unknowns of the linear system are eliminated
a PEO of the filled graph GF [40]. Usually, the primary objective in reordering the
linear system is to reduce the storage (i.e., fill) and work required by the factoriza-
tion. Every PEO of GF results in precisely the same factorization storage and work
requirement [28]. It is common practice in this setting to define all perfect elimination
orderings of G F as equivalent orderings.

Before advanced machine architectures entered the marketplace, there was lit-
tle reason to con,;.:!;;:- rhoosing one PEO of GF over another. Generally, whatever
orderi ng was produced by the fill-reducing orderi ng algorithm (e.g., nested dissec-
tion [14, 15] or minimum degree [17, 25]) was accepted without modification. But
this situatian has changed to some extent with the advent of veetol' supercomputers,
powerful RISC-based workstations, and a wide variety of parallei architectures. AI-
gorithms designed for such machines may benefit by choosing one PEO of G F over
the others in order to optimize some secondary objective function. (There is still
the underlying assumption that a good fill-reducing ordering is desired, though this
assumption is subject to question more than it once was and deserves further study.)
The following summarizes a few algorithms designed to produce an equivalent order-
ing that optimizes some secondary objective function.

Reordering for stack storage reduction. One of the first algorithms of this type
was a simple algorithm due to Liu [27] for finding, among all topolagieal orderings
of the elimination tree, an orderi ng that minirnizes the auxiliary storage required by
the multifrontal factorization algorithrn. In addition, Liu [28] gives a heuristic for
finding an equivalent orderi ng that further reduces auxiliary storage for multifrontal
factorization. Finding an optimal equivalent ordering for this problem is still an op en
question.

Jess and Kees reordering. Short elirnination trees can be useful when the fac-
torization is to be performed in paralle!. .less and Kees [22] introduced a sim-
ple greedy heuristic for finding 'an equivalent ordering that reduces elirnination tree
26

leight. Liu [29] has shown that the Jess and Kees ordering scheme minimizes elimi-
lation tree height among all equivalent orderings. Liu and Mirzaian [32] introduced
m O(n + IEFI) implementation of the Jess and Kees scheme. Lewis, Peyton, and
Pothen [24] used a clique tree of GF to obtain an O(n + q)-time implementation of
~he Jess and Kees algorithm where q = 2:~1 IKd, which in practice is substantially
!maller than IEFI. Because a PEO of G F is known a priori, a clique tree of GF can
be obtained in O(n) time using output from the symbolic factorization step of the
!olution process [24].

A block Jess and Kees reordering. Blair and Peyton [4] have studied a block
form of the Jess and Kees algorithm that generates a clique tree T E T G of minimum
:liameter. The primary motivation for this algorithm is to minimize the number
::If expensive communication calls to the general router on a fine-grained parallel
machine [19]. The time complexity of their algorithm is also O(n + q) in the sparse
matrix setting, where a PEO is known a priori. A similar algorithm motivated by
the same application was given by Gilbert and Schreiber [19].

Partitioning (and reordering) for parallel triangular solution. A related problem


is the following: Find a partition of the columns in the factor L with as few members
as possible, such that for each partition member, the elementary elimination matrices
associated with that member can be multiplied together without increasing the stor-
age requirement for the factor. Such a partition and its associated PEO is desirable for
implementing sparse triangular solution on a fine-grained massively parallel machine.
Pothen and Alvarado [37] have solved this problem when the ordering is restricted
to topological orderings of the elimination tree. Peyton, Pothen, and Yuan [36] have
developetl. ""n O(n + IEFI) algorithm that solyes the problem for the larger set of all
equivalent orderings; they are also working on an O(n+q) clique-tree-based algorithm
for solving the problem [35].

5.4. Clique trees and the multifrontal method. Block algorithms have be-
come increasingly important on advanced machine architectures, both in dense and
sparse matrix computations [11]. The multifrontal factorization algorithm [7, 31]
is perhaps the canonical example in sparse matrix computation. That clique trees,
which represent chordal graphs in block form, might be a useful tool in explaining
the multifrontal method is not at all surprising.
Clique trees provide the framework for presenting the multifrontal algorithm in
Peyton, Pothen, and Sun [34, 38]. The clique tree is rooted and ordered by a pos-
tordering of the tree, and each clique K has associated with it a frontal matrix F(K).
Let K and P be respectively a clique and its parent in the clique tree. The columns
of F(K) are partitioned into two sets: the factor columns of F(K) correspond to the
vertices in K \ P, and the update columns of F(K) correspond to the vertices in
K n P. For further details consult the two references given above.
Due to its simplicity, the supernodal elimination tree is more commonly used in
descriptions of the multifrontal algorithm. Liu's survey article [31], for example, uses
the supernodal elimination tree to describe the block version of the algorithm.
27

5.5. Future progress on the "ordering" problem. Finally, we anticipate


that a solid understanding of chordal graphs and elique trees will play a role in future
progress in the difficult area of analyzing and understanding ordering heuristies. The
problem of finding a fill-minimizing ordering of an arbitrary graph is NP-hard [48].
Consequently, progress in understanding the "ordering" problem will probably require
a better understanding of the broad but nontheless highly restricted elasses of graphs
GA that arise in various application areas. If there is some progress in that area, then
we further speculate that creating and/or analyzing ordering algorithms for these
elasses of graphs will involve many interesting properties and features of chordal
graphs and elique trees. Some will be the results presented in this paperi perhaps
others will be new, or at least a fresh look at familiar coneepts.

REFERENCES

[1] C. BEERI, R. FAGlN, D. MAIER, AND M. YANNAKAKIS, On the desirability of aeyclie database
systems, J. Assoc. Comput. Mach., 30 (1983), pp. 479-513.
[2] P. A. ·BERNSTEIN AND N. GOODMAN, Power of natural semijoins, SIAM J. Comput., 10
(1981), pp. 751-771.
[3] J. BLAIR, R. ENGLAND, AND M. THOMASON, Cliques and their separators in triangulated
graphs, Tech. Rep. CS-78-88, Department of Computer Science, The University of Ten-
nessee, Knoxville, Tennessee, 1988.
[4] J. BLAIR AND B. PEYTON, On finding minimum-diameter elique trees, Tech. Rep. ORNLjTM-
11850, Oak Ridge National Laboratory, Oak Ridge, TN, 1991.
[5] P. BUNEMAN, A eharacterization of rigid cireuit graphs, Discrete Math., 9 (1974), pp. 205-212.
[6] G. A. DIRAC, On rigid eircuit graphs, Abh. Math. Sem. Univ. Hamburg, 25 (1961), pp. 71-76.
[7] 1. DUFF AND J. REID, The multifrontal solution of in definit e sparse symmetrie linear equations,
ACM Tra..: Math. Software, 9 (1983), pp. 302-325.
[8] 1. S. DUFF AND J. K. REID, A note on the work involved in no-fill sparse matrix faetorization,
IMA J. Numer. Ana!., 3 (1983), pp. 37-40.
[9] P. EDELMAN AND R. JAMISON, The theory of eonvex geometries, Geometriae Dedicata, 19
(1985), pp. 247-270.
[10] D. FULKERSON AND O. GROSS, Ineidenee matriees and interval graphs, Pacifk J. Math., 15
(1965), pp. 835-855.
[11] K. GALLIVAN, M. HEATH, E. NG, J. ORTEGA, B. PEYTON, R. PLEMMONS, C. ROMINE,
A. SAMEH, AND R. VOIGT, Parallei Algorithmsfor Matrix Computations, SIAM, Philadel-
phia, 1990.
[12] F. GAVRIL, The intersection graphs of subtrees in trees are exaetly the ehordal graphs, J.
Combin. Theory Ser. B, 16 (1974), pp. 47-56.
[13] - - , Generating the maximum spanning trees of a weighted graph, J. Algorithms, 8 (1987),
pp. 592-597.
[14] A. GEORGE, Nested dissection of a regular finite element mesh, SIAM J. Numer. Ana!., 10
(1973), pp. 345-363.
[15] A. GEORGE AND J .-H. LIU, An automatie nested disseetion algorithm for irregular finite
element probiems, SIAM J. Numer. Ana!., 15 (1978), pp. 1053-1069.
[16] - - , Computer Solution of Large Sparse Positive Definite Systems, Prentice-HaIl Inc., En-
glewood Cliffs, New Jersey, 1981.
[17] - - , The evolution of the minimum degree ordering algorithm, SIAM Review, 31 (1989),
pp. 1-19.
[18] A. GIBBONS, Algorithmie Graph Theory, Cambridge University Press, Cambridge, 1985.
[19] J. GILBERT AND R. SCHREIBER, Highly parallei sparse Cholesky faetorization, SIAM J. Sei.
Stat. Comput., 13 (1992), pp. 1151-1172.
[20] M. GOLUMBIC, Algorithmie Gr:aph Theory and Perfect Graphs, Aeademie Press, New York,
1980.
28

[21] C.-W. Ho AND R. C. T. LEE, Counting eligue trees and computing perfect elimination
schemes in parallel, Inform. Process. Lett., 31 (1989), pp. 61-68.
[22] J. JESS AND H. KEES, A data structure for parallei L/U deeomposition, IEEE Trans. Comput.,
C-31 (1982), pp. 231-239.
[23] E. KIRSCH, Practieal parallel algorithms for ehordal graphs, Master's thesis, Dept. orComputer
Science, The University of Tennessee, 1989.
[24] J. LEWIS, B. PEYTON, AND A. POTHEN, A fast algorithm for reordering sparse matrices for
parallel factorization, SIAM J. Sci. Stat. Comput., 10 (1989), pp. 1156-1173.
[25] J .-H. LJU, Modifieation of the minimum degree algorithm by multiple elimination, ACM Trans.
Math. Software, 11 (1985), pp. 141-153.
[26] - - , A compact row storage scheme for Cholesky faclors using elimination trees, ACM Trans.
Math. Software, 12 (1986), pp. 127-148.
[27] - - , On the storage reguirement in the out-of-core multifrontal method for sparse factoriza-
tion, ACM Trans. Math. Software, 12 (1986), pp. 249-264.
[28] - - , Equivalent sparse matrix reordering by elimination tree ratations, SIAM J. Sci. Stat.
Comput., 9 (1988), pp. 424-444.
[29] - - , Reordering sparse matrices for parallel. elimination, Parallei Computing, 11 (1989),
pp. 73-91.
[30] - - , The raie of elimination trees in sparse factorization, SIAM J. Matrix Ana\. AppI., 11
(1990), pp. 134-172.
[31] - - , The multifrontal method for sparse matrix solution: theory and practice, SIAM Review,
34 (1992), pp. 82-109.
[32] J. W.-H. LJU AND A. MIRZAIAN, A linear reordering algorithm for parallel pivoting of chordal
graphs, SIAM J. Dise. Math., 2 (1989), pp. 100-107.
[33] M. LUNDQUIST, Zero patterns, ehordal graphs and matrix completions, PhD thesis, Dept. of
Mathematical Sciences, Clemson University, 1990.
[34] B. PEYTON, Some applieations of elique trees to the solution of sparse linear systems, PhD
thesis, Dept. of Mathematical Sciences, Clemson University, 1986.
[35] B. PEYTON, A. POTHEN, AND X. YUAN, A eligue tree algorithm for partitioning chordal
graphs for parallel sparse triangular solution. In preparation.
[36] - - , Partitioning a chordal graph into transitive subgraphs for parallel sparse triangular so-
lutian. In preparation.
[37] A. POTH EN AND F. ALVARADO, A fast reordering algorithm for parallei sparse triangular
solution, SIAM J. Sci. Stat. Comput., 13 (1992), pp. 645-653.
[38] A. POTHEN AND C. SUN, A distributed multifrontal algorithm using eligue trees, Tech. Rep.
CS-91-24, Department of Computer Science, The Pennsylvania State University, University
Park, PA, 1991.
[39] R. PRIM, Shortest conneetion networks and some generalizations, Bell System Technical Jour-
nal, (1957), pp. 1389-1401.
[40] D. ROSE, A graph-theoretie study of the numerical solution of sparse positive definite systems
of linear equations, in Graph Theory and Computing, R. C. Read, ed., Academic Press,
1972, pp. 183-217.
[41] D. ROSE, R. TARJAN, AND G. LUEKER, Algorithmie aspeets of vertex elimination on graphs,
SIAM J. Comput., 5 (1976), pp. 266-283.
[42] R. SCHREIBER, A new implementation of sparse Gaussian elimination, ACM Trans. Math.
Software, 8 (1982), pp. 256-276.
[43] D. SHIER, Some aspeets of perfect elimination orderings in chordal graphs, Discr. AppI. Math.,
7 (1984), pp. 325-331.
[44] R. TARJAN, Maximum cardinality search and chordal graphs. Unpublished Leeture Notes
CS 259, 1976.
[45] - - , Data Structures and Network Algorithms, SIAM, Philadelphia, 1983.
[46] R. TARJAN AND M. YANNAKAKIS, Simple linear-time algorithms to test ehordality of graphs,
test acyelicity of hypergraphs, and seleetively reduee aeyelie hypergraphs, SIAM J. Comput.,
13 (1984), pp. 566-579.
29

[47] J. WALTER, Representations of rigid cycle graphs, PhD thesis, Wayne State University, 1972.
[48] M. YANNAKAKIS, Computing the minimum fill-in is NP-complete, SIAM J. Alg. Disc. Meth.,
2 (1981), pp. 77-79.
CUTTING DOWN ON FILL USING NESTED DISSECTION:
PROVABLY GOOD ELIMINATION ORDERINGS'

AJIT AGRAWALt, PHILIP KLEIN, AND R. RAVI+

Abstract. In the last two decades, many heuristics have been developed for finding good
elimination orderings for sparse Cholesky factorization. These heuristics aim to find elimination
orderings with either low fill, low operatian count, or low elimination height. Though many heuristics
seem to perform weil in practice, there has been a marked absence of much theoretieal analysis to
back these heuristics. Indeed, few heuristics are known to provide any guarantee on the quality of
the elimination ordering produced for arbitrary matrices.
In this work, we present the first polynomial-time ordering algorithm that guarantees approxi-
mately optimal fill. Our algorithm is a variant of the well-known nested dissectian algorithm. Our
ordering performs particularly weil when the number of elements in each row (and hence each col-
umn) of the coeflicient matrix is small. Fortunately, many problems in practice, especially those
arising from finite-element methods, have such a property due to the physical constraints of the
problems being modeled.
Our ordering heuristie guarantees not only low fill, but also approximately optimal operatian
count, and approximately optimal elimination height. Elimination orderings with small height and
low fill are of much interest when performing factorization on parallei machines. No previous orderi ng
heuristic guaranteed even small elimina.tion height.
We wiJJ describe our ordering algorithm and prove its performance bounds. We shall also present
some experimental resuIts comparing the quality of the orderings produced by our heuristic to those
produced by two other well-known heuristics.

1. Introduction. Solution of Iinear systems of the form Ax = b is a basie tool


in numerical an"':i3i~ ~n<l is eommonly used in almost all branehes of seienee and
engineering. The most popular direet method of solving a system of linear equations
is Gaussian elimination, in whieh the matrix A is first transformed into an upper
triangular matrix by forward elimination, and then the solution is found by baekward
substitution. In many applieations, the coefficient matrix A is sparse, i.e. has very
few non-zero elements. Sinee the eomputations on the zero element s in A ean be
performed trivially without requiring any floating point operations, it is important to
take advantage of the sparsity to keep the eomputational eomplexity low. However,
as the elimination phase progresses, new non-zero elements are introdueed into the
coefficient matrix, and the matrix tends to beeome less sparseo The loss in sparsity
translates into an inereased storage requirement, and also inereased eomputation time
sinee the non-zero elements enter into subsequent ealeulations.
The new non-zero elements introdueed during the elimination proeess are ealled

• Some of the work reported in this paper first appeared in an extended abstract in the Pro-
ceedings of the 31st Annual IEEE Conferenee on the Foundations of Computer Science, 1990 [33].

t Digital Equipment Corp., Massively ParalleI Systems Group, 146 Main Street, Maynard, MA
01574.
t Brown University, Providenee, Rl 02912. Research supported by NSF grant CCR-9012357 and
an NSF PYI award, together with PYI matching funds from Thinking Machines Corporation and
Xerox Corporation. Additional support provided by ONR and DARPA contract NOOOI4-83-K-0146
and ARPA Order No. 6320. Amendment l.
32

lill-in. Different orders of eliminating the variables may yield very different fill-in.
It is thus of prime importanee to be able to choose an elimination ordering of the
variables that results in small fill-in. The choice of an elimination ordering that
results in small fill-in often conflicts with the requirement of an ordering that ensures
numerical stability of the solution process. Fortunately, many of the systems of
equations that arise in practice are positive definite, in which numerical stability is
not a problem (17). In solving such lineal' equations, we are free to choose an ordering
of variables entirely based on our desire to preserve the sparsity of the matrix during
the elimination process.
A matrix A is called symmetric if Aij equals Aji for every Z,J. Positive defi-
nite matrices that are also symmetric frequently arise in structural analysis, signal
processing, economics, VLSI simulation, solution of lineal' prograrns, and solution of
partial differential equations, to name a few. In this work, we study the problem of
finding a good elimination orderi ng for such matrices.
Henceforth, when we refer to a lineal' system of equations Ax = b, we assume that
A is~ symmetric positive definite matrix.

Minimizing till. We shall define the fiil for an ordering as the sum of the number
of non-zero elements in the matrix, and the fill-in introduced by the ordering. The
fill for an ordering measures the amount of storage required, and also has bearing on
the total time required for the elimination process. It is thus of interest to find an
elimination ordering that minimizes fill. For the purposes of analyzing fill, we shall
count each pair of symmetric elements only once.
Findiug such an ordering is NP-complete [58) and hence is unlikely to have a
polynomial-time solution. Nevertheless, this problem is of fundamental importance
and a large set of orderi ng heuristics have been developed [3, 55, 14, 20, 49, 16, 15, 43,
18, 17, 9, 10). However, none of them are known to give any performance guarantee
on the size of the fill for arbitrary symmetric matrices. In this work, we present
the first polynomial-time algorithm that guarantees approximately optimal fill. Our
algorithm performs particularly weIl when the number of elements in each row (and
hence each column) of the matrix is small. Fortunately, many problems in practice,
especially those arising from finite-element methods, have such a property due to the
physical constraints of the problems being modeled. As stated earlier, all our results
are for the dass of symmetric positive definite systems of equations.
THEOREM 1.1. There is a polynomial-time algorithm that finds an elimination
ordering yielding approximately optimal fiil. The fiIl for the ordering is within a
factor of O( v'd log4 n) of the optimum, where n is the number of variables and d is
the maximum number of non-zero entries in any row or column of the coefficient
matrix.
Our algorithm is a variant of the well-known nested dissection algorithm (14). It
treat s the input matrix as the adjacency matrix of a graph. The algorithm is based on
finding a recursive decomposition of the graph associated with the coefficient matrix.
The use of graphs in the study of elimination orderi ng is not new [51, 54). There is an
obvious way of associating a graph with a given symmetric matrix; the variables of
the matrix associate with the nodes of the graph, and there is an edge between nodes
33

i and j iff the element (i,j) of the matrix is non-zero. The values of the non-zero
element s in the eoefIicient matrix are not relevant for the ordering problem in the
case of symmetrie positive definite systems.
Parter [51 J first showed how to interpret the elimination proeess as a graph-
theoretieal proeess. Rose [55J eharacterized the dass of graphs for whieh there is
an elimination order with no fill-in, and showed how to find such an ordering. George
[14J first proposed a nested dissection approach, designed specifieally for grid graphs.
The algorithm was shown to be optimal (within eonstant faetors) in terms of opera-
tion eount for the ease of a regular finite-element mesh [29J. George's approach was
first generalized to arbitrary graphs by George and Liu [16J. However they used a
simple heuristie for graph separators and henee eould not prove any bounds on its
performance. Lipton, Rose, and Tarjan [41J proved a performance bound for their
version of nested dissection algorithm for graphs with small (O( vin) in size) sepa-
rators. Gilbert and Tarjan [26J later showed that ~y using small separators, George
and Liu's nested dissection algorithm also gives dose to optimal fill for planar graphs,
but does not generalize to the dass of graphs with O( vin) size separators. A good
overview of these algorithms ean be found in the books of George and Liu [17J and
Duff and Reid [9J.
Node separators are fundamental to the nested dissection algorithm. A node
separator consists of aset of nodes whose removal breaks the graph up into pieees.
For our purposes, every subset of the nodes is a separator. A separator is f-balanced,
for some f < 1, if no pieee on its removal has more than fn nodes, where n is the
number of nodes in the graph.
The fill restlItinI!; from applying the approaeh of Lipton, Rose, and Tarjan depends
upon the size of the separators for the dass of graphs being dealt with. For example,
their approaeh yields O( n log n) fill for any planar graph based on the fact that planar
graphs have ~-balaneed separators of size O( vin).
There are three main differenees between our work and the work of Lipton, Rose
and Tarjan. One, we do not assume the existenee of any speeial separator strueture
in the graphs. Two, our analysis is more striet; we are able to analyze the quality
of our result with respeet to the minimum fill aehievable over all orderings. Three,
our variation of nested disseetion is similar to that of George and Liu, and somewhat
simpler than that of Lipton, Rose and Tarjan.
Gilbert [22J showed that for a matrix with at most d non-zero elements in each row
(and eolumn), there exists a nested dissection algorithm whose fill is within O(dlog n)
of minimum. His nested disseetion algorithm, however, is inherently non-eonstruetive;
the choice of separators in his algorithm depends erucially on the optimally filled
matrix.
Our nested disseetion algorithm also uses balaneed node separators. No polynomial
time algorithms are known for finding a minimum-size balaneed node separator in a
graph. However, we show that ehoosing near-optimal balaneed node separators is
sufIicient to achieve near-optimal fill. All our proofs are independent of the method
by whieh the near-optimal separators are found. To our knowledge, Leighton and Rao
[36, 37J have provided the first and the only polynomial-time algorithm to find an
approximately balaneed separator in an arbitrary graph. We henee use their method
34

in our nested dissection algorithm. However, in practice, other separator algorithms


may be preferable on grounds of efficiency.

Though the performance guarantee of our algorithm deteriorates in going from a


graph of small degree to one without this restriction, it is the first such algorithm
that gives any non-trivial performance bound on the quality of the fill.

THEOREM 1.2. There is a polynomial-time algorithm that produces an elimination


ordering yielding afill ofO(F*~rmlog3.5n), where F* is the size of the minimumfill,
and m is the number of non-zero elements in the given n x n matrix.

Note that the performance guarantee for fill from the above theorem is never worse
than O(mt log3.5 n) since the size of the minimum fill F* must be at least as much as
the number of edges m in the graph.

Minimizing the operation count. We show that the orderi ng generated by


our algorithm approximately minimizes not only fill, but also the total number of
arithmetic operations required to solve the system of equations using Gaussian elim-
ination. Much of the previous work in elimination orderi ng has been concerned with
minimizing fill, and surprisingly little attention has been given to minimizing the
number of operations. The onlyexceptions we know of are the works of Hoffman et
al. [29], Lipton et al. [41] and Gilbert and Talojan [26], who analyzed the operation
counts for their nested dissection algorithms. However, their results only apply to
specific elasses of problems as explained in the discussions above.
THEOREM 1.3. The elimination ordering produced by our algorithm approximately
minimizes the to tal number of arithmetic operations. The performance guarantee is
O( d log6 n) for an n x n matrix with a maximum of d non-zero elements in each row
and column.

Solving sparse systems in parallel. With the advent of parallei machines,


much attention has naturally been focussed on solving sparse systems in parallel. In
a typical parallei implementation of the elimination process, multiple variables are
eliminated simultaneously in a step. Such variables must then have no dependencies
between them, i.e. there must be no edge in the graph between the nodes correspond-
ing to the variabIes. For an elimination ordering, the minimum number of paralleI
steps required to eliminate all the variables is called its height [47]. It is hence of
interest to find an elimination orderi ng of minimum height. This problem is also
known to be NP-complete [53]. Many researchers have given heuristics without any
guarantees for this problem in the past [30, 38, 44, 48, 40, 10]. We prove that our
nested dissection algorithm can guarantee approximately minimum height.

THEOREM 1.4. The elimination ordering produced by our algorithm has height
with in a factor of O(log 2 n) of optimal.

Note that our algorithm itself is not parallei, but is to be used to generate an
orderi ng with small height. Such an approach is suitable for problems where the
sparsity structure of the matrix is fixed, and the linear system has to be solved for
many different coefficient values and/or right hand sides. A good elimination ordering
can hence be found sequentially as a preprocessing step. Our work is thus different
35

TABLE 1
Performance guarantees for the elimination ordering produced by our algorithm. In the above taMe,
n, m and d respectively denote the number of nodes, edges, and the maximum degree of the graph
associated with the coefficient matrix.

Characteristic Performance Guarantee


Fill o S~in(.jd log4 n, mi log3.s n) )
Operation Count O(dlog 6 n)
Elimination Height o(log2 n)

from other work addressing the issue of generating the ordering itself in parallel
[43,52,39,27,25,5).
Along with minimizing the height, it is desirable to keep both the fill and the
operation count small, since they determine the to.tal space and the total work done
by the algorithm. Our algorithm is the first one known that approximately minimizes
all three quantities simultaneously: fill, operation count, and height. By putting
Theorems'1.1, 1.3 and 1.4 together, we get the following result.
THEOREM 1.5. The elimination ordering produced by our algorithm simultane-
ously minimizes height to within a O(log2 n) factor, fill to within O(.jd log4 n) factor,
and the operation count to within a O( d log6 n) factor of the respective optimum quan-
Uties. In the guarantee, d denotes the maximum number of non-zero elements in any
row or column of the n x n coefficient matrix.
Bodlaender et al. (2) have independently presented essentially the same algorithm
as ours for findine: an elimination ordering of approximately minimum height. How-
ever, they do not analyze the fill and operation count for their ordering.
The performance guarantees for the elimination ordering obtained by our algo-
rithm are given in Table 1.
Gilbert (22) has conjectured that there is an orderi ng that simultaneously min-
imizes height and approximately minimizes fill (to within a constant factorl. Our
analysis here represents progress towards proving this conjecture. Our algorithm is a
polynomial-time algorithm since we utilize near-optimal node separators in our nested
dissection algorithm. By utilizing a minimum node separator (non-polynomial) algo-
rithm, we can prove the following:
THEOREM 1.6. There exists a nested dissection ordering that simultaneously
minimizes height to within a O(log n) factor, fill to within a O(.jd log2 n) factor, and
operation count to within a O( d log4 n) factor of the respective optimum quantities.
In the guarantee, d denotes the maximum number of non-zero elements in any row
or column of the n x n coefficient matrix.

Chordal graphs. An elimination step can be interpreted as updating the asso-


ciated graph by adding new edgesj the fill-in introduced by eliminating a vanable i
turns the higher numbered neighbors of node i into a clique. At the termination of
the elimination process, the edges of the associated graph thus obtained correspond
directly to the fill yielded by the ordering. Rose [55) showed that this updated graph
is in fact chordal, and that minimizing fill corresponds to finding a minimum-size
36

chordal graph containing the graph associated with the input matrix. A chordal
graph is a graph in which every cyele of length at least four has a chord, i.e. there
is an edge between two non-consecutive no des in the cyele. Chordal graphs are also
sometimes referred to as triangulated graphs.
We exploit the characterization of the elimination process given by Rose. We
prove that our nested dissection orderi ng yields a chordal graph of small size with
respeet to the optimal.

Organization of the paper. We explain the relationship between the elimina-


tion process and graph chordalization in Section 3. We give our algorithm in Section
4, and explain our algorithm in terms of a separator tree in Section 5. The lower
and upper bounds for the fill, operation count, and height are provided in Sections
6, 7 and 8 respectively. We provide some experimental results for the performance of
our algorithm in Section 9, and conelude the paper by discussing some open issues in
Section 10.

2. Notation. A graph G and a matrix A are said to be corresponding if (u, v) E


G if and only if A uv =1= O. Very often we shall refer to a matrix as a graph, meaning
the graph corresponding to the matrix. The position of a no de v in an elimination
ordering a will be denoted by a( v). Throughout this paper, we shall use G~ to refer
to the chordal extension of a graph G for a given orderi ng a. Moreover, we shall refer
to the optimal chordal extension of a graph G by G'; the optimality criteria for the
extension will be elear from the context. By IGI for a graph G, we shall refer to its
number of edges.

3. Graph chordalization. An elimination orderi ng of a graph G is said to


be perfect, if the orderi ng induees zero fill-in on the matrix corresponding to the
graph. A graph is chordal if it has a perfect elimination order. One of the simplest
characterizations of a chordal graph is that every cyele of length at least four in
the graph has an edge between some pair of non-consecutive nodes of the cyele. A
good discussion of chordal graphs can be found in Golumbic's book [28]. Apart from
Gaussian elimination, study of chordal graphs has applications in pedigree analysis,
and evidence propagation in belief networks [31].
Graph chordalization is the problem of extending an input graph G to a chordal
graph by adding a minimum number of edges. Every minimaI chordal extension of
a graph can be completely specified by an orderi ng of the no des of the graph; the
ordering is a perfect elimination orderi ng for the chordal extension of the graph. For
the matrix corresponding to the graph, this orderi ng of nodes also corresponds to an
elimination orderi ng of the corresponding variabIes.
To construct a chordal extension of a graph G from a given orderi ng a of its nodes,
we mimic the elimination process in terms of the following operations on G [54]. Go
is set to be the original graph G. Let G i be the graph at the end of step i, and let v
be the (i + l)th node in the ordering. At step i + 1, we augment G i to obtain G HI
by turning all the neighbors of v numbered higher than i + 1 into a elique. Gn is the
desired unique chordal graph corresponding to the orderi ng a, where n is the number
of no des in the graph.
37

An ordering of the nodes of a graph is henee sufficient to specify a ehordal extension


of a graph. We employ this approaeh in presenting our algorithm for approximately
minimum ehordalization.

4. Our algorithm

Graph separators. Before we give the orderi ng algorithm, we need some baek-
ground on an essential ingredient of our algorithm, namely balaneed no de separators.
Reeall that aset of nodes X in a graph G = (V, E) is ealled an J-balanced node
separator for some fraction J < 1, if no connected eomponent of G - X is of size more
than the fraction J of IVI. No polynomial-time algorithms are known for finding an J-
balaneed no de separator of minimum size for a non-trivial eonstant J. However, using
the technique of Leighton and Rao [36], one ean find an approximately minimum-sized
balaneed node separator. This was also shown by Makedon and Tragoudas [50].
LEMMA 4.1 ([36, 50]). There exists a polynomial-time algorithm to find a ~­
balanced node separator in a graph oJ size within an 0 (log n) Jactor oJ the optimal
~ -balanced node separator.

Note that every ~-balaneed node separator is also a ~-balaneed no de separator.

Elimination ordering algorithm. Our ordering algorithm is a nested dissee-


tion algorithm that is based on a reeursive deeomposition of the graph.
Given a graph G = (V, E) with n nodes, we proeeed as follows to number its no des
in the range la, b], where b = a + n - 1. If n = 1, we number the single no de a. Else,
we find an approximate ~-balaneed no de separator X for G Hsing the algorithm in
Lemma 4.1. We number the vertiees in the separator from b - lXI + 1 to b in any
order. The rest of the nodes are numbered as follows. Let G - X have k eonnected
subgraphs AI, ... , Ak of sizes nl, ... , nk. We reeursively number the graph Ai in the
range [a + L:~;;'; nj, a - 1 + L:~=l nj] for eaeh i E [1, kl.
We shall refer to this ordering as 0: for the rest of this paper.

5. Separator tree. In this seetion, we will establish a lower bound on the size
of the optimum ehordal extension of a graph in terms of the separator sizes found
by the nested dissection algorithm. In our nested dissection ordering, we employ an
approximation algorithm for finding balaneed node separators. However, for ease of
exposition below, we shall as sume that we have a separator algorithm that finds the
best balaneed node separator in a graph. We shall later forego this assumption.
Consider the following tree, ealled the separator tree, representing the nested dis-
seetion process on a graphj the separator vertiees form the root of the tree, and the
trees of eaeh of the pieees are built reeursively. To distinguish from the nodes of the
tree, we shall refer to the no des of the graph as ve1,tices. A tree node henee stands
for a separator that may eonsist of several vertiees. Our algorithm thus defines an
elimination ordering of the vertiees of the original graph that is eonsistent with a
postorder traversal of the nodes of the separator tree. However, the algorithm orders
the vertiees within a tree no de arbitrarily.
38

Let G(V, E) be an input graph and G* be a minimum-size chordal extension of


G. Let T be the separator tree given by our nested dissedion ordering (employing
an optimal balaneed separator algorithm). With each node x in the separator tree T,
let us associate three quantities Sx, Vx, and Gx. Sx ~ V is the set of vertices forming
the node x, li" ~ V is the set of vertices belonging to the nodes of the subtree rooted
at x, and Gx ~ G is the subgraph induced by the vertex set Vx in the original graph
G. Let us denote the separator containing a vertex v by Xv and the set of vertices
belonging to any of the nodes in the subtree rooted at Xv by Tv. The separator tree
naturally defines an ancestor relation on the no des of the tree. We shall also consider
a node to be a trivial ancestor of itself. We shall say that a no de u is a proper ancestor
of a no de v if u is an ancestor of v but u =/= v.

We sh all derive a lower bound on the size of the optimal chordal extension G* in
terms of the sizes of the separators at any level of the separator tree. By a level of the
tree we mean all the nodes in the tree that are at the same distance from the root.
Let Xl> ••• ,X p be the tree nodes at some level of the separator tree. Since VX1 ' ••• , VXp
are disjoint, it follows that the graphs G;" ... ,G;p induced by them in G* are also
disjoint. Thus we have
p
(1) 2:: lG:, I S; IG*I
i=l

where IGI refers to the number of edges (size) of G.


We shall now use the following two previously known results.

FACT 5.1. Every node-induced subgraph of a chordal graph is chordal.


Gilbert, Kose, aHd. Edenbrandt [24J showed that every chordal graph has a bal-
anced elique separator, i.e. aset of nodes that along with being a balaneed node
separator, induees a elique in the chordal graph. Since the number of edges in this
elique can be at most the number of edges in the chordal graph, the following theorem
follows.

THEOREM 5.1 ([24]). Evel'y chordal graph has a ~-balanced clique separator, and
hence has a ~-balanced node separator of size at most m,
where E is the number
of edges in the chordal graph.

By Fact 5.1, each of the graphs G;, is chordal. Hence by Theorem 5.1, we can
write

(2)

On rewriting (1) using this observation, we have the following lemma.

LEMMA 5.2. The size of the optimai chordal extension is at least one-half the
largest sum of the squares of the sizes of the separatars at any level of the nested
dissection separator tree.

One of the main results of this work is to show that the nested dissection algorithm
in fad yields a chordal graph whose size is elose to the lower bound given above (see
Section 6.3 for proof).
39

THEOREM 5.3. For any level 1 of the nested dissection separator tree, let S/ be
the sum of the squares of the sizes of the separators at this level. Then the size of the
optimal chordal extension ofG is at least! max/ S"~ and at most O( Vdlog4 n)*max, S,.
In employing an approximation algorithm for finding balaneed node separators
that has a factor of f performance guarantee, we prove that the size of the chordal
graph thus obtained is no more than O(P) times the size of that obtained by using
the optimal balaneed node separators. We employ a separator algorithm with an
O(logn)-factor performance guarantee, and obtain the following resuIt (see Section
6.3 for proof).
THEOREM 5.4. There is a polynomial-time algorithm that generates a nearly
optimai chordal extension of an input graph. The size of the chordal graph is
O(min(IG"1 Vdlog4 n, IG"I~ Jmlog3.sn )), where IG"I is the size of the optimai chordal
extension of an input graph of n nades, medges, and maximum degree d.

6. Performance guarantee: Number of edges. In this section, we est ab-


lish the performance guarantees for the number of edges and hence the fill for our
elimination ordering.

6.1. A lower bound. We shall first establish a lower bound on the number of
edges in the optimally filled graph G*. In Lemma 5.2, we showed a lower bound
for IG*I in terms of the sizes of the separators at any level of the separator tree.
However, we had assumed that we had an optimal separator algorithm. We now
relax that restriction, and derive a similar resuIt using the nested dissection tree
buiIt with the O(logn) separator approximation algorithm of Leighton and Rao (see
Lemma4.1).
We first state the following simple observation.
PROPOSITION 6.1. Let x!, ..• ,xp be the separatars at some level of the separator
tree. The 'I{ertex sets T"", ... ,T",p of the subtrees rooted at these separatars are disjoint.
LEMMA 6.2. Let Xl' •.• ' X p be the separatars at any level of the nested dissectian
separator tree. The size of the optimai chordal extension is n e:r~;}~;l2).
Proof. Let G'i be the subgraph induced in G* by the vertices belonging to the
subtree rooted at Xi. By Theorem 5.1, ai has a !-balanced separator of size at most
J2IGil. Let this separator be Xi. Then we have

(3) ::; 21G*1


The second inequality follows from the disjointness of the subgraphs G'i, using Propo-
sition 6.1. Let the graph induced in G by the vertices of Gr be Gi • Since the edges of
Gi form a subset of the edges of G';, it follows that Xi is also a !-balanced separator
in Gi. By construction, the vertex set Xi is a ~-balanced node separator in Gi, and
on applying Theorem 4.1, we have

lXii = O(logn)lXil
40

whieh implies that


p p
(4) L IX;l2 ~ 0(1og2 n) L IXil 2
.=1 .=1
On substituting (3) into (4), we get
p
(5) L IXil 2 ~ O(log 2 n)IG*1
i=l

Hence the lemma follows on rewriting the last equation. 0

6.2. A characterization of chordal graphs. Our aim is to estimate the num-


ber of edges in the chordal graph corresponding to the ordering given by our algorithm.
To do so, we need a good charaeterization of these edges. Earlier we discussed one
such characterization by specifying how to extend a graph to be chordal given the
elimination orderi ng of its nodes. However, there is in fact a more direet characteri-
zatiqn of these edges. We shall employ this charaeterization in estimating the total
number of edges in the chordal extension resulting from our elimination ordering.
This characterization is the following.
LEMMA 6.3 ([56]). For a given elimination ordering a, an edge (u, v) is in G~ if
and only if there is a path P = {zo = U, ZI, ... ,zp = v} in G such that a( Zi) ~ a( u )
and a(zi) ~ a(v), for each i = 1, ... ,p - 1.
Using Lemma 6.3 and the strueture of the separator tree, we claim the following
characterization of the edges in a chordal extension given by our nested disseetion
ordering. T~!~ has also been shown by Gilbert and Tarjan as Lemmas 3 and 4 in
their paper [26] .
LEMMA 6.4. Let a be a nested dissection ordering, and w, v E G such that
a( w) a( v). An edge (w, v) is in G~ only if there exists an edge (u, v) E G such
~
that Xv is an ancestor of X w , and X w is an ancestor of Xu •
Proof By Lemma 6.3, we know that if the edge (w, v) exists in G~, then there is
a path P = {w = Zo, ZI, ... ,zp = v} from w to v such that all the vertices in the path
are ordered before w and v. We claim this implies that all the vertices in P belong
both to Tw and Tv. We prove it by contradietion. For contradietion, let us assume
that such is not the case, and that there is a vertex Zi E P such that Zi fj. Tw . Since
X w is a node separator, any path from a vertex in Tw to a vertex not in Tw must
contain a vertex belonging to a proper ancestor of X w • But then such a vertex will
be numbered higher than w, since the numbering is consistent with a post-ordering
of the three nodes. By our assumption of the path P, this eannot be true. Thus eaeh
of the vertices on the path P belongs to Tw • A similar argument shows that each of
these vertices also belongs to Tv.
Thus X w and Xv are aneestors of X Zi for every 0 ~ i ~ p. In partieular, X w and
Xv are ancestors of X Zp _ " and the edge (zp_!, v) exists in G. The only way both X w
and Xv can be ancestors of another separator, is if one of them is an aneestor of the
other. Sinee a(v) > a(w), it follows that X w must be an ancestor of X w • Henee the
lemma holds. 0
41

/' //;:/-;t/~~
."." ------'X _ -::. _ \
O

Nclde contalnlng v

I
, /lI' \ ~
/
1/ 0
I
, ' 01, °
lx t i
: I II.~\ \/Ift /\
I ~lO ° °

I
I
I
\ 0'0
. , II
II I~
I . 10 °
1\°
\ \/X X~
" .,0 ° °
I /
° °
'-'" I

D Nod.e of the separator tree

II Node containing a nelghoor of v

Associated tree for v


Nelghbors of v in the orlglnal graph

FIG. 1. The associated tree of a node v.

6.3. An upper bound: Small degree graphs. Now we shall establish an


upper bound on the number of edges in the chordal graph for the orderi ng given by
our nested dissection algorithm. We shall count the edges to a vertex v from any of
the vertices numbered smaller than v.

Let us define the level of a node v in the tree as the distance of v from the root,
and denote it by level~v J. -eyaleveli in the tree, we refer to all the nodes at leveli.
By the level of a vertex we shall refer to the level in the separator tree of the no de it
belongs to. The depth of a t.ee refers to the maximum level of any node in the tree.
We claim that the depth of the tree is small.
LEMMA 6.5. The depth of the separafo. f.ee is at most O(log n).
Proof On removing a balaneed separator from a graph with n vertiees, each of
the pieces has at most ~n vertices. Hence the graph size decreases exponentially with
the increase in recursion depth of the nested dissection algorithm. The depth of the
separator tree is then at most log~ n. D

We shall now count the number of edges to a vertex v from any of the vertices
numbered smaller than v. For that, we define the notion of an associated tree for each
vertex. The associated t.ee for a vertex v belonging to a separator X is constructed
as follows. Let Vb"" Vk be the neighbors of v such that level( Vi) ~ levele v), for
1 :s i :s k. Let Xi be the separator containing Vi. The associated tree for v is the
smallest subtree root ed at X containing each of the separators Xl,"" X k (see Figure
1).
Lemma 6.4 implies that for every edge (w,v) E G~ where a(v) > a(w), w must
belong to the associated tree of v. Thus the total number of edges to v from vertices

In Liu's terminology [47], the associated tree for a vertex is exactly the part of the separator
tree that contains its "row subtree" .
42

numbered lower than v in the orderi ng is at most the number of vertices belonging
to all the separators in the associated tree of v. We shall refer to this number for v
as the cost of v. Thus the total number of edges in our chordal extension is at most
the sum of the costs of the vertices.
THEOREM 6.6. The total number of edges in the chordal extension obtained by
our nested dissection ordering is at most is O(,jJ log4 n) times optimai, where d is
the maximum degree of the graph.

Proof. Let us estimate the sum of the costs of all vertiees at a given levelil in the
tree. Let this level consist of separators Xl,"" Xp • For i = 1, ... ,p, consider the
highest-cost vertex of Xi, and let Ai be the associated subtree for this vertex. For
each leveli:::: Il, let W/(A i ) be the number of vertiees in Ai at leveli. Then the sum
of the costs of vertiees at level h is no more than the sum, over all levels I greater
than 11, of the value
p
(6) :L lXii· W/(A i ).
i=l

Let Ai have qi separators X i,l, . .. ,Xi,qi at level I. Since each vertex has a maximum
degree of d, it follows that the associated tree of a vertex has at most d leaves. This
implies that each level of the tree has at most d nodes, and hence qi is at most d.
Substituting into (6), we get

P qi P qi

:L:LIXdIXi,il ::; :L:LIXi ,jI2


i=l j=l i::::::l j=l

P qi
(7) :L:LIXi ,jI2
;=1 j=l

where the first inequality follows from the Cauchy-Schwartz inequality, and the second
from the fact that q; ::; d. By Lemma 6.2 it follows that

and similarly

P qi

:L :L IXi,j 12 = O( jiQ.!log n)
i=l j=l

Thus the right-hand side of (7) is 0 ( ,jJIG* Ilog 2 n).


Summing over all levels 11 and
I, we conelude that there are O( ,jJIG*llog4 n) edges. D

Our elimination orderi ng hence yields a chordal graph which has only a polylog
factor more edges than the optimal if the maximum degree of the graph is at most
polylog in the number of nodes. This al so proves that the fill for such graphs is
also provably small. Moreover, many problems in practice, for example finite element
probIems, have small degree and thus for these problem s our nested dissection ordering
is guaranteed to produce near-optimal fill.
43

6.4. An upper bound: Large degree graphs. While the performance bound
is polylog for small degree graphs, we cannot claim the same for the unbounded degree
graphs. We can, however, claim a non-trivial performance bound which is no worse
than a faetor of mt log" n times the optimal, where m is the number of edges in the
graph. We omit the proof for brevity. The details can be found elsewhere [1].
THEOREM 6.7. For an unbounded degree gmph G with n vertiees and medges,
the total number of edges in G~ is 0 (lG*lt vmlog3.sn).

7. Performance guarantee: Number of multipIications. In this section,


we shall establish the performance guarantee for the number of multiplications re-
quired by our nested dissection ordering. Since the cost of solving a system of linear
equations is proportional to the number of multiplications required for the process,
this guarantee reHeets the guarantee for the total sequential time required to solve
the problem using Gaussian elimination.

7.1. A characterization of number of multiplications required. We shall


use the following characterization of the total number of multiplications required by
an elimination ordering in terms of the cliques of the filled-in chordal graph. Every
vertex v in G~ forms a clique with all its neighbors ordered after v. We shall refer to
this clique as the associated clique for the vertex, and denote it by Cv. The number
of multiplications required to eliminate a variable v is the total number of edges in
the clique Cv. Thus the total number of multiplications required to eliminate all the
variables in a chordal graph equals the sum of the number of edges in the associated
cliques of each node.

7.2. A lower bound. Consider the case when a chordal graph has a clique of
size p. Then for any ordering of variables in the clique, the node numbered i within
the clique has an associated clique of size p - i, for every i from 1 to p. Thus the
total number of multiplications required to eliminate all the variables in this clique
is L:f=l (p - i)2, which is n (p3). By Lemma 5.1, since every chordal graph has a
~-balanced clique separator, the following lemma easily follows.

LEMMA 7.1. For any ehordal gmph G*, if P is the size of its clique sepamtor, then
n (p3) is a lower bound on the number of multiplieations required for any elimination
ordering.
Let M* be the least multiplication count for any elimination ordering of G. We
shall extend Lemma 7.1 a step further to relate M* to the sizes of the separators at
any level of the separator tree.
LEMMA 7.2. Let a given level in the sepamtor tree obtained by our algorithm have
p sepamtors XI, . .. , X p • Then n e::r~~~;l3) is a lower bound on M*.

Proof. Let T õ be the subtree rooted at X; and G'; be the subgraph induced by the
vertices of Tõ in G*. Since Gi is chordal by Fact 5.1, it has a clique separator. Since
our separator approximation has aguarantee of O(log n), the optimal clique separator
must have size (&n). By Lemma 7.1, ai must then require O!() multiplica-
n n
tions. Since the subgraphs Gi, .•. ,G; are disjoint, it follows that any ordering in G*
44

leveli ...

z
I level J 2

leval13

FIG.2. The only vertices z which can contribute to the edge (u,v) must beiaug both to the associated
tree of v and to the subtree rooted at u.

.
mus t reqmre n (L:P,IXd3)
'"los' n mu I'
tlP I'Icat'IOns. 0

7.3. An upper bound. We shall now derive an upper bound on the number
of multiplications required. Let M be the number of multiplications required for the
eEmination ordering defined by the algorithm. M is given by the sum over all nodes v
of the number of edges in v's associated clique. Thus we can write M as L:v L:eECv 1,
whieh is the same as L:e L:v:Cv 3e 1. The contribution of an edge to this sum is the
number of vertiees containing the edge in their associated cliques. We shall refer to
this quantity as the contribution of the edge. M is hence the sum of the contributions
of the edges in a:.
We shall use this characterization along with Lemma 7.2 to relate
M to M*.

THEOREM 7.3. The number of multiplications required by our nested dissection


elimination ordering is 0 (dlog 6 n) times optimai, where d is the maximum degree of
the grap;•.

Proof The contribution of an edge (u, v) is 1 for each vertex z such that C z
contains the edge (u,v). Without loss of generality, let us assume that a(v) > a(u).
Since (u, v) E a:,
by Lemma 6.4, u must belong to the associated tree of v. Since u,
v and z belong to the clique C" C z must contain the edges (z,v) and (z,u). Since
a(z) < aev), the presence of the edge (z,v) in C z implies that z must also belong
to the associated tree of v (see Figure 2). Similady, the fact that (u,v) is in C z
implies that z must belong to the subtree rooted at u. Thus the only vertices that
can contribute to the edge (u, v) are those which belong to the associated tree of v
and also belong to the subtree rooted at u. Note that the latter implies that the level
of such a vertex is at least as high as that of u.

Our approach in counting the total number of multiplications is the following.


We consider all the edges in a:
that go between two given levels. For each edge
we count the number of vertices in a given third level which contain the edge in its
associated clique. We show that this count over all the edges between two levels is
at most O(dlog 3 nM*). Since there are o (log3 n) choiees of the three levels under
consideration, the total number of multiplication is O(dlog 6 nM*), and we get the
theorem.

So let us consider three level s in the separator tree II, i2 , and i3 such that i3 ;:::
lz ;:::
il' Our aim is to count for each edge (u, v) between a vertex v in levelil and a
vertex u in level i2 , the total number of vertices in the level 13 that contain (u, v) in
45

their associated clique. Let this quantity be called M'. M' can be written as

(8) M'= L 1
vElevelil tlElevel b.(u.v)EG~ zElevells,Cz3{u,v)

We want to estimate M'. Let us denote by M v the sum

1
uElevelI2,(u,v)EG~ zElevelI 3 ,cz3(u,v)

for a vertex v. Let XI, ... , X q be the separators at level il, and let Vi denote the
vertex V in Xi for which !'vtv is maximum. Then we can rewrite (8) as
q
(9) M' = LLMv
i=l VEXi
q
(10) ::; LLMv ,
i=l vEXj
q
(11) LIXilMv ,
i=l

Let us now estimate the value of !'vtv,. Let Av, denote the associated tree of Vi. Let
the separators in Av, at level i 2 be XiI, ... , X iq,. Each of the edges of Vi to level i 2
must have a vertex in Av, as its endpoint. Consider all the ed ges between Vi and the
vertiees of the separator Xij. There are a maximum of IXij I such edges. By the above
discussion, any vertex that has any of these edges in its associated clique must belong
to the subtree of Av, rooted at Xi. All such vertices at leveli3 must then belong to
one of the separators in the subtree of Av, rooted at Xij. Let the separators in Av, at
level i3 be X ijl , ... ,Xijq'l' Then the maximum number of vertiees whose associated
cliques can contain an eäge between Vi and a vertex in Xij is given by L:k;1 IXijkl,
and there can be at most IXijl such edges. Summing over all the separators in Av, at
level 12 , we get
qi qij

(12) M v , < LIXijlLIXijkl


j=1 k=1
We can rewrite (11) after substituting (12) as
q qi qi,
(13) M' ::; L IX;! L IXijl L IXijkl
,=1 j=1 k=1
By using the inequality L XiYiZ, ::; L:i(X; + Y7 + z7), we can rewrite (13) as
q qi qij q qi qi, q qi qi)

M' ::; LLLIXiI3+LLLIXijI3+LLLIXijkI3


i=1 j=1 k=1 i=1 j=1 k=1 i=1 j=1 k=1
(14)
i=l j=1 k=1 i=1 j=1 k=1 i=1 j=1 k=1
Since each vertex has degree at most d, it follows that the associated tree of each of
the vertices has at most d separators at any level. Hence we have L:J~I L:k';l 1 ::; d
for all i, and L:k;11 ::; d for all i and j. We can then rewrite (14) as
q q ql q ql qij
(15) M' < d· L IX;j3 + d . L L IXij 13 + L L L IXijk 13
i=I i=1 j=1 ;=1 j=1 k=1
46

Note that each of the terms on the right hand side of (15) is a sum over the (disjoint)
separators at a single level, and hence we can apply Lemma 7.2. We get

(16) M' ~ dO(M log3 n) + dO(M lol n) + O(Mloln)


(17) O(dM log3 n)

As mentioned before, the total number of multiplications is the sum of M' over all
the possible choiees of 11, 12, and 13 . There bei ng o (log3 n) such possible choices, the
theorem follows. D
The theorem above shows that the performance guarantee of our nested dissection
algorithm is a polylog factor if the degree of the graph is small. As mentioned earlier,
low degree graphs account for many of the matrices arising in practice.

8. Performance guarantee: Elimination height. Since problem s in numer-


ieal analysis are a favorite for parallei machines, it is natural to consider how weIl one
can perform Gaussian elimination in parallei. The amount of parallei time required
by !tn elimination ordering can be characterized by its height. Multiple variables can
be eliminated simultaneously in parallei only if the variables do not have any depen-
dencies between them. In the graph representation, in eliminating a vertex v, we
update all the neighbors of v that are numbered higher than v. Hence two vertices
cannot be eliminated simultaneously if they have an edge between them. If we think
of each edge being directed from the vertex with a lower number to the other, then
the height of an ordering fr is the longest directed path in the chordal graph a:.
Alternate characterizations of the height in terms of the elimination tree are given by
Liu [47].
An elimmatioll "tdering that minimizes height does not necessarily minimize other
important quantities like fill, or the multiplication count for the ordering. In fact, for
the example of a simple line graph, the minimum degree heuristic is optimal in terms
of fill, but has much worse height than a nested dissection ordering. Gilbert [21]
has conjectured that there is an ordering that minimizes height and simultaneously
approximately minimizes fill to within a constant factor. The conjecture remains
umesolved.

Finding an ordering that minimizes height itself is NP-hard [53]. Hence we have
to be content with finding an ordering that approximately minimizes height. It turns
out that our nested dissection elimination ordering also approximately minimizes
height, and thus we obtain an algorithm that simultaneously gives low fill, number
of multiplications, and height. Contrary to our performance bounds for the fill and
multiplication count, the guarantee for the height is independent of the degree of the
input graph, and is always a O(log2 n) factor of the optimal. We prove this result in
this section.
Bodlaender et al. [2] have independently proposed an ordering scheme similar
to ours that achieyes approximately minimum height. The problem of finding an
ordering with small height has been studied by many researchers in the past and an
excellent survey can be found in the artide by Heath, Ng, and Peyton (in [10]).

Since the height of an ordering is of concern when solving a system of linear


equations in parallei, it will be desirable to obtain the ordering itself in paralleI.
47

However, we do not address that issue here. Our implementation of the algorithm
at present is sequential. We use the technique of Leighton and Rao [36] for finding
small balaneed separators in a graph, and no efficient parallei implementations are
known for it. Some work has been done [32] on parallelizing the technique, but the
resulting method is still not competitive. We suspect that the algorithm of Leighton
and Rao cannot be parallelized efficiently. However, we hope that other techniques
for finding small graph separators will be developed, which will be more amenable to
parallel implementations. The issue of generating the elimination ordering itself in
parallei has been studied by other researchers [10]. However, none of the previously
proposed algorithms have yielded any performance guarantees.

8.1. A lower bound. From the discussion on the height of an elimination or-
dering, it follows that the height of any elimination orderi ng for a clique of size m is
m. That gives us the following simple lemma.
LEMMA 8.1. For any chordal graph G*, if m is the size of its clique separator,
then the he,ight of any elimination ordel-ing must be n (m).
We can build on the above lemma to get the following result.
LEMMA 8.2. Let the Zargest separator in the separator tree obtained by our algo-
rithm for a graph G be X. Then any elimination ordering for G must have height
n(~).
Proof. Let Vx be the set of vertices in the subtree of the separator X and G*
be the chordal graph with minimum height over all elimination orders. By Theorem
5.1, and the performance guarantee of our separator algorithm (see Theorem 4.1),
the graph induced by \Ix 111 G* has a clique separator of size n (~). This clique
size is a lower bound on the height of any elimination ordering by Lemma 8.1, and
hence the lemma follows.

8.2. An upper bound. We shaH now show that the height generated by our
nested dissection orderi ng is not too much more than the optimal height.
Consider the separator tree. Let X be the largest separator in the tree. Consider
all the separators at each level. One variable from each of the separators can be
eliminated simultaneously as there are no direet edges between the variables of dif-
ferent separators. Hence the number of parallei elimination steps for eliminating all
the variables at a level is no more than the size of the largest separator at the level.
This size is no more than lXI by assumption. Since the number of levels is O(log n),
the height of the orderi ng is at most O(IXllogn). By Lemma 8.2, the value of lXI
is at most O(log n) times the minimum height of any elimination ordering. It then
follows that the height of our orderi ng is at most O(log2 n) times the minimum height
over all orderings. We have thus proved our claim of this performance guarantee in
Theorem 1.4.

9. Experimental results. In this section we back up the theoretically provable


performance of our ordering with some experimental data. We compared the quality
of our results to two publicly available codes. These two codes use two different well-
known heuristics. The first is the minimum-degree heuristic code by Joseph Liu [43].
48

The second code is the nested disseetion heuristic that is implemented in SPARSPAK
[19].
The minimum-degree heuristic is by far the most commonly used and acknowl-
edged as the most effeetive heuristic known for finding good elimination orderings. It
has a rich history. It originated from the work of Markowitz in 1957, has undergone
many enhancements over the last fifteen years, and has been incorporated in many
publicly available codes like MA2S, YALESMP, and SPARSPAK. Much statistics re-
garding the performance of this heuristic and all the enhancements are also available
in literature. George and Liu [IS] present an excellent survey of the developments
and enhancements in the minimum-degree heuristic. They suggest that a minimum-
degree heuristic with certain enhancements [43J outperforms other variations of this
heuristic. We abtained the latest versian of the code implementing this heuristic from
Joseph Liu in July 1991, and that is what we shall refer to as the minimum-degree
code for the purposes of the comparison. vVe al so wanted to compare our nested dis-
section ordering against an already existing one. The SPARSPAK nested disseetion
was an ideal choice because of its popularity.
We compared the fill, the total number of multiplications, and the height of our
ordering with those abtained by the other two codes for a variety of matrices. These
matrices were abtained from the Harwell-Boeing test set of sparse matrices [7, 6].
They are symmetric positive definite matrices that are derived from real applications
in the industry. They have also been extensively used as a test suite by many re-
searchers [S, 45, 40, 46, 47]. Many of the matrices that we used came from struetural
engineering and finite-element analysis probiems.

We inlPlemented the algorithm for fillding approximate balanced node separatars


as described by Leighton and Rao [36J. Their algorithm consists of repeatedly apply-
ing the algorithm for finding an approximately sparsest no de separator in a graph.
The algorithm for finding the node separator consists of two phases. In the first
phase, a uniform concurrent How problem is salved, and in the second phase, the so-
lutian of the concurrent How problem is rounded to produce a node separator in the
graph. The first phase requires the solution of a linear program, the time complexity
of which, though polynomial, was unacceptable for our purposes. We hence turned
to an approximation algorithm for solving the uniform concurrent How problem [35],
which was implemented by Sarah Kang and Philip Klein [34].

We report our results below. We give the names of the matrices from the Harwell-
Boeing colleetion, and the actual values of the three quantities of interest for the
orderings. For the other two codes, we also compute the percentage difference in the
values of the three quantities as compared to the values for our ordering. The number
of non-zero elements in the original matrix is given in the table for reference.

Our fill is usually within ± 11% of the minimum-degree ordering. The height
of our orderi ng is generally bettel' than that of the minimum-degree ordering. The
latter however, has better performance in terms of the number of multiplications.
Compared to the SPARSPAK nested dissection ordering, our ordering seems to fare
weIl in all the three criteria.

Though our nested disseetion algoritlun seem s to provide competitive results, its
practical use is limited dtie to the computationally intensive algorithm for finding the
49

TABLE 2
Comparison of fill: fill is the total number of elements in the matrix that were either non-zero or
became non-zero during the course of elimination

matrix Order # entries Our ordering Minimum Degree SPARSPAK


isymmetric) # # '10 Change # % Change
CANN24 24 92 213 214 +0% 228 +7%
CANN61 61 309 752 669 -11% 781 +4%
CANN96 96 432 1895 1856 -2% 2166 +14%
CANN144 144 720 1683 1746 +4% 1812 +8%
CANN187 187 839 3776 3735 -1% 4067 +8%
CANN229 229 1033 5502 5883 +7% 7439 +35%
BCSSTKOl 48 224 901 906 +0'10 1072 +19%
BCSSTK04 132 1890 7121 6544 -8% 9414 +32%
BCSSTK05 153 1288 5022 4524 -10'10 5415 +8'10
BCSSTK06 420 4140 22249 20782 -7% 24116 +8%
BCSPWR02 49 108 212 215 +1% 265 +25%
BCSPWR05 443 1033 2720 2425 -11% 4557 +67%
DWT193 193 1843 8556 8155 -5% 9489 +11%
DWT209 209 976 4118 3812 -7% 6263 +52%
NOS4 100 347 1515 1206 -20% 1754 +16%

approximate separators. Our algorithm may run for hours while the minimum degree
heuristic algorithm or the SPARSPAK nested dissection algorithm might terminate
in minutes or even seconds.

10. Conclusions and open issues. Our study suggests some new directions
for further research and many open issues. We list them here.
• Improving the performance bounds for the orderi ng probiems: The perfor-
mance guarantees for the fill and the operation counts for our nested dissec-
tion ordering depend upon the maximum degree of the graph associated with
the coefficient matrix. It is a challenging problem to find a polynomial-time
ordering algorithm whose performance guarantees are independent of the de-
gree of the input graph. A simpler problem might be to obtain an ordering
algorithm whose performance guarantees are proportional to the average de-
gree of the input graph. Such aresult will be interesting even for the cases
where the graph has excluded minors.
• Experiments with variants of our nested dissection algorithm: While our
nested dissection algorithm seems to perform weil in practice, we have not
yet experimented with variants of our algorithm. We think that further
experience with this algorithm might suggest practical enhancements to the
elimination orderings produced by the algorithm. We point out again that
the minimum-degree code against which we compare our heuristic has been
tuned and adjusted over many years.
• Finding in parallei an elimination orderi ng of small height: Our nested dissec-
tion ordering is a good orderi ng for solving sparse linear systems in paralleI.
However, our algorithm for finding the eliminationordering itselfis inherently
sequential at present. That is because no parallei approximation algorithms
are yet known for finding balanced separators in a graph. It is of interest to
find a parallei algorithm that produces an ordering that has provably small
height.
50

TABLE 3
Comparison of multipIieation eount

matrix Order # entries Our ordering II Minimum Degree SPARSPAK


(symmetric) # II # % Change # % Change
CANN24 24 92 1076 895 -17% 1162 +8%
CANN61 61 309 5794 3757 -35% 6201 +7%
CANN96 96 432 22983 22360 -3% 30349 +32%
CANN144 144 720 13165 14435 +10% 14172 +8%
CANN187 187 839 49313 48754 -1% 58165 +18%
CANN229 229 1033 104839 119890 +14% 184882 +76%
BCSSTK01 48 224 8688 8893 +2% 11706 +35%
BCSSTK04 132 1890 201202 144314 -28% 316461 +57%
BCSSTK05 153 1288 95389 51549 -46% 110254 +16%
BCSSTK06 420 4140 815252 639199 -21% 993936 +22%
BCSPWR02 49 108 772 658 -15% 1147 +48%
BCSPWR05 443 1033 17299 11568 -33% 50220 +190%
DWT193 193 1843 229852 185812 -19% 291769 +27%
DWT209 209 976 60381 45457 -25% 126012 +109%
NOS4 100 347 15967 8585 -46% 19331 +21%

TABLE 4
Comparison of height

matrix Order # edges Our ordering Minimum Degree SPARSPAK


(symmetric) # # % Change # % Change
CANN24 24 92 9 11 +22% 10 +11%
CANN61 61 309 14 24 +71% 18 +29%
CANN96 96 432 28 26 -7% 36 +29%
CANN144 144 720 16 18 +12% 20 +25%
CANN187 187 839 32 42 +31% 34 +6%
CANN229 229 1033 48 52 +8% 71 +48%
BCSSTK01 48 224 25 24 -4% 30 +20%
BCSSTK04 132 1890 61 86 +41% 72 +18%
BCSSTK05 153 1288 41 84 +105% 43 +5%
BCSSTK06 420 4140 97 138 +42% 92 -5%
BCSPWR02 49 108 5 13 +160% 9 +80%
BCSPWR05 443 1033 23 31 +35% 56 +143%
DWT193 193 1843 58 92 +59% 73 +26%
DWT209 209 976 32 54 +69% 54 +69%
NOS4 100 347 24 30 +25% 27 +12%
51

200

150 150

100 100
50 ...... 50

O ~----~------~~
o .-.ä!:W-'
o 100 200 o 100
nz = 1777 nz = 5502. nops =104839. h =48

nz = 7439. nops = 184882. h = 71 nz =5883. nops =119890. h =52


Flo. 3. The quo/i/li oI th. dirnination ordenng! produced by the Ihree cod•• are co rnpa red. The
ongina/ malnx lrom Ih. JJaMllell-Boeill9 t.st-suile alld is ealled CANtt9. The fill, Ih. number oI
rnu/lip/ication., and th e lI eigllt lor III. ordenllgs are speci/i.d.
52

150

100

50

100 200 100 200


nz= 1743 nz =4118, nops = 6038 1. h = 32

nz = 6263. nops = 126012. h = 54 nz = 3821. nops = 45457. h = 54


FIG . 4. The qualily of Ih e elim inalioll o"d" 'illgs produced by th e Ihree codes are compared. The
originaI malriz from Ih e lfa rwe/l·Boeillg lesl.suite alld i ealled DWTf09. Th. jill, Ih e number of
multipliea lions, and Ihe heigM for Ih . orderi/lgs are specijied.
53

Finding a parallel algorithm for approximating minimum balaneed no de sep-


arators in a graph is independently of mueh interest .
• Running time for finding good balaneed separators: The running time of
our nested dissection algorithm for finding an elimination orderi ng directly
depends upon the running time of the balaneed separator algorithm. For
the algorithm to gain aeeeptanee, we must have a faster approximate separa-
tor algorithm. Separators have numerous other applieations also and henee
having fast separator algorithms is of mueh independent interest.

11. Acknowledgments. We gratefully aeknowledge the contributions of Sarah


Kang, John Gilbert, and R. Ravi to this work.

REFERENCES

[1] A. Agrawal, "Network Design and Network Cut Dualities: Approximation Algorithms and
Applieations," Ph.D. thesis, Teehnieal Report CS-91-60, Brown University (1991).
[2] H. 1. Bodlaender, J. R. Gilbert, H. Hafsteinsson and T. Kloks, "Approximating treewidth,
pathwidth, and minimum eHmination tree height," Teehnieal Report CSL-90-01, Xerox
Corporation, Palo Alto Research Center (1990).
[3] E. Cuthill, and J. MeKee, "Reducing the bandwidth of sparse symmetrie matriees," Proceedings
of the 24th National Conference of the ACM (1969), pp. 157-172.
[4] I. S. Duff, A. M. Erisman, and J. K. Reid, "On George's nested disseetion method," SIAM
Journal on Numerical Analysis, vol. 18 (1976), pp. 686-695.
[5] I. Duff, N. Gould, M. Leserenier, and J. K. Reid, "The multifrontal method in a parallei
environment," in Advances in Numerical Computation, M. Cox and S. Hammarling, eds.,
Oxford University Press (1990).
[6] I. Duff, R. Grimes, and J. G. Lewis, "Users' guide for the Harwell-Boeing sparse matrix eol-
leetion," Manuseript (1988).
[7] I. Duff, R. Grimes, anu;;. G. Lewis, "Sparse matrix test problems," ACM Transactions on
Mathematical Software, vol. 15 (1989), pp. 1-14.
[8] I. Duff, and J. K. Reid, "The multifrontal solution of indefinite sparse symmetrie linear equa-
tions," ACM Transactions on Mathematical Software, vol. 9 (19S3), pp. 302-325.
[9] I. Duff, and J. K. Reid, Direct Methods for Sparse Matrices, Oxford University Press (19S6).
[10] K. A. Gallivan et al. Parallei Algorithms for Matrix Computations, SIAM (1990).
[11] M. R. Garey and D. S. Johnson, Computers and Intractability: Aguide to the theorg of NP-
completeness, W. H. Freeman, San Francisco (1979).
[12] George, J. A., "Computer implementation of a finite element method," Tech. Report STAN-
CS-20S, Stanford University (1971).
[13] George, J. A., "Block elimination of finite element system of equations," in Sparse Matriees
and Their Applications, D. J. Rose and R. A. WiIloughby, eds., Plenum Press (1972).
[14] George, J. A., "Nested nissection of a regular finite element mesh," SIAM Journal on Numer-
ical Analysis 10 (1973), pp. 345-367.
[15] George, J. A., "An automatie one-way disseetion algorithm for irregular finite-element prob-
lems," SIAM Journal on Numerical Analysis, vol. 17 (19S0), pp. 740-751.
[16] George, J. A., and J. W. Liu, "An automatie nested dissection algorithm for irregular finite-
element problems," SIAM Journal on Numerical Analysis, vol. 15 (197S), pp. 1053-1069.
[17] George, J. A., and J. W. Liu, Computer Solution of Large Sparse Positive Definite Systems,
Prentiee-Hall Inc. (1981).
[18] George, J. A., and J. W. Liu, "The evolution of the minimum degree ordering algorithm,"
SIAM Review, vol. 81 (1989), pp. 1-19.
[19] George, J. A., J. W. Liu, and E. G. Ng, "User's guide for SPARSPAK: Waterloo sparse linear
equations paekage," Tech. Rep. CS78-30 (revised), Dept. of Computer Science, Univ. of
Waterloo, Waterloo, Ontario, Canada (1980).
[20] N. E. Gibbs, W. G. Poole Jr., and P. K. Stockmeyer, "An algorithm for redueing the bandwidth
and profile of a sparse matrix," SIAM Journal on Numerical Analysis, vol. 18 (1976), pp.
236-250.
54

[21] J. R. Gilbert, "Some Nested Dissection Order is Nearly Optimal," Information Proeessing
Letters 26 (1987/88), pp. 325-328.
[22] J. R. Gilbert, personal communication (1989).
[23] J. R. Gilbert and H. Hafsteinsson, "Approximating treewidth, minimum front size, and mini-
mum elimination tree height," manuscript, 1989.
[24] J. R. Gilbert, D. J. Rose and A. Edenbrandt, "A separator theorem for chordal graphs," SIAM
J. Alg. Dise. Meth. 5 (1984), pp. 306-313.
[25] J. R. Gilbert, and R. Schreiber, "Hightly parallel sparse Cholesky factorization," Tech. Report
CSL-90-7, Xerox Palo Alto Research Center, 1990.
[26] J. R. Gilbert, and R. E. Tarjan, "The analysis of a nested dissection algorithm," Numerisehe
Mathematik, vol. 50 (1987), pp. 377-404.
[27] J. R. Gilbert, and E. Zmijewski, "A parallei graph partitioning algorithm for a message-passing
multiprocessor," International Journal of Paralid Programming, vol. 16 (1987), pp. 427-
449.
[28] M. C. Golumbic, Algorithmie Graph Theorg and Perfect Graphs, Academic Press, New York
(1980).
[29] A. J. Hoffman, M. S. Martin, and D. J. Rose, "Complexity bounds for regular finite difference
and finite element grids," SIAM Journal on Numerical Analysis, vol. 10 (1973), pp. 364-
369.
[30] J. Jess, and H. Kees, aA data structure for parallei L/U decomposition," IEEE Transactions
on Computers, vol. 31 (1982), pp. 231-239.
[31]· U. Kjrerulff, ''Triangulation of graphs - AIgorithms giving small total state space," R 90-
09, Institute for Eleetronie Systems, Departmellt of Mathematies and Computer Scienee,
University of Aalborg (1990).
[32] P. N. Klein, "A parallei randomized approximation scheme for shortest paths," Technical
Report CS-91-56, Brown University (1991).
[33] P. N. Klein, A. Agrawal, R. Ravi and S. Rao, "Approximation through multicommodity How,"
Proceedings of the 31st Anllual IEEE COllferellce Oll Foundations of Computer Scienee,
(1990), pp. 726-737.
[34] P. N. Klein, and S. Kang, "Approximating concurrent How with uniform demands and capac-
ities: an implementation," Technieal Report CS-91-58, Brown University (1991).
[35] P. Klein, C. Stein and E. Tardos, "Leighton-Rao might be practical: faster approximation
alg"';~:,m: t". concnrrent How with uniform capacities," Proceedings of the 22nd ACM
Symposium on Theorg of Computing (1990), pp. 310-321.
[36] F. T. Leighton and S. Rao, "An approximate max-How min-cut theorem for uniform multicom-
modity How problems with application to approximation algorithms," Proceedings of the
29th Annual IEEE Conferenee on Foundations of Computer Science (1988), pp. 422-431.
[37] F. T. Leighton, F. Makedon and S. Tragoudas, personal communication, 1990
[38] C. Leiserson, and J. Lewis, "Orderings for parallei sparse symmetric factorization," in Parallei
Processing for Scientific Computillg, G. Rodrigue, ed., Philadelphia, PA, 1987, SIAM, pp.
27-32.
[39] M. Leuze, "Independent set orderings for parallei matrix factorization by Gaussian elimina-
tion," Parallei Computing, vol. 10 (1989), pp. 177-191.
[40] J. LewiS, B. Peyton, and A. Pothen, "A fast algorithm for reordering sparse matrices for parallel
faetorization," SIAM Journal on Scielltific and Statistical Computing, vol. 10 (1989), pp.
1156-1173.
[41] R. J. Lipton, D. J. Rose and R. E. Tarjan, "Generalized nested disseetion," SIAM Journal on
Numerieal Analysis 16 (1979), pp. 346-358.
[42] R. J. Lipton and R. E. Tarjan, "Applications of a planar separator theorem, SIAM Journal on
Computing 9 (1980), pp. 615-627.
[43] J. W. Liu, "Modifieation of the minimum degree algorithm by multiple elimination," ACM
Transactions on Mathematieal Software, vol. 12 (1985), pp. 141-153.
[44] J. W. Liu, "Reordering sparse matrices for parallei elimination," Parallei Computing, vol. 11
(1989), pp. 73-91.
[45] J. W. Liu, "The minimum degree ordering with constraints," SIAM Journal on Seientifie and
Statistieal Computing, vol. 10 (1989), pp. 1136-1145.
[46] J. W. Liu, "A graph partitioning algorithm by node separators," ACM Transactions on Math-
ematieal Software,. vol. 15 (1989), pp. 198-219.
[47] J. W. Liu, "The role of elimination trees in sparse factorization," SIAM Journal on Matrix
Analysis alld Applic/dions, vol. 11 (1990), pp. 134-172.
55

[48] J. W. Liu, and A. Mirzaian, "A linear reordering algorithm for paralleI pivoting of ehordal
graphs," S/AM Journal on Diserete Mathematies, vol. 2 (1989), pp. 100-107.
[49] J. W. Liu, and A. H. Sherman, "Comparative analysis of the Cuthill-McKee and the reverse
Cuthill-MeKee orderi ng algorithms for sparse matriees," SIAM Journal on Numerieal Anal-
ysis, vol. 13 (1976), pp. 198-213.
[50] F. Makedon, and S. Tragoudas, "Approximating the minimum net expansion: near optimal
solutions to eireuit partitioning problems," Manuseript (1991).
[51J S. Parter, "The UBe oflinear graphs in Gaussian elimination," SIAM Review, vol. 3 (1961), pp.
364-369.
[52] F. Peters, "ParalleI pivoting algorithms for sparse symmetrie matriees," Parallei Computing,
vol. 1 (1984), pp. 99-110.
[53] A. Pothen, "The complexity of optimal elimination trees," Tech. Report CS-88-1B, Departm.ent
of Computer Science, The Pennsylvania State University, University Park, PA, 1988.
[54] D. J. Rose, "Triangulated graphs and the elimination process," Journal of Math. Anal. Appi.
32 (1970), p. 597-609.
[55] D. J. Rose, "A graph-theoretic study of the numerieal solution of sparse positive definite
systems of linear equations," in Graph Theory and Computing, R. C. Read, ed., Academic
Press (1972), pp. 183-217.
[56] D. J. Rose, R. E. Tarjan and G. S. Lueker, "Algorithmic aspects of vertex elimination on
graphs," SIAM J. Camp. 5 (1976), pp. 266-283.
[57] R. Schr~iber, "A new implementation ofsparse Gaussian elimination," ACM Trans. on Math-
ematiea/ Software 8:3 (1982), pp. 256-276.
[58] M. Yannakakis, "Computing the minimum fill-in is NP-eomplete," SIAM J. A/gebraie and
Discrete Methods 2 (1981), pp. 77-79.
AUTOMATIC MESH PARTITIONING

GARY L. MILLER·, SHANG-HUA TENG t, WILLIAM THURSTON +,


. AND STEPHEN A. VAVASIS §

Abstract This paper describes an efficient approach to partitioning unstructured meshes that
occur naturally in the finite element and finite difl'erence methods. This approach makes use of the
underlying geometric structure of a given mesh and finds a provably good partition in random O(n)
time. It applies to meshes in both two and three dimensions. The new method has applications in
efficient sequential and paralleI algorithms for large-scale problems in scientific computingo This is
an overview paper written with emphasis on the algorithmic aspects of the approach. Many detailed
proofs can be found in companion papers.
Keywords: Center points, domain decomposition, finite element and finite difl'erence meshes, ge-
ometric sampling, mesh partitioning, nested dissection, radon points, overJap graphs, separators,
stereographic projections.

1. Iqtroduction. Many large-scale problems in scientific computing are based


on unstructured meshes in two or three dimensions. Examples of such meshes are
the underlying graphs of finite volume methods in computational fluid dynamies or
graphs of the finite element and finite difference methods in struetural analysis. These
meshes may have millions of nodes. Quite often the mesh sizes used are determined
by the memory available on the machine rather than the physics of the problem to
be solved. Thus, the larger the memory the larger the mesh used and, hopefully, the
better the simulation produced.
The main goal of this paper is to describe our work on howand under what
conditions unstructure.l üi~shes will have partitions into two roughly equal sized pieces
with a small boundary (called small separators to be defined later). When these
partitions exist they have several important applications to the finite element and
finite difference methods. We list some of them here.
One approach to achieving the large memory and computation power requirements
for large-scale computational problems is to use massively parallel distributed-memory
machines. In such an approach, the underlying computational mesh is divided into
submeshes, inducing asubproblem to be stored on each processor in the paralleI
system and boundary information to communicated [67]. To fully utilize a massively
paralleI machine, we need a subdivision in which subproblems have approximately
equal size and the amount of communication between subproblems is relatively small.
This approach will decrease the time spent per iteration. There are also methods
• School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213. Supported
in part by National Science Foundation grant CCR-9016641.
t Xerox Corporation, Palo Alto Research Center, Palo Alto, CA 94304. Part of the work
was done while the author was at Carnegie Mellon University. Current address: Department of
Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139.
t Department of Mathematics, University of California, Berkeley CA 94720.
§ Department of Computer Science, Comell University, Uhaca, NY 14853. Supported by an
NSF Presidential Young Investigator award. Revision work on this paper was supported by the
Applied Mathematical Sciences program of the U.S. Department of Energy under contract DE-
AC04-76DP00789 while the author was visiting Sandia National Laboratories.
58

which use good partitioning to either decrease the number of iterations used or the
time used by direct methods.
Several numerical techniques have been developed using the partitioning method
to solve problems on a paralleI system. Examples indude domain decomposition and
nested dissection. Domain decomposition divides the nodes among processors of a
parallel computer. An iterative method is formulated that allows each processor to
operate independently. See Bramble, Pasciak and Schatz [11], ehan and Resasco
[13], and Bj~rstad and Widlund [9]. Nested dissection is a divide-and-conquer node
ordering for sparse Gaussian elimination, proposed by George [34] and generalized
by George and Liu [36] and Lipton, Rose and Tarjan [49]. Nested dissection was
originally a sequential algorithm, pivoting on a single element at a time, but it is
an attractive parallel ordering as well because it produces blocks of pivots that can
be eliminated independently in paralleI. ParalleI nested dissection was suggested by
Birkhoff and George [8] and has been implemented in several settings [12, 21, 35, 84];
its complexity was analyzed by Liu [52] (fo~ the regular square grid) and Pan and
Reif [63] (in the general case).
Vaidya has produced results which indieate that the quality of good precondition-
ers may also be linked to the existence of good partitions [78].
Therefore, one of the key problems in solving large-scale computational problems
on a paralleI machine is the question of how to partition the underlying meshes in
order to reduce the total communication cost and to achieve load balanee.
If a mesh has a sufficiently regular structure, then it is easy to decide in advanee
how to distribute it among the processors of a paralleI machine. However, meshes
of many ä.}:;:!:!:;>tions are irregular and unstructured, making the partition problem
much more difficult. In general, there are meshes in three dimensions which have no
small partition [59]. These examples are not the type that would naturally arise in the
finite element methods, but they are meshes. One important goal is to understand
which meshes do and which do not have small partitions.
Various heuristies have been developed and implemented [65, 68, 82]. However,
none of the prior mesh partitioning algorithms is both efficient in practice and prov-
ably good, especially for meshes from three dimensional probiems. Leighton and Rao
[46] have designed a partitioning algorithm based on multieommodity How probiems,
which finds a separator that is optimal within logarithmie factors. But their algo-
rithm runs in superlinear time and it remains to be seen if it could be used in practice
for large-scale probiems.

1.1. A new method. In a series of papers, the authors (Vavasis [81]; Miller
and Thurston [59]; Miller and Vavasis [60]; Miller and Teng [55]; Miller, Teng, and
Vavasis [56]) have developed an efficient and provably good mesh partitioning method.
This overview paper describes this new approach. It is written with emphasis on the
algorithmie aspects of the approach. Many detailed proofs can be found in com pani on
papers [57, 58].
This method applies to meshes in both two and three dimensions. It is based on
the following important observation: graphs from large-scale problems in scientific
computing are often defined geometrically. They are meshes of element s in a fixed
59

dimension (typically two and three dimensions), that are weil shaped in some sense,
such as having elements of bounded aspeet ratio or having element s with angles
that are not too small. In other words, they are graphs embedded in two or three
dimensions that come with natural geometric coordinates and with structures.
Our approach makes use of the underlying geometric strueture of a given mesh
and finds a provably good partition efficiently. The main ingredient of this approach
is a novel geometrieal charaeterization of graphs embedded in a fixed dimension that
have a small separator, whieh is a relatively small subset of vertices whose removal
divides the rest of the graph into two pieces of approximately equal size. By taking
advantage of the underlying geometric strueture, we also develop an efficient algorithm
for finding such a small separator.
In contrast, all previous separator results (see Section 1.2) are combinatorial in
nature. They not only charaeterize the small separator property combinatorially, but
also find a small separator based only on the combinatorial strueture of the given
graph. When applied to unstructured geometric meshes, they simply discard the
valuable geometric information. The result has been that they are either too costly
to use or they do not find a separator as good as it should be. Worst of all, none of
the earlier separator results is useful for graphs in three dimensions.

1.2. Separators and earlier work. DEFINITION 1.1 (SEPARATORS). A sub-


set of verticese of a graph G with n vertices is an f( n )-separator that 8-splits
if lei::; f(n) and the vertices of G - e can be partitioned into two sets A and E
such that there are no edges from A to E, lAI, lEI::; fm, where f is a function and
0<8<1.
Two of the most well-known families of graphs that have small separators are
trees and planar graphs. Every tree has a single vertex separator that 2j3-splits [44].
Lipton and Tarjan [50] proved that every planar graph has a vtsn-separator that
2j3-splits. Their result improved an earlier one by Ungar [77]. Some extensions of
their work have been made [19, 31, 32, 54], and separator theorems have also been
obtained for graphs with bounded genus [38, 43] and graphs with bounded exduded
minor [2]. In partieular, Gilbert, Hutchinson, and Tarjan showed that all graphs with
genus bounded by 9 have an O(.J9i'i)-separator, and Alon, Seymour, and Thomas
proved that all graphs with an exduded minor isomorphic to the h-dique have an
O( h3 / 2 y'n}separator.

Interestingly, all the charaeterizations above are combinatorial, not geometric, as


are their proofs!

Separator results for families of graphs dosed under the subgraph operation im-
mediately lead to divide-and-conquer recursive algorithms for many applications. In
general, the efficiency of such algorithms depends on 8 being bounded away from 1
and f(n) being a slowly-growing function.

Perhaps the most dassieal application of small separator results is nested dissec-
tion, a widely used technique for solving a large dass of sparse linear systems. This
approach was pioneered by George [34], who designed the first O(n1.5)-time nested dis-
section algorithm for linear systems on regular grids using the faet that the vn vn
x
grid has a vn-separator. His result was extended to planar linear systems by Lip-
60

ton, Rose, and Tarjan [49]. Gilbert and Tarjan [40] examined several variants of the
nested dissection algorithms. It has been demonstrated, in theory and in practice,
that nested dissection can be implemented efficiently in parallei.
In the analysis of sparse matrix algorithms, a priori upper bounds on operation
counts are rare in the literature (aside from the trivial dense-matrix upper bounds).
The major exception is nested dissection. The a priori bounds attained by nested
dissection, which in many cases are asymptotically the best possible, always depend
on the associated bounds of the underlying graph-separator algorithm. This means
that a careful analysis of separator sizes is an important aspeet of nested dissection.
Small separator results have found fmitful applications in VLSI design (Leiser-
son [47]; Leighton [45]; Valiant [79]) and efficient message routing (Fredrickson and
Janardan [28]). They have also been used in proving several complexity-theoretic
results (Paterson [62]; Lipton and Tarjan [51]), and have been used to design efficient
graph algorithms such as parallel construction of breadth-first-search trees (Pan and
Reif [63]), testing graph isomorphism (Gazit [33]), and approximating NP-complete
problems (Lipton and Tarjan [51]).

1.3. Outline of the paper. Section 2 defines a new dass of geometric graphs,
the overlap graphs, and describes our main separator theorem. This dass has a simple
definition and contains many important dasses of graphs as special cases. Section 3
studies meshes from the finite element and finite difference methods. We show that
overlap graphs indude "well shaped" meshes. We also show that planar graphs are
a special case of overlap graphs in two dimensions. Section 4 presents a partitioning
algorithm for overlap graphs. The algorithm first uses the geometric information of
the input õ.ü.p!::. tn find a "continuous" separator, then uses the combinatorial struc-
ture to compute a "discrete" separator from its continuous counterpart. The central
step of the algorithm is to find a center point of a point set in a fixed dimensions,
where a center point is a point such that every hyperplane passing through it about
evenly divides the point seto We show that center points always exist and can be
corriputed in polynomial time using linear programming. We also show that the step
of computing a "discrete" separator from a continuous one can be performed in linear
time. Section 5 introduces geometric sampling, a technique that reduces the prob-
lem size and simultaneously guarantees a provably good approximation of the larger
problem. tJsing geometric sampling, we can compute an approximate center point in
random constant time and find a "good" separator of an overlap graph in random
linear time. We further give a practical heuristic for approximating a center point.
We then extend the partitioning algorithm for unstructured meshes. Section 6 gives
the proof outline of the main separator theorem. We demonstrate how to use geomet-
ric arguments to prove separator properties for graphs embedded in fixed dimension.
Section 7 summarizes the paper and gives open questions.

2. Neighborhood systems and overlap graphs. Our geometric character-


ization of graphs that have small separators is based on the following elementary
concept.

2.1. Neighborhood systems. DEFINITION 2.1. Let P = {PI, ... ,Pn} be points
in IRd. A k-ply neighborhood system for P is aset, {B I, ... , B n }, of closed balls
61

such that (1) Bi is centered at Pi and (2) no point P E rn.d is strictly interior to more
than k balls from B.
A 3-ply neighborhood system in two dimensions is illustrated in Figure 1.

FIG. 1. A 3-ply neighborhood system

The following notation will be used throughout this paper. For each positive real
a, if B is a ball of radius r in rn.d , then a . B denotes the ball with the same center
as B but radius aro
We now state an important property of neighborhood systems [58].
LEMMA 2.2 (BALL INTERSECTION). Suppose {BI, ... , B n } is a k-ply neighbor-
hood system in rn.d • For each d-dimensional ball B with radius r, for all constant
(3 : 0 < (3 ::; 1,

where ri is the radius of B i •

2.2. Overlap graphs. DEFINITION 2.3. Let a ;::: 1 and let {BI, ... , B n } be
a k-ply neighborhood system for P = {Pt, ... , Pn}. The (a, k )-overlap graph for
the k-ply neighborhood system {Bt , ... ,Bn } is the undirected graph with vertices V =
{I, ... , n} and edges

E = {(i,j) : (Bi n (a. B j ) i- 0) and ((a· B i ) n Bj i- 0)}.

For simplicity, we call a (1, k)-overlap graph a k-intersection graph. In the case
that a = 1 and k = 1, and no two balls in the neighborhood system have a common
point in their interior, we have the family of graphs known as sphere-packingsj this
interesting dass of graphs will be discussed in the next section.

2.3. Main separator theorem. THEOREM 2.4 (MAIN). Let G be an (a,k)-


overlap graph for some fixed d. Then G has an

Oa·
( kd1 . n ~
d + q(a, k, d) )
-separator

that (d+ 1)j(d + 2)-splits. Furthermore, such a separator that (d+ 1 + €)j(d+ 2)-splits
can be computed in random linear time sequentially and in random constant time,
using n processors, for any ljn~/2d < € < 1.
62

riO . 2. Airplane wlng (8arth and hspers.n)

riO . 3. US map

The funetion q( a, k, d) depends exponentially on d but is independent of n. Sinee


the interesting cases are when d = 2 or d = 3 and when n is large, this term should
be considered loworder.
It has been shown in the eompanion paper [58] that the bound of Theorem 2.4 is
tight up to a eonstant faetor. Seetion 6 will outline the proof of this theorem.

3. Finite element and finite difference meshes. One important aspeet that
distinguishes a finite element or finite differenee mesh from a regular graph is that
it has two struetures: the eombinatorial strueture and the geometrie strueture. In
general, it ean be represented by a pair (G,xyz) where G deseribes the eombinatorial
strueture of the mesh and xyz gives the geometrie information.
63

3.1. Meshes from the finite element method. The finite element method
is a colleetion of numerical techniques for approximating a continuous problem by a
finite structure [69]. To approximate a continuous funetion, the finite element method
subdivides the domain (a subset of ffid) into a mesh of polyhedral elements (Figures
2 and 3), and then approximates the continuous funetion by a piecewise polynomial
on the elements.
A common choice for an element in the finite element method is a d-dimensiona!
simplex, which is the convex hull of (d + 1) affinely independent points in ffid, e.g., a
triangle in two dimensions and a tetrahedron in three dimensions. A d-dimensional
simplicial complex is defined to be a collection of d-dimensional simplices that meet
only at shared faces [6, 7, 59]. So a 2-dimensiona! simplicial complex is a colleetion
of triangles that interseet only at shared edges and vertices.
For most applications, a mesh is given as a list of its elements, where each element
is given by the information describing the hierarchical strueture of the elements, its
lower dimensional structures such as its faces, edges, and vertices. Moreover, each
vertex has' geometric coordinates in two or three dimensions.
Associated with each simplicial complex is a natural graph, its 1-skeleton. For ex-
ample, the l-skeleton of a 2-dimensional simplicial complex is a planar graph. Con-
versely, every planar graph can be embedded in the plane such that each edge is
mapped to a straight line segment (Fary [25]; Tutte [74, 75]; Thomassen [71]; Frays-
seix, Pach, and Pollack [27]).
In the finite element method, a linear system is defined over amesh, with variables
representing physical quantities at the nodes. Let finite element graph refer to the
nonzero structure oi i,;lv ,:;"efficient matrix of such a linear system. In the case of
linear finite elements based on a triangulation, such as in Figures 2 and 3, the no des
of the finite element graph are exactly the nodes of the mesh, and hence the finite
element graph is the same as the 1-skeleton of the simplicial complex. In the case of
higher-order elements, the finite element graph usually contains the l-skeleton as a
proper subset. It can be obtained from the finite element mesh as follows: Identify
certain points (vertices, points on edges, points in faces, and points in element s) as
"nodes." Add edges between every pair of nodes that share an element.
To properly approximate a continuous function, in addition to the conditions
that a mesh must conform to the boundaries of the region and be fine enough, each
individual element of the mesh must be weil shaped. A common shape criterion for
element s is the condition that the angles of each element are not too small, or the
aspect ratio of each element is bounded [6,29].
Several definitions of the aspect ratio have been used in literature. We list some
of them.

1. The ratio of the longest dimension to the shortest dimension of the simplex
S, denoted by A 1 (S). For a triangle in ffi2, it is the ratio of the longest side
divided by the altitude from the longest side.
2. The ratio of the radius of the smallest containing sphere to the radius of the
inscribed sphere of S, denoted by A 2 (S).
3. The ratio of the radius of the circumscribing sphere to the radius of the
64

FIG. 4. M eshes derived from quadtrees (S. Mitchell)

inscribed sphere of S, denoted by A 3 (S).


4. The ratio of the diameter to the dth root of the volume of the simplex S,
denoted by A 4 (S), where the diameteT of a d-simplex S is the maximum
distance between any pair of points in S.
Examples of a simplicial complex with bounded aspeet ratio are illustrated in
Figure 4 as weil as in Figures 2 and 3. The above definitions of the aspect ratio
are polynomially related to each other. They are also elosely related to the smallest
angle of the simplex, which is the smallest angle among all the angles between pairs of
supporting hyperplanes of the simplex. Using elementary geometric arguments, one
can prove the following set of inequalities: There are constants CI < C2, C3, C4 < Cs,
depending only on d, such that if t he smallest angle of S is 0, then
2
:::; AI(S) :::;
Isin 0I Isin OI
cIAI(S) :::; A2(S) :::; C2 AI(S)
AI(S) :::; A 3( S) :::; C3( Al (S))2
C4 AI(S) :::; (A 4(s))d :::; cs(AI(S))d-l.

Therefore, if one of the above parameters is bounded by a constant, then all of them
are bounded.

3.2. Graphs from the finite difference method. The finite difference method
is another useful technique for solving computational problems in scientific computing.
It also uses a finite and discrete structure, a finite diffeTence mesh, to approximate a
continuous problem.

Finite difference meshes are often produced by inserting a uniform grid of rn? or
IR3 into the domain via a boundary-matching conformai mapping. In general, the
derivative of the conformai transformation must be slowly varying with respeet to the
mesh size in order to produce good results. See, for example [72]. This means that
the mesh will probably satisfy a density condition [5, 60] .

Let G be an undirected graph and let 'Ir be an embedding of its nodes in IRd. We
say 'Ir is an embedding of density a if the following inequality holds for all vertices
v in G. Let u be the elosest node to v . Let w be the farthest no de from v that is
65

FIG. 5. Berger and Bokhari 's example of a density graph.

connected to V by an edge. Then

117l"(w) - 7l"(v)1I <


117l"(u) - 7l"(V) II - a.

In general, G is an a-density graph in IRd if there exist an embedding of G in IRd


with density a. It can be easily shown that there is a D.(a,d) depending only on a
and d such that the maximum degree of an a-density graph is bounded by D.(a,d).
Furthermore, a finite difference mesh may not be a collection of simplices or ele-
ments as a finite element mesh, so we can not analyze it as a triangulation. Finite
difference meshes are often locally refined by further subdividing some mesh eelIs. See
Figure 5. This means that nodes occur on the sides of some mesh eelIs and that inter-
polation must be used in the finite difference approximation. For numerical accuracy
of the interpolatillll, th::: '.!~l1al practice is that mesh eelIs are refined to a level no more
than a constant factor smaller than their neighboring eelIs [5]. In the presence of such
refinement, the finite difference mesh will still satisfy a constant density condition.

3.3. Overlap graphs and weIl shaped meshes. One of the most valuable
aspects of the dass of overlap graphs is that it enables us to give a unified geomet-
ric characterization of graphs with the small separator property. The set of overlap
graphs in IRd contains all finite subgraphs of infinite grids, planar graphs and sphere
packing graphs. Moreover, overlap graphs indude graphs associated with finite ele-
ment and finite difference methods, as special cases. The parameter a, in a strong
sense, measures the degree to which the mesh is well-shaped.
We now show that for each well-shaped mesh, there is an overlap graph with a pair
a and k that contains the graph defined by (G,xyz) as a subgraph. We say a graph
Gl is a spanning subgraph of another graph G2 if Gl can be obtained from G2 by
deleting edges. A graph G is (a, k)-embeddable in IRd if it is a spanning subgraph of
an (a, k)-overlap graph in IRd. Notice that the small separator property is preserved
under spanning subgraphs.
LEMMA 3.1. IJ G is an a-density graph in IRd, then G is (2a, l)-embeddable.
Proof: Let 7l" be an embedding of G with density a in IRd. Without loss of generality,
assume that G has vertex set V = {1,2, ... ,n}. Let P = {7l"(1),7l"(2), ... ,7l"(n)}. For
each p EP, let c(p) denote the point of P - {p} dosest to p. Let r = {Bt, . .. , B n },
66

where for each i : 1 ~ i ~ n, B i is a ball centered at 1I"(i), whose radius ri is


O.51Ic(p) - pll. Clearly, balls from f do not interseet each other. We elaim that G is
a subgraph of the 2a-overlap graph G' of f.
For each edge (u, v) of P, we need to show that (11"( u), 11"( v)) is an edge of G'.
Without loss of generality, assume r" ~ rv. Because 11" is an a-embedding of G, we
have
117r(U) - 7r(v)11 <
2r u - a.

So 11"( v) E (2a) . B u , and therefore (7r( u), 11"( v)) is an edge of G', completing the proof.
D

LEMMA 3.2. Suppase G is the i-skeleton of a simpiiciai complex J( in IRd. Let


G be the subgraph of G abtained by removing all vertices that appear on the external
boundary of J(. Then if the aspect ratio of K, is bounded by a constant a, then there
is a constant e depending only on d and a such that G is (e, 1 )-embeddable.

Proof: Because the aspect ratio of the complex J( is bounded by a, there is a 8


depending only on a and d such that the smallest angle of each simplex of J( is at
least 8. Therefore, there is a constant 6 depending only on d and a, such that the
degree of G is bounded by 6. Furthermore, the bounded aspect ratio implies that the
ratio of the longest edge over the shortest edge of any simplex in J( is also bounded.
Thus, there is a constant el depending only on d and a, such that for each vertex v
of J(, the ratio of the longest edge to the shortest edge connecting v is bounded by
el. Moreover, for each vertex v of J( that is not on the external boundary, the elosest
vertex of v in J( is also connected with v by an edge, because no vertex can appear in
the interiol vi ü~ oiTYlplex. Therefore, if we remove all vertices of J( on the external
boundary, we obtain a cl-density graph G. Let c = 2el. It follows from Lemma 3.1
that G is (e, 1)-embeddable. D

Therefore, the following theorem follows from Theorem 2.4.


THEOREM 3.3 .
• If G is the i-skeleton of a simplicial complex J( with bounded aspect ratio,
letting ii be the number of exterior vertices of J(, then G has an

o ( n !.!!::.U + ii )
d -separator.

• If G is a graph with bounded density a, then G has an

( n !.!!::.U)
Oa· d -separator.

Moreover, sueh separators that (d + 1 + €)/(d + 2)-split ean be computed in random


O(n) time and in random 0(1) time using n processors, for any 1/n l / 2d < € < 1.

An algorithm for partitioning finite element and difference meshes is given in


Section 5.4.

3.4. Overlap graphs indude all planar graphs. The proof that pIanar graphs
are a special case of overlap graphs relies on the following theorem of Andreev and
Thurston [3, 4, 73] characterizing all planar graphs in a novel geometric fashion.
67

THEOREM 3.4 (ANDREEV AND THURSTON). Bach triangulated planar gmph G


is isomorphie to a 2-dimensional sphere packing gmph.
Simply from the definition, each sphere packing graph is a (1, 1)-overlap graph.
Therefore, planar graphs are a special case of overlap graphs.

4. A randomized partitioning algorithm. In this section, we describe a ran-


domized algorithm for computing a small separator of a given overlap graph. In the
next section, we shall show how to make it efficient by using geometric sampling. In
Section 6, we will outline a correctness proof of the algorithm to derive a constructive
proof of Theorem 2.4.

4.1. The algorithm. Given a k-ply neighborhood system r = {B l , ••• ,Bn }


and the a-overlap graph G of r, let P = {pt, ... , pn} be the centers of r. The
following algorithm first finds a (d - 1)-sphere S with some desired properties to be
stated later and then computes a vertex separator of G from S. In the following
algorithm, let Ud be the unit d-sphere in IRd+!. We define ST : IRd -+ Ud to be

°
the standard stereographic projection mapping. This mapping can be described as
follows. Assume IRd is embedded in IRd+! as the Xd+! = coordinate plane, and
assume Ud is also embedded in IRd+! centered at the origin. Given a point p in
IRd, construct the line L in IRd+! passing through p and through the north pole
of Ud (that is, the point (0, .. ,0,1)). Line L must pass through one other point
q of Udj we define ST(p) to be q. For aset P = {Pt, ... ,Pn} in IRd, we denote
{ST(Pl), ST(1l2) , ... , ST(PnH by ST(P). Recall that the center point of a point set is
the one such that every hyperplane passing through it about evenly divides the point
seto We will de!:......'" rf'nter point formally in Section 4.2.

AIgorithm 1 (Generic Geometric Partitioning)


Input: (a neighborhood system r and the geometric coordinates of its centers
P).
1. Compute Q = ST(P)j
2. Find a S-center point e of Qj
3. Compute the rotation 11'1 : Ud -+ Ud that maps e to el, a point on the
diameter between the south and the north poles, say el = (0, ... , 0, r).
4. Let 11'2 be the dilation of IRd by a factor of .../(1-
r)/(1 + r).
5. Choose a random great cirele GC of Udj
6. Transform GC back IRd using the inverse of the above trans-
formations to obtain a separating sphere S, i.e., S =
[ST 0 11'2 0 ST- l 011'1 0 STr l (GC)j
7. Compute a vertex separator of G from S.

AIgorithm 1 defines some point sets that are not explicitly computed. We intro-
duce them below only for the purpose of explaining the algorithm.
• Let Ql = 1I'1(Q) in Step 3 abovej
• Let Pt = ST-l(Ql), the pre-image ofQt in IRdU{oo}. The pre-image of the
north pole is defined to be a point at infinityj
• Let P2 = 1I'2(Pl ) in Step 4j
68

• Let Q2 = ST(P2). Note that the origin (0,0, ... ,0) is a center point of Q2'
See further comments below.

4.2. Center points. Suppose P is a finite set of points in IRd. A hyperplane H


in IRd divides P into three subsets: P+ = H+ n P, P- = H- n P, and pn H. The
splitting ratio of H over P, denoted by <PH(P), is defined as
IP+I IP-I)
<PH(P) = max ( lPf' lPf

°
For each < S < 1, a point c E IRd is a S-center point of P if every hyperplane
containing c S-splits P. Each d/(d + 1)-center point is called a center point of P,
and the set of all center points is denoted by Center(P). The balaneed separation
property of a center point makes it very useful for designing efficient divide and
conquer algorithms [16, 30, 55, 83].
Given aset of points P e IRd, the question of whether P has a center point is
always affirmative. This follows from Helly's Theorem [18].
THEOREM 4.1 (HELLY). Suppase J( is a family of at least d + 1 convex seis in
IRd, and J( is finite or each member of J( is compad. Then if each d + 1 members of
J( have a common point, there is a point common to all members of J(.
LEMMA 4.2 (CENTER POINTS). For each sei P ~ IRd, Center(P) =J: 0.
Proof: 1 We prove the lemma by induction on d. When d = 1, the lemma is elearly
true. We now assume that the lemma holds for all d' < d. If all points of P lie in a
(d - 1)-dh~pnsional affine space, then we can reduce the dimension by one and apply
the induction hypotheses to prove that a better center point exists.
So without loss of generality, assume that P does not lie in a (d - 1)-dimensional
affine space. Notice that P induees an equivalence relation on the set of elosed
halfspaces in IRd: those halfspaces which contains the same subset of points from
P are equivalent. Each equivalence elass can be identified with a halfspace whose
supporting hyperplane passes through d affinely independent points from P.
Let H be the set of all elosed half-spaces with supporting hyperplane passing
through d affinely independent points of P that contain more than LdIPI/(d + l)J
points of P. We want to show that
Center(P) = n H =J: 0.
HE'H

We first show that nHE'H H =J: 0. Clearly, each element from His convex and H
is finite. By Helly's theorem, it is suffident to show that for each H 1 , ••• ,Hd+I E H,
nf:f H; =J: 0.
Note that

nH; = IRd - U(IRd - H;)


d+l

;=1
d+1

;=1
;2 P -
d+1
U (IRd -
;=1
H;).

1 We present this proof to indicate that there is an O(n d ) time algorithm for computing a center
point. Similar proofs can be found in many previous works, e.g., [18].
69

Note also
d+l d+l 1
I U((lRd - Hi ) n P)I ::; 2: 1((lRd - Hi ) n P)I < (d + 1)l-d-IPIJ < IPI·
;=1 ;=1 +1
Hence, P - Uf~l(lRd - H;) # 0.
We now show that each point e in nHE1{ H is a center point of P. Suppose e is
not a center point of P. Then there is a hyperplane h passing through e defining a
halfspace H such that the interior of H contains at least rdlPI/(d + 1)1 points of P.
Thus, there is a elosed halfspace H' contained in the interior of H that has at least
rdlPI/(d + 1)1 points of P, contradicting the assumption that e E H'. Therefore,
every point in nHE1{ H is a center point of P. Similarly, we can show that every
center point of P is in nHE1{ H. 0

Immediately following from the above proof is ,an O(n d ) time algorithm for com-
puting a center point of aset P. This algorithm uses linear programming. It forms
a collection of O(n d ) linear inequalities by considering the set of hyperplanes passing
through d affinely independent points of P, and finding the common intersection of
the halfspaces that contain at least dn 1(d + 1) points from P. The intersection of
the O(n d ) halfspaces can be found in O(n d ) time using Megiddo's linear program-
ming algorithm [22, 53]. We will refer this algorithm as the LP algorithm. Of course,
this algorithm is too slow for applications in practice. An effieient algorithm will be
presented in Section 6:
If e in AIgorithm 1 is a D-center point of Q, then the origin 0 is also aD-center
point of Q2 [58]. First of all, the point el is a D-center point of Q1. Now intuitively, a
dilation of lRd moves a center point on the diameter between the south and the north
poles along this diameter either up or down depending on the dilation factor. We
will prove in our companion paper [58] that the dilation of by factor J(1 -
1')/(1 + 1')
indeed makes 0 a D-center point of Q2. SO, any hyperplane passing through 0 0-
splits Q2, and hence GC D-splits Q2. Because all transformations used in the above
partitioning algorithm preserve the splitting ratio of spheres, S also D-splits P.

4.3. Separating spheres. We now explain how to choose a random great eirele
in AIgorithm 1.

A great cirde of Ud is the intersection of Ud with a hyperplane passing through


the center of Ud.

Let randn(m) be a function that generates m normally distributed random num-


bers with mean 0.0 and variance 1.0. A random point P from Ud can be chosen as
p = qlllqlb, where q = randn(d + 1). A random great cirele of Ud is then the great
cirele normal to the vector p.

Each (d - 1)-sphere S separates int(S) from ext(S): any segment connecting a


point in int (S) and one in ext(S) must intersect S. In analogy to vertex separators
in graph theory, we say that S is called a separating sphere in d-space.

More speeifically, for aset of points P = {Pb ... , Pn} in lRd and a constant
o< D < 1, we say that S D-splits P if both lint(S) n PI ::; on and lext(S) n PI
::; on.
70

4.4. Computing a vertex separator from a separating sphere. We now


show how to compute a vertex separator of an overlap graph G from a separating
sphere S.
One approach is to remove one of the two endpoints of each edge cut by S. We say
S cuts an edge (Ei, E j ) of the overlap graph if the line segment between the centers
of Ei and Ej has a common point with S. Let Es be the set of edges cut by S. A
ball Ei is a boundary ball with respeet to S if there is an edge cut by S ineident to
it. Let Ube the set of all boundary balls and let G s = (U, Es) be the subgraph of G
induced by S. Clearly, G s is a bipartite graph, with boundary balls from the interlor
of S on one side and boundary balls from the exterior of S on the other side.
The discussion in this section assumes that no point of P lies exactly on S. Because
we choose S at random in AIgorithm 1, the occurrence of a point of P exactly on S
is a zero-probability event. Even if this event were to occur, a slight generalization
of the results in this section would cover that case. In fact, we can first put all points
of P that appear exactly on S in the vertex separator.
Reeall that a vertex cover of a graph G is a subset e
of vertiees such that each
edge G has an endpoint in c. e
In other words, deleting from G removes all edges
of G. Simply from the definition of vertex cover, we have
LEMMA 4.3. Suppose r = {Et, ... , En} is a k-ply neighborhood system in IRd
and G is the o:-overlap graph oJr. IJ S is a (d -1)-sphere that 8-splits r, then each
e
vertex cover oJ Gs 8-splits G.
Therefore, the best way to compute a vertex separator from S is perhaps to take a
minimuID vertex cover of Gs. In fact, a minimum vertex cover of a bipartite graph can
be computed in poiynlJmial time, using Dulmage-Mendelsohn decomposition [20J (see
[64J for related applications of Dulmage-Mendelsohn decomposition in sparse matrix
computations).
On the hand, a faster way to compute a vertex separator from S is to put all
the boundary balls from either the interlor of S or the exterior of S, whichever has
smaller cardinality, into the separator (also see [84]).
However, in the above construction, one has to check the structure of an overlap
graph to find a vertex separator from a sphere separator. In some applications,
only the neighborhood system is given and it is relatively expensive to compute the
overlap graph. We now show how to find a small vertex separator directly from the
neighborhood system and the separating sphere.
For each edge (Ei, E j ) cut by S, let qi,j be the common point of S and line
segment between the centers of Ei and E j . Let ri be the radius of Ei. Without loss
of generality, assume ri :::; rj. Notice that qi,j is either in Ej or in 0: • Ei. If qi,j is in
Ej, then we put E j in D. If qi,j is not in E j (in which case it must be in 0: • Ei), we
put Ei in D. Clearly, since at least one endpoint of every cut edge is in D, D is a
vertex cover of Gs.
We now introduce the notion of overlap neighbor. Let S be a (d - 1)-sphere in
IRd, whose radius is r. A ball Ei is an overlap neighbor of S if one of the following
conditions is true.
71

1. B i n S i= 0j
2. a· Bi n S i= 0 and ri ::; r.
The number of overlap neighbors of S is called the overlap number of S. The
set of overlap neighbors of a sphere can be computed in O(n) time directly from the
neighborhood system.
LEMMA 4.4. The set of all overlap neighbors of S is a vertex cover of Gs.
Proof: From the discussion above, D is a vertex cover of Gs. We want to establish
that each ball from D is an overlap neighbor of S to prove the lemma.
We partition the set D into two subsets D1 and D2 , with

D1 {Bi E D : B i n S i= 0}
D2 D - D1 •

Clearly, each ball from D1 is an overlap neighbor. Now we need to show that for
all B i E D2 , ri ::; r.
If B i E D 2 , then Bi n S = 0. There are two possible cases:
• Case 1: If Pi E int(S), then it simply follows from B i n S = 0, that ri ::; rj
• Case 2: If Pi E ext(S), then from Bi E D2 , it follows that a· Bi n S i= 0,
and there is a ball B j in the neighborhood system such that (1) Pj E int(S)j
(2) B j n S = 0j (3) ri ::; rjj and (4) a· B i n B j i= 0. Because qi,j is not in B j
(otherwise B i would not be in D) there is no intersection between B j and S,
and hence condition (2) holds. Hence rj ::; r, and ri ::; rj ::; r.
Thus, in each case, we have ri ::; r, i.e., Bi is an overlap neighbor, completing the
proof of the lemma. D

So, our second method is to remove the set of all overlap neighbors. The method
is efficient when the overlap graph G is not given. Section 6 shows that the expected
number of overlap neighbors generated by the above algorithm is small.
Notice that the definition of overlap neighbors implicitly removes the assumption
that no point of P lies exactly on S. If the center of B i appears exactly on S, then
B i n S i= 0. Hence B i is an overlap neighbor and is placed in the vertex separator.

5. Making the method practical. The run time of the algorithm above cru-
ciaIly depends on the time needed to compute a center point in (d + 1}-space. All
other steps of the aIgorithm can be performed in O( n) time, and in constant paraIlel
time using O(n) processors.
Unfortunately, no linear-time aIgorithm is known for computing center points. As
shown in Section 4.2, there is a method that requires solving aset of 0(n d ) linear
inequalities. The only improved resuit, due to Cole, Sharir, and Yap [16], is that a
center point in two dimensions can be computed in O( n log5 n) time, and in three
dimensions in O(n 2 log 7 n) time. No subquadratic algorithm is known that aIways
returns even an approximate center point.
In this section, we show that an approximate center point can be found efficiently
72

using geometric sampling [15, 42, 66], which is an important algorithmic technique
for designing efficient geometric algorithms.

5.1. Geometric sampling for efficiency. To illustrate the idea, we first show
how to use random sampling to compute an approximate center point in one dimen-
sion. In this case, the input is aset of 2n integers P = {Pl,'" ,P2n}' If Pi < Pi for
all i < j then centerep) = [Pn,Pn+l]'
Now suppose we randomly seleet an element from P, say p. The probability that
P E {Pn,Pn+d is I/n, while the probability that prn/21 ::; P ::; pr3n/21 is 0.5. So, with
probability 0.5, a randomly selected element from P is an ( = 3/4 center point.

We can improve e using larger samples! Suppose 1random element s S = {rJ, ... , rt}
are seleeted and their median r is the output. Letting I(r) be the rank of r in P,
it follows from a simple analysis that E[l(r)] = n, and V[I(r)] = (2n + 1)(2n -1-
21)/(8k + 6). By Chebyshev's inequality,
. n2
P(lI(r) - nl) ;::: t) ::; 2lt 2

Thus, with probability at least 0.5, II(r )-nl ::; n/0, i.e., r is a 1/20 center point
of IPI. A 1/2 + 1/20 center point of IPI can be computed in 0(1) time. A similar
sampling idea was used by Floyd and Rivest [26] in their fast seleetion algorithm.
The algorithm can be generalized to higher dimensions. In d dimensions, the
randomized 5-center point algorithm has the following form.

Algorithm 2: (The Sampling Algorithm for Center Points)


Input:(a point set P e IRd)
1. Seleet a subset S of P with size I uniformly at random;
2. Compute a center point es of S, using the LP algorithm for center
points, given in Section 4.2;
3. Output es.

The feasibility of Algorithm 2 above is specified in the following question: What


is the probability that es computed above is an (-center point?

We now introduce a notation which will be very useful in quantifying the quality of
the es computed by Algorithm 2. Recall that tPh(P) is the ratio in which hyperplane
h splits a point set P.

DEFINITION 5.1 (e-GOOD SAMPLE). Suppase P is a sel of points in IRd. S <;;; P


is an (-good sample if for all hyperplanes h, ItPh(S) -tPh(P)1 ::; tö.

The following lemma shows the importance of the e-good sample in approximating
center points. Its proof is straightforward.
LEMMA 5.2. For each P e JRd, if S <;;; P is an e-good sample, then each 5-center
point of S is a (6 + e)-center point of P.

Now the question becomes: how often does aset of 1 randomly chosen points form
73

an f-good sample? This is not a trivial question, but was in fact answered by Vapnik
and Chervonenkis [80] (see [70] for a detailed proof).
THEOREM 5.3 (VAPNIK AND CHERVONENKIS). There is a constant Cd depending
only on d such that for each 0 < f :::; 1 and 1 :::: 2/ f2, if S
is aset of 1 randomly chosen
points from P, then

Pr[S is an f-good sample] :::: 1 - cd1d+1e -t'

The Vapnik-Chervonenkis bound implies that we only need to sample about


O( dlog d) points to compute an approximate center point with high probability.

THEOREM 5.4. For all P E IRd, Algorithm 2 computes a (.Ad + f)-center point of
P in

time, with probability at least 1 - 1J, where .Ad = d/(d + 1).


Notice that the computation above can be effidently implemented in parallei.

5.2. Separators using sampling. We now incorporate random sampling into


the partitioning algorithm for overlap graphs.

AIgorithm 3 (Fast Geometric Partitioning)


Input: (a neighborhood system r and the geometric coordinates of its centers
P).
1. Choose a random sample P' of size given by Theorem 5.4j
2. Let Q = ST(P')j
3. Compute a o-center point c of Q using the LP algorithmj
4. Compute the rotation 11"1 and the dilation 11"2 that conformally map c
to the originj
5. Choose a random great drele CC of Udj
6. Let S = [ST 0 11"2 0 ST-1 011"10 STr 1 (CC)j
7. Induce a vertex separator of C from S.

According to our experiments, about 800 points work very well for meshes in two
dimensions, and 1100 points work very well for meshes in three dimensions.
THEOREM 5.5. Algorithm 3 computes S in random constant time, and a vertex
separator of an overlap graph in random O(n) time. Using p processors, the time can
be reduced to n/p.
AIgorithm 3 demonstrates the usefuIness of geornetry, sampling, and randomiza-
tion in mesh partitioning. The random sampling in the above algorithm reduces the
problem size and simultaneously guarantees a provably good approximation of the
larger problem. It is the underlying geometric structure that ensures the quality of
the partition.
74

5.3. A fast heuristic for center points. Although the sampling algorithm
(AIgorithm 2) for center point (in fixed dimensions) is efficient from theoretieal view-
point, it uses linear programming to solve the center point problem on a smaller
sample point seto The use of linear programmi ng becomes a serious concern in prac-
tical implementation. For example, the experimental results show that the sampling
algorithm need to choose a sample of about five hundred to eight hundred points in
two dimensions. The sampling algorithm thus needs to solve ( 5~0 ) ~ 20 million
linear inequalities! Worse, the state-of-art linear programmi ng algorithms (for fixed
dimensions) have a large constant. The sample size would be larger for higher di-
mensions. The seemingly efficient sampling algorithm is too expensive for practical
applications.
To overcome this difficulty, we have developed a heuristie for approximating center
points [55]. The heuristic uses randomization and runs in linear time in the number of
sample points. Most importantly, it does not use linear programming. Our algorithm
is based on the notion of aradon point. Let P be aset of points in IRd. A point
q E .lRd is a mdon point [18] if Pean be partitioned into 2 disjoint subsets P1 and
P2 such that q is a common point of the convex hull of Pj and the convex hull of P2•
Such a partition is called aradon partition.

FIG. 6. The radon point of four points in IR? When no point is in the convex hull of the other
three (the left figure), then the radon point is the unique cross of two linear segments. Otherwise
(the nght figure), the point that is in the convex hull of the other three is aradon point.

FIG. 7. The radon point of five points in IR3 . Two cases are similar to those in two dimensions.

The following theorem shows that if IPI :2: d + 2, then aradon point always exists.
Moreover, it can be computed efficiently.

THEOREM 5.6 (RADON [18]). Let P be aset oJ points in lRd • IJ IPI :2: d + 2,
then there is a partition (P1 , P2 ) oJ P such that the convex hull oJ P 1 has a point in
common with the convex hull oJ P2.

Proof: Suppose P = {Pl,'" ,Pn} with n :2: d + 2. Consider the system of d + 1


homogeneous linear equations
n n
LO'i = 0 = L O'iP! (1 ~ j ~ d),
i=l i=l
75

where Pi = (p], ... ,pf) are the usual coordinates of in IRd. Since n ~ d+2, the system
has a nontrivial solution (ab ... ' an). Let U be the set of all i for which ai ~ 0,
and V the set for which ai ~ 0, and C = LiEU ai > O. Then (U, V) is a partition
of P, and LEv ai = -c and LiW(a;fc)pi = LiEV(a;fc)pi. Let q = LiEU(a;fc)pi =
LiEV(a;fc)pi. The point q is simultaneously written as a convex combination of
points in U and a convex combination of points in V. Hence, q is in the convex hull
of U and the convex hull of V, completing the proof. 0

To compute aradon point of P, we need only to compute aradon point for the
first d + 2 points. It follows from the proof above that aradon point can be computed
in O(c?) time.
We now describe our heuristic for approximating center points.

Algorithm 4: (Fast Center Points)


Input:(a point set P e IRd)
1, Construet a complete balaneed (d + 2)-way tree T of L leayes (for an
integer L);
2. For each leaf of T, choose a point from P uniformly at random, inde-
pendent of other leayes;
3. Evaluate tree T in a bottom-up fashion to assign a point in IRd to
each internai no de of T such that the point of each internai no de is a
radon point of the points with its (d + 2) ehiidren;
4. Output the point associated with the root of T.

A complete (d + 2)-way tree of L leaves has at most Lj(d + 2) internai nodes.


AIgorithm 4 takes O( d2 L) time, with a small constant. Our experimental results
suggest that, independent of the size of the mesh, L = 900 is sufficient for meshes
from two dimensions and L = 1200 for meshes from three dimensions. Moreover,
about 10 to 30 tries give a small cost separator that approximately 0.52-splits a
mesh.
On the theoretical side, recently, Eppstein, Miller, Sturtivant, and Teng [23] gave
a proof that AIgorithm 4 finds a (1 - ljJ2)-center point with high probability.

5.4. Apraetieal algorithm for partitioning unstructured meshes. A fi-


nite element mesh is not given by a neighborhood system or an overlap graph. For-
tunately, our partitioning algorithms do not require such a neighborhood system
representation; only the proof does. To cope with the new setting, we show how to
adapt our partitioning algorithm.
76

AIgorithm 5 (Partitioning Finite Element Meshes)


Input: (the combinatorial strueture of a mesh G, and the geometric coordi-
nates of the mesh xyz).
1. Choose a random sample P of size given by Theorem 5.4 from xyz;
2. Apply AIgorithm 3 (Fast Geometric Partitioning) to compute a sep-
arating sphere S for P; In practice, we replace the LP algorithm of
Step 3 in AIgorithm 3 by AIgorithm 4, the Radon partition based fast
center point algorithm.
3. Partition G into two subgraphs Gl and G2 by removing the set of
edges E that cut this sphere. This set of edges can be found using
the structure of G; each vertex is placed into Gl or G2 , depending on
whether it is mapped to the interior or the exterior of S;
4. Compute the density a of the embedding given by (G, xyz). Then,
for each v that is an endpoint of an edge in E, find the ball centered
at v whose radius is given in the proof of Lemma 3.1. Use the rule
of overlap neighbors of Section 4.4 to compute a vertex separator of
(G,xyz).

However, in practice, we do not calculate the density of the embedding to compute


a vertex separator. We will use more direet and practical subroutines for finding a
vertex separator of (G,xyz) in the last step of the algorithm above. For example, a
faster way to compute a vertex separator is to put every vertex v in the interior or
exterior of S that is an endpoint of an edge in E, whichever has smaller cardinality,
into the separator. Notice that the degree of a density graph or a finite element graph
is bounded. So, this heuristic also finds a vertex separator that satisfies the separator
bound in Theorem 3.3. To compute the smallest vertex separator induced by S, we
can use the minimum bipartite vertex cover procedure (see Seetion 4.4). Although
the worst case time complexity of such a procedure is quite expensive, its expeeted
(average) time complexity is much lower, especially since in our case the bipartite
graph induced by E is much smaller than the original graph G.
Figure 8 shows the mesh of Figures 2 and 3 partitioned using our experimental
implementation [39] of AIgorithm 5 recursively.
Our mesh partitioning algorithms first use the geometric information to compute
a continuous separator, then use the combinatorial strueture to find a vertex separa-
tor. One major reason making the above partitioning algorithm suitable for efficient
practical implementation is the use of geometric sampling, a technique that reduces
the problem size and simultaneously guarantees a provably good approximation of
the larger problem.

6. Proof outIine for the main separator theorem. We now outline the
proof of the Main Theorem 2.4, to show how to use geometric arguments to prove
separator properties for graphs embedded in fixed dimension. The detailed proofs
are presented in the companion paper [58]. In Section 6.1, we present a continuous
separator theorem, based on which we give a geometric method for proving a small
separator theorem in Section 6.2. We then apply this method to prove Theorem 2.4
in the remainder of this seetion.
77

FIG . 8. Recursive partitioning using Algorithm 5.


78

6.1. A continuous separator theorem. In the partitioning algorithms pre-


sented above, a "continuous" separator, in the form of a sphere, is computed first, and
then a "discrete" separator is deduced from its continuous counterpart. The quality
of the continuous separating sphere, in a strong sense, deterrnines the quality of the
vertex separator.
Suppose J(x) is a real-valued nonnegative function defined on IRd such that Jk is
integrable for all k = 1,2,3, .... Such an J is called a eost Junetion. The total volume
of the function J is defined as

Total-Volume(f) =[ (f( V ))d( dv)d


lVERd

Suppose S is a (d - 1)-sphere in IRd. The surJaee area of S is then

Area(f, S) =[ (f(v))d-l(dv)d-l
lVEs

Let 'Ir denote a map from IRd to Ud, which is formed by a stereographic projection,
followed by a rotation, followed by an inverse stereographic projeetion, followed by
adilation, followed by a stereographic projection. Such a map is called an H-map.
Our partitioning algorithms compute an H-map, choose a random great cirele, and
use the inverse of the H-map to transform the great eirele to a sphere in IRd.
For each great eirele GC of Ud, let SGC be the sphere defined bY'lr-1 (GC). Let
Cost(GC) = Area(f,SGc). Let Avg(f) be the average cost of all great eireles of Ud.
We wiII use the following lemma, whose proof can be found in the companion paper
[58].
LEMMA 6.1. Suppose J is a eost Junction in IRd. Then
Avg(f) =0 ((Total-Volume(f)) dd' ) •

Consequently, we have the following continuous separator theorem.


THEOREM 6.2 (CONTINUOUS SEPARATOR). Suppose J is a eost Junction on IRd
and P is aset oJ n distinct points in IRd. Let S be a sphere chosen by the random
process deseribed in Algorithm 1. Then S (d + l)/(d + 2)-splits P, and with high
probability,
Area(f, S) =0 ((Total-Volume(f)) dd' ) •

The splitting ratio of the separator in the theorem above is (d+ 1)/(d+2) instead
of d/(d + 1). This is because points are mapped from IRd to the unit sphere in IRd+! ,
and the center of the unit sphere is a (d + 1)-dimensional center point of the image
rather than a d-dimensional center point.

6.2. A new approach to proving small separator theorems. The contin-


uous separator theorem of the last section provides the following generic approach to
prove that a graph G embeddedin IRd has an O(e(d-l)/d)-separator that (d+l)/(d+2)-
splits.
79

A Geometric Approach to Proving Small Separator Theorems


1. Define a real-valued funetion J based on the strueture of G so that
Total-Volume(f) is bounded by a funetion Cj
2. Find a (d - l)-dimensional separating sphere S that (d + l)/(d + 2)-
splits the vertices of G and has Area(f, S) = 0 (c(d-l)/d), by Theorem
6.2.
3. Deduce a vertex separator of G from the separating sphere S.

In order to deduce a vertex separator from its continuous counterpart, the funetion
J must be JaithJul in the sense that the cost of a continuous separator models the
size of a vertex separator of the underlying graph. In other words, the continuous
function J faithfully encodes some combinatorial properties related to separators of
the underlying graph.
We will follow the basic steps above to prove Theorem 2.4. To this end, for each a-
overlap graph of a k-ply neighborhood system r = {B I , . .• , B n } in IRd, we construct
a real-valued function J based on rand prove that

Total-Volume(f) ~ 0 (a.6k :.n) jd

then we show that from each separating sphere S we can deduce a vertex separator
of size linearly bounded by Area(f, S).

6.3. Local cost functions. Just as each overlap graph is defined from its neigh-
borhood system, Bt, . .. ,Bn , the cost function J itself is defined from the local eost
Junetions, It, ... , Jn, with Ji based on B i •
Let P = {Pl"" ,Pn} be the set of centers of {Bt, ... , B n}, and suppose that the
radius of B i is rio We define Ji as

Mx) = {1/(2ar;) if x E (2a) . B i , i.e., IIx - Pill ~ 2ar;


I 0 otherwlse

Intuitively, Ji sets up a cost on each (d - 1)-sphere S such that the closer S is to


B i , the more Bi contributes to the surlace area of S. The function J; measures the
cost of a sphere passing through Bi and its vicinity.
Notice that the function J; defined above has Total-Volume(fi) equal to Vd, the
volume of a unit ball in d dimensions.

6.4. Putting local cost functions together. Now we need to .put the local
cost functions together into a global cost funetion. Perhaps the simplest way is to
take the sum, J = Ei k But this is not the best choice. To see this, just check the
extreme case where the neighborhood is a collection of n identical balls and a = 1. In
this case k = n. The total volume of the sum is ndl/d, while we need a cost function
of total volume 0 (nd/(d-l») to establish Theorem 2.4.

To achieve a tight bound, we make use of the "slight" difference between the
various p-norms when applied to high-dimensional vectors. This is a technique that
appears to be newand is inte~esting in its own right. Recently, Mitchell and Vavasis
80

[61] used a cost function similar to ours to analyze their three dimensional mesh
generation algorithm.
Suppose ab ... , an are reaIs. For each positive integer p, the Lp norm of ab ... , am
denoted Lp( ab . .. , an), is

The following lemma states the relationship between different norms.

LEMMA 6.3. Lei al, ... , an be real. IJp :5 q, then Lp(ab ... ' an) ~ Lq(ab .. ·' an).

Proof: See Hardy, Littlewood and Põlya [41] (pages 26 and 144). o
We define the global eost Junction of the overlap graph to be the Ld-l norm of
Jb···, Jn, i.e.,

Notice that the L d norm of Ji is not a good choice, because its total volume is nVd
for all neighborhood systems. The following lemma, proved in [58], bounds the total
volume of the function J.
LEMMA 6.4. Let r = {Bl> ... ' B n } be a k-ply neighborhood system in IRd. IJ
Jl, ... , Jn are the loeal eost Junctions oJ r and J is its global eost Junetion, then
d I
Total-Volume(J) = O(a-.=rk-.=rn).

Consequently, by Theorem 6.2, we have the following lemma.


LEMMA 6.5. Suppose {Bb . .. , B n } is a k-ply neighborhood system in IRd. Then
there exists a ((d + 1)j(d + 2))-splitting sphere S oJ {Bt, ... , B n } with Area(J, S) =
o (eaP/dn(d-l)/d), where e = 2d- l Vd.
6.5. A vertex separator from a continuous one. Reeall that a ball B i is an
overlap neighbor of a sphere S if one of the following conditions is true.

1. B i n S =1= 0;
2. a· B i n S =1= 0 and ri :5 r.
The number of overlap neighbors of S is called the overlap number of S, denoted
19r(S). The following lemma bounds the overlap number 19r(S) of a k-ply neighbor-
hood system in IRd.
LEMMA 6.6. Suppose r = {Bt, ... , B n } is a k-ply neighborhood system in IRd.
Let Jb ... , Jn be loeal eost Junctions defined Jor the a-overlap graph oJ r and lei
J = Ld-l(/t, ... ,In). For eaeh (d - 1)-sphere S,
19r(S) = O(adk) + O(Area(J, S)).
The eonstant in the big-O notation depends only on d.

Theorem 2.4 follows from Lemma 6.4, Theorem 6.2, and Lemma 6.6.
81

7. Final remarks. We have demonstrated that geometric structure is useful for


mesh partitioning. We have show n that geometric sampling can be used to reduce the
problem size, making our approach feasible for practical applications in large-scale
computation. We have implemented the random linearotime separator algorithm of
Section 6 [39] and have experimented with various examples. The numerieal results
are encouraging. With the help of some heuristics to speed up the geometric transfor-
mation and the local optimization, the program is very fast. In practice, our program
generates partitions much better than what the theoretical results predict; the par-
titions are competitive with such previous methods as those based on an expensive
eigenvector computation [6,5].

Recently, Eppstein, Miller and Teng [24] have showed that a small separator for the
intersection graph of a k-ply neighborhood system can be in fact found in deterministic
linear time.

We condude the paper with the following two op en questions.

1. What is the computational complexity of deciding whether a graph G is


k-embeddable or (0:, k )-embeddable?
2. Is there a polynomial time algorithm for computing the disk packing of a
planar graph?

Acknowledgments. We wouldlike to thank David Applegate, Marshall Bern, David


Eppstein, John Gilbert, Bruce Hendrickson, Ravi Kannan, Tom Leighton, Mike Luby,
Oded Schramm, Doug Tygar, and Kim Wagner for invaluable help and discussions.
We would like to thank John Gilbert especially for his editorial contribution which
great ly improved this paper.

REFERENCES

[1) A, Agrawal and P. Klein. Cutting down on fill using nested dissection: Provably good elimi-
natian orderings. In Sparse Matrix Computations: Graph Theory Issues and Algorithms,
f1l.JA Volumes in Mathematics and its Applications, (this book), A. George, J. Gilbert and
J. Liu, Springer-Verlag, New York. 1992.
[2) N. Alan, P. Seymour, and R. Thomas. A separator theorem for non-planar graphs. In Proceed-
ings of the 22th Annual A CM Symposium on Theory of Computing, Maryland, May 1990.
ACM.
[3) E. M. Andreev. On convex polyhedra in Lobacevskii space. Math. USR Sbornik, 10(3):413-440,
1970.
[4) E. M. Andreev. On convex polyhedra offinite volumein Lobacevskii space. Math. USR Sbornik,
12(2):270-259,1970.
[5) M. J. Berger and S. Bokhari. A partitioning strategy for nonuniform problems on multipro-
cessors. IEEE Trans. Comp., C-36:570-5S0, 1987.
[6) M. Bern, D. Eppstein and J. R. Gilbert. Provably good mesh generation. In 31st Annual
Symposium on Foundations of Computer Science, IEEE, 231-241, 1990, (to appear JCSS).
[7) M. Bern and D. Eppstein. Mesh generation and optimal triangulation. In Computing in
Euclidean Geometry, F. K. Hwang and D.-Z. Du editors, World Scientific, 1992.
[S) G. Birkhoff and A. George. Elimination by nested dissection. Complexity of Sequential and
Parallel Numerical Algorithms, J. F. Traub, Academic Press, 1973.
[9) P. E. Bjprstad and O. B. Widlund. It.erative methods for the solution of elliptic problems on
regions partitioned into substructures. SIAM J. Numer. AnaI., 23:1097-1120, 1986.
82

[10] G. E. BlelJoch. Vector Models for Data-Parallel Computing. MIT-Press, Cambridge MA, 1990.
[11] J. H. Bramble, J. E. Pasciak, and A. H. Schatz. An iterative method for elliptic problems on
regions partitioned into substructures, Math. Comp. 46:361-9, 1986.
[12] D. Calahan. ParalJel solution of sparse simultaneous linear equations. in Proceedings of the
11th Annual Allerton Conference on Circuits and Systems Theory, 729-735, 1973.
[13] T. F. Chan and D. C. Resasco. A framework for the analysis and construction of domain
decomposition preconditioners. UCLA-CAM-87-09,1987.
[14] L. P. Chew. Guaranteed quality triangular meshes, Department of Computer Science, CornelJ
University TR 89-893, 1989.
[15] K. Clarkson. Fast algorithm for the all-nearest-neighbors problem. In the 24th Annual Sym-
posium on Foundations of Computer Science, 226-232, 1983.
[16] R. Cole, M. Sharir and C. K. Yap. On k-hulls and related probIems. S/AM J. Computing, 61,
1987.
[17] J. H. Conway, and N. J. A. Sloane. Sphere Packings, Lattiees and Groups. Springer-Yerlag,
1988.
[18] L. Danzer, J. Fonlupt and Y. Klee. HelJy's theorem and its relatives. Proceedings of Symposia
in Pure Mathematics, American Mathematical Society, 7: 101-180, 1963.
[19] H. N. Djidjev. On the problem of partitioning planar graphs. S/AM J. Alg. Disc. Math.,
3(2):229-240, June 1982.
[20] ·A. L. Dulmage and N. S. Mendelsohn. Coverings of bipartite graphs. Canadian J. Math. 10,
pp 517-534, 1958.
[21] I. S. Duff. Parallel implementation of multifrontal schemes. Parallei Computing, 3, 193-204,
1986.
[22] M. E. Dyer. On a multidimensional search procedure and its application to the Eudidean
one-centre problem. S/AM Journal on Computing 13, pp 31-45, 1984.
[23] D. Eppstein, G. L. Miller, C. Sturtivant and S.-H. Teng. Approximating center points with and
without linear programming. Manuscript, Massachusetts Institute of Technology, 1992.
[24] D. Eppstein, G. L. Miller and S.-H. Teng. A deterministic linear time algorithm for geometric
separators and its applications. Manuscript, Xerox Palo Alto Research Center, 1991.
[25] I. Fary. On straight line representing of planar graphs. Acta. Sci. Math. 24: 229-233, 1948.
[26] R. W. Floyd and R. L. Rivest. Expected time bounds for selection. CACM 18(3): 165-173,
March, 1975.
[27] H. de Fraysseix, J. Pach, and R. Pollack. Small sets supporting Fary embeddings of planar
graphs. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing,
426-433, 1988.
[28] G. N. Fredrlckson and R. Janardan. Separator-based strategies for efficient message routing.
In 27st Annual Symposium on Foundation of Computation Scienee, /EEE, 428-237, 1986.
[29] I. Fried. Condition offinite element matrices generated from nonuniform meshes. A/AA J. 10,
pp 219-221, 1972.
[30] A. M. Frieze, G. L. Miller and S.-H. Teng. Separator based divide and conquer in compu-
tational geometry. Proceedings of the 1992 ACM Symposium on Parallei Algorithms and
Architectures, 1992.
[31] H. Gazit. An improved algorithm for separating a planar graph. Manuscript, Department of
Computer Science, University of Southern California, 1986.
[32] H. Gazit and G. L. Miller. A parallel algorithm for finding a separator in planar graphs. In 28st
Annual Symposium on Foundation of Computation Scienee, /EEE, 238-248, Los Angeles,
October 1987.
[33] H. Gazit. A deterministic paralleI algorithm for planar graph isomorphism. In 32nd Annual
Symposium on Foundations of Computer Scienee, /EEE, to appear, 1991.
[34] J. A. George. Nested dissection of a regular finite element mesh. S/AM J. Numerieal Analysis,
10: 345-363, 1973.
[35] A. George, M. T. Heath, J. Liu, E. Ng. Sparse Cholesky factorization on a local-memory
multiprocessor. S/AM J. on Seientifie and Statistieal Computing, 9,327-340,1988.
[36] J. A. George and J. W. H. Liu. An automatic nested dissection algorithm for irregular finite
element probIems. SIAM J. on Numerieal Analysis, 15, 1053-1069, 1978.
83

[37] J. A. George and J. W. H. Liu. Computer Solution of Large Sparse Positive Definite Systems.
Prentice-Hall, 1981.
[38] J. R. Gilbert, J. P. Hutchinson, and R. E. Tarjan. A separator theorem for graphs ofbounded
genus. J. Algoritkms, 5 pp391-407, 1984.
[39] J.R. Gilbert, G.L. Miller, and S.-H. Teng. Geometric mesh partitioning: Implementation and
experiments. Technical Report, Xerox Palo AIto Research Center, to appear, 1992.
[40] J. R. Gilbert and R. E. Tarjan. The analysis of a nested dissection algorithm. Numerische
Mathematik, 50(4):377-404, 1987.
[41] G. Hardy, J. E. Littlewood and G. P6lya. Inequalities. Second edition, Cambridge University
Press, 1952.
[42] D. Haussler and E. Welzl. (-net and simplex range queries. Discrete f3 Computational Geometry,
2: 127-151, 1987.
[43] J. P. Hutchinson and G. 1. Miller. On deleting vertices to make a graph of positive genus
planar. In Discrete Algorithms and Complexity Theory - Proceedings of the Japan-US Joint
Seminar, Kyoto, Japan, pages 81-98, Boston, 1986. Academic Press.
[44] C. Jordan. Sur les assemblages de lignes. Journal Reine Angew. Math, 70:185-190, 1869.
[45] F. T. Leighton. Complexity Issues in VLSI. Foundations of Computing. MIT Press, Cambridge,
MA,1983.
[46] F. T. Leighton and S. Rao. An approximate max-flow min-cut theorem for uniform muIti-
commodity flow problems with applications to approximation algorithms. In 29th Annual
Symposium on Foundations of Computer Seienee, pp 422-431, 1988.
[47] C. E. Leiserson. Area Effieient VLSI Compulation. Foundations of Computing. MIT Press,
Cambridge, MA, 1983.
[48] C. E. Leiserson and J. G. Lewis. Orderings for parallei sparse symmetric factorization. in 3rd
SIAM Conference on Parallei Processing for Scientific Computing, 1987.
[49] R. J. Lipton, D. J. Rose, and R. E. Tarjan. GeneraIized nested dissection. SIAM J. on
Numerical Analysis, 16:346-358, 1979.
[50] R. J. Lipton and R. E. Tarjan. A separator theorem for planar graphs. SIAM J. of Appi.
Math., 36:177-189, Apri11979.
[51] R. J. Lipton and R. E. Tarjan. Applications of planar separator theorem. SIAM J. Comput,
9(3): 615-627, August 1981.
[52] J. W. H. Liu. The solution ofmesh equations on a parallei computer. in 2nd Langley Conference
on Scientific Computing, 1974.
[53] N. Megiddo. Linear programming in linear time when the dimension is fixed. SIAM Journal
on Computing 12, pp 759-776, 1983.
[54] G.1. Miller. Finding small simple cycle separators for 2-connected planar graphs. Journal of
Computer and System Sciences, 32(3):265-279, June 1986.
[55] G. L. Miller and S.-H. Teng. Centerpoints and point divisions. Manuscript, School of Computer
Science, Carnegie Mellon University, 1990.
[56] G. L. Miller, S.-H. Teng, and S. A. Vavasis. A unified geometric approach to graph separators.
In 32nd Annual Symposium on Foundations of Computer Scienee, IEEE, pp538-547, 1991.
[57] G. L. Miller, S.-H. Teng, W. Thurston and S. A. Vavasis. Separators for sphere-packings and
nearest neighborhood graphs. in progress 1992.
[58] G. L. Miller, S.-H. Teng, W. Thurston and S. A. Vavasis. Finite element meshes and geometric
separators. in progress 1992.
[59] G. L Miller and W. Thurston. Separators in two and three dimensions. In Proceedings of the
22th Annual ACM Symposium on Theory of Computing, pages 300-309, Maryland, May
1990. ACM.
[60] G. L. Miller and S. A. Vavasis. Density graphs and separators. In Second Annual ACM-
SIAM Symposium on Discrete Algorithms, pages 331-336, San Francisco, January 1991.
ACM-SIAM.
[61] S. A. Mitchell and S. A. Vavasis. Quality mesh generation in three dimensions. Proc. ACM
Symposium on Computational Geornetry, pp 212-221, 1992.
84

[62] M. S. Paterson. Tape bounds for time-bounded Turing machines. J. Comp. Sgst. Sci., 6:116-
124,1972.
[63] V. Pan and J. Reif. Efficient parallel solution of linear systems. In Proceedings of the 17th
Annual ACM Symposium on Theorg of Computing, pages 143-152, Providenee, Rl, May
1985. ACM.
[64] A. Pothen and C.-J. Fan. Computing the block triangular form of a sparse matrix. ACM
Transactions on Mathematical Software 16 (4), pp 303-324, 1990.
[65] A. Pothen, H. D. Simon, K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs.
SIAM J. Matrix Anal. Appi. 11 (3), pp 430-452, July, 1990.
[66] J. H. Reif and S. Sen. Polling: A new randomized sampling technique for computational
geometry. In Proceedings of the 21st annual ACM Symposium on Theorg of Compudng.
394-404,1989.
[67] E. Schwabe, G. Blelloch, A. Feldmann, O. Ghattas, J. Gilbert, G. Miller, D. O'Hallaran, J.
Schewchuk and S.-H. Teng. A separator-based framework for automated partitioning and
mapping ofparallel algorithms in scientific computingo In First Annual Dartmouth Summer
Institute on Issues and Obstacles in the Practical Implementation of Parallei Algorithms
and the use of Parallei Machines, 1992.
[68] H. D. Simon. Partitioning of unstructured problems for paralleI processingo Computing Systems
in Engineering 2:(2/3), ppI35-148.
[69] G. Strang and G. J. Fix. An Analysis of the Finite Element Method, Prentice-Hall, 1973.
[70] S.-H. Teng. Points, Spheres, and Separators: A Unified Geometric Approach to Graph Parti-
tioning. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh,
1991. CMU-C5-91-184.
[71] C. Thomassen. Planarity and duality of finite and infinite graphs. Journal of Combinatorial
Theorg, Series B, 29: 244-271, 1980.
[72] J. F. Thompson, Z. U. A. Warsi and C. W. Mastin. Numerieal Grid Generation: Foundations
and Applications. New York, North Holland, 1985.
[73] W. P. Thurston. The geometrg and topology of 3-manifolds. Princeton University Notes, 1988.
[74] W. T. Tutte. Convex representations of graphs. Proc. London Math. Soe. 10(3): 304-320,
1960.
[75] W. T. Tutte. How to draw a graph. Proc. London Math. Soe. 13(3): 743-768, 1963.
[76] J. D. Ullman. Computational Aspeefs of VLSI. Computer Science Press, Rockville MD, 1984.
[77] P. Ungar. A theorem on planar graphs. Journal London Math Soe. 26: 256-262, 1951.
[78] P. M. Vaidya. Constructing provably good cheap preconditioners for certain symmetric positive
definite matrices. IMA Workshop on Sparse Matrix Computation: Graph Theorg Issues
and Algorithms, Minneapolis, Minnesota, October 1991.
[79] L. G. Valiant. Universality consideration in VLSI circuits. IEEE Transaction on Computers,
30(2): 135-140, February, 1981.
[80] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of
events to their probabiJities. Theorg Probab. Appi., 16: 264-280, 1971.
[81] S. A. Vavasis. Automatic domain partitioning in three dimensions. SIAM J. Sci. Stat. Comp.,
12 (1991) 950-970.
[82] R. D. Williams. Performance of dynamic load balancing algorithms for unstructured mesh
calculations. Technical Report, California Institute of Technology, 1990.
[83] F.-F. Yao. A 3-space partition and its application. In Proceedings of the 15th Annual ACM
Symposium on Theorg of Computing, ACM, 258-263, 1983.
[84] E. E. Zmijewski. Sparse Cho/esky Factorization on a Multiprocessor. PhD thesis, Department
of Computer Science, Comell University, 1987.
STRUCTURAL REPRESENTATIONS OF SCHUR COMPLEMENTS
IN SPARSE MATRICES

STANLEY C. EISENSTAT" AND JOSEPH W. H. LIUt

Abstraet This paper considers effective implicit representations for the nonzero structure of
a Schur complement in a sparse matrix. Each is based on a characterization of the structure in
terms of paths in the graph of the matrix and/or its triangular factors. Three path-preserving
transformations - quotient graphs, edge pruning, and monotone transitive reduction - are used
to further reduce the size/cost.

1. Introduction. Let A be an n x n sparse matrix partitioned as

where the leading (k -1) x (k -1) prineipal submatrix AKK is nonsingular and has
an LU faetorization. Assume that k - 1 steps of Gaussian elimination have been
performed on A, eliminating the rows/columns associated with A KK • In other words,
the matrix A has been factored as

where LKKUKK is the LU faetorization of AKK , L[(K = A[(KUKk, UK[( = LI!KAK[('


and R[([( = A[([( - A[(KAjAAK [(. The submatrix R[([( is known as the Schur
complement of A KK in A [4].

In this paper, we study effeetive ways to represent the nonzero strueture of a Schur
complement in a sparse matrix. Such struetural representations are helpful in two
importantcontexts. First, they can be used as data struetures when designing effi-
cient symbolic factorization schemes. Second, they can be used to provide struetural
information on the uneliminated portion of the matrix when determining good sparse
matrix reorderings. Indeed, even in numerical faetorization they can be helpful in the
seleetion of pivots (without having to compute the numeric values of the entire Schur
complement) .
It is weIl known that the sparse factorization process can be modeled using a
sequence of graph structures, called elimination gmphs. Such sequences were de-
scribed by Parter [9] for the symmetric case, and by Haskins and Rose [5] for the
unsymmetric case. But the graph associated with the submatrix R[([( il! simply the

• Department of Computer Science and Research Center for Scientific Computation, Yale 'Uni-
versity, New Haven, Connecticut 06520. The research of this author was supported in part by U. S.
Army Research Office contract DAAL03-91-G-0032.
t Department of Computer Science, York University, North York, Ontario, Canada M3J lP3.
The research of this author was supported in part by Natural Sciences and Engineering Research
Council of Canada grant A5509, and in part by the Institute for Mathematics and .its Applications
with funds provided by the National Science Foundation.
86

elimination graph after all of the nodes corresponding to AKK have been eliminated.
Interpreted in these terrns, the purpose of this paper is to study effective ways to
represent (directed and undireeted) eHmination graphs.
In §2, we introduce the matrix and graph notation used throughout the paper. In
partieular, we define notation for edges and paths in the directed graph associated
with a matrix.
In §3, we give a number of characterizations of the Schur complement structure
in terms of paths in the graph of the matrix and/or its triangular factors. These
equivalent conditions are the basis for the strueturaI representations described in
later sections. We also provide an interesting interpretation of these conditions in
terms of the leayes of certain fill trees.
In §4, we describe a dass of struetural Schur representations. In §5, we describe
three path-preserving transformations - quotient graphs, edge pruning, and rnono-
tone transitive reductions - that can be used to further reduce the size/ cost of these
representations.
In §6, we describe severaI struetural Schur representations using paths in the
original graph. We review the use of quotients of strongly conneeted components by
PagaIlo and MauIino [8], and give new representations based on edge prunings of the
original graph and the quotient graph. In §7, we describe similar representations using
paths in the filled graph. In §8, we present several representations using paths in the
graphs of the lower and upper triangular faetors. We review transitive reduetions,
symmetrie reduetions, and path-symmetrie reduetions, and introduce the quotients
associated with the latter two.

2. Notation

2.1. Matrix notation. Consider an n x n matrix M and two subseript sets I


and J of {1,"" n}. We let Mu denote the submatrix of M including only the rows
of I and the columns of J. Letting]{ = {1,· .. , k -1} and [{ = {k,' ", n}, the block
notation introdueed in the previous seetion conforms with this definition. We shall
also use L.K and UK. to represent

Note that the Sehur complement RKK is formed by an update to the originaI
block submatrix AKK' Let Sj?K = -AKKAKkAKK be this Schur complement update.
Thus, to compute the strueture of RKK = AKK + SKK, it is suffieient to obtain the
strueture of SKK, which is just the Sehur complement of A KK in

Aa = (A KK AKK) .
AI?K 0
To allow for uniform indexing into RI?K and SKK, we introduce the matrices

0), S=(O0
R=(Oo RI?K 0).
SKK
87

.2.•
1
• •

• 3 •

A=
• 4
.5.

•••• 6 •
••

• 7 •
•• • 8
9 •
•• 10

FIG. 1. A sparse matrix example and its directed graph.

Henceforth, we shall also refer to R as the Schur complement and to S as the Schur
complement update.
If A ha~ an LU factorization, then we let F denote the filled matrix L + U. The
corresponding blocks of F are defined accordingly; for example, we have FKK =
LKK + UKK, and FKo = (FKK UKj(). We also introduce

Fo = (L KK
Lj(K 0
0) + (UKK 0
UJ(]?)
0
= (FJ(J( UKOj()
Lj(K = (FJ(J( FKi()
Fj(K 0 .

Note that Fo is aetually the filled matrix of

(~;: ~;:j() = (t;: ~) (U~K U~i() .

2.2. Graph notation. For a m-by-n rectangular matrix M, we let G(M) =


(V, E) denote the associated direeted graph. The nodes in G(M) are referred to by
their corresponding row/column indiees in M; that is, V = {1,2,···,max{m,n}}.
Edges are direeted from row to column; that is, (r,c) E E if and only if m rc i- O.
We adopt the notation r ~ c to indieate a direeted edge from r to e in G(M).
Furthermore, we use the notation r ~ e to indieate a path from r to c. By con-
vention, x ~ x for each node x. Figure 1 contains a 10 x 10 sparse matrix and its
corresponding direeted graph.

We ofteo consider composite paths, such as r ~ i ~ c. If there is no restrietion


oo the iotermediate node i, then we use the abbreviated form r ~ ~ e (this
notation is due to John Gilbert).

3. Path characterizations of Schur complement structures

3.1. Equivalent path characterizations. The nonzero strueture of the Schur


complement R is simply the union of the struetures of the original submatrix Ai(i(
and 'the Schur complement update S. It is therefore sufficient to characterize the
nonzero locations in S. We need the following "path-theorem" of Rose and Tarjan
[lI] that characterizes noozero locations in the faetor matrices.
88

1 1 .
.. 2 2 2 .. . . .
. 3 3 3 ..
.. 4 4 4 .. .
.. 5 5 5 ..
.. . .
.. 0 0
A=
.. 6 6 6 .. 0
.
0

7 7 0 .. 7
.. .. 0 0 .. 0 8 8 0 0 8
9 9 .. 9
.. 0 10 .. 0 0 1 0 10

FIG. 2. The partial factorization of the matrix in Figure 1.

THEOREM 3.1. [11, Theorem 1] f re or U rC is nonzero if and only if there exists a


path in G(A) from node r to node c going th~ough a (possibly empty) subsel of nodes
in {1,· .. ,m}, wherem=min{r,c}.

This result can be used to prove the equivalence of aset of path characterizations
of the structure of G(S).
THEOREM 3.2. Let rand c be the row and column subscripts respectively with
r;:::: k and c ;:::: k. The following conditions are equivalent:
S
(1) r --+ c (S-edge)
Ao Ao
(2) r ====;.. c or r --+ ~~c (Ao-path)
Fo Fo
(3) r ====;.. c or r --+ ~~c (Fo-path)
(4) r~~c (L-edge, U -edge)
(5) r~ ~c (L-path, U -path)
Ao
(6) r --+ ~c (Ao-edge, U -path)
Ao
(7) r~ --+ c (L-path, Ao-edge)

Consider the matrix A in Figure 1. If the eliminated submatrix AKK corresponds


to the leading 6x6 diagonal block, then the partial factorization is as given in Figure 2.
The structures of the corresponding filled matrix Fo and Schur complement update
S are shown in Figure 3.

The nonzeroes S7,1O, 88,9, and 810,9 in the Schur complement update correspond
respectively to the following paths in G(A o):

7 --> 4 --> 10,


8 --> 2 --> 5 --> 6 --> 3 --> 4 --> 9,
10 --> 5 --> 6 --> 2 --> 3 --> 4 --> 9.

These paths are not unique; for example, 8 --> 1 --> 3 --> 4 --> 9 is another path from
8 to 9, as is 10 --> 5 --> 4 --> 9 from 10 to 9. Only 88,9 and 810,9 are fills in the Schur
complement since the entry a7,1O is already nonzero.
89

.2.• • • • 2
• 3 • 3

Fo =
• 4
.5. • •
0
5=
4
5
•••
• 0
• 6 0 6
• 0
• 7 7 0 0

•• 0 0 It 0 8 000

9 9
It 0 10 000

Fw. 3. The struetures of Fo and S of the example in Figure 1.

/
e S6
SS9~

U69
/
e 84
8S9~

U49

/ \
ess US6
/ \
e6S US9
/ \ e S3 U34
I
a49

I
aSS
I
aS6
I e/ \
a6S S4 U49 e
/\ I
S2 U23 a34

I
aS4
I
a49
I
aS2
I
a23

FIG.4. Two ]iII trees for S89 in the example of Figure 1.

3.2. Fill tree interpretation. There is an interesting interpretation of the var-


ious path characterizations given in Theorem 3.2 in terms of the leaves of certain trees.

Consider a nonzero ere in the lower triangular factor L. Since it is nonzero, either
a rcis nonzero or there exists ak < min{r,c} such that erk and Ukc are both nonzero.
In the latter case, take erk to be the left son of are and Ukc to be the right son.
Repeating this recursively, we get a binary tree whose root is erc and whose leaves
are entries in A. We eal! this tree a fiil tree for the nonzero entry rc • Similar trees e
can be defined for nonzero entries in the upper triangular factor U and the Schur
complement update S. Figure 4 gives two different fil! trees for the nonzero 8S9.

The leaves in the tree correspond to nonzero entries in A. Indeed, if the leaf set
is listed from left to right, then we obtain an Ao-path as guaranteed by Theorem 3.2.
For example, for the nonzero 889, the two fil! trees lead to two different Ao-paths:

8 ---> 5 ---> 6 ---> 5 ---> 4 ---> 9,


8 ---> 2 ---> 3 ---> 4 ---> 9.

Note that the set of leaves in any subtree of a fill tree corresponds to a path
in the filled matrix F. Judicious choiees of such subtrees lead to the various path
90

/389~ /389~
t;;,
! \56
-86 U69 U69

L-edge U-edge

I
a85 Ao-edge U - path

L-path Ao-edge
FIG _ 5_ Fill subtrees for 889 to il/ustrate the paih canditions

Ä"U U
f~ f

U U
f~
f u

e
A u
a
Ao-edge U-path L-path Aa-edge
a

L-path U-path
FIG _ 6_ Generic forms of fill subtrees associated with pa th eonditions_

:haracterizations of Theorem 3_2. For example, the "L-edge, U-edge", "Ao-edge, U-


)ath", and "L-path, Ao-edge" paths for the nonzero 389 are illustrated in Figure 5_
lere, we have used the first fill subtree of 389 from Figure 4.

Generic forms of the fill subtrees associated with the path conditions in Theo-
'em 3.2 are shown in Figure 6.

4. Struetural Schur representations. An obvious struetural representation


'01'the Schur complement R is the nonzero strueture of R itself. During the course of
iparse LU factorization, if the numerical values of the intermediate Schur complements
Lre computed and stored, then such an explicit representation of R may be reasonable.
3ut, if the values are not required, then it can be expensive in terms of storage and
;ime due to fill-in. Some form of implicit struetural representation may be preferable.
Our objective is to find a sparse matrix strueture

;hat preserves the structure of the Schur complement update Sj that is, one for which
;he nonzero structure of the Schur complement of X KK in X o is the same as that of
4 KK in A o. We shall refer to X o as a structural Schur representation of Aa 01' of S.
91

Condition (2) in Theorem 3.2 implies that the strueture of S can be represented
implicitly using paths in G(Ao)j that is, we can choose X O to be A o. No additional
storage is required, and there is no cost associated with the construetion of this
representation.
Another obvious choice is the filled matrix Fo. There may be more nonzeroes in Fo
than in Aa due to fill-in, but the locations of the nonzeroes in its Schur complement
update will be identical to those of Aa.
Yet another possibility is the use of condition (4) in Theorem 3.2. Each nonzero in
S corresponds to a path of length exaetly two, with one edge from L/(K and another
from UK/(' Therefore,

is a struetural Schur representation of A o.


Note that as far as the strueture of G(S) is concerned, we are only interested in
whether there is a path from r to e as given by the various conditions in Theorem 3.2.
We are not concerned with the intermediate nodes along the path 01' the number of
paths. Any struetural representation of G(S) should be able to generate at least one
path from r to e if and only if STe is nonzero.
We can measure the appropriateness of a struetural representation of S based on
the storage required, the time for constructing the representation, and the time for
retrieving the structure. In most cases, we will want to retrieve the nonzero strueture
of a row 01' column of S.

5. Path-preserving transformations. Theorem 3.2 charaeterizes the struc-


ture of the Schur complement update S in terms of paths in the graphs of Ao, Fo,
L. K , and UK.' In this seetion, we introduce three elasses of path-preserving trans-
formations that will be used in later sections to improve these struetural Schur rep-
resentations.

5.1. Node reductions by path-preserving quotients. Assume that i ~


j ~ ifor two no des i and j in G(AKK)j that is, there is a cyele in G(AKK ) containing
both i and j. We can form a quotient graph by collapsing i and j. The nodes i and
j are replaced by a new no de i. If x =1= i and x =1= j, then (x,i) is an edge in the
quotient graph if and only if either x ~ i or x ~ jj and (i, x) is an edge in the
quotient graph if and only if either i ~ x or j ~ x. Edges between other nodes
are not affected. It is not hard to see that if r ~ k and e ~ k, then r ~ e if and
only if there is a path from r to e in the quotient graph.
More generally, consider any partition {ql, ... , qt} of the no de subset {I, ... , k -I}
that satisfies i,j E q. (1 ~ S ~ t) only if i ~ j ~ i (that is, i and j have a
common cyele). Then Q = {ql"", qll k, ... , n} forms a partition of the no de seto
The following result is straightforward.

THEOREM 5.1. IJ r ~ k and c ~ k, then r ~ e iJ and only iJ there is a path


Jrom r to e in the quotient graph oJ A o using the partition Q.
92

This path-preserving transformation will reduce the number of nodes, yet each
path r ~ e will correspond to a (possibly shorter) path in the quotient graph.
It should be emphasized that we do not impose any maximal property on the
partition. Indeed, the trivial partition consisting of single nodes satisfies the cyele
condition. The amount of node reduetion depends on how good the partition is. This
will become elear from the examples in the following sections.

5.2. Path-preserving edge reduetions. Edge-pruning is another path-pre-


serving transformation. Consider two distinet nodes i and i in G(M) for which
i ~ i and i ~~~ i (that is, there is a path of length at least two from i to
i in G(M)). Then the edge i ~ i ean be pruned from G(M) without affeeting the
set of paths. More generally we ean prune any set of sueh edges yet still preserve the
set of paths.
Three types of extremal path-preserving edge reduetions have appeared in the lit-
erature. A minimum equivalent digraph [7] is a smallest subgraph of G that preserves
the path set of G. A minimal equivalent digraph is one that preserves the path set of
G, yet no proper subgraph does so. A transitive reduction [1] is a graph that preserves
the set of paths in G, yet no graph with fewer edges does so (transitive reduetions
need not be subgraphs of the original graph).
Finding a minimum equivalent digraph for a general direeted graph is an NP-
eomplete problem. On the other hand, finding a transitive reduetion of a direeted
graph has ~he same time eomplexity as finding the transitive elosure [1]. For a direeted
aeyclic graph (or dag), the transitive reduetion is unique and is the same as the
minimum equivalent digraph [1]. This is useful sinee the graphs associated with
triangular matriees are dags.

5.3. Monotone transitive reduetions. Let LU be the sparse faetorization


of a matrix M (without pivoting). Theorem 3.1 charaeterizes the fills in terms of
fill paths, that is, those paths from r to e in G(M) through subsets of nodes in
{I,··· ,i - I}, where i = min{r, e}. A fill path of length at least two from r to e
can be expressed in our notation as r ~~~ e, where J = {I,.··, i-I} and
i = min{r,e}.
If r ~ e and r ~~~ e, then we ean prune this edge in G(M). We shall
refer to this edge pruning as monotone transitive reduction. (Note that Rose [10]
refers to the addition of an edge associated with a fill path as monotone transitive
elosure.) In terms of the faetors L and U of M, we are pruning those edges r ~ e
for which r~..E... e.
Monotone transitive reduetions were used in the symmetrie ease to obtain the
skeleton graphjmatrix [6], the smallest subgraph of the given graph that preserves the
strueture of its filled graph. This notion ean be generalized to the unsymmetric case,
and we shall refer to the pruned strueture after all monotone transitive reduetions
have been performed on M as its skeleton matrix.
93

F------(7

Directed Graph G(A) Strong Quotient Digraph


FJG. 7. The directed graph G(A) of Figure 1 and its strong quotient digraph.

6. Struetural Schur representations using paths in Aa. Theorem 3.2 is


the basis of a number of structural representations for the Schur complement update
S. Condition (2) says that the locations of nonzeroes in S are given by the set of
paths in Aa between no des rand c greater than or equal to k. In this section, we
consider ways to improve Aa as a structural Schur representation by pruning nodes
and edges in G(Aa) using the path-preserving transformations described in the last
section.

6.1. Quotient digraphs by strongly connected components. For any par-


tition of {I,.··, k - I} with the cycle property, the corresponding quotient graph
preserves the path set. One possibility, used by Pagallo and Maulino [8), is to use
the strongly connected components l of the subgraph G(AKK). In other words, each
strongly connected component is reduced to one node in the quotient. We shall refer
to the resulting graph as the st1'Ong quotient digraph.
In Figure 7, we display the directed graph G( A) of Figure 1 with J{ = {I, 2, 3, 4, 5, 6}
as the set of eliminated nodes. Since there are two strongly connected components,
{1,3,4} and {2,5,6}, in the eliminated subgraph G(A[([(), the quotient digraph co-
alesces these two components into supernodes. The paths in G(Aa) corresponding to
the nonzeroes S7,1O, S8,9, and SlO,9 that were given in §3.1 can be also expressed in
terms of paths through the quotients:

Paths in G(Aa) Paths in quotient


7 -+ 4 -+ 10 7 -+ {1,3,4} -+ 10
8-+2-+5-+6-+3-+4-+9 8 -> {2,5,6} -> {1,3,4} -+ 9
10 -+ 5 -+ 6 -+ 2 -+ 3 -+ 4 -+ 9 10 -+ {2,5,6} -> {1,3,4} -> 9.

Let Q(A) be the matrix obtained from A by coalescing each strongly connected
component of G(A[([(). The matrix Q(A a) is similady defined. The following result
folIows directly from Theorem 5.1.
THEOREM 6.l. [8) The matrix Q(Aa) is a structural Schur representation of Aa.

1 A strongly connected component of a directed graph is a maximal subgraph in which there is a


path between any pair of nodes.
94

ill • • ill • •
.1]] • • • I]] • •
Q(A) = • 7 • Q(Ao) = • 7
• • 89 • • • 8
9
• • 10 • 10

FIG. 8. Q(A)/Q(A o) of the matrix in Figure 1

In Figure 8, we display the corresponding Q(A) and Q(Aa} for the matrix in
Figure 1. We use ill
to indicate the component {1,3,4} and I]] the component
{2, 5, 6}.
Pagallo and MauIino [8] observe that the strueture of Q(A) can be represented
in-place using the nonzero strueture of A. This is useful if we want a struetural Schur
representation that requires no more storage than required for the original matrix.

6.2. Monotone transitive reductions and skeleton graphs. The quotient


digraph approach improves a struetural Schur representation by replacing all nodes in
a component by a representative, thus removing some edges. This has the important
effeet of reducing the number of edges that must be considered when cheeking for the
existence of a path. Now, we consider path-preserving edge prunings using monotone
transitive reduetions.
The skeleton matrix A- of A (see §5.3}.is obtained by pruning those edges in G(A)
that corresponds to fill edges; that is

ifi~~j
otherwise

Thus G(A-) is the smallest subgraph of G(A) that preserves the filled graph of G(A).
We call G(A-) the skeletan graph of G(A).
THEOREM 6.2. The skeleton matrix A õ is a strudural Schur representation of
Aa.
In Figure 9, we display Aõ for the matrix in Figure 1. The symboI «." is used to
represent a nonzero that has been pruned. For example, the nonzero a6,4 is removed
due to the path 6 ~ 3 ~ 4 in G(F) or the path 6 ~ 2 ~ 1 ~ 3 ~ 4 in
G(A). Every nonzero in Sean be generated by a path in G(Aõ). For example, for
A- A- A- A-
8 ~ 9, we have the path 8 -<:.. 2 -<:.. 5 -<:.. 4 -<:.. 9.

6.3. Path-preserving edge reductions. Since the skeleton graph G(A õ ) pre-
serves the filled graph of G(A o), it must also preserve the set of paths in G(A o). But
we are only interested in the strueture of G(S), so it is not neeessary for the repre-
sentation to preserve all of the filled graph of G(Ao). This suggests that we expIore
other possible edge prunings.
95

1 •
• 2 • • •
• 3•

Aõ =
• 4
.5. •
• •
• 6
• 7
•• 8
9
• 10

FIG. 9. The skeleton matrix of the matrix in Figure 2.

1• • • 1• •• 1• •• 1 . •• 1 •
• 2 • • .2 .2 .2 .2
•• 3· • • 3.4 • 3.4 • 3 .3
• ••5 •
. ..
• •• 4 .4 .4
• ••• 5. • • .•
.
5 .• 5 .5
•Original
••• 6 •• 6
•Skeleton • 6 6 .6
Minimal Minimum Transitive
Matrix Matrix Equivalent Equivalent Reduetion

FJG. 10. An example to show minimaljminimum equivalent digmph and tmnsitive redBetian.

Three types of reduced graphs were diseussed in §5.2: minimum equivalent di-
graphs, minimal equivalent digraphs, and transitive reduetions. In general, they are
all eandidates for a struetural Sehur representation and are all different (see Fig-
ure 10).

6.4. Path-preserving reductions on quotient digraphs. In the last two


subsections, we discussed the use of edge pruning on the original graph G(A) while
preserving its set of paths. These techniques can also be applied to the quotient
digraph and its eorresponding matrix.
LEMMA 6.3. The graph of the quotient matrix Q(AKK ) is a direeted aeyclie graph
(dag).
THEOREM 6.4. The transitive reduction ofQ(Ao) is unique (and is the same as
its minimum equivalent digraph).
In Figure 11, we show the quotient matrix Q(Ao) of Figure 8, its skeleton matrix,
and the matrix associated with its minimum equivalent digraph.

It is interesting to note that Moyles and Thompson [7] compute a minimum equiv-
alent digraph by first computing the transitive reduetion of the quotient digraph based
on strongly conneeted components, and then finding a minimum equivalent digraph
96

III
.~
• 7
••
• • ·
III
~
• 7

•• III

.~
7

• •

• • 8 • • 89 • 8
9
9
• 10 • 10 • 10
Quotient Skeleton M inimumEquivalent /
Q(Ao) Matrix TransitiveReduction

FIG. 11. A quotient matrix Q( A o) and its redueed matriees

1
2 • •
3 III ••
4 • • ~ • 0

5 • 0 0 • 7
6 • 0
0
• • 8
• 7 9
• .00 • 0 8 • 10
9
• 0 10

FIG. 12. Struetuml Sehur representations using Fo.

for each strongly conneeted component. For our purpose here, we do not need the
time-consuming second stage.

7. StrueturaI Schur representations using paths in Fo. Section 6 presents


several strueturaI Schur representations based on condition (2) in Theorem 3.2. Con-
dition (3) of Theorem 3.2 is similar except that the partial filled graph G(Fo) is used
instead of the original graph G(A o). Therefore, the same path-preserving transfor-
mations can be applied. However, very little is gained.
THEOREM 7.1. Let i and j be less than k. Then i and j belong to the sam e strongly
connected component of G(A KK ) if and only if they belong to the same strongly con-
nected component of G(FKK ).
THEOREM 7.2. G(Aõ ) is isomorphic to G(Fo-).
These results indicate that there is little difference between using A o or Fo in terms
of quotient and skeleton graphs. One notable difference is the possible use of condition
(4) in Theorem 3.2 (as pointed out in §4). Since we can use paths of length exactly
two from the filled matrix, we can replace the leading prineipal submatrix FKK or its
quotient by the identity matrix. In Figure 12, we display the two struetural Schur
representations using Fo and its quotient for the matrix in Figure 1.
97

1 •
• 2 2 • • • •
• 3 3 •

.. •
• 4
.. 5
4
5 •
••
• 6 + 6 • 0 0

7 7
0 8 8
9 9
0 10 10

FIG. 13. L~K and UK. of the matrix in Figure 1.

Another difference is in the minimum equivalent digraphs of A o and Fo. For the
example of Figure 10, it is easy to see that the transitive reduction of Ao is a minimum
equivalent digraph of Fo, and is slightly smaller than any minimum equivalent digraph
of Aa.

8. Struetural Schur representations using paths in L.K and UK,. Condi-


tions (5)-(7) of Theorem 3.2 provide necessary and sufficient conditions for nonzeroes
in the Schur complement update S in terms of paths in G(L. K ) and G(UK.). Any
quotients andi or edge prunings that preserve paths in these direeted graphs can be
used to define struetural Schur representations.

8.1. Transitive reduetions of G(L.K) and G(UK.). One approach is to re-


move all redundant edges in G(L. K ) and G(UK.) while preserving their path sets [3].
Since triangular matrices correspond to dags, it follows from §5.2 that the transitive
reduetions of G( L.I<) and G( UI<.) are unique.

The transitive reduction L~K of L.J{ can be defined formallyas follows. For j < k,

fO. ={ 0 1·f ~. ---*='?---*


L.K L.T> L.K .
J
'J f ij otherwise

Thus, G(L~J{) is the smallest subgraph that preserves the set of paths in G(L,J{).
The transitive reduction UK. can be defined in a similar way.

THEOREM 8.1. The matrix

is a structural Schur representation of A o.

In Figure 13, we show the transitive reductions of the lower and upper triangular
factors of the matrix in Figure 1. Each nonzero pruned is indicated by a ".". For
example, f SI is pruned because of the path 8 ~ 6 ~ 2 ~ 1 (or the path 8 ~
L.K 5 ---*
6 ---* L.K 4 ---* L. K 1) ; an d U59 b ecallse 0 f t Ile pat h 5 ---*
L.K 3 ---* UK_ 9 .
UK_ 6 ---*
98

It should be emphasized that this strueturaI Schur representation is obtained by


:omputing the transitive reduetions of L.K and UK. independently. In general, this is
luite different from a transitive reduetion of Fo. For example, the path 2 ~ 3 ~ 1
s a path of length 2 in G(Fo). Therefore, the edge 2 ~ 1 can be pruned in a
;ransitive reduetion of Fo. However, it cannot be pruned in the transitive reduetion
)f G(L.K).

8.2. Symrnetric reductions: Edge pruning and quotient digraphs. The


;ransitive reductions L~K and U'k. prune all redundant edges from G(L.K) and
'1(UK.)' Pruning any subset of such edges wiIl stiIl result in a strueturaI Sehur
representation (though not as tight). The idea of symmetrie reduetion [2] is to find a
mbset of edges that eosts Iess to determine.
Symmetrie reduetion uses the struetures of both G(L. K ) and G(UK.) to identify
~he pruned edges. We define the symmetrieally-redueed matrix FJ == L~ + U~ as
follows. For i and j with j < k and j < i,
0 if s ~ j ~ s , for some j < s < min{ i, k}
f:j=f.~j= { 0..
'-') otherwise

p.=U'.. ={ 0 if s~j~s,forsomej<s<min{i,k}
J' J' Uji otherwise

The edges pruned to form G(FJ) are a subset of the edges removed to form G(L~K)
and G(U'k.) [2]. Therefore, we have the following result.
THEOREM 8.2. The symmetrieally-redueed matrix F~ is a stroetuml Sehur repre-
sentation of A o.
Symmetrie reduetion prunes edges from the two faetor matrices based on their
symmetric nonzeroes, but it ean aIso be used to reduee nodes by path-preserving
quotients. We first define an equivalenee relation. For i < j < k, i and j are
said to be symmetrieally-related if and only if i ~ il ~ ... ~ it ~ j and
j ~ it ~ ... ~ il ~ i for some nodes il < ... < it. In other words, there is
a symmetric (undireeted) path from i to j in G(FKK) through nodes in increasing
order. .

It is easy to verify that this relation is reflexive, symmetrie, and transitive, and
henee is an equivaIenee relation. We refer to the quotient defined by this relation
as the symmetrie quotient digraph/matrix, and Iet Q'(Ao) and Q'(Fo) denote the
symmetric quotient matrices of Ao and Fo respeetively. By definition, the partitions
in the symmetric quotient matrix are finer than those in the strong quotient matrix.
We have therefore the following result.
THEOREM 8.3. The symmetrie quotient matriees Q'(A o) and Q'(Fo) are stroetural
Sehur representations of Aa.
= m
{1,3,4}, m
For the example of Fi~e 1, the symmetrie quotient has three partitions:
= {2}, and W = {5,6}. Figure 14 shows the symmetrieally-reduced
matrix FJ and its symmetrie quotient matriees.
99

.2.•
1
• • • [1]. • • • [1] • • • •
• 3 •
4 • • m
.[&]
• • m
.[&]
• II


II

II 5 II • • • 0 0

• • II 6 • 0 0 • 7 II 7

III 7 • III III 8 II


• III 8
9 9
III 0 0 8
• 10 III 10
9
0 10
F.'a Q'(A a) Q'(Fa)

FIG. 14. F~, Q'(A o), and Q'(Fo) o/the matrix in Figure 1.

8.3. Path-symmetric reductions: Edge pruning and quotient digraphs


In an attempt to find a praetical pruning that removes more nonzeroes than symmetric
reduction, the authors [2] introduced the notion of path-symmetric reduction of the
struetures of G(L.K ) and G(UK.). We define the path-symmetricalIy-reduced matrix
Fr; as follows. For i and j with j < k and j < i,

d if s~j~s,forsomej<s<min{i,k}
JJi = l:j = { n ..
'-'J otherwise

'f f . {' k}
r~ = u'!. = { 0 1
LKJ{' UKJ{
S ==;> J ==;> S , or some J. < S < mm z,
J' J' Uji otherwise

The edges pruned in G(F~') are again a subset of the edges removed to form G(L~K)
and G(UK.) [2]. Therefore, we have the following resuIt.
THEOREM 8.4. The path-symmetrically-reduced matrix F~' is a stroctuml Schur
representation oJ Aa.
We can define a quotient based on the idea of path-symmetric reduetion. For i <
j < k, i and j are path-symmetrically-related if and only if s ~ i ~ s ~ j ~ s
for some node s < k. It can be shown that this is an equivaIence reIation, and we use
Q"(Aa) to denote the resulting quotient matrix of Aa.
THEOREM 8.5. Q"(A a) is the same as the matrix Q(A a) corresponding to the
strong quotient digmph.

In Figure 15 we display the path-symmetrically-redueed matrix Fr; and the matrix


of the strong quotient graph for the example in Figure 1. Note that the quotient
formed by the path-symmetrie relation has two partitions: = {1,3,4}, and =
{2, 5, 6}, whieh are the same as the partitions from the strong quotient digraph.
m [&]

9. Concluding remarks. In this paper, we have presented a number of ways


to represent the struetures of Sehur eomplements in sparse matriees. The choice of a
partieular representation depends on the applieation.
100

1 •
.2. •
• 3• GJ •• GJ ••

.. .5.
• 4 •• • [ill • • .[ill • 0 •

• 7 • 7
• • 6 • 0 0
• • 8 • • 8
• 7 9 9
0 0 8 • 10 • 10
9
0 10
F."0 Q"(Ao) Q"(Fo)

FJG. 15. Fo', Q"(A o), and Q"(Fo} for the matrix in Figure 1.

For example, in a sparse matrix reordering context, some form of quotients (or its
reduetion) on the originaI matrix Ao will be appropriate, especially since it can be
stored in place within the strueture of Ao• (Note that the quotient used ean be the
st rong quotient digraph [8] or the symmetric quotient digraph in §8.2). On the other
hand, in anumerieal faetorization eode, where we ean make use of the computed
strueture of Fo from the numerieal phase, quotients and reduetions on Fo should be
used.

REFERENCES

[1] A. V. Aho, M. R. Garey, and J. D. UlIman. The transitive reduction of a directed graph. S/AM
J. Comput., 1:131-137, 1972.
[2] S. C. Eisenstat and J. W. H. Liu. Exploiting structural symmetry in sparse unsymmetric
symbollc factorization. S/AM J. Matrix Anal. Appi., 13:202-211, 1992.
[3] J. R. Gilbert and J. W. H. Liu. Elimination struetures for unsymmetric sparse LU factors.
Technkal Report CS-90-11, Department of Computer Science, York University, North York,
Ontario, Canada, 1990.
[4] G. H. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, Balti-
mare, MD, 1983.
[5] L. Haskins and D. J. Rose. Toward a characterization of perfect elimination digraphs. S/AM
J. Comput., 2:217-224, 1973.
[6] J. W. H. Liu. A compaet row storage scheme for Cholesky faetors using elimination trees.
ACM Thms. Math. Software., 12:127-148, 1986.
[7] D. M. Moyles and G. L. Thompson. Finding a minimum equivalent graph of a digraph. Journal
of the ACM, 16:455-460, 1969.
[8] G. PagalIo and C. Maulino. A bipartite quotient graph model for unsymmetric matrices. In
V. Pereyra and A. Reinoza, editors, Numerical Methods, volume 1005 of Lecture Notes
in Mathematics, pages 227-239. Springer-Verlag, 1983. Proceedings of the International
Workshop held at Caracas, June 14-18, 1982.
[9) S. V. Parter. The use of linear graphs in Gauss elimination. S/AM Review, 3:119-130, 1961.
[10) D. J. Rose. A graph-theoretic study of the numerical salutian of sparse positive definite systems
of linear equations. In R. Read, editor, Graph Theorg and Computing, pages 183-217.
Academic Press, New York, 1972.
[11) D. J. Rose and R. E. Tarjan. Algorithmic aspects of vertex elimination of directed graphs.
S/AM J. Appi. Math., 34:176-197,1978.
IRREDUCIBILITY AND PRIMITIVITY OF PERRON
COMPLEMENTS: APPLICATION OF THE COMPRESSED
DIRECTED GRAPH*

CHARLES R. JOHNSON AND CHRISTOS XENOPHONTOSt

Abstract. The Perron complement is a smaller matrix derived in a natural way from a square
nonnegative matrix. By compressing the directed graph of a nonnegative matrix in a certain way,
we analyze fully the connected components, irreducibility and primitivity of its Perron complement
with respeet to a given subset of indices.

O. Introduction. The notion of Perron complement in a nonnegative (irre-


ducible) matrix was introduced in [M) in the context of computing Perron vectors
of finite state Markov processes. The notion was so titled because a Perron comple-
ment is somewhat analogous to a Schur complement. There are, however, nice alge-
braic relationships between Perron eigenvectors of a given matrix and those of Perron
complernerrts, given in [M). Whether or not a Perron complement is irreducible or
primitive was of interest in [M), and partial results and examples were given there.
Our primary interest here is in presenting a complete description of the occurrence
of irreducibility and primitivity. It turns out that this is an entirely combinatorial
question in terms only of the location of positive entries in the original matrix. Not
surprisingly, then, we introduce a combinatorial object, the compressed directed graph
with respeet to a subset of vertices, to deal with such questions. The compressed
directed graph may weIl be of value in other contexts.

1. The compressed directed graph. Let N = {1,2, ... ,n}j we denote the
complement with respeet to N of a subset a S;; N by aC. For it, ... ik E N (repeats
allowed) and a directed graph D on vertex set N, we denote a path consisting of
edges (il, i z ), (i z, i 3 ) •••• , (ik-t, ik) in D by p = p(it, i z, .•• , ik)j a path p(i,j) in D
is simply an edge of D. A path whose initial and terrninal vertices are the same is
called a circuitj we denote the circuit consisting of edges (il' i z), .•• , (ik-t, ik), (i k, il)
by e = c(it, i z, ••. , i k ). Thus c(it, i 2 , ••• , i k ) is the same as p(it, ... , ik, il)' The Zength
of a path or circuit in D is measured in terms of the number of edges and denoted
»
by f(·). Thus, f(p(it,iz, ... ,i k = k - 1, and f(c(it,i z, ... i k » = k. If we wish to
emphasize the graph in which the path occurs, we use fDO. We shall also need a
special measure of Zength for a circuit in D relative to a subset (J S;; N. We use fJJ(e)
to denote the number of appearances of a vertex in (J in the circuit c. Again, fD,JJ(c)
emphasizes the graph. We note that repetitions are allowed and eounted, and that a
directed graph D may have loops, edges of the form (i, i), i E N, so that a path or
circuit may have a vertex repeated consecutively an arbitrary number of times. The
greatest common divisor of an arbitrary, possibly infinite, list L of positive integers
is denoted gcd( L ).
For a directed graph D on vertices N and a subset a S;; N, we define the compressed

• This work supported in part by National Science Foundation grant DMS 92-00899 and by
Office of Naval Research contract N00014-90-J-1739.
t Department of Mathematics, The College of William and Mary, Williamsburg, Virginia 23187-
8795.
102

directed graph relative to a, denoted CD[a], as follows. The vertex set of CD[a] is a,
and, for i,j E a, there is a directed edge from i to j in CD[a] if and only if there is a
(directed) path in D from i to j, all of whose intermediate vertices, if any, are in aC.
For example, if

then
..••.••....... ~ .............•••
CD[{2, 3}] = 0· 0.
l.~) .

The edge (2,3) already oeeurs in D and the dotted edges result from paths via {1,4}
in D : (3,2) from the path p(3, 1,2) and (3, 3) from the cireuit c(3, 1). Of eourse,
several different paths of the allowed type could result in the same edge in CD[a].
It is important to note that a eompressed direeted graph CD[a] inherits exactly
the eonneetivity among its vertiees that oeeurs in the graph D.

LEMMA 1. If D is a directed graph on vertex set N and a S; N, with i,j E a,


then there is a path from i to j in CD[a] if and only if there is a path from i to j in
D.
Proof If there is a path p( SI, ••• ,Sk) in C D[a] from i = St to j = Sk, then for
eaeh edge (S.,sr+t) in p(SI, ... ,Sk) there is a path Pr = p(S.,trt , ... ,trgr>sr+t) in D,
with trI, ... , tr gr E aC. Coneatenation of the paths Pt,P2, ... ,Pk-t gives a path

in D, whieh shows that there is a path from i = St to j = Sk in D.


Conversely, given a path p(tt, t 2, ... , t m) from i = it to j = tm in D, mark all the
vertiees from a, say tt = t Ull tu, ... ,tuh = tm. As tu, Ug < U < Ug+t, is then in aC, we
have that (t ug , t U9+1 ) is an edge of C D[a], beeause there is a path in D from tu. to
t U >+1 via the aC vertiees tu> + 1 ... , t U9+1 -t (if any), 9 = 1, ... , h - 1. Thus p(tt uh ) is
a path from i to j in CD[a], as was to be shown. 0
Given a sequenee d of vertiees from N and a subset of a S; N, we denote the
subsequenee of d lying in a by d[a]j this will be used when d is a path p or cireuit e
in a direeted graph D on N. The following is then straightforward.

OBSERVATION. If e is a cireuit in a direeted graph D on N and jJ S; N eontains at


least one vertex of e, then c[jJ] is a eireuit in C D [jJ] and

fD,p(c) = fODLB] (c[jJ]) •.


103

We note that the "v-elimination graph" Gv of [RT] is the special case of our eom-
pressed direeted graph CD[a] in whieh a is the eomplement of a single vertex v. Fur-
thermore, our CD[a] may be obtained as (((G v,)",)" ')Vk in which aC = {Vl!'" ,Vk};
the order does not matter. The purpose in [RT] is to study "fill" in a symbolic version
(no cancellation assumption) of Gaussian elimination, and nonzero diagonal entdes
are assumed (and self edges in their direeted graphs are suppressed). Our purpose is
to study irreducibility and primitivity in a context in whieh cancellation cannot oe-
eur. No assumption need be made about diagonal entries, and we need keep track of
self-edges. Because of the formal similarity between Perron and Schur complements,
it is not surprising that the compressed direeted graph arises naturally in both con-
texts, and we suspeet that the eompressed direeted graph will have further use in the
study of elimination/complements.

2. The Perron complement relative to a subset of indices. For a non-


negative n-by-n matrix A, we denote the speetral radius, or Perron root, by peA).
For an arbitrary n-by-n matrix A and subsets a, {3 S;; N, we denote the submatrix
of A lying in rows a and columns {3 by A[a,{3]. The principal submatrix A[{3,{3] is
abbreviated to A[{3]. Entries of A[a, {3] are indexed according to a and {3. For an
irreducible nonnegative n-by-n matrix A, the Perron complement PA({3) relative to a
proper subset {3 S;; N was defined in [M] by

PA({3) = A[{3] + A[{3, {3C](p(A)! - A[{3C])-l A[{3c, {3].

The irreducibility assumption insures that p(A[{3C]) < peA), so that


peA)! - A[{3C] is invertible; in this event PA({3) is nonnegative, and it is shown in [M]
that p(PA({3)) = peA). It is elear that this formula for PA({3) makes sense as long as
peA)! - A[{3C] is invertible; we take the formula as definition at this level of generality
(i.e., A need not be irreducible). Recall that the speetral radius is monotone with
respeet to containment of principal submatriees of a nonnegative matrix, so that
p(A[{3]) ~ peA), even if A is reducible. The following useful facts are then elearo

OBSERVATION. If A is an n-by-n nonnegative matrix and {3 S;; N, then

(i) PA({3) is weIl defined by the above formula if and only if p(A[{3C]) < peA);
(ii) in this event PA ({3) is 1{3I-by-I{31 and nonnegative;
and
(iii) for any t > 0 and any {3 for which p(A[{3C}) < peA), ~ PtA ({3) = PA({3).

The directed graph of an n-by-n matrix A = (aij) is, as usual, the graph on vertex
set N with a direeted edge from i to j if and only if aij =1= 0 (ineluding loops in case
j = i). We denote it by D(A). For convenienee, we often identify notions associated
with the matrix A and the graph D(A), for example indices that identify rows or
columns of A and vertices in the graph D(A).
Important to all our analysis of irreducibility and primitivity of PA ({3) is the faet
that the direeted graph of PA ({3) is exactly the eompression of the direeted graph of
D(A) relative to the vertex set {3 S;; N.

LEMMA 2. Let A be an n-by-n nonnegative matrix, and let {3 ~ N be such that


104

p(ALSC]) < p(A). Then

D(PA(f3)) = CD(A)[f3].

Proo! Let i, j E 13. We must show that the i, j entry of PA (13) is positive if and
only if there is a path from i to j in D(A), all of whose intermediate vertiees (if any)
lie in f3c. As p(A) > 0 by hypothesis, we assume, without loss of generality that
p(A) = 1. As p(A[f3C]) < 1, then (I - A[f3c])-l = 1+ A[f3C] + A[f3c]2 + .... Thus,

PA(.B) = A[f3] + A[f3, f3C]( 1+ A[f3C] + A[f3c]2 + ... )A[f3c, 13].


As the p, q entry of A[f3c]k is nonzero if and only if there is a path in D(A), entirely
via f3c vertices, of length k from p E f3c to q E f3c, we may, via matrix multiplication,
observe the following. There is a path from i to j in D(A) with no intermediate
vertiees if and only if the i, j entry of A[f3] is positive, with exactly one intermediate
f3c vertex if the i, j entry of A[f3, f3C]A[f3c, 13] is positive, and with exactly k + 1 inter-
mediate f3c vertiees if and only if the i,j entry of A[f3,f3C]ALBC]kA[f3C,13] is positive,
k = 1,2, .... Since all terms are nonnegative (so that no cancellation is possible) and
because of the expression for PA (f3), the proof is complete. D

3. Irreducible components of the Perron complement. Recall that an n-


by-n matrix A is reducible if there is a proper subset 13 S;;; N for which A[f3,f3 C] = 0;
otherwise, if n :::: 2, A is irreducible. The matrix A is irreducible if and only if the
directed graph D(A) is strongly connected (there is a path in D(A) from any vertex
to any other vertex). Any n-by-n reducible matrix A is permutation similar to an
irreducible (or Frobenius) normal form

Au A 12 ••• A.lk 1
[ o A 22 :

: 0 '. :
o ... 0 A kk

in whieh the ni-by-ni matrix A ii is irreducible of ni = 1, i = 1, ... , k. Each Ai is


a principal submatrix A[ai] of the original (prior to permutation similarity) matrix
A. We refer to the index set ai as an irreducible component of A. Of course the
irreducible components al, a2, ... , ak form a partitian of N, and, in the event that A
is irreducible, we say that Ahas onlyone irreducible component, al = N. For each
irreducible component, either lail = 1, of A[a;] is a maximal irreducible principal
submatrix of A.

Our main result in this section is that the irreducible components of a Perron
complement PA (f3) are naturally related to those of A. This is actually just a fact
about strongly connected components of a compressed directed graph.

THEOREM 3. Let A be an n-by-n nonnegative matrix with ir1'fducible compone71ts


all a2,"" ak and let 13 S;;; N be such that p(A[f3C]) < p(A). The irreducible compo-
nents of PA (13) are then the nonempty sets among 13 n all 13 n a2," .,13 n ak·

Proo! It is clear that the nonempty sets among f3n al, ... ,f3n ak form a partition
of 13, which indexes the rows and columns of PA (13). It suffiees to show that
105

(1) if 1,8 nOil;::: 2, then there is a path in D(PA (,8)) connecting any two vertices in
,8 n Oi and
(2) if p E ,8 n Oj1 and q E ,8 n oj"h =I h, then there is not both a path from p to q
and from q to p in D(PA(,8)).
But, since D(PA(,8)) = CD(A)[,8], by lemma 2, and since any two vertices in ,8
are connected in CD(A)[,8] if and only if they are in D(A) by lemma 1, requirement
(1) is met, as Oi is a connected component of A. On the other hand, again since
D(PA(,8)) = CD(A)[,8] and connectivity of ,8 vertiees in CD(A)[,8] is equivalent
to connectivity in D(A), requirement (2) is met because Oj, and Oi> are different
connected components. D
Two corollaries of interest follow immediately from Theorem 3. It was a main
result of [M] that PA(,8) is (either 1-by-l or) irreducible whenever the nonnegative
matrix A is. Of eourse, PA(,8) may be irreducible when A is not, whieh was not
addressed in [M].

COROLLARV 4. Let A be an n-by-n nonnegative matrix and let ,8 ~ N, 1,81 ;::: 2, be


such that 'p(A[,8C]) < peA). Then, PA(,8) is irredueible iJ and only iJ,8 is eontained
in an irreducible eomponent oJ A.

COROLLARV 5. Let A be an n-by-n nonnegative matrix. Then PA(,8) is irreducible


(or l-by-l) Jor every nonempty subset ,8 ~ N iJ and only iJ A is irreducible.

4. Primitivity of Perron complements. An n-by-n nonnegative matrix A is


ealled primitive if some power A9 is positive. Irreducibility is necessary for primitivity,
but not sufficient. Whether or not an irreducible nonnegative matrix A is primitive is
also entirely combinatorially determined. In terms of D(A), one deseription focHses
upon the set of lengths of cireuits through a given vertex i. Let Lb(A) = {RD(A)(e) :
i E e}. We shall also need Lb(A),1l = {eD(A),Il(e) : i E e}, the set of alllengths, relative
to ,8 ~ N, of circuits passing through a given vertex i E N. The following fact may
be found, for example, in [HJ].

LEMMA 6. Let A be an n-by-n nonnegative matrix. Then A is primitive iJ and


only iJ A is irreducible and ged(Lb(A») = 1 Jor some i EN.

In [M] it wasnoted by example that a primitive matrix eould have a non-primit.ive


Perron complement and, eonversely, that a non-primitive matrix eould have a prim-
itive Perron eomplement. No explanation of exactly when primitivity oecurs was
given. It is elear that the primitivity of PA(,8) is entirely determined by CD(A)[i3] =
D(PA(,8)) by Lemma 2. By Lemma 6, primitivity is determined by LhD(A)LB]' which
is the same as LV(A),Il' We thus have

THEOREM 7. Let A be an n-by-n nonnegative matrix and let ,8 ~ N be such that


p(A[,8C]) < peA). Then, PA(,8) is primitive iJ and only iJ PA (,8) is irredueible and
ged(Lb(A),Il) = 1 Jor some i E ,8.
Proof First, reeall that RD.!1Ce) = RCDLB] (e[,8]) beeause each is equal to le[,8]I. Since
PA(,8) is primitive if and only if PA(,8) is irreducible and ged(Lb(PA(llll) = 1 for some
i E ,8 (by Lemma 6), and since D(PA(,8)) = CD(A)[,8] (by Lemma 2), it follows
that Lb(A),1l = Lb(PA(llll and ,that an irreducible PA(,8) is primitive if and only if
ged(Lb(A),Il) = 1 for some i E,8. D
106

5. Path product forrnula for the Perron cornplernent. For a path p =


p(i 1 ,i2 , ••• ,ik) in D(A), let IIAP = ai,i2ai2;. ···aik_lik, the path product of entries
from A corresponding to the edges of p. For (3 ~ N and i,j E (3, we may then define

SA,f3(i,j) = LIIAP,
.,,:il~i;ilc=i e
12"",'k-lE{3

the sum of all path products from A whose initial vertices are i, whose terminal
vertices are j and all of whose intermediate vertices (if any) are from (3c.
Analogous with the fact that edges in D(PA((3)) correspond to special paths in
D(A), we may give a path product formula for PA ((3) when A is normalized so that
p(A) = 1. Because of the homogeneity of PA ((3) (observation iii in Section 2), the
latter is no restriction.
THEOREM 8. Let (3 ~ N. If A is an n-by-n nonnegative matrix such that
p(A[(3C]) < p(A) = 1, then for i,j E (3, the i;j entry of PA ((3) is SA,f3(i,j).

Proof. Inspect further the expansion:

PA ((3) = A[(3] + A[(3, (3C]A[(3c, (3] + A[(3, (3C]A[(3C]A[(3c, (3]+


A[(3, (3C]A[(3c]2A[(3c, (3] + ....

By inspection, the i,j entry of A[(3,(3c]A[(3"]mA[(3c, (3] is

L IlA p(it, ... , i m +3 ), m = 0,1,2, ...


~:il=~jim+3=i
12t .•• t 1m+2E{3

Since the i,j entry of A[(3] is IIAP(i,j), summing results in the entry formuIa of
Theorem 8. D

REFERENCES

[HJ) R. Horn and C.R. Johnson, Matrix Analysis, Cambridge University Press, New York, 1985.
[M) C. Meyer, Uncoupling the Perron Eigenvector Problem, Linear Algebra and its Applications
114/115 (1989), 69-94.
[RT) D. Rose and R. Tarjan, Algorithmic Aspects ofVertex Elimination on Directed Graphs, SIA!\f
J. AppI. Math. 34 (1978), 176-197.
PREDICTING STRUCTURE IN
NONSYMMETRIC SPARSE MATRIX FACTORIZATIONS

JOHN R. GILBERT' AND ESMOND G. NGt

Abstract. Many eomputations on sparse matriees have a phase that predicts the nonzero struc-
tu re of the output, followed by a phase that actually performs the numerical eomputation. We study
structure prediction for c.omputations that involve nonsymmetric row and column permutations and
nonsymmetric or non-square matrices. Our tools are bipartite graphs, matchings, and alternat.ing
paths.

Our main new result concerns LU factorization with partial pivoting. We show that if a square
matrix A has the strong Hall property (i.e., is fully indecomposable) then an upper bound due to
George and Ng on the nonzero structure of L + U is as tight as possible. To show this, we prove a
crucial result about alternating paths in strong Hall graphs. The alternating-paths theorem seems
to be of independent interest: it can also be used to prove related results about structure prediction
for QR factorization that are due to eoleman, Edenbrandt, Gilbert, Hare, Johnson, Olesky, Pothen,
and van den Driessche.

Keywords: Gaussian elimination, partial pivoting, orthogonal factorization, matchings in bipartite


graphs, strong Hall property, strueture prediction, sparse matrix factorization.

AMS(MOS) subject c1assifications: 05C50, 05C70, 15A23, 65F05, 65F50.

1. Introduction. Many sparse matrix algorithms prediet the nonzero strueture


of the output of a computation before performing the computation itself. Knowledge
of the output strueture can be used to allocate memory, set up data structures,
schedule parallei tasks, and save time by avoiding operations on zeros. Usually the
output strueture is predieted by doing some sort of symbolic computation on the
nonzero strueture of the input; the aetual input values are ignored until the numerieal
computation begins.

This paper discusses structure predietion for orthogonal faetorization and for
Gaussian elimination with partial pivoting. These algorithms permute the rows and
columns of an input matrix nonsymmetrically: starting with a linear system (or least-
squares system) of the form Ax = b, they instead solve a system (pr ApC) (( PC) T x) =
(prb). Here pr and pc are permutation matrices; pr reorders the rows of A (the
equations), often for numerieal stability or for efficiency, and pc reorders the columns
of A (the variabies), often for sparsity. We are most interested in the case where pc
has already been chosen on grounds of sparsity.

* Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, California 94304-1314
(gilbert@parc.xerox.com). This work was supported in part by the Christian Michelsen Institute,
Bergen, Norway, and by the Institute for Mathematics and Its Applications with funds provided by
the National Science Foundation. Copyright © 1992 by Xerox Corporation. All rights reserved.
t Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge,
Tennessee 37831-6367 (ngeg@ornl.gov). This author's work was supported by the Applied Mathe-
matieal Sciences Research Program of the Office of Energy Research, U .S. Department of Energy,
under contract DE-AC05-840R21400 and by the Institute for Mathematics and Its Applieations
with funds provided by the National Science Foundation.
108

Our main tools are bipartite graphs, matchings, and alternating paths. A match-
ing corresponds to a choiee of nonzero diagonal elements. Paths in graphs are impor-
Lant in many sparse matrix settings; the notion of alternating paths links matchings,
connectivity, and irreducibility. In this paper we highlight a partieular sort of ir-
reducibility called the strong Hall property: this generalizes the notion of strong
conneetivity (or irreducibility under symmetric permutations) to nonsymmetric per-
mutations and nonsquare matrices. It turns out that accurate strueture predietion is
easier for strong Hall matriees than for general matrices. Fortunately, a non-st rong-
Halllinear system is often most efficiently solved by decomposing it into a sequence
of strong Hall systems.
The next seetion gives definitions and background results, beginning with a def-
inition of exactly what we mean by strueture predietion. Seetion 3 discusses QR
factorization. Most of this seetion reviews earlier work, placing it in a framework
that can be used to study LU factorization as weIl. Seetion 3 also contains a new
tight symbolic result on columnwise orthogonal faetorization. Seetion 4 applies the
framework from Section 3 to LU factorization. It contains the main resuIts of the
paper, which are tight upper and lower bounds on where fill can occur during LU
factorization with partial pivoting. Both Seetions 3 and 4 conelude with remarks and
open probIems; Section 5 makes some final remarks.

2. Preliminaries. We begin this section by defining various kinds of strueture


predietion. We then discuss several graph-theoretic models of sparse matrix strueture.
We define so-called "strong Hall bipartite graphs," whieh model a useful elass of
fundamental matrices. We prove a crucial resuIt (Theorem 2.9) about matchings and
alternating paths in strong Hall graphs, which is the basis for the main resuIts in
the rest of the paper. Finally, we briefly review work on structure predietion for
symmetric and nonsymmetric Gaussian elimination without pivoting.

2.1. Symbolic and exact structure prediction. Suppose f is a funetion from


matrices to matriees, and :F is an algorithm that computes f(A) by applying elemen-
tary transformations (or elementary matrices) to A. The transformations of interest
to us are Gauss transforms (elimination steps), Givens rotations, Householder reflec-
tions, and row and column swaps. (See Golub and Van Loan [18] for detailed de-
scriptions of various elementary matrix transformations.) We will discuss two kinds
of structure predietion, which we call symbolic and exact.
Symbolic structure predietion models the effeet of algorithm :F by modeling the
effeet of each elementary transformation on the nonzero structure of a matrix. Each
elementary transformation is defined to produce zeros in certain positions: aGauss
transform or a Householder refleetion annihilates part of a column, a Givens rotation
annihilates a single element, and a swap interchanges the zeros in two rows or columns.
In symbolic strueture prediction we assume that no zeros are ever produced outside
those well-defined positions, whether because of numerieal coincidence or struetural
singularity. This "no-cancellation" assumption generally guarantees that we compute
an upper bound on the possible nonzero structure of f(A). (At least, it does so if
algorithm :F never makes choiees based on numerical comparison to zero.)
Symbolic strueture prediction can sometimes produce too generous an answer for
109

reasons that have nothing to do with numerical values. For example, consider an
algorithm that solves a nonsymmetric linear system Ax = b by forming the normal
equations ATAx = ATb and factoring the matrix ATA. If Ahas the stmcture

J
x x
x
x

then the symbolic approach wilI predict (correctly) that ATA is full, and then (incor-
rectly) that the factor of this full matrix is a full triangular matrix.

Even though the no-cancellation assumption may not be strictly correct, there
are situations in which symbolic structure prediction is the most useful kind. For
example, an algorithm may produce intermediate jill, or element s that are nonzero
at some point in the computation but zero in the final resulto (Using the normal
equations on the triangular matrix above is an example.) A symbolic prediction can
be used to, identify all possible intermediate filllocations, and thus to set up a static
data structure in which to carry out the entire algorithm. Also, even if an element
can be proved to be zero in exact arithmetic, it may not be computed as zero in
floating-point arithmetic; we may wish to use symbolic stmcture prediction to avoid
having to decide when such an element should really be considered to be zero.
Exact structure prediction, on the other hand, predicts the nonzero stmcture of
f(A) from that of A without regard to the algorithm that computes f(A). For each
input structure S, it yields the set of output positions that are nonzero for some choice
of input A having structure S. Thus the output of an exact structure prediction is

U{structure(J(A)) : structure(A) = S}.

In all the interesting cases that we know, this is equal to

U{structure(J(A») : structure(A) ~ S}.

An exact structure prediction for the normal equations algorithm on the triangular
input above is that the output has the same structure as the input.
If T is the exactly predicted stmcture of J on input stmcture S, then for each
nonzero position (i,j) of T there is some A (depending on i, j, and S) for which
[f(A)};j is nonzero. (We use [J(A)]ij to denote the (i,j) element of f(A).) This is
what we call a one-at-a-time resuit: it promises that every position in the predicted
structure can be made nonzero, but not necessarily all for the same input A. A
stronger result is an all-at-once result, saying that there is some single A depending
only on S for which f(A) has the structure T. Some functions J admit all-at-once
exact structure predictions and some do not. For example, we wilI see that if f(A)
is the upper triangular factor in QR factorization of astrong Hall matrix, then there
is an all-at-once exact prediction; but if f(A) is the upper triangular factor in LU
factorization with partial pivoting of astrong Hall matrix, then the tightest possible
exact prediction is only one-at-a-time.

Exact structure prediction depends only on the input structure, so numerical


coincidence can stilI produce unexpected zeros. For example, the exact structure
110

I'

0'
X X
X
3'
X X X
X X 0'
X X
X X 5'

0'

FIG. 1. A matrix A and its bipartite graph H(A).

prediction of the upper triangular factor of

1 1) = (1 0 0)o (1 1 1)
2 1 1 1 0 1 0
1 2 1 0 1 0 0 1

is that it is full, though in fact its (2,3) element is zero (for the particular choice of
numerical values).
A symbolic upper bound on structure is an exact upper bound, but not vice versa.
In each of Sections 3 and 4, we prove that an exact lower bound is equal to a symbolic
upper boundj it follows that the bound is tight both symbolically and exactly.

2.2. Graphs of matrices: Definitions. We assume the reader is familiar with


basic graph-theoretic terminologyj Harary [20] is a good general reference. We write
Gl ~ G2 to mean that graph Gl is a subgraph of graph G2 •
Suppose A is a matrix with m rows and n columns. We write [A]TC for the element
in the (r, e) position of A.
We will use three graphs to describe the nonzero structure of A. The bipartite
graph of A, which we write H(A), has m "row vertices" and n "column vertices."
The row and column vertices are drawn from two different copies of the positive
integers, which we distinguish by using primes on row vertex names. Thus the row
vertices are 1', 2/, ... , m', and the column vertices are 1, 2, ... , n. When a variable
names a vertex, we will use a prime for a row vertexj thus for example i is a column
vertex, and i' is the row vertex with the same number. The graph H(A) has an edge
(r', e) for each nonzero element [A]TC of A. Figure 1 is an example.
If m = n then A is square, and we also say that H(A) is square. In this case
the directed graph of A is the directed graph G(A) whose n vertices are the integers
1, ... , n, and whose edges are {(r, e) : r =f. e and [A]TC =f. O}. This graph does not
inelude self-Ioops, so we cannot teIl from G(A) whether or not the diagonal element s
of A are zero. Figure 2 is an example.
If m = n and in addition A is symmetric, then the edges of G(A) occur in sym-
metric pairs. An undirected graph with n vertices and one undirected edge for each
111

symmetric pair of off-diagonal nonzeros is often used to represent the structure of


a symmetric matrix. We will write this undirected graph as G(A), and we will not
distinguish between it and the directed graph of A. Figure 3 is an example.
The column intersection graph of an arbitrary m x n matrix A is the undirected
graph Gn(A) whose vertices are the integers 1, ... , n, and whose edges are ((i,j) :
3r with [A)T; =f. 0 and [A).; =f. O}. Thus the vertices of Gn(A) are the columns of A,
and an edge joins two vertices whose columns share a nonzero row in A. Uniess there
is numerical cancellation, Gn(A) is equal to G(ATA)j in all cases Gn(A) :2 G(ATA).
Figure 4 is an example.
Table 1 summarizes this notation, as well as some that is defined in later sections.
We allow both graphs and matrices as arguments to Gn and so onj thus for example
if H = H(A) then Gn(H) means the same as Gn(A).
If x is a vertex of graph G (bipartite, directed, or undirected), we write AdjG(x)
for the set of vertices y such that (x, y) is an edge of G. A walk is a sequence of edges
P = ((xo, Xl), (Xl, X2)' ... , (xp-I, x p)). We can also describe this walk by listing its
vertices, (xo, Xl, ••• , x p). The length of the walk is p. We count the empty sequence
as a walk of length o. A path is a walk in which all the vertices are distinet. We use
P[x;: Xj) to denote the portion of path P from Xi to Xj.l If P is a path from X to y,
and Q is a path from y to z, and y is the only vertex on both P and Q, then PQ is
a path from X to z.
The intermediate vertices of a path P are all its vertices except its endpoints. If
X is a vertex of G and S is aset of vertices of G, we write ReachG(x, S) to denote the
set of vertices y such that G contains a path from X to y with intermediate vertices
from S. In this case we also say that y is reachable from x through S. For a bipartite
graph H, we write ReacheolH(x, S) to mean the column vertices in ReachH(x, S).
The following trivial lemma relates paths in a bipartite graph and in its column
intersection graph.
LEMMA 2.1. Let H be a bipartite graph, and let Gn(H) be its column intersection
graph. For any subset e
of the columns of H, and for any two column vertices x and
y of H, there is a path in H from x to y whose intermediate column vertices all lie
in e if and only if y E ReachGn(H)(X, e).
Proo/. Immediate. D

1 When the graph G is bipartite or undireeted, P[x;: Xj) = «X;,Xi+l}, ... ,(Xj_bXj}) if i::; j,
=
and P[x; :Xj) «X;,X;_l}, ... ,(Xj+bXj}) ifi ~ j.

(:
x
X

X
X
X
X X
X J 'rxl'
FIG. 2. A nonsymmetric matrix A and its directed graph G(A).
112

x
x x '~'
x

x
x
x
x
x
x
;)
FIG. 3. A symmetric matrix A and its undirected graph G(A).

TABLE 1
Graphs associated with the matrix A.

H(A) bipartite graph of arbitrary matrix


G(A) direeted graph of square matrix
G(A) undireeted graph of square symmetric matrix
Gn(A) eolumn intersection graph of arbitrary matrix
G+(A) filled graph (direeted or undireeted) of square matrix
GMA) filled graph of eolumn interseetion graph of arbitrary matrix
HX(A) row merge graph (bipartite) of arbitrary matrix
GX(A) row merge graph (directed) of square matrix

2.3. Bipartite matching: Definitions. We briefly summarize some terminol-


ogy on matehings in bipartite graphs. Lovasz and Plummer [24] is a good general
referenee on matehingj some of our terminology is from Coleman, Edenbrandt, and
Gilbert [5]. Brualdi and Ryser [3, Chapter 4] is a good referenee on deeompositions
of bipartite graphs.

Let H be a bipartite graph with m rows and n eolumns. A matching on H is a


set M of edges, no two of whieh have a common endpoint. A vertex is covered or
matched by M if it is an endpoint of an edge of M. Clearly, no matehing ean have
more than min( m, n) edges. A matching is called column-complete if it has n edges,
and row-complete if it has m edgesj if m = n a matehing with n edges is also called
perfect. Not every bipartite graph has a eolumn-complete or row-eomplete matchingo
If M is a matehing on H, an alternating path (with respeet to M) is a path on
whieh every second edge is an element of M j an alternating walk is a walk on whieh
every secOlid edge is an element of M. Alternating paths and walks eome in two

x x
x
x x x
x x
x x
x x

FIG. 4. A matrix A, its column intersection graph Gn(A), and its lilled column intersection graph
G~(A).
113

Havors: an r-alternating path is one that follows matching edges from columns to
rows and non-matching edges from rows to columns; a e-alternating paih is one that
follows matching edges from rows to columns. The reverse of an r-alternating path
or walk is a c-aIternating path or walk. Suppose the last vertex of one alternating
walk is the first vertex of another. If the aIternating walks are of the same Havor,
their concatenation is an alternating walk of that Havor; if the walks are of opposite
Havors, their concatenation is not an alternating walk.

Suppose that P is an alternating path (of either Havor) from an unmatched vertex v
to a different vertex w. If the last vertex w on P is unmatched, or the last edge on P
belongs to M, then the set of edges M = M EEl P = (M UP) - (M n P) is another
matching; we say that M is obtained from M by alternating along path P. If w is
matched in M, then v is matched and w is unmatched in M, and IMI = IMI. If w
is unmatched in M, then both v and w are matched in M, and IMI = IMI + 1. In
the latter case we also call P an augmenting palh (with respeet to M). A dassical
resuIt of matching theory is that a maximum-sizematching can be construeted by
greedily finding augmenting paths and alternating along them.

A perfeet matching in the bipartite graphH = H(A) of a square matrix can


be thought of as a way to find a row permutation P for A so that the permuted
matrix PA has nonzero diagonal. Then alternating paths in H correspond to direeted
paths in G(P A).

LEMMA 2.2. Suppose Ahas a nonzero diagonai. The directed graph G(A) has
a path from vertex r to vertex e if and only if the bipartite graph H(A) has a path
from row rl to column e that is r-alte1'nating with respeet to the matching oJ diagonaI
edges (il, i).

Proo! Immediate. D

2.4. Hall and strong Hall bipartite graphs. A bipartite graph with m rows
and n columns has the Hall property if every set of k column vertices is adjacent to
at least k row vertices, for all 0 ::; k ::; n. Clearly a Hall graph must have m 2: n.
If a graph is not Hall, it cannot have a column-complete matching, because aset of
columns that is adjacent only to a smaller set of rows cannot all be matched. The
converse is a dassical faet about bipartite matchingo

THEOREM 2.3 (HALL'S THEOREM). A bipartite graph has a eolumn-eomplete


matching iJ and only iJ it has the Hall property. D

COROLLARY 2.4. IJ a matrix Ahas Jull column rank, then H(A) is Hall. Con-
versely, iJ H is Hall then almost all matriees A with H = H(A) have Jull eolumn
rank.

Proo! If H(A) is not Hall, then it has aset of columns with nonzeros in a smaller
number of rows; those columns must be linearly dependent. For the converse, let M
be a column-complete matching on Hand let R be the set of rows that are matched
by M. Consider any matrix A with H(A) = H. The submatrix of A consisting of
rows R and all columns is square. Its determinant is a polynomial in the nonzero
values of A. We claim that this polynomial is not identically zero: if the entries
corresponding to edges of M have the value one and all other entries are zero, the
114

submatrix is a permuted identity matrix and the determinant is ±1. The set of zeros
of a k- variable polynomial has measure zero in Rk, unIess the polynomial is identieally
zero. Thus the set of ways to fill in the values of A to make this submatrix singular
has measure zero. If the submatrix is nonsingular, then all the columns of A are
linearly independent and A has full column rank. 0
A bipartite graph with m rows and n columns has the strong Hall property if every
set of k column vertiees is adjacent to at least k + 1 row vertices, for all 1 :::; k < n. 2
It is easy to see that the strong Hall property implies the Hall property.
If the Hall property is a linear independence condition, the strong Hall property is
an irreducibility condition: any matrix that is not strong Hall can be permuted to a
block upper triangular form called the Dulmage-Mendelsohn deeomposition [3,24,29],
in which each diagonal block is strong Hall. 3 Linear equation systems and least-
squares problems whose matrices are not strong Hall can be solved by performing first
a Dulmage-Mendelsohn decomposition, and then a block backsubstitution that solyes
a system with each strong Hall diagonal block. Strong Hall matriees are therefore of
part.icular interest in sparse Gaussian elimination and least squares probIems.
Brualdi and Shader [4] and Coleman, Edenbrandt, and Gilbert [5] discuss prop-
erties of st rong Hall matrices. In the following result, an independent set is aset
of vertiees no two of which are adjacent; an independent set in a bipartite graph
corresponds to the rows and columns of a zero submatrix.
THEOREM 2.5 (BRUALDI AND SHADER [4]). A bipartite graph having m rows
and n :::; m eolumns is Hall iJ and only iJ it has no independent set oJ more than m
vertiees, and strong Hall iJ and only iJ it has no independent set oJ at least m vertiees
that includes at least one vertex Jrom eaeh part. 0
A square strong Hall matrix is often ealled Jully indeeomposable, meaning that
there is no way to permute its rows and columns into a block triangular form with
more than one block [3]. This gives the following (standard) resulto
THEOREM 2.6. Let H = H(A) be a square strong Hall graph. Then Jor all
row and eolumn permutations pr and pe, the directed graph G(pr Ape) is strongly
eonneeted. 0
We conelude this subsection by proving a theorem (Theorem 2.9) about strong
Hall matrices that is useful in several structure prediction results. The theorem first
appeared in a technical report by Gilbert [15]; other proofs have been given by Hare,
Johnson, Olesky, and van den Driessche [21] and Brualdi and Shader [4]. First we
need two technieal lemmas.
LEMMA 2.7. Let H be astrong Hall graph and let (r/,e) be an edge oJ H. Then
there is a eolumn-eomplete matehing that includes (r', e), and uniess (r', e) is the only
edge oJ H there is a eolumn-eomplete matehing that excludes (r', e).
2 This definition is from Coleman et al. [5]. Another definition that is sometimes used replaces
the bounds on k by 1 ~ k < mj the only dilference is that an m by n matrix with m > n and m - n
zero rows that is strong Hall by our definition is not strong Hall by the other definition. All the
resuits in Section 3 and Seotion 4 hold no matter which definition is used.
3 This assumes m ~ n. More generally, for any m and n, an m x n matrix can be permuted
to a block upper triangular form in which each diagonal block is strong Hall or has a strong Hall
transposeo
115

Proo! First, let H be H without vertiees r' and e and their ineident edges. We
show that H is Hall. Every nonempty set C of eolumns of H is a nonempty proper
subset of eolumns of H, and henee is adjaeent to at least ICI + 1 rows of H. This
indudes at least ICI rows of H. Therefore H is Hall and has a eolumn-eomplete
matehing. That matehing pius edge (r', e) is a eolumn-eomplete matehing on H.
Now assume that H has mare than one edge, and let fi be H without the single
edge (r', e). We show that fi is Hall. Any nonempty proper subset C of eolumns is
adjaeent to at least ICI + 1 rows in H, henee to at least ICI rows in fi. The same
argument works if C is the set of all eolumns and H has at least ICI + 1 nonzero rows.
If C is the set of all eolumns and H has exaetly ICI nonzero rows R, we argue as
follows: If r' were adjaeent only to e in H, then C - e would be adjaeent in H only to
the IC - el rows R - r ' ,4 eontradicting the fact that His strong Hall. Thus C must
be adjaeent in fi to all ICI rows.
Whether or not H is square, then, we eondude that fi is Hall. Thus fi has a
column-<;omplete matehing, whieh is a eolumn-complete matehing on H that exdudes
(r',e). D
LEMMA 2.8. If H is strong Hall and has more nonzero rows than columns, and
M is any column-complete matching on H, then from every row or column vertex w
of H there is a c-alternating path to some unmatched row vertex r' {which depends
on w and MJ.
Proo! This is a standard result on Dulmage-Mendelsohn deeompositioni we in-
du de a proof here only to be self-eontained. If w is an unmatehed row there is nothing
to proveo Otherwise, let C be the set of eolumns reaehable by e-alternating paths
from w. Then C is nonempty. Let R be the set of row vertiees adjaeent to vertiees
of C. Sinee H is strong Hall and has mare nonzero rows than eolumns, IRI is larger
than ICI. Thus there is sam e vertex r' in R that is not matehed to a vertex in C.
Suppase r' is adjaeent to e E C. The e-alternating path from w to e ean be extended
by edge (e, r') to r'. Now if r' were matched, it would be matehed to a vertex v
not in Ci but then there would be a e-alternating path from w to v, eontrary to the
definition of C. Therefore r' is the desired unmatehed row vertex. D

Finally we prove the main result about alternating paths in strong Hall graphs.

THEOREM 2.9 (ALTERNATING-PATHS THEOREM). Let H be astrong Hall


graph with at least two rows, let v be a column vertex of H, and let w be any row
or column vertex of H such that a path exists from v to w. Then H has a column-
complete matching relative to which there exists a c-alternating pa th from v to w {or,
equivalently, an r-alternating pa th from w to v J.
Proo! Sinee some of the vertiees in this proof ean be either row or eolumn vertiees,
we will not use primed variabiesi an unprimed variable may denote either a row or a
eolumn vertex.
If H is square, or if H has only as many nonzero rows as eolumns, then the theorem

4 If e is aset of vertices and e is a vertex, we use e- e to denote the set e- {e}.


116

. (.I-.-=~.-·.. .:
.T----<............ .
.-. j.. • •
r

. 'Õi l ! I
v
:: i i : I
I I

FlG. 5. Gase 1 of Theorem 2.9. The dashed edges are the matching M. P is the horizontal path
from v to w. The light dotted line shows pa th 'ii from u to r. Path P[v : u]'ii[u : x] is c-alternating
with respeet to M.

follows from Theorem 2.6 and Lemma 2.2.

Suppose that H has more nonzero rows than columns. If v = w there is nothing to
proveo Otherwise, by hypothesis there is at le'ast one path from v to W. By Lemma 2.7
there is a column-complete matching that omits the first edge on that path. (Note
that this edge is not the only edge of H since H has more nonzeros than columns.)
If P is a path from v to w and M is a column-complete matching that omits the
first edge on P, let u (dependent on P and M) be the last vertex on P such that
P[v : u] is alternating. Then P[v : u] is c-alternating. Among all such paths and
column-complete matchings, choose P and M such that the length of P[u : w] is
minimum.

If u = w the theorem holds. We shall assume u =f. tv and derive a contradiction.


Let t be the next vertex after u on P. Both the last edge of P[v: u] and the first edge
of P[u: w] (which is (u, t)) must be non-M edges, or else P[v :t] would be alternating.

Because P[v : u] is c-alternating and begins and ends with non-matching edges,
u is a row vertex and hence t is a column vertex. Let s be the vertex matched to t
in M, which may or may not be on P.

Lemma 2.8 implies that there is an unmatched row vertex rand a c-alternating
path n from u to r (possibly u = r). Now t is on path n
if and only if s is. There
are two cases.

Case 1. Both t and s are on n. In this case P[v: u]n[u: t] is a c-alternating walk
from v to t. Therefore there is a c-alternating path n from v to t. Let x be the last
vertex on P that is also on n (so x is on P[t: tv]). Then 15 = n[v: x]P[x: tv] is a path
from v to W. But this is a contradiction: J5 is c-alternating from v at least to x, and
15[x: w] is shorter than P[u: w]. This contradiets the choice of P. Figure 5 illustrates
this case.

Case 2. Neither t nor s is on n. In this case n = (s, t) (t, u) n is a c-alternating


path from s to r. Since r is an unmatched row and s is matched to t, M = M Ef) n
is a column-complete matchingo Path n is c-alternating with respeet to M.

Let x be the first vertex on P that is also on n (so x is on P[ v: u]), and let y be the
last vertex on P that is also on n (so y is on P[t :w]). Then J5 = P[v: x]n[x :y]P[y:w]
is a path from v to w. The path P[v : x] is c-alternating with respeet to both M
and M, because M and M agree on P[v : x]. Depending on whether x precedes or
117

.J---. · · ·
,--=---_..._ .._,

iQ; :Ji ~
\ ,,' I .,
• ~=~It---4I-~~---4I-~~~~.

~ ·I
FIG. 6. Case 2 of Theorem 2.9. The most eomplieated version is shown. The dashed edges are the
matching M. l' is the horizontal path from v to w. The light dotted line shows path R from s to r.
Path 1'[v : z]'R[z : y] is c-alternating with respeet to M Ell 'R. A simpler version, not shown, is if'R
does not interseet l' after u. Then z = u, y = t, and 1'[v : u](u, t} is e-alternating with respeet to
MEIl'R.

follows y on n, the path n[x : y] is c-alternating either with respeet to M or with


respeet to M, because M and M disagree on n. Therefore P[v: y] = P[v: x]n[x: y]
is c-alternaÜng either with respeet to M or to M. Figure 6 illustrates this case.
But this is a contradietion: With respeet to one of the column-complete matchings
M and M, we have shown that 15 is a path from v to w that is c-alternating from v
at least as far as y, and P[y: w] is shorter than P[u: w]. This contradiets the choice
of P and M, and finishes the proof of Theorem 2.9. D

2.5. Gaussian elimination without pivoting. We now briefiy review a graph-


theoretic model of LU factorization without row or column interchanges. The undi-
reeted version of this model is due to Parter [27] and was developed extensively by
Rose [30]; the direeted version was developed by Rose and Tarjan [31]. George and
Liu [11] is a good source for the undirected model. Gilbert [14] surveys these and
related results.
If G = G(A) is a directed or undi reet ed graph, we define the deficiency of a
vertex v of G as the set of edges

{(r, e) : v E Adja(r), e E Adja(v), and e 1. Adja(r)}.

The deficiency of v corresponds to the fill that occurs in A when the (v, v) element
is used as a pivot in Gaussian elimination. Therefore we can define a sequence of
elimination graphs Go, Gb ... , G n , where Go = G(A) and Gi is obtained from Gi-l
by adding the deficiency of vertex i (in Gi-t) and deleting vertex i and its incident
edges. Then Gi is the graph of the (n - i) x (n - i) Schur complement that remains
after eliminating the first i vertiees of A. This is in the symbolic sense--that is, it
ignores possible numeric cancellation. We define the filled graph of A, which we write
G+(A), as the n-vertex graph containing all the edges of all the Gi's. Thus we have
the following resulto
THEOREM 2.10. Suppose the square matrix A can be factored as A = LU without
row or column interchanges. Then G( L + U) ~ G+ (A) with equality unless there is
cancellation in the factorization. In other words, the filled graph contains edges for
all the nonzems of L and U. D,
118

If A is symmetric and G(A) is the undirected graph, then G+(A) is undireeted.


(Remember that we do not distinguish between an undirected graph and a directed
graph with symmetric pairs of edges.) Historically, filled graphs were studied first in
the undirected case, specifically for the Cholesky factorization of symmetric positive
definite matrices. The theory of undirected filled graphs, which are the same as
chordal graphs, is quite rich [19, 30].
We ean eharaeterize the structure of G+(A) in terms of paths in the graph of A,
without actually computing all the elimination graphs. In the following theorem,
the paths ean be interpreted as directed paths for nonsymmetric matrices and either
directed or undirected paths for symmetric matrices.
LEMMA 2.11 (ROSE, TARJAN, AND LUEKER [31, 32]). Let G be a directed or
undirected graph whose vertices are the integers 1 through n, and let G+ be its filled
graph. Then (x, y) is an edge oJ G+ iJ and only iJ there is a path in G Jrom x to y
whose intermediate vertices are all smaller than min(x,y). 0
Paths from x to y whose intermediate vertiees are all smaller than min(x,y) are
sometimes referred to as fill paths.
A graph that is often useful in nonsymmetric strueture prediction is the filled
column intersection graph of an arbitrary m x n matrix A. This graph, whieh we
write GMA), is just G+(Gn(A))j it is the n-vertex undirected filled graph of the
eolumn intersection graph of A. Figure 4 is an example. The graph GMA) is related
to the normal equationsj its structure is the symbolic resuIt of forming ATAand then
computing the Cholesky factor of that matrix. Section 3 diseusses the conditions
under which this symbolic structure prediction is exact.

2.6. Lemmas on exact structure in Gaussian elimination. In this final


subsection we prove some easy lemmas that take into aecount the values of the nonze-
ros in the matrix. These resuIts will be the building bloeks for the exact strueture
predictions in the rest of the paper.
LEMMA 2.12. Suppose A is square and nonsingular, and has a triangular Jactor-
ization A = LU without pivoting. Let r be a row index and e a column index oJ A,
and let K be the submatrix oJ A consisting oJ rows 1 through min(r, c) -1 and r, and
columns 1. through min(r,c) - 1 and c. Then [L + U]re is zero iJ and only iJ K is
singular.

Proof. Let s = min(r,c). Factor K = LKUK. Then [UK] •• = [U]re if r ~ c, and


[UK] •• = [L]re[U]ee if r > e, so [UK] •• is zero if and only if [L + U]re is zero. The
determinant of UK is the same as that of K, and the first s - 1 diagonal elements
of UK are the same as those of U, so [UK] •• = 0 if and only if K iuingular. 0
LEMMA 2.13. Suppose A is square and nonsingular, and has a triangular Jactor-
ization A = LU without pivoting. Suppose also that all the diagonal elements oJ A
except possibly the last one are nonzero, and that every square Hall submatrix oJ A is
nonsingular. Then G(L + U) = G+(A); that is, every nonzero predicted by the filled
graph oJ A is actually nonzero in the Jactorization.
119

Proof. Suppose (r,e) is an edge of G+(A). Then there is a fill path P from r to e
whose intermediate vertices are less than s = min(r, e).
L~t K be the submatrix of A mentioned in Lemma 2.12, consisting of rows 1
through s-I and r, and columns 1 through s-I and e. For convenience, call the
last row and column in K number r and e respectively instead of number s. Then
path P corresponds to a path in H(K) from row vertex r' to column vertex e, which
is r-alternating with respeet to the matehing M of edges (i', i).
Now M is one edge short of being a perfect matehing on K, because column e
and row r' are not matched. However P is an augmenting path with respect to M,
and therefore M Ee P is a perfect matehing on K. Since K has a perfect matching,
it is Hall; thus its determinant is nonzero by hypothesis, and [L + U]rc is nonzero by
Lemma 2.12. D
The hypothesis that Ahas nonzero diagonal in ,Lemma 2.13 is cruciaI. Brayton,
Gustavson, and Willoughby [21 gave the following counterexample in the case when
this hypothesis is not included. Let

J
x x

x x

Then the (4,3) entry in G+(A) is nonzero, but [L]4,3 = 0 regardless of the nonzero
values of A.
LEMMA 2.14. Suppose bipartite graph H has a perfect matching M. Let A be a
matrix with H(A) = H, such that [A]rc > n for (r', e) E M and 0 < [A]rc < 1 for
(r', e) f. M. If A is factored by Gaussian elimination with partial pivoting, then the
edges of M will be the pivots.

Proof. When the rows of matrix Aare permuted so that the edges of M are the
diagonal elements, the values chosen make the permuted matrix strongly diagonally
dominant. D

3. Orthogonal factorization. Let A be a matrix with m rows and n :::; m


columns, with full column rank n. In this section we consider the orthogonal fa.c-
torization A = QR, where Q is an m x m orthogonal matrix (that is, QTQ = I),
and R is an m x n upper triangular matrix with nonnegative diagonal entries. (All
the nonzeros of R are in the n x n upper triangle, so we will think of R as being
n x n.) This fa.ctorization is unique. It arises in least squares and other optimization
problems [18, 22].
To compute the QR factorization, A is transformed into R by multiplying it on
the left by a sequence of orthogonal transformations that annihilate nonzeros below
the main diagonaI. In most applications, Q is not computed explicitly: either the
orthogonal transformations are applied to a right-hand side at the same time as to
A, or else a description of the sequence is saved to be applied later.
At least two structure prediction problems are of interest here. First, what is the
nonzero structure of A at each step of annihilation? Second, what is the nonzero
120

strueture of R? The answer to the first question depends on the algorithm we use to
eompute the factorizationj the answer to the second does not.
In Seetion 3.1 below, we review work of George, Liu, and Ng on intermediate fill
during eolumn QR faetorization. We then give a new tight symbolic resuIt on eolumn
QR factorization. In Seetion 3.2, we survey several authors' work on predieting the
strueture of Rj in Seetion 3.3, we re-prove aresult of Coleman, Edenbrandt, and
Gilbert in a framework that relates it to the new results on LU faetorization in
Seetion 4. Finally, in Seetion 3.4, we brieily survey some related work.

3.1. Nonzero structure of A during annihilation. In this seetion we develop


a symbolie model of the eolumn Givens and Householder algorithms for redueing A
to upper triangular form. Our goal is a tight symbolie result, that is, an aeeurate
deseription of the nonzero strueture of A during the algorithm, under the assumption
that no eaneellation oeeurs.
The standard algorithms to eompute R from A multiply A on the left by a se-
quenee either of Householder reileetions or of Givens rotations [18]. Multiplieation
by a Householder reileetion reileets a veetor with respeet to a specified hyperplanej
a Householder reileetion ean be chosen to annihilate all but one of the entries of the
veetor. MuItiplieation by a Givens rotation rotates a veetor through a specified angle
in the plane of two specified eoordinate axeSj a Givens rotation ean be chosen to
annihilate any single ent ry of the veetor. We eonsider three aIgorithms to eompute R
from A in the sparse setting: row Givens, column Givens, and column Householder.
The sparse row Givens algorithm is due to George and Heath [8]. They first
predict the nonzero strueture of R, and set up a statie data strueture to hold R.
Then they annihilate nonzeros from one row of A at a time, proeessing eaeh row unt il
either it beeomes eompletely zero, or its strueture fits into an empty row of the static
data strueture. This approaeh is attraetive beeause only the data strueture for Rand
the storage for the rows of Aare needed in the annihilation process. Thus the only
strueture predietion neeessary is for R, as deseribed in Seetions 3.2 and 3.3.
The eolumn Givens and eolumn Householder algorithms both annihilate the sub-
diagonaI elements of one eolumn of A at a time. We will analyze them from the
symbolic point of view, that is, assuming that zeros are produeed only by intentional
annihilation and not by numerieal eancellation or coincidence. Define the sequence
A o, At, ... , An, where A o = A and Ai is the (m - i) x (n - i) submatrix remaining
to be processed at the end of step i of the annihilation. For convenience, the columns
of the (m - i) x (n - i) matrix Ai are labeled from i to n, and the rows of Ai are Ia-
beled from i to m. The matrix Ai is obtained from A i - I by annihilating the nonzeros
below the diagonal in column i of A;-I. The Givens algorithm uses one rotation for
each subdiagonal nonzero in column i of Ai-Ij the Householder aIgorithm uses one
reileetion to annihilate the entire column. The struetural effeets are closely related,
so we combine their descriptions.
Consider Givens rotations first. Suppose [Ai-I]ki is nonzero, k > i, and assume
that any nonzero [Ai-I]ji' i < j < k, has been annihilated. Then [Ai-Ihi will be
annihilated by a Givens rotation, whieh is construeted using [Ai-I];; and [Ai-I]ki. This
rotation replaces rows k and i by Iinear combinations of their old valuesj symbolically,
121

except for the (k, i) element, it replaces both their nonzero structures with the union
of their nonzero struetures. Thus the strueture of row k of Ai is the union of the
structures of those rows j of A i - l for which i :::; j :::; k and [Ai-l]ji i= o. Moreover, at
the end of step i, the structure of row i of Ai is the union of the structures of those
rows j of Ai- l for which i :::; j :::; m and [Ai-l]ji i= o.
Now consider (the row-oriented version of) Householder refleetions. The House-
holder reflection that annihilates the subdiagonal nonzeros of column i of Ai-I replaces
all the rows containing those nonzeros with linear combinations of their old values.
Symbolically, every row with a nonzero in column i of Ai-l has the same strueture
in Ai, namely the union of their original struetures in Ai-I.
In terms of struetures, the fundamental difference between Givens rotations and
Householder reflections is the number of rows participating in one reduetion operation.
In one Householder reduetion, all rows that have a nonzero in column i of Ai- l
participate in a reduction step, whereas in a Givens Teduetion, only a subset of those
rows are involved.
We now describe a bipartite graph model that George, Liu, and Ng [12] developed
to analyze the reduetion process using Givens rotations. Their model associates a
bipartite graph H i with the matrix Ai. We number the m - i row vertices of H i from
i+1 to m, and the n-i column vertices from i+l to n. The changes in the structure of
Ai due to the reduetion process are described in terms of transformations on the graph
H i • Because of the similarity between Givens reduetions and Householder reflections,
this model can be extended to cover both cases. We summarize these results below;
proofs can be found in the paper [12]. All these results are symbolic; theyassume
that zeros are introduced only by explicit annihilation, not by cancellation.
The following results contain a parameter p, which we introduce to cover both
of the column algorithms. We define p == r for Givens rotations, and p == m for
Householder reflections.
We begin by formalizing the symbolic effeet of annihilating one column, that is,
the relationship between Hi-l and H i . The four statements in the lemma below are
easily seen to be equivalent.
LEMMA 3.1.

• For r > i, AdjHi(r') ==


AdjHi_' (r'), iJ i fj. AdjHi_' (r'),
{
U{AdjHi_'(S'): i:::; s':::; p,i E AdjHi_' (s')} - {i}, otherwise.

• For r > i, AdjHi(r') == ReachCol Hi _, (r', {i, i', (i + 1)" ... ,p'}).
• For r > i, e E AdjHi(r') iJ and only iJ there exists a path oJ Zength 1 or 3
Jrom r' to e through {i,i',(i + 1),,··· ,p'} in H i- l .
• For r > i and e > i, c E AdjHi(r') iJ and only iJ either e E AdjHo(r') or Jor
some k:::; i, there is a path (r',k,s',c) in Hk-l with k:::; s:::; p.

We wish to charaeterize fill in terms of the structure of the original matrix. George,
122

1'~1
2'
2
3'
3
4'

FIG. 7. The converse of Theorem 3.e is not true.

Liu and Ng [12] provided upper and lower bounds on the structure of H õ, but neither
bound is tight. Their upper bound is as follows.
THEOREM 3.2. For r > i, AdjH.(r') ~ ReachCoIHo(r', {I"", i, 1'"" ,p'}). 0
Note that Theorem 3.2 provides only a .necessarv eondition for a fill element to
oeeur during the annihilation proeess. Figure 7 (from [12]) is an example showing
that Theorem 3.2 is not tight. There is a path (4',2, I', 3) in the graph H o, but it
is easy to verify that no zero element in A beeomes nonzero in reducing A to upper
triangular form by Givens rotations or Householder refleetions.
The George, Liu, and Ng lower bound is as follows.
THEOREM 3.3. Suppose that Ho contains a path (r',ct,r~,c2,r~, .. ·,ct,r~,c)
whose intermediate vertices are all in {I,,,,, i, 1',,,,, p'}. If Ch ~ r~ for k ~ t and
Ck+! ~ rk for k < t, then e E AdjH.(r'). D

Again Theorem 3.3 is a partial eharacterization of fill; it provides only a sufficient


condition. Figure 8 (also from [12]) shows that the condition in Theorem 3.3 is not
necessary. Consider the path (5',2, I', 1,4',3) in Ho. It does not satisfy the condition
in Theorem 3.3 and it is the only path from 5' to 3 in H o. However it is straightforward
to verify that 3 E AdjH2(5') when either Givens rotations or Householder reflections
are used.
We now provide a necessary and sufficient condition, in terms of paths in Ho,
for fill to occur in the symbolic orthogonal factorizations. As in the case of sparse
Gaussian elimination without pivoting, we define a dass of fill paths in Ho for sparse

I'
1
2'
2
3'
3
4'
4
5'

FIG. 8. The converse of Theorem 3.3 is not true.


123

orthogonal faetorizations: a path

in H o is a fill path for sparse Givens rotation or sparse Householder transformation


if either t = 0 or the following eonditions are satisfied.

1. ek < min (r' , e) and rk ::; p, for all k.


2. Let ep be the largest ek. Then there is some q with p ::; q ::; i such that
ep ::; r~ ::; p, and the three paths P[r' : ep], P[ep: r~], and P[r~ : e], are also fill
paths in H o.

By this definition, all edges in H o are also fill paths. The main new resuIt of this
seetion is the following, whieh generalizes the last statement of Lemma 3.1. It gives a
neeessary and suffieient eondition for a zero element of A to become nonzero at some
stage of the annihilation process, in the symbolie sense. The proof of the resuIt is an
easy induetion, and is omitted.
THEOREM 3.4. For r',e > l, e E AdjH;(r') iJ and only iJ there is a fill paih
joining r' and e in Ho . D
Consider the path (4 ' ,2, 1',3) in H o in Figure 7. Sinee it does not satisfy eondi-
tion (2), the (4,3) element of A will remain zero throughout the eomputation, whieh
is indeed the ease for either Givens or Householder. AIso eonsider the example in
Figure 8. Although the path (5',2,1',1,4',3) does not satisfy the eondition in The-
orem 3.3, it does satisfy eondition (2) above. Henee, the (5,3) element of A will
beeome nonzero at some point during the eomputation, assuming exaet numerieal
eaneellation does not oeeur.
Unfortunately, unIike the ease of sparse Gaussian elimination without pivoting,
there does not appear to be a simple and non- reeursive way to express the fill property.
Finally, we define a graph whose strueture eaptures all of the Hi for the ease
of Householder refleetions. The (bipartite) TOW merge graph of a matrix A whose
diagonal is nonzero, whieh we write HX (A), is the union of H j (by the Householder
interpretation) for 1 ::; i ::; n. Thus HX(A) has m row vertiees and n eolumn vertiees,
and is eonstrueted by the following process. Begin with the bipartite graph H(A),
whieh indudes all edges of the form (i', i) beeause Ahas nonzero diagonal. For
eaeh k from 1 to n, add an edge from eaeh row r' :::: k adjaeent to eolumn kto eaeh
eolumn e :::: k adjaeent to any such row. (In other words, take those rows at or below
row k with nonzeros in eolumn k, and merge the parts of their nonzero struetures at
or to the right of eolumn k.)
We also define a direeted version of the row merge graph. The bipartite row
merge graph HX(A) is a bipartite graph with m rows, n ::; m eolumns, and a eolumn-
eomplete matehing of edges (i', i). The (directed) TOW merge graph, whieh we write
GX(A), is the n-vertex direeted graph whose adjaeeney matrix has the strueture of
the first n rows of HX(A).
Theorems 3.2, 3.3, and 3.4 ean be translated into statements about HX(A). We
will need one of these later.
124

COROLLARY 3.5. IJ A is an m x n matrix with nonzero diagonal, m 2': n, and


(r', e) is an edge oJ the row merge graph HX(A), then there is a path in H(A) Jrom
row vertex r' to column vertex e whose intermediate column vertices are all numbered
less than miner', e).
Proof. Immediate from Theorem 3.2 or Theorem 3.4. D

3.2. Upper bounds on nonzero structure of R. If A has full column rank


and factorization A = Q R, it follows from the column Householder algorithm (and
the uniqueness of the factorization) that G(R) ~ GX(A). In this section we state and
prove a bound on the structure of R that seems weaker than this onej then we show
that if A is strong Hall then the weaker bound is tight, and hence in that case the
two bounds are the same.
If A = QR then ATA = RTQTQR = RTR. Thus (the upper triangular part
of) R is equal to the Cholesky factor of the normal-equations matrix ATA (which
is symmetric and positive definite). George and Heath [8] used this fact in their
implementation of sparse orthogonal factorization by Givens rotations. They predict
the structure of ATA to be the column intersection graph Gn(A), which has a nonzero
in position (i,j) whenever columns i and j of A have a common nonzero roWj then
they predict the structure of R to be G;!;(A), the symbolic Cholesky factor of that
structure.
We will derive this prediction as a corollary of a relationship between row merge
graphs and column intersection graphs. We prove this relationship for all of GX(A)
even though the structure of R concerns only the "upper triangle" of GX(A)j we will
need the more general version in Section 4. A similar result for square matrices can
be found in George and Ng [9].
THEOREM 3.6. IJ A is an m x n matrix with m 2': n and nonzero diagonal
elements, then GX(A) ~ G;!;(A).
Proof. Suppose (r,c) is an edge of GX(A). Then (r', e) is an edge of HX(A) with
r' ~ n. Let i = min(r, e) - 1. Then by Corollary 3.5 there is a path from r' to e
in H(A) whose column vertices are all numbered at most i. Since Ahas nonzero
diagonal, (r', r) is an edge of H(A). Thus H(A) contains a path between column
vertices r. and e, whose intermediate column vertices are all smaller than min(r, e).
Therefore (by Lemma 2.1), the column intersection graph Gn(A) contains a path
between vertices r and e, whose intermediate vertices are all smaller than min(r, e).
Thus (by Lemma 2.11), (r,e) is an edge of G;!;(A). D
COROLLARY 3.7 (GEORGE, HEATH, LIU, AND NG [8, 10, 13]). IJ A = QR is
the orthogonal Jactorization oJ a matrix with JuU column rank and nonzero diagonal,
then G(R) ~ G;!i(A). D
Corollary 3.7 says that the structure G;!i(A) of the Cholesky factor of ATA is an
upper bound on the structure of R. This upper bound may be an overestimate for
reasons that have nothing to do with the numerical values of the nonzeros of A. An
example is the upper triangular matrix in Section 2.1.
125

3.3. Lower bounds on nonzero structure of R. Colernan, Edenbrandt, and


Gilbert [5] showed that GMA) does not overpredict G(R) if the matrix A is st rong
Hall. We give a proof that is related to theirs, but (unlike thern) we use the alternating-
paths theorem explicitly, to highlight the similarity between this resuIt and Theo-
rem 4.5 on LU factorization.
The hypotheses of Theorem 3.8 do not indude a nonzero diagonal. This is because
both G( R) and Gt, (H) are independent of the row ordering of H, and since H is strong
Hall its rows can be permuted to make the diagonal nonzero.
THEOREM 3.8 (COLEMAN, EDENBRANDT AND GILBERT [5]). Let H be a bipar-
tite graph with the strong Hall property. Then there is a matrix A with jull column
rank and with H(A) = H, such that the orthogonal jactorization A = QR satisfies
G(R) = Gt,(H).
Proo! First we show that any single edge of Gt,(H) can be made nonzero by an
appropriate choice of A; then we show that there is one choice of A that makes all
those positions nonzero at once. We shall think of the entries of A that correspond
to edges of H as variabIes; a "choice of values for A" means an assignment to those
variabIes. Figure 9 illustrates the proof.

l'

l'
1 2 3 4 5
I' 1 1 2 5
3' 3., 2'

.'
3'
4'
1
1C 1)
2
3
1
1 2
1
s- s ·0
5' 1
6'

"/
FIG. 9. Example for Theorem 3.8. Graph H is shown in Figure 1. Its column intersection graph
and filled column intersection graph are shown in Figure 4. This figure shows the construction that
makes entry [Rh5 nonzero. At leJt, graph Il is the subgraph of H induced by column vertices 1
through r = =
3 and e 5, and all the row vertices. The dashed edges are a column-complete matching
=
M with respeet to which there is a c-alternating path Q (5,5',2, 1', 1,3',3) from e to r. At center,
A is chosen to have ones in positians M and Q and zeros elsewhere. At right, K is the submatrix
of ATA consisting of rows and columns 1 through r - 1 =
2, as weil as row r = 3 and column
c = 5. Matrix K is a permutation of a triangular matrix with nonzero diagonal and hence cannot
be singular.

Chaase rand c with r < c ::; n. Take an arbitrary m x n matrix A with factor-
ization Q R, such that the first r columns of Aare linearly independent. Now let K
be the submatrix of ATA consisting of columns 1 through r - 1 and e, and rows 1
through r. Lemma 2.12 applies to ATA (because ATA is positive definite), and says
that K is singular if and only if [Rlre, the entry in the (r,e) positian of R, is zero.
Thus [R]re is zero if and only if a eertain polynomial pre in the nonzero entries of A
(namely the determinant of K) is zero.

We now show that if A is a matrix with H(A) = Hand (r, e) is an edge of GMH),
126

then the polynomial Prc is not identieally zero. (Note that prc has a variable for eaeh
edge of H.) Let H be the subgraph of H indueed by all the row vertiees and the
eolumn vertiees 1, 2, ... , r, and e. Lemma 2.11 says that there is a path P from e to r
in the undireeted graph Gn(H) whose intermediate vertiees are all smaller than r.
Thus P is also a path in Gn(H). By Lemma 2.1, there is a path in H from eolumn
vertex e to eolumn vertex r.
Now H is st rong Hall beeause His. Therefore the alternating-paths theorem
(Theorem 2.9) applies, and says that there is a eolumn-eomplete matehing M for H
and a path Q from e to r that is e-alternating with respeet to M.
Choose the values of those nonzeros of A eorresponding to edges of M u Q to
be 1, and ehoose the values of the other "nonzeros" to be O. Let us examine the r x r
submatrix K of ATA defined above. (For simplicity, we will eall the last eolumn of K
number e rather than number r; the last row of K is number r.) We daim that the
bipartite graph H(K) has exaetly one perfect matehing (or, equivalently, that K ean
be permuted to a triangular matrix with nonzero diagonal). To prove this, we match
rows of K greedily to eolumns of K. Take a column j of K. If j is a vertex that
is ~ot on path Q, then the only nonzero in eolumn j of K is [Kli;, and we match
eolurnn j to row j'. If j is on Q, i' is the vertex following j on Q, and k is the vertex
following i' on Q, then [K]ki is nonzero and we match column j to row k'. (The last
vertex on Q is oolumn r, which is not a column of K.) This is a perfect matching
on H(K). Its uniqueness follows by induetion on the length of Q, the induetion step
being the fact that column e of K has onlyone nonzero (because row ef is not a row
of K).
This proves the daim that H(K) has exactly one perfect matchingo Thus the
determinant of K is just the produet of the nonzero values eorresponding to element s
of that matehing, and is itself nonzero. This shows that the polynomial Prc is nonzero
for at least one point, that is, for at least one ehoiee of values for A.
Now the set of zeros of a k-variable polynomial has measure zero in Rk, unIess
the polynomial is identically zero. Thus not only do values for the nonzero entries
of A exist that make prc and hence [R]rc nonzero, but almost all choiees of values (in
the measure-theoretic sense) work. Therefore, almost all choiees of values for A make
every [R]rc nonzero simultaneously. Furthermore, almost all of those choiees indude
no zero values; that is, for almost all such choices, H(A) = H as desired. Finally, we
observe that we ean ehoose A to have full rank n: for some n x n submatrix of A there
is a choice of values that gives nonzero determinant (namely, ones for the elements of
a oolumn-complete matehing of Hand zeros elsewhere), and hence almost all choiees
of values make that submatrix nonsingular. D
COROLLARY 3.9. IJ H is strong Hall and has nonzero diagonal, then the upper
triangular paris oJ GX(H) and GMH) are equal.
Prooj. By Theorem 3.6 and its eorollary we have G(R) ~ GX(H) ~ GMH) for
any A = QR with H(A) = H. If we choose A as in Theorem 3.8, the first and third
graphs are equal, and hence the second and third are also equal. D
COROLLARY 3.10. IJ H is strong Hall and has nonzero diagonal, then there is
a matrix A with Jull column rank and with H(A) = H, sueh that the orthogonal
Jaetorization A = QR s~tisfies G(R) = GX(H). D
127

3.4. Remarks on orthogonal factorization. Theorem 3.8 gives 80 tight pre-


dietion of the strueture of R in QR faetorization, in the exaet sense, provided that A
is strong HaIl. Recently, Hare, Johnson, Olesky, and van den Driessche [21] exte~ded
this result significantly by giving 80 tight exa.ct charaeterization of the struetures of
both Q and R, under the weaker assumption that A is Hall-that is, that A is struc-
turaIly of full column rank. The Hare et al. charaeterization uses a notion called "Hall
sets," which concerns strong Hall submatrices of A and is related to the Dulmage-
Mendelsohn decomposition of H(A). Hare et al. proved that their structure prediction
was one-at-a-time exaetj Pothen [28] then showed that in fa.ct it is all-at-once exaet.
Both Hare et aI. and Pothen used versions of the alternating-paths theorem in their
work.
Theorem 3.4 gives a tight predietion of the structure of A at ea.ch step of column
Q R fa.ctorization, in the symbolic sense. This prediction is not tight in the exaet
sensej see Coleman et al. [5] for an example. It is an open problem to give a tight
exa.ct strueture prediction for ea.ch Ai in column faetorization. The techniques of
Hare et aI. [21] are probably relevant here.
Recently, Ng and Peyton [26] investigated the structure of the so-called matrix of
Householder veetors. This is a representation of Q in which the vector that generates
the i-th Householder refieetion is stored in pla.ce of the i-th column of Q. Ng and
Peyton gave 80 tight exaet prediction of the strueture of this matrix in the case that A
is either st rong Hall or has its columns permuted according to a Dulmage-Mendelsohn
decomposition.
Givens rotations can be used to introduce zeros in other orders than row by row
or column by columnj exa.mples are reduetions of symmetric sparse matrices to tridi-
agonaI form [33] and the Jacobi algorithm for finding eigenvalues [18]. Little work
exists on structure prediction for such probIems. For example, it would be interesting
to prove upper and lower bounds on the work required to tridiagonalize 80 symmetric
matrix A by Givens rotations, in terms of the structure G(A).

4. LU factorization with partial pivoting. Let A be 80 nonsingular n x n


matrix. The triangular faetorization A = LU does not always exist, and is not
always numerically stable when it does exist [18, Chapter 3]. Thus some form of row
or column interchanges are needed in Gaussian eliminationj at ea.ch step, a nonzero
must be brought into the pivotal position before elimination.
In the dense setting, the pivot is usually chosen as the element of largest magnitude
in the current column (partial pivoting) or in the entire uneliminated matrix (complete
pivoting). In the sparse setting, there are several strategies for choosing pivots to
combine stability and sparsity. Some variations of complete pivoting choose 80 pivot
at each step to minimize operation count from among candidates that are not too far
from maximum magnitude [6]. Another approach is to preorder the matrix columns
purely to preserve sparsity, and then use partial pivoting to reorder the rows for
stability [13, 16].
This section paralleIs Section 3 in outline. In Section 4.1, we review a graph
128

model of Gaussian elimination with row and eolumn interehanges, and we prove some
results on the structure of the matrix during elimination. These results are symbolie;
that is, theyassume that zeros are introdueed only by explicit elimination, not by
eaneellation. In Section 4.2 we give upper bounds on the structure of the factors L
and U obtained by Gaussian elimination with row interehanges. In Section 4.3, we
give an exact lower bound on L and U. This result is tight-that is, best possible-
and is the main new result of this paper. We eonclude the seetion with remarks and
op en probIems.

We write LU factorization with row and eolumn interehanges as follows.


Ao
P[AoPf
P;AIP~

Here pr is an n x n elementary permutation matrix corresponding to the row inter-


ehange at step i, P{ is an n x n elementary permutation matrix corresponding to the
column interehange at step i, Li is an n x n elementary Iower triangular matrix whose
i-th eolumn eontains the multipliers at step i, and U is an n x n upper triangular
matrix. Sinee eaeh elementary permutation matrix (pr or pn is its own inverse, we
ean write the final factorization as
(1)

We define L as the n x n matrix whose i-th eolumn is the i-th eolumn of Li, so
that L - I = Li(Li - I). Note a subtle point about L: we ean also think of Gaussian
elimination as eomputing a factorization pr Apc = LOU, but this LO is not the same
as L. The two matriees are both unit lower triangular, and they contain the same
nonzero values, but in different positions; LO has its rows in the order deseribed by
the entire row pivoting permutation, while L has the rows of its i-th eolumn in the
order deseribed by only the first i interehanges. The matrix L is essentially a data
strueture for storing LO; either can be used in solving systems of equations. The
strueture prediction results in Seetions 4.2 and 4.3 below will be about L, not LO.

Note also that our notation is slightly different than in the previous sectian: now
Ai is always n x n, not (n - i) x (n - i).

4.1. Nonzero structure of A during elimination. In this subseetion we de-


velop a symbolie model of Gaussian eliminatian with row and/or eolumn interehanges.
The model is based on that of Golumbie [19] and Gilbert [17]. Theorem 4.2 is new.

Let Ho = H(A) be the bipartite graph of A = A o. Assume [Aol rc is nonzero and


is chosen as pivot at step 1. Define the deficiency of the edge (1", e) of Ho to be the
set of edges

((i',j) : e E AdjHo(i'),j E AdjHo(r'), and j tt AdjHo(i')}·


We abtain the bipartite graph H 1 of the (n - 1) x (n - 1) submatrix that remains
after eliminating (1", e) as follows: delete from Ho vertices 1" and e and all edges
129

ineident on them, then add the edges in the defieieney of (r',e). The edges in
the defieieney of (r',e) eorrespond to the zero elements of Ao that beeome nonzero
when lAol, e is eliminated. (Note that the labeHing of the vertices of H 1 refers to
the labeHi~g in the original matrix Ao.) Thus, given a sequenee of pivot element s
(r~, el), (r~, e2)'· .. , (r~_l' c,,-l) (some of whieh may be fill edges), we ean follow the
reeipe above to eonstruct a sequenee of bipartite graphs Ho, HI , · · · , Hn , where H;
deseribes the structure of the (n - i) x (n - i) Sehur eomplement remaining after
step i.
It is possible to prove bipartite versions of several of the results from Seetion 2.5.
We will use the following lemma in the exaet lower bound proof later in this section.
LEMMA 4.1. Let A be a square matrix, and let M be a perfect matching on H(A).
Let H o, ... , H n be the sequenee of bipartite elimination graphs deseribed above, when
elimination is earried out by pivoting on the edges of M. If (r', e) is a non-matehing
edge of H;, then there is a path from r' to e in H(A) that is r-alternating with respeet
to M, and whose intermediate vertiees are all endpoints of edges of M eliminated at
or before step i.

Prooj. We induee on the smallest i such that (r', e) is an edge of H i . If i = 0 then


(r', e) itself is the path. Otherwise, (r', e) is in the defieieney of the matehing edge
(r:, c;) in H i- b so edges (r', ei) and (r:, e) are non-matching edges of H i- l . Applying
the induction hypothesis to those edges, we get r-alternating paths P from r' to ei
and Q from r: to e in H(A). Then P(ci,rDQ is an r-alternating walk from r' to e in
H(A) whose intermediate vertices are all eliminated at or before step i. Thus there
exists an r-alternating path with the same property. D
One interesting faet about symbolic bipartite elimination, which is newand is
stated below as a theorem, is that it preserves the Hall and strong Hall properties.
THEOREM 4.2. Let H o be a bipartite graph and let (r',c) be an edge of H o . Let
H I be the bipartite graph resulting from the elimination of edge (r', e). If H o has the
Hall proptlrty, then H I also has the Hall property. If H o has the strong Hall property,
then H I also has the strong Hall property.

Prooj. Reeall Theorem 2.5, whieh says that an m x n bipartite graph is Hall if and
only if it has no independent set of more than m vertiees, and strong Hall if and only
if it has no independent set of exaetly m vertiees that includes at least one vertex
from eaeh part.
Let Rl and Gl be the row and column vertices in a largest independent set in H I ·
It is not possible that both r' E AdjHo(GI ) and e E AdjHo(RJ), for that would imply
an edge between Rl and Gl in HI . Therefore either Rl U Gl U {r'} or Rl U Gl U {e}
is an independent set in H o. If H o is Hall, that set has size at most m, and hence
Rl U Gl has size at most m - 1, so HI is also Hall. The strong Hall ease follows
the same argument, considering only independent sets that include both rows and
eolumns. D

4.2. Upper bounds on L and U with partial pivoting. For the remainder
of this seetion, we restrict our attention to the ease in whieh only row interehanges
are performed during Gaussian elimination, so the eolumn ordering is fixed initially.
130

This subseetion proyes symbolie upper bounds on the struetures of L and U, making
no assumptions on the row pivoting strategy. For the ease where A is strong Hall
and rows are ordered by partial pivoting, the next subseetion proyes matehing exact
lower bounds. Therefore the symbolie upper bound is in faet a tight exaet bound in
this ease. As we will see, the tight exaet bound is a one-at-a-time result; there is no
tight all-at-once bound on L and U in general.
In the rest of this seetion we require A to have a nonzero diagonal. The rows of
any nonsingular square matrix ean be permuted to put nonzeros on the diagonal (by
Theorem 2.3 and Corollary 2.4). In faet, only the bounds on L below depend on a
nonzero diagonal; the bounds on U hold for arbitrary nonsingular A.
Sinee the row interehangespr depend on the numerieal values, it is in general im-
possible to determine where fill will oeeur in L and U from the strueture of A. George
and Ng [13) suggested a way to get an upper bound on possible fillloeations. At step
i of Gaussian elimination with row interehanges, eall the rows that have nonzeros
in eolumn i below the diagonal candidate pivot rows. George and Ng observed that
fill e,an only oeeur in eandidate pivot rows, and only in eolumns that are nonzero in
some eandidate pivot row. Thus the strueture that results from the elimination step
is bounded by replacing eaeh eandidate pivot row by the union of all the eandidate
pivot rows (to the right of eolumn i). We need the faet that the diagonal of A is
nonzero to argue that this models the effeet of row interehanges eorreetly: row i is
itself a eandidate pivot row at step i, and therefore interehanging row i with another
eandidate pivot row does not affeet the strueture of the bound.
This proeedure for bounding the struetures of L and U is precisely the eonstruetion
of the row merge graph from Seetion 3. Therefore we have the following theorem.
(Note that GX(A) = HX(A) sinee A is square.)
THEOREM 4.3 (GEORGE AND NG [13)). Let A be a nonsingular square ma-
trix with nonzero diagonal. Suppose A is factored by Gaussian elimination with row
interchanges as

and L is the union of the Li as described above. Then

G(L + U) ~ GX(A),

that is, the structures of L and U are subsets of the lower and upper triangles of the
row merge graph of A. D

COROLLARY 4.4. Let A be a nonsingular square matrix with nonzero diagonal,


factored by Gaussian elimination with row interchanges as in Theorem 4.3. Then

G(L + U) ~ G;!;(A),

that is, the structures of L and U are subsets of the lower and upper triangles of the
(symmetric) Jilled column intersection graph of A. D
George, Liu, and Ng [10, 13) gave an algorithm for Gaussian elimination with
parti al pivoting that uses GX(A) to build a data strueture to hold the faetors of Aas
elimination progresses. The strueture may be overgenerous in the sense that it stores
131

1 2 3 4 5 6 1 2 3 4 5 6
I' x X I' x x x X
2' x X 2' x x x x X
3' x x X 3' x x x x x X
4' x x X 4' x x x X
5' x X 5' x x x x X
6' x x X 6' x x x

FIG. 10. Example for Theorem ..(.5. On the left is a matrix A. On the nght is the bound GX(A) on
the struetures of L and U. In the eas e r < e, Figure 11 shows how to make [U]35 nonzero. In the
ease r > c, Figure 12 shows how to make [LJs4 nonzero.

some zeros, but it has the advantage that it is static; the structure does not change
as pivoting choiees are made. George, Liu, and Ng's numerieal experiments indicated
that (with a judicious choice of a column reordering for sparsity) the total storage
and execution time required to compute the LU decomposition using the statie data
structure were quite competitive with other approaches.

4.3. Lower bounds on L and U with partial pivoting. In this section we


show that Theorem 4.3 is tight in the exact sense for strong Hall A. In other words,
if a given input structure is strong Hall, then for every edge of the row merge graph
there is a way to fill in the values so that the corresponding position of L or U is
nonzero. This implies that George and Ng's statie data structure [13] is the tightest
possible for Gaussian elimination with partial pivoting. This is a one-at-a-time result;
as we will see, no all-at-once result is possible.
The case r < e of Theorem 4.5 (that is, the proof for U) first appeared in a
technical report by Gilbert [15]; the case r > e (for L) has not appeared before.
(Gilbert actually related U to G;t.(A) rather than GX(A), but the U parts of those
graphs are the same for strong Hall A by Corollary 3.9.)
THEOREM 4.5. Let H be the structure of a square strong Hall matrix with nonzero
diagonai. Let (r,e) be an edge of the row merge graph GX(H). There exists a nonsin-
gular matrix A (depending on r and e) with H(A) = H, sueh that if A is factored by
Gaussian elimination with partial pivoting into L and U as deseribed in Theorem 4.3,
then [L + U]rc #- O.
Proo! Figure 10 shows an example of the bound. The cases r < e (that is, U)
and r > e (that is, L) are similar. Row interchanges make the L case a littl~ more
complicated; thus we prove the two cases separately.
Case r < e (structure of U). Figure 11 illustrates this case. Aecording to
Corollary 3.5, there is a path P in H from row vertex r' to column vertex e whose
intermediate column vertices are all at most r.
Let H be the subgraph of H induced by all the row vertiees and the column
vertices 1 through rand e. Now H is st rong Hall because His. Therefore the
FIG. 11. Example Jor Case 1 oJ Theorem 4.5, showing the eonstruction that makes [U]a5 nonzero
in the strueture Jrom Figure 10. At top leJt, the graph 7J is the subgraph oJ H indueed by eolumn
vertiees 1 th7'Ough r = 3 and e = 5, and all the row vertiees. The dashed edges are a eolumn-eomplete
matching M with respeet to whieh there is a e-alternating path Q = (5,5',2, 1', 1,3',3) Jrom e to r.
At top right, A is chosen to have large values in positions M and small values elsewhere. At bottom
leJt, A, is the submatrix oJ PA with eolumns 1 through r and e and the rows in the eorresponding
positions aJter 3 pivot steps. The element [U]a5 is in positian * oJ the Jaetor oJ A r . At bottom right,
the directed graph G(A,) has a path (3,1,2,5); thereJore (3,5) Jills in. In the original A, the first
pivot st ep does no row swap and fills position (3',2); the second pivot step swaps rows 2' and 5' and
fills positian (3',5).
133

l'

2' 2 _d 1 2 3 4 5 6
I' x 9
3' 2' x X
3' 9 x x
.' • ·C
4'
5' x
9 x
X
X

t'.r' .. S'
6' 9 x x

"

'Y.
1 2 3 4 5

T
x x
I' x 9

:)
4' 9 X
6' 9
5' x
*

FlG. 12. Example Jor Gase 2 oJ Theorem 4.5, showing the construction that makes [L]s4 nonzero
in the stmeture Jrom Figure 10. At top leJt, the graph Il is the subgraph oJ H indueed by eolumn
= =
vertiees 1 through e 4, and all the row vertiees. Then d 2 is the first eolumn vertex on sam e path
Jrom r' to c. The dashed edges are a eolumn-eomplete matching M with respeet to which there is a
c-alternating path Q = (4,4',3,3', 1, 1',2) Jrom e to d. At top right, A is chosen to have large valu es
in positions M and small values elsewhere. At bottom leJt, Ac is the submatrix oJ PA with eolumns
1 through e and r and the rows in the eorresponding positions aJter -1 pivot steps. The element [L]s4
is in positian * oJ the Jactor oJ Ac. The fiJth and last row oJ Ac is 5', the fiJth row oJ A, beeause 5'
= =
was not involved in a pivoting swap during the first 4 steps; thereJore s' r' 5' and the argument
about an alternating path Jrom r' to s' is not needed in this example. At bottom right, the direcled
graph G(A c) has a path (5,2,1,4); thereJore (5,4) fil/s in. In the originai A, the first pivot step fil/s
position (1',4), and the second pivot step fil/s positian (5',4).
134

alternating-paths theorem (Theorem 2.9) applies, and says that there is a column-
complete matching M for H and a path Q from e to r that is e-alternating with
respeet to M.
Choose the values of those nonzeros of A corresponding to edges of M to be
larger than n, and the values of the other nonzeros of A to be between 0 and 1.
Further, choose the values so as to make every square submatrix of A that is Hall,
induding Aitself, nonsingular. (Such a choice is possible by an argument like that
in Theorem 3.8: the determinant of a Hall submatrix is a polynomial in its nonzero
values, not identically zero because the Hall property implies a perfect matchingo
Therefore the set of values that make any Hall submatrix singular has measure zero,
and can be avoided.)
Now we prove that this choice of values makes [U].c nonzero. In the first r steps
of elimination of A, the pivot elements are nonzeros corresponding to edges of M.
Let P be the permutation matrix that describes the first r row interchanges (that is,
P = PrPr- 1 ••• P1 in Theorem 4.3). Let A r be the (r+ 1) x (r+ 1) prineipal submatrix
of PA that indudes the first r columns and column e, and the corresponding rows.
Thus the columns of A r are those numbered 1 through r and e in Hj the first r rows
of A r are those matched to columns 1 through r of H by M j and it does not matter
which row of H the last row of A r is. We will consider the rows and columns of the
bipartite graph H(A r ) to have the same numbers that they did in Hj thus the column
vertex numbers are 1 through r and e, and the row numbers may be anything. In the
directed graph G(A r ), we will also number the vertices 1 through r and e, but bear
in mind that the row of A r corresponding to a vertex v was not necessarily row v'
in H.
Now the first r diagonal element s of A r are nonzero, and dominant. Let L r and Ur
be the triangular factors of Ar without pivoting, A r = LrUr. Then the element [U]rc
mentioned in the statement of the theorem is in fact [Ur]rc, the element in the last
column and next-to-Iast row of Ur' We proceed to show that [Ur]rc i- O.
All square Hall submatrices of A r are nonsingularj thus, by Lemma 2.13, G+(A r )
is exactly the structure of [L r + Url. Therefore [U]rc is nonzero if and only if G(A r)
contains a directed path from vertex r to vertex e, through vertices numbered less
than r.
Reeall the path Q, which is a path in H from e to r that is c-alternating with
respect to M. The matching M consists of exactly the edges on the diagonal of A r
(except for the one in the last column, which cannot be an edge of Q because Q is
c-alternating). Therefore Q corresponds to a directed path from r to e in G(A r ).
Every vertex of G(Ar ) except r and e is numbered less than r, so this is the desired
directed fill path and the proof of this case is complete.
Note that the proof never explicitly identified the row of H that ended up in
position (r, e) of Uj it is the row matched to column r by M, and is the second last
vertex on the path Q.
Case r > e (structure of L). Figure 12 illustrates this case. The proof for this
case is much like that for U, but it needs to do some extra work to identify the row
of H that ends up in position (r,e) of L, because that row has not yet been matched
(pivoted on) when L rc is computed.
135

Again by Corollary 3.5, there is a path l' in H from row vertex r' to eolumn
vertex e whose intermediate eolumn vertiees are all at most e. Let d be the first
eolumn vertex on l' (this is the vertex after r' on 1'j possibly d = e).
Let H be the subgraph of H indueed by all the row vertices and the eolumn
vertices 1 through e. (This has one less eolumn than in the prooffor U.) Then 1'[d:el
is a path (possibly of length 0) in H from eolumn vertex d to eolumn vertex e in H.
Again, therefore, there is a eolumn-eomplete matehing M for H and a path Q from
e to d that is e-alternating with respeet to M.
Again we ehoose A so that edges of M have values larger than n, other edges have
values between 0 and 1, and every square Hall submatrix of A is nonsingular.
The first e steps of elimination of A pivot on nonzeros eorresponding to edges of M.
Let P be the permutation matrix that deseribes the first e row interehanges (that is,
P = PcPc- 1 ••• PI in Theorem 4.3). Let Ac be the (e+ 1) x (e+ 1) prineipal submatrix
of PA that indudes the first e eolumns and eolumn r, and the rows in eorresponding
positions of PA. Thus the eolumns of Ac are those numbered 1 through e and r in
Hj the fir~t e rows of Ac are those matehed to eolumns 1 through e of H by M. The
last row of Ac is some row number s' in H that is not matehed by M. (Row s' may
or may not be matehed to eolumn r in the final faetorization of A.)
Again, we give the rows and eolumns of the bipartite graph H(A c ) the same
numbers they had in Hj the eolumn vertex numbers are 1 through e and r, and the
row numbers may be anything (but the last row is s'). In the direeted graph G(A c ),
we will also number the vertices 1 through e and rj again, bear in mind that the row
of Ac oorresponding to a vertex v was not neeessarily row v' in H, and in partieular
the row eorresponding to vertex r of G(A c ) is row s' of H.
Now the first e diagonal elements of Ac are nonzero, and dominant. Let L c and
Uc be the triangular faetors of Ac without pivoting, Ac = LcUc. The element [Ll re
mentioned in the statement of the theorem is in faet [Lel re , the element in the last
row and next-to-Iast eolumn of Le.
As before, we show that [Lel re =1= 0 by exhibiting a direeted path from vertex r
to vertex e of G(A e ), based on a e-alternating path in H. However there is not
neeessarilyan edge between eolumn vertex rand row vertex s' in Hj thus we must
find a e-alternating path that ends at s', not r. The details of how to do that will
eomplete the proof.
We now traee the pivoting proeess to diseover where row s' eame from. H row r'
of H was not used as one of the first e pivots, then it has not moved and s' = r'.
H row r' was used as a pivot, suppose it was in eolumn el ~ e, and that the row
interehanged with r' at step el was row r~. (Reeall that all row and oolumn numbers
are vertex numbers of H.) Again, either r~ = s' or else r~ was later used as a pivot
in some eolumn e2 > el, when it was interehanged with some row r~. Continuing
induetively, we eventually arrive at a row rk which is equal to s', whieh was not used
as a pivot in the first e steps.
136

The sequence of nonzeros we followed while tracing the pivoting process was

Each (c;, rl) is an edge of one of the bipartite elimination graphs H 0, H I! ... , H e
corresponding to the first e steps of symbolic Gaussian elimination of H. Therefore,
by Lemma 4.1, there is a c-alternating path in H from c; to ri for eaeh i. Furthermore
eaeh (ri_t,C;) is an edge of M, and is thus a one-edge c-alternating path from rLt
to c;. Concatenating these paths yields a c-alternating walk W (which may repeat
vertices or edges) from r' to s' in H.
Now if edge (d, r') is not an edge of M, then Q followed by (d, r') followed by W
is a c-alternating walk from column e to row s'. Alternatively, if (d, r') is an edge
of M, then d = eI! and Q followed by W[d: s') is a c-alternating walk from column e
to row s'. Either way, we have a walk in H from e to s' that is c-alternating with
respeet to M. This walk corresponds to a direeted walk from vertex r to vertex e
of G(A e). Thus there is a direeted path from vertex r to vertex e of G(A e). The
inter~ediate vertices on this path are less than both r and e, because r and e are the
last two vertices of G(A e). Therefore (r, e) is an edge of G+(A e). Since all square Hall
submatrices of Ac are nonsingular, therefore, [Le)re is nonzero. Thus [L)re is nonzero
and the proof is complete. D

4.4. Remarks on LU factorization with pivoting. Theorem 4.5 showed that


GX(A) is a tight exaet bound on the strueture of the faetors L and U, assuming that
the strueture of A is not only strong Hall, but also has its rows permuted so that
the diagonal is nonzero. We can get a tight exact bound on U without assuming a
nonzero diagonal. The following result does not depend on row ordering.
COROLLARY 4.6. Let H be a square bipartite graph with the strong Hall property.
Let (r, e) be an edge of the filled eolumn intersection graph GMH). Then there is a
nonsingular matrix A (depending on r and e) with H(A) = H, such that the upper
triangular factor U of A in Gaussian elimination with partial pivoting has [U)re 1= o.
Proof. Since H is strong Hall, it has a column-complete matchingo Let H be
H with its row vertices permuted so that (i', i) is a matching edge for all i. The
filled column interseetion graph is independent of the row permutation, so GMH) =
GA(H). Corollary 3.9 says that the upper triangles of GMH) and GX(H) are the same.
Therefore (r, e) is an edge of Gx (H). Then, by Theorem 4.5, there is a nonsingular
matrix A with H(A) = H, such that the upper triangular faetor 11 of Ain Gaussian
elimination with partial pivoting has [U)re 1= o.
By a measure-theoretic argument like that in Theorem 3.8, we can choose A so that
there is never a tie for the choice of pivot element, that is, so that at each elimination
step all the subdiagonal nonzeros of the pivot column have different magnitudes.
Under this assumption, the upper triangular faetor 11 is independent of the row
ordering of A. Let A be A with its rows permuted so H(A) = H. The upper
triangular factor U of A is equal to 11, and hence [U]rc 1= O. D
Theorem 4.5 on LU differs from Theorem 3.8 on QR in that the latter is all-
at-once; that is, for each strueture a single matrix exists that fills all the predieted
nonzeros. Theorem 4.5 is not all-at-once, and no tight exact all-at-once result is
137

possible for LU factorization with partial pivoting. To see this, consider a matrix
that is tridiagonal pIus a full first column,
x
x x
x x x
x x
x

The graph H(A) is strong Hall. The row merge graph GX(A) is full. As Theorem 4.5
says, any single position in L or U can be made nonzero by an appropriate choice of
pivots. But the first row of U will have the same strueture as some row of A, so it is
impossible for U to be full.
One application of strueture predietion for partial pivoting is to prediet which
columns of A will update which other columns if the faetorization is done with a
column-by-column algorithm. For example, Gilbert [15] gave a parallei implementa-
tion of LU faetorization with partial pivoting in which tasks (columns of the faetor-
ization) w~re scheduled dynamically to processors, based on a precedence relationship
determined by precomputing the elimination tree [23] of Gn(A). Since [U]ij is nonzero
if and only if column i updates column j during the factorization, a corollary of The-
orem 4.5 is that, for strong Hall A, this is the tightest prediction possible from the
strueture of A aloneo
COROLLARY 4.7 (GILBERT [15]). Let a strong Hall structure for the square
matrix A be given. If k is the parent of j in the elimination tree of Gn(A), then there
exists a choice of nonzero values of A that will make column j update column k during
factorization with parlial pivoting. D
This corollary is a one-at-a-time result. However, if we restrict our attention to the
edges of the elimination tree of Gn(A) instead of all of GX(A), it may be possible to
prove an all-at-once resulto We conjeeture that for every square st rong Hall matrix H,
there exists a single matrix A with H(A) == H such that every edge of the elimination
tree of Gn(A) corresponds to a nonzero in the upper triangular factor U of A with
partial pivoting.

Little if anything is known about the case when H(A) is not strong Hall. Hare
et al. [21] gave a complete exaet result for Q R factorization assuming only the Hall
property; is a similar analysis possible for partial pivoting? In particular, since the
upper triangles of GX(A) and G;!;(A) can differ in the non-strong Hall case, how tight
is the former for partial pivoting? There are non-strong Hall structures for which
GX(A) is tight but G;!;(A) is not; an example is a matrix whose only nonzeros are the
diagonal and the first row.

5. Remarks. The theme of this paper is that, when solving a nonsymmetric


linear system, strueture prediction is easier if the matrix is strong Hall. On the other
hand, a system whose matrix is not strong Hall can be partitioned (by Dulmage-
Mendelsohn decomposition) into smaller strong Hall systems. This useful coincidence
makes some intuitive sense. Symbolic independence of vectors (the Hall property) is a
weaker condition than numeric linear independence. In a sense, Dulmage-Mendelsohn
decomposition tries to wring as much as possible out of symbolic relationships before
138

Gaussian elimination takes over to handie numeric relationshipsj the tight exact (Le.
numeric) lower bounds in this paper say that Dulmage-Mendelsohn decomposition is
doing its job.
Predicting structure in algorithms that combine numerical and structural infor-
mation is an interesting challenge. Murota et al. [25J have studied block triangular
decompositions that take some but not all of the numerical values into account.
We point out once more that Hare, Johnson, Olesky, van den Driessche, and
Pothen [21, 28] have recently obtained tight exact bounds on both Q and R in the
general Hall case, thus extending the work of Coleman, Edenbrandt, and Gilbert that
we reviewed in Section 3. It would be interesting to see whether our bounds on L
and U for partial pivoting, in Section 4, could be similarly extended.
We conelude by mentioning three open problem areas for nonsymmetric structure
prediction.
First, it would be interesting to understand the relationship between the structure
of 1. and the structure of LO, both of which are different ways of storing the lower
triangular factor in Gaussian elimination with partial pivoting. Can the techniques
discussed in this paper be used to obtain bounds on the structure of LO?
Second, it would be useful to achieve a complete structural understanding of the
Bunch-Kaufmann symmetric indefinite factorization [18, Chapter 4.4J. Here a sym-
metric indefinite matrix is factored symmetrically by choosing pivots from the di-
agonai, but each pivot may be either an element or a 2 x 2 submatrix. Thus the
factorization is P ApT = LDLT, where P is a permutation, L is lower triangular, and
D is block diagonal with 1 x 1 and 2 x 2 blocks. This factorization is particularly
useful for solving "augmented systems" of the form

where A is rectangular and K is symmetric and (perhaps) positive definite [1]. Even
the common case K = I is not weil understood.
Third, it would be interesting to understand the structural issues in the incomplete
LU factorizations sometimes used to precondition iterative methods for solving linear
systems [7].

REFERENCES

[1) Ake Björck. A note on scaling in the augmented system methods, 1991. Unpublished
manuscript.
[2) Robert K. Brayton, Fred G. Gustavson, and Ralph A. WiIIoughby. Some results on sparse
matrices. Mathematics of Computation, 24:937-954, 1970.
[3) Richard A. Brualdi and Herbert J. Ryser. Combinatorial Matrix Theory. Cambridge University
Press, 1991.
[4) Richard A. Brualdi and Bryan L. Shader. Strong Hall matrices. IMA Preprint Series #909,
Institute for Mathematics and Its Applications, University of Minnesota, December 1991.
[5) Thomas F. Coleman, Anders Edenbrandt, and John R. Gilbert. Predicting!ill for sparse
orthogonal factorization. Journal of the Association for Computing Machinery, 33:517-
532. 1986.
139

[6] I. S. Duff and J. K. Reid. Some desigu features of a sparse matrix eode. ACM Transactions
on Mathematieal Software, 5:18-35, 1979.
[7] Howard Elman. A stability analysis of ineomplete LU factorization. Mathematies of Compu-
ta/ion, 47:191-218, 1986.
[8] Alan George and Michael T. Heath. Solution of sparse linear least squares problems using
Givens rotations. Linear Algebra and its Applieations, 34:69-83, 1980.
[9] Alan George and Joseph Liu. Householder refteetions versus Givens rotations in sparse orthog-
onal decomposition. Linear Algebra and its Applieations, 88:223-238, 1987.
[10] Alan George, Joseph Liu, and Esmond Ng. A data strueture for sparse QR and LU factoriza-
tions. S/AM Journal on Seientifte and Statistieal Computing, 9:100-121, 1988.
[11] Alan George and Joseph W. H. Liu. Computer Solution of Large Sparse Positive Deftnite
Systems. Prentice-Hall, 1981.
[12] Alan George, Joseph W. H. Liu, and Esmond Ng. Row ordering schemes for sparse Givens
transformations I. Bipartite graph mode!. Linear Algebra and its Applieations, 61:55-81,
1984.
[13] Alan George and Esmond Ng. Symbolie factorization for sparse Gaussian elimination with
partial pivoting. S/AM Journal on Seientifte and Statistieal Computing, 8:877-898, 1987.
[14] John R. Gilbert. Predieting structure in sparse matrix computations. Technical Report 86-750,
Cornell University, 1986. To appear in S/AM Journal on Matrix Analysis and Applications.
[15] John R. Gilbert. An efficient parallei sparse partial pivoting algorithm. Technical Report
88/45052-1, Christian Michelsen Institute, 1988.
[16] John R. Gilbert and Tim Peierls. Sparse partial pivoting in time proportional to arithmetie
operations. S/AM Journal on Seientifte and Statistieal Computing, 9:862-874, 1988.
[17] John Russell Gilbert. Graph Separator Theorems and Sparse Gaussian Elimination. PhD
thesis, Stanford University, 1980.
[18] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University
Press, second edition, 1989.
[19] Martin Charles Golumbic. Algorithmie Graph Theory and Perfect Graphs. Aeademic Press,
1980.
[20] Frank Harary. Graph Theory. Addison-Wesley Publishing Company, 1969.
[21] Donovan R. Hare, Charles R. Johnson, D. D. Olesky, and P. van den Driessche. Sparsity
analysis of the QR factorization, 1991. To appear in S/AM Journal on Matrix Analysis
and Applications.
[22] Michael T. Heath. Numerieal methods for large sparse linear least squares problems. S/AM
Journal on Seientifte and Statistieal Computing, 5:497-513, 1984.
[23] Joseph W. H. Liu. The role of elimination trees in sparse factorization. S/AM Journal on
Matrix Analysis and Applications, 11:134-172,1990.
[24] L. Lovasz and M. D. Plummer. Matehing Theory. North Holland, 1986.
[25] Kazuo Murota, Masao Iri, and Masataka Nakamura. Combinatorial canonical form of layered
mixed matrices and its application to block-triangularization ofsystems oflinear/nonlinear
equations. S/AM Journal on Algebraie and Discrete Methods, 8:123-149, 1987.
[26] Esmond G. Ng and Barry W. Peyton. A tight and explicit representation of Q in sparse QR
factorization. Teehnical Report ORNL/TM-12059, Oak Ridge National Laboratory, 1992.
[27] S. Parter. The use of linear graphs in Gauss elimination. S/AM Review, 3:119-130, 1961.
[28] Alex Pothen. Predieting the strueture of sparse orthogonal faetors. Manuseript, 1991.
[29] Alex Pothen and Chin-Ju Fan. Computing the block triangular form of a sparse matrix. ACM
Iransactions on Mathematieal Software, 16:303-324, 1990.
[30] Donald J. Rose. Triangulated graphs and the elimination process. Journal of Mathematieal
Analysis and Applieations, 32:597-609, 1970.
[31] Donald J. Rose and Robert Endre Tarjan. AIgorithmic aspects of vertex elimination on direeted
graphs. S/AM Journal on Applied Mathematics, 34:176-197, 1978.
[32] Donald J. Rose, Robert Endre Tarjan, and George S. Lueker. AIgorithmic aspects of vertex
elimination on graphs. S/AM Journal on CompuUng, 5:266-283, 1976.
[33] H.R. Schwartz. Tridiagonalization of a symmetric band matrix. Numerisehe Mathematik,
12:231-241, 1968.
HIGHLY PARALLEL SPARSE TRIANGULAR SOLUTION*

FERNANDO L. ALVARADOt, ALEX POTHENt


AND ROBERT SCHREIBER§

Abstract. In this paper we survey a reeent approach for solving sparse triangular systems
of equations on highly parallei computers. This approach employs a partitioned representation
of the inverse of the triangular matrix so that the solution can be computed by matrix-vector
multiplication. The number of factors in the partitioned inverse is proportional to the number of
general communication steps (router steps on a CM-2) required in a highly parallei algorithm. We
describe partitioning algorithins that minimize the number of factors in the partitioned inverse over
all symmetric permutations of the triangular matrix such that the permuted matrix continues to
be triangular. For a Cholesky factor we describe an O(n) time and space algorithm to solve the
partitioning problem above, where n is the order of the matrix. Our computational results on a CM-
2 demonstrate the potential superiority of the partitioned inverse approach over the conventional
substitution algorithm for highly parallei sparse triangular solution. Finally we describe current and
future extensions of these results. .
AMS(MOS) subject classifications: primary 65F50, 65F25, 68R10.
Keywords. chordal graph, directed acyclic graph, elimination tree, graph partitioning, mas-
sively parallei computers, partitioned inverse, sparse triangular systems, transitive dosure.

1. Introduction. We survey some recent developments in the solution of sparse


triangular linear systems of equations on a highly parallel computer. For concreteness,
we consider a unit lower triangular system L;J;. = Q, but the results in the paper apply
in a straightforward manner to upper triangular systems as well. We discuss the
situation when there are multiple right-hand side vectors Q, and all these vectors are
not necessarily available at once. Such situations occur in finite element applications,
preconditioned iterative solvers for linear systems, solution of initial value problems by
implicit methods, variants of Newton's method for the solution of nonlinear equations,
and in numerical optimization.
There are two possible approaches to the paralleI solution of triangular systems of
equations. One approach is to exploit whatever limited parallelism is available in the
usual substitution algorithm [4, 7, 9, 11]. The second approach requires preprocessing,
and works with a partitioned representation of L -1 •

• A part of this work was done while the authors were visiting the Institute for Mathematics
and its Applications (IMA) at the University of Minnesota. We thank the IMA for its support.
t Electrical and Computer Engineering Department, 1425 Johnson Drive, The University of
Wisconsin, Madison, WI 53706 (alvarado@ece.wisc.edu). This author was supported under NSF
Contracts ECS-8822654 and ECS-8907391.
+ Department of Computer Science, University of Waterloo, Waterloo, Ontario Canada N2L
3G1 (apothen@narnia.uwaterloo.ca, na.pothen@na-net.ornl.gov). This author was supported by
NSF grant CCR-9024954 and by U. S. Department of Energy grant DE-FG02-91ER25095 at the
Pennsylvania State University and by the Canadian Natural Sciences and Engineering Research
Council under grant OGP0008111 at the University of Waterloo.
§ RIACS, MS T045-1, NASA Ames Research Center, Moffett Field, CA 94035
(schreiber@riacs.edu). This author was supported by the NAS Systems Division under Coopera-
tive Agreement NCC 2-387 between NASA and the University Space Research Association (USRA).
142

To begin we review the partitioned inverse approach to paralleI triangular solution.


Any unit lower triangular matrix L can be expressed as a product of elementary
matrices: L = ni=l Li, where the elementary matrix Li is unit lower triangular
and nonzero below the diagonal only in column i. Hence it has the representation
Li = I +mi~?' where mi has its first i components zero, and ~i is the i-th coordinate
vector. (Here it will be convenient to indude L n == I among the elementary matrices. )
The elementary lower triangular matrices can be grouped together to form m unit
lower triangular factors L = n~l Pi, where each factor Pi is chosen to have the
property that Pi- l can be represented in the same space as Pi • (Here m :5 n is
a number to be determined.) Each factor Pi = n~~;i-l Lk, with el == 1 < e2 <
... < em < em+I == n + 1. The factor Pi is lower triangular and is zero below its
diagonal in all columns except columns ei through eHI -1. This leads to a partitioned
representation of the inverse of L of the form L- l = n~=m Pi- l (each Pi- l is explicitly
stored) that can be stored in just the space required for L.
It follows that the solution to L;f = 12 can be computed by means of m matrix-
vector products
1
;f = L-lk = IT Pi-lk.
i=m
By using as many virtual processors as there are nonzeros in Pi and summing the
products {(Pi-lhl bl I (Pi-l)kl =/; O} in logarithmic time, we may exploit parallelism
fully in computing the matrix-vector products.
We consider the problem of computing partitioned inverses with the fewest factors
in this paper, since in practice the complexityof highly paralleI triangular solution is
determined by the number of factors. There are two variations of this problem, and
we describe them next after introducing some notation.
A matrix X is invertible in place if and only if (X-l )i,j = 0 whenever Xi'; = 0,
for any assigument of (nonzero) numerical values to the nonzeros in X. Since the
elementary lower triangular matrices are invertible in place, there is always at least
one partition of L with factors that invert in place. A partition in which the factors
Pi are invertible in place is called a no-Jill partition. A no-fill partition of L with
the fewest factors is a best no-fill partition. An admissible permutation Q of L is a
symmetric permutation of the rows and columns of L such that the permuted matrix
QLQT is lower triangular. A best reordered partition of L is a best no-fill partition of
QLQT with the fewest factors over all admissible permutations Q of L.
An overview of this survey is as follows. We shall describe efficient algorithms for
computing best no-fill and best reordered partitions of lower triangular matrices in
section 2. Then we shall show in section 3 that if L is restricted to be the unit lower
triangular matrix from an LDLT (Cholesky) factorization, there is an even more effi-
cient algorithm for computing these partitions that makes use of the elimination tree.
In section 4 we demonstrate the usefulness of these ideas in practice by comparing the
partitioned inverse approach with a conventional triangular solution algorithm on a
Connection Machine CM-2. We condude by summarizing our findings and describing
both ongoing and future extensions of this work in section 5.
We have taken the opportunity for writing this surveyartide to provide a unified
discussion of the algorithms that have appeared in two different papers, to improve
143

the description of the algorithms (especially Algorithm RP2 in section 2), to illustrate
the differences between the algorithms by means of examples, and to correct minor
errors.

2. Two partitioning prohIerns. We begin by providing formaI statements and


graph models of the best no-fill and best reordered partitioning probIems, and then
describe algorithms for computing the partitions when L is obtained from unsymmet-
ric, symmetric indefinite, or incomplete factorizations.

2.1. Graph rnodels. A formaI statement of the best no-fill partitioning problem
is as follows:
(Pr!) Given a unit lower triangular matrix L = Il?=l Li, find a partition into factors
L = Il:'l Pi , where
1. each Pi = Il~~;.-l Lk, with el = 1 < e2 < ... em < em +1 = n + 1,
2. each Pi inverts in place, and
3. m is minimum over all partitions satisfying the given conditions.

oo IV 0
o o
o o
000 o
o 0 000
o o o
oo o
o o
o 0 ggolgo oo o
oo o oo
o !< 0 0

FIG. 1. A lower triangular matrix, its DA G, and its partitions. The originai ordering and the
partition found by Algorithm P1 are shown on the leJt, and the ordering and partition found by
Algorithms RP1 or RP2 are shown on the right.

It is helpful to consider a graph model of (Pr 1) and the other partitioning probIems.
Let G( L) denote a directed graph with vertices V = {I, ... , n} corresponding to the
columns of L and edges E = {(j,i): i > j and lij =f. O}. The edge (j, i) is directed
from the lower-numbered vertex j to the higher-numbered vertex i. It follows that
G(L) is a directed acyclic graph (DAG). If there is a directed path from a vertex j to
a vertex i in G(L), we will say that j is a predecessor of i, and that i is a successor
of j. In particular, if (j, i) EE, then j is a predecessor of i and i is a successor of j.
Given a subset P of the columns of L, the column subgraph of G( L) induced by P
is the graph whose edge set is the subset of edges in E that are directed from vertices
in P to all vertices in V, and whose vertex set is the subset of vertices which are the
endpoints of such edges. Thus the column subgraph of P is the subgraph induced by
the edges that correspond to nonzeros in the column set P.
144

In what follows, we identify a subset of eolumns P with the factor formed by


multiplying, in order of inereasing eolumn number, the elementary matriees eorre-
sponding to eolumns in P. The eondition that the nonzero strueture of a factor P
should be the same as the structure of its inverse eorresponds in the graph model to
the requirement that the eolumn subgraph of P should be transitively elosed [8). (A
DAG G is transitively closed if and only if for every pair of vertiees j and i sueh that
there is a direeted path in G from j to i, the edge (j, i) is present in G.)
Henee the graph model of (Prl) is as follows:
(PrI') Find an ordered partition P1 -< P2 -< ... -< Pm of the vertiees of G(L) sueh
that
1. for every v EV, if v E Pi then all vertiees numbered less than v belong
to PI, ... , Pi,
2. the eolumn subgraph of eaeh Pi is transitively elosed, and
3. m is minimum over all partitions satisfying the given eonditions.
The reader will note that we have identified a factor Pi in (Prl) with a subset in the
vertex partition in its graph model (PrI'), sinee there is a one-to-one eorrespondenee
between them. We will not distinguish between them, and it should be elear from
the eontext whether the factor or the vertex subset is being diseussed.
We illustrate these eoneepts by means of an example. Consider the matrix L
with graph G(L) illustrated in Fig. 1. The original ordering of its rows and eolumns
is shown on the leftj the eorresponding vertex ordering of G(L) is indieated by the
vertex numbers on the left. As shown in the partition of the matrix on the left in
Fig. 1, L has a best no-fill partition with four factors:

It is possible to symmetrieally permute the rows and eolumns of L sueh that


L remains lower triangularj this eorresponds to reordering the elementary matrices
white preserving the lower triangular strueture of L. The permutation eorresponds
to a reordering of the vertiees of G(L). If in Fig. 1 we reorder the vertices with the
numbers shown in the right, then a best no-fill partition of the permuted L is

L = (LI'" L 6 )(L 7 • •• L I2 ),

whieh has only two factors. The matrix on the right in Fig. 1 eorresponds to the
reordered matrix, and its partition into two factors is also shown.
A formaI statement of the best reordered partitioning problem is as follows:

(Pr2) Given a unit lower triangular matrix L = ni=l Li, find an admissible permu-
tation Q and a partition LQ == QLQT = nr,:,l Pi , where
1. each Pj = n~~!;-l Lk, with el = 1 < e2 < ... e m < em+I = n + 1,
2. eaeh Pi is invertible in plaee, and
3. m is minimum over all permutations Q sueh that L Q is lower triangular.
As noted above, the action of the permutation Q on L is to reorder the elementary
matrices whose product is Lj however, these elementary matrices eannot be arbitrarily
reordered, sinee we require the resulting matrix LQ to be lower triangular. From the
145

equation Li = I + mi f:.? it can be verified that the elementary matrices Li and Li+!
can be permuted if and only if lHl,i = o. These precedence constraints on the order
in which the elementary matrices may appear is nicely captured in a graph model of
(Pr2).
A topological ordering of G( L) is an ordering of its vertices in which predeces-
sors are numbered lower than successorSj Le., for every edge (j,i) E E, i > j. By
construction, the original vertex numbering of G(L) is a topological ordering. A per-
mutation Q that leayes L Q lower triangular corresponds to a topological reordering
of the vertices of G(L).
The graph model of (Pr2) is:
(Pr2') Find an ordered partition P1 -< P2 -< ... -< Pm of the vertices of G(L)
numbered in a topological ordering such that
1. for every v E V, if v E Pi then all predecessors of v belong to PI, ... ,
Pi ,
2. the column subgraph of each Pi is transitively dosed, and
3. mis minimum subject to these conditions.

The permutation Q in (Pr2) can be obtained by renumbering the vertices in the


ordered partition P1 to Pm in increasing order, and in topological order within each
subset Pi •

2.2. Partitioning algorithms. We now describe "greedy" algorithms for solv-


ing the best no-fill and best reordered partitioning probIems.

Input: A unit lower triangular matrix L = L 1 L 2 ••• L n and its DAG G(L).
Output: A best no-fill partition of L.
i+- 1j {Li is the lowest-numbered elementary matrix not induded in a factor yet}
k +- 1j {Pk is the factor being computed}
while (i :::; n) do
{Find the largest integer r ~ i such that Li ... L r is invertible in place}
r +- Zj
while r < n and in G(L) every successor of the vertex r is a successor
of all predecessors v of r such that i :::; v < r do r +- r + 1j od
Pk+-{i, ... ,r}j k+-k+1j i+-r+1j
od

FlG. 2. Algorithm Pl.

Best no-HU partitions. Algorithm Pl, shown in Fig. 2, was proposed by Al-
varado, Yu and Betancourt [3]. This algorithm greedily tries to indude as many
elementary matrices in the current factor as possible, while maintaining the two
properties that a factor should invert in place, and that the 'left-to-right' precedence
constraint in problem (Pr1) should be obeyed. The condition that in the graph G(L)
every successor of a vertex r is also a successor of every predecessor of r ensures that
indusion of the vertex r in the current factor Pk will continue to make G(Pk) transi-
tively dosed, and thus Pk will be invertible in place. Alvarado, Yu, and Betancourt
146

did not eonsider the issue of optimality, but later it was proved by Alvarado and
Sehreiber [2] that Algorithm PI solyes problem (Prl).
Best reordered partitions. Now we deseribe Algorithm RPI that solyes the
reordered partitioning problem (Pr2). A vertex v in the DAG G(L) is a saUTee if
there are no edges directed into v: i.e., there are no edges (u, v). The level of a vertex
v is the length of a longest direeted path into v. It follows that if v is a souree, then
level (v) = Oj furthermore, if v is not a source, then level (v) is the length of a longest
path from a souree to v. The level values of all the vertiees of G(L) ean be computed
in O( e) time. We define the set hadj (v) to be the set of all vertices adjaeent to v and
numbered higher than v.

Input: A lower triangular matrix L = LI ... L n and its DAG G(L).


Output: A permutation Q : V --+ {I, ... , n} and a partition of the permuted matrix
L Q into factors.
Compute level(v) for all v E Vj
max _level ..- maxvEV (level (v )) j
i ..- 'Oj {i elementary matriees have been ineluded in faetors}
k..- Ij {Pk is the factor being computed}
while i < n do
Pk"- 0j
e..- min{jl there is an unnumbered vertex at level j}j
repeat
for every vertex v at level e do
if «([Condition la] v is unnumbered) and
([Condition Ib] Every predeeessor of v has been numbered) and
([Condition 2] Every sueeessor of v is a suecessor of all
u E Pk such that u is a predecessor of v) ) then
i..-i+lj Q(v)..-ij Pk..-PkU{V}j
fi
od
e..- e+ Ij
until e > max_level or no vertiees at level e-I were ineluded in Pkj
k..- k + Ij
od

FIG. 3. Algorithm RP1.

Algorithm RPI, shown in Fig. 3, renumbers the elementary matriees during the
eourse of its execution since it eomputes an appropriate symmetric permutation Q to
minimize the number of factors. Conditions la and Ib in the algorithm ensure that
the first eondition of problem (Pr2) is satisfiedj similarly condition 2 ensures that the
eolumn subgraphs of the factors are transitively elosed.
Alvarado and Schreiber [2] proved that AIgorithm RPI finds a best reordered
partition. The time complexity of the algorithm is dominated by the eheeking of
eondition 2: in the worst-case, this eost is LVEV dIe v )do (v), where dIe v) is the indegree
and do(v) is the outdegree of V. Since dI(v) :::; n-I, and LVEV do(v) == e, the time
147

complexity of the algorithm is O(ne). If we assume that the indegrees and outdegrees
are bounded by d, then the complexity is O(d 2n). The space complexity is O(e).

Input: A lower triangular matrix L = LI'" L n and its DAG G(L).


Output: permutation Q : V ~ {I, ... , n} and a partition of L into factors.
forall v E V do
pred(v) +- {u: Luu =I- O}j count(v) +- indegree(v)j
od
i +- Oj {i elementary matrices have been induded in factors}
k +- 1j Pk +- 0j {Pk is the factor being computed}
e +- {v E V: count(v) = O}j
{vertices eligible to be tested for indusion in current factor}
e+ +- 0j {vertices eligible to be tested for indusion in the next factor}
while i < n do
while e =I- 0 do
choose v E e and delete it from ej
if· ( [Condition 2'] Every successor of v is a successor of all
u E Pk n pred(v) ) then
i+-i+1j Q(v)+-ij Pk+-PkU{V}j
for every successor w of v do
pred(w) +- pred(w) \ pred(v)j count(w) +- count(w) - 1j
if count(w) = 0 then e +- e U {w}j fi
od
else
indude v in e+j
fi
od
k +- k + 1j Pk +- 0j e+- e+j e+ +- 0j
od

FIG. 4. Algorithm RP2.

At the expense of additional space, in most cases we can reduce the running time
required by AIgorithm RP1 by incorporating two enhancements.
The first improvement is that a vertex need not be tested for indusion into a factor
Pk until all of its predecessors have been numbered. To accomplish this, in count( v)
we count the number of unnumbered predecessors of each vertex Vj initially, this is its
indegree. When this count becomes zero, we indude v in aset e of vertices eligible
to be tested for indusion in a factor Pk.
If an eligible vertex v satisfies condition 2', then it is deleted from e and induded
in the factor Pk. Otherwise, it is induded in e+, the set of vertices eligible to be
tested for indusion in the next factor Pk+!. Further, since newly eligible vertices are
adjacent to currently eligible vertices, we need to maintain only the sets e and e+ in
the algorithm. Thus we can dispense with the processing of vertices by level values.
The second improvement is to reduce the cost of checking condition 2 in AIgorithm
RPl. If u and v are both nurribered vertices which have been induded in the current
148

factor Pk, and v is a successor of u, then hadj(v) ~ hadj(u), otherwise v would


have failed condition 2. Thus we need not consider vertex u when applying the
requirements of condition 2 to a vertex that is also a successor of v. We make use of
this in the faster implementation by deleting from pred( w) some of the predecessors
of w that need not be examined in checking condition 2. In the situation above,
when v is included in Pk we remove u from the predecessor sets of v's successors, thus
avoiding some of the unnecessary checking.
In condition 2', the test whether u E Pk n pred( v) can be done efficiently by
maintaining an array that maps each vertex to the factor to which it has been assigned.
Fig. 4 contains a description of AIgorithm RP2.
The worst-case time complexity of AIgorithm RP2 is O(ne) as weIl (and there are
DAGs which attain this bound), though practicaIly the above improvements should
reduce the running times in many cases.

3. Cholesky factorization. Now we consider the restriction of (Pr2) to Cholesky


factol,"s. Then the graph G(L) viewed as an undirected graph is chordal; i.e., every
cycle with more than three edges has a chord, an edge joining two nonconsecutive
vertices on the cycle. The chordality of G( L) simplifies the problem a great deal since
it suffices to consider the transitive reduction of G(L), the elimination tree, instead
of G(L). This simplification enables the design of an O(n)-time and space algorithm
(AIgorithm RPtree) for computing the partition.

oo
V
o
o o
o o
000 o
000 000
o o 0
oo 0

o oo
og~l:0~1:oo
000 0000 0
00000
~I:o
o oo 000

FIG. 5. ACholesky Jactor L, its DAG, and its partitions. The original ordering and the partition
Jound by Algorithm P 1 are shown on the leJt, and the ordering and partition Jound by Algorithm
RPtree are shown on the right.

In Fig. 5, we display the structure of aCholesky factor L and the associated


chordal graph G(L). The vertex numberings on the left in G(L) correspond to the
matrix on the left, and those on the right correspond to the reordered matrix shown on
the right. AIgorithm PI partitions L into six factors as shown on the left; AIgorithms
RPI, RP2, or RPtree will partition it into three factors as shown on the right.
The elirnination tree of L (equivalently G(L)) is a directed tree T = (V,ET),
whose vertices are the columns of L, with a directed edge (j, i) E ET if and only if
the lowest-numbered row index of a subdiagonal nonzero in the j-th column of L is
149

FIG. 6. The eliminatian tree of the Cholesky factor in Fig. 5.

i. (The edge is directed from j to i.) The vertex i is the parent of j, and j is a
ehild of i. If (j, i) is an edge in the elimination tree, the lowest-numbered vertex in
hadj(j) is i. The elimination tree of the graph G(L) in Fig. 5 is shown in Fig. 6. (The
vertex nUllibering corresponds to the originaI orderi ng shown on the left in Fig. 5.) A
comprehensive survey of the role of elimination trees in sparse Cholesky factorization
has been provided by Liu [13].
Our partitioning algorithm will require as input the elimination tree with vertices
numbered in a topological ordering. It also requires the subdiagonal nonzero counts of
each column v of L, stored in an array hd(v) (the higher degree of v). The algorithm
uses a variable member to partition the verticesj member( v) = eimplies that v belongs
to the set Pi .
UnIike AIgorithms RPI and RP2 which compute the factors PI, ... , Pm in that
sequence, AIgorithm RPtree examines the vertices of the elimination tree in increasing
order of their numbers. If a vertex v is a leaf of the tree, then it is induded in the
first member (the vertices in Pl ). Otherwise, it divides the children of v into two
sets: Gl is the subset of the children u such that the column subgraph of G(L)
induced by u and v is transitively dosed, and G2 denotes the subset of the remaining
children. Let mi denote the maximum member value of a child in Gl and m2 denote
the maximum member value of a child in G2 • Set mi = 0 if Gi = 0. If Gl is empty,
or if mi :::; m2, then we will show that v cannot be induded in the same member as
any of its childrim, and hence v begins a new member (m2 + 1). Otherwise, mi > m2,
and v can be induded together with some child u E Gl such that member(u) = mi'
We now describe the details of an implementation. The vertices of the elimination
tree are numbered in a topological ordering from 1 to n. The descendant relationships
in the elimination tree are represented by two arrays of length n, ehild and sibling.
The array ehild (v) represents the first child of v, and sibling (v) represents the right
sibling of v, where the children of each vertex are ordered arbitrarily. If ehild( v) = 0,
then v has no child and is a leaf of the elimination treej if sibling(v) = 0, then v has
no right sibling. AIgorithm RPtree is shown in Fig. 7.
The reader can verify that P1 = {I, 3, 4, 7, 8, 9}, P2 = {2, 5, IO}, and P3 =
{6, 11, 12} for the graph in Fig. 5. The time and space complexities of the algo-
rithm are easily shown to be O(n). We turn to a discussion of the correctness of the
algorithm.
150

Input: The elimination tree of a DAG e( L) and the higher degrees of the vertices.
Output: A mapping of the vertices such that member(v) = f implies that v EP/..
for v := 1 to n ~
if ehild( v) = 0 then {v is a leaf}
member(v):= 1;
else {v is not a leaf}
u := ehiidev); mI := 0; m2 := 0;
while u # 0 do
if hd(u) = 1 + hd (v) then
mI := max{ml, member(u)};
else {hd(u) < 1 + hd(v)}
m2 := max{m2' member(u)};
fi
u := sibling( u);
od
if mI :s; m2 then {v begins a new factor}
member(v) := m2 + 1;
else {mI> m2, vean be induded in a factor which indudes a ehiId}
member(v) := mI;
fi
fi
rof

FIG. 7. Algorithm RPtree.

Condition 1 of problem (Pr2) requires that if a vertex v belongs to Pe, then all
predecessors of v must belong to PI , ... , Pe. The elimination tree T, being the transi-
tive reduction of the DAG e( L), preserves path structure: i.e., there exists a directed
path from v to w in e(L) if and only if there is a (possibly some other) directed path
from v to w in the elimination tree T. Hence the predecessors of a vertex in e( L) re-
main its predecessors in the eliminatian tree. Further, since we assign member values
in a topological ordering of the vertices in the eliminatian tree, to satisfy Condition
1 we need consider only the children of a vertex v among its predecessors. Now since
AIgorithmRPtree assigns member values such that member( v) is greater than or
equal to member(u) for any child u, the condition is satisfied.
Condition 2 requires that each factor Pe be transitively elased. An important
property of the elimination tree [13] is that if v is the parent of a vertex u in the
elimination tree, then hadj (u) ~ {v} U hadj (v). Hence hd (u) :s; 1 + hd (v). On
the other hand, if u and vean be induded in the same transitively elosed column
subgraph, then hadj (u) :2 {v} U hadj (v). It then follows that u and vean be possibly
ineluded in the same column subgraph only if hadj( u) = {v }Uhadj( v), or equivalently,
hd(u) = 1 +hd(v). Furthermore, if v has a child u not satisfying the degree condition,
then v but not u is adjacent to some higher numbered vertex x, and hence veannot
belong to the same member as U. Thus we partition the children of v into two
subsets: Gl consists of children u such that u and vean be induded in the same
column subgraph; G2 indudes the rest of its ehiIdren. It follows that if mi is the
151

maximum member value among vertices in Gi , then the inclusion of v into a column
subgraph containing a child preserves transitivity only if mi > m2.
It can be established by induction that AIgorithm RPtree solves (Pr2) by parti-
tioning G(L) into the minimum number of factors over all topological orderings [16].

4. Experimental results. In this section we provide experimental results to


demonstrate the superiority of the partitioned inverse approach over the conventional
substitution algorithm for highly parallei triangular solution. First we describe the
performance of the various partitioning algorithms, and then we report results for
triangular solution on a eM-2.

4.1. Partitioning algorithms. We implemented AIgorithms RPl, RP2, and


RPtree and compared their performances on eleven problems from the Boeing-Harwell
collection [6]. All the algorithms were implemented in e, within Alvarado's Sparse
Matrix Manipulation System [1]. Each problem was initially ordered using the
Multiple-Minimum-Degree ordering of Liu [12], and the structure of the resulting
lower triangular factor L was computed. We call this the primary ordering step.
Then AIgorithms RPl, RP2, or RPtree were used in a secondary orderi ng step to
reorder the structure of L to obtain the minimum number of partitions over reorder-
ings that preserve the DAG G(L). All three algorithms lead to the same number of
factors in the partition since they solve the same problem.
The experiments were performed on aSun SPARestation IPe with 24 Mbytes of
main memoryand a 100 Mbyte swap space running the SunOS 4.1 version of the Unix
operating system. The unoptimized standard system compiIer was used to compile
the code. Let T(A) denote the number of nonzeros in the strict lower triangle of Aj
T(L) is then e, the number of edges in G(L). We scale these numbers by a thousand
for convenience. In Table 1, we report the scaled values of T(A) and T(L), the epu
times taken by the primary and secondary orderi ng algorithms (in seconds), and the
height of the elimination tree obtained from the primary ordering.
Table 1 also reports the number of factors in the partitioned inverse of L. The
number in the column 'Factors(Prl)' corresponds to the number of factors in the
solution of problem (Prl), Le., the best no-fm partition problem. The number in
the column 'Factors(Pr2)' indicates the number of factors in the solution of problem
(Pr2), i.e., the best reordered partitioning problem. Note the substantial decrease in
the number of factors obtained by the permutation.
Later results in this section will show that when the partitioned inverse is em-
ployed on a highly paralleI computer, the number of factors in the partitioned inverse
determines the complexity of parallei triangular solution. On the other hand, the
complexity of a conventional triangular solution algorithm is governed by the height
of the elimination tree. Table 1 shows both these quantities, and it is seen that the
number of factors in the partitioned inverse is several fold smaller (by a factor of
sixteen on the average) than the elimination tree height. Hence the use of the parti-
tioned inverse potentially leads to much faster parallei triangular system solution on
massively parallei computerso
For the k X k model grid problem ordered by the optimal nested dissection ordering,
152

TABLE 1
Comparison of exeeution times on aSun SPARCstation IPC for three seeondary reordering sehemes
with the MMD primary ordering. The parameters r(A) and r(L) have been sealed by a thousand for
convenience.

Originai Data MMD etree CPU time (see) Factors


Problem n r(A) time r(L) height RP1 RP2 RPtree (Pr1) (Pr2)
BCSPWR10 5,300 8.27 1.72 23.2 128 1.07 1.26 0.10 70 32
BCSSTK13 2,003 40.9 4.74 264 654 61.1 22.1 0.05 53 24
BCSSTM13 2,003 9.97 1.12 42.6 261 5.08 2.63 0.03 25 16
BLCKHOLE 2,132 6.37 0.73 53.8 224 3.15 2.58 0.05 24 15
CAN1072 1,072 5.69 0.72 1904 151 0.78 0.92 0.02 21 16
DWT2680 2,680 11.2 1.82 49.9 371 2043 2045 0.05 50 36
LSHP3466 3,466 10.2 1.03 81.2 341 4048 4.14 0.07 37 25
NASA1824 1,824 18.7 1.42 72.2 259 6.01 3.88 0.03 34 16
NASA4704 4,704 50.0 3.92 275 553 33.8 16.1 0.12 41 17
39x399pt 1,521 10.9 0.50 31.6 185 1.35 1.50 0.02 19 15
79x799pt 6,241 45.9 2.17 190 429 12.7 llA 0.12 30 23

the height of the elimination tree is 3k + 8(1), while the number of faetors (in (Prl)
and (Pr2)) is 210g 2 k + 8(1). The results in Table 1 show that the number of faetors
for these irregular problems is only weakly dependent on the order of A, compatible
with logarithmic growth.
The RPtree algorithm has O(n) time complexity while RP1 and RP2 are both
O(nr(L)) algorithms (recall that r(L) == e). This is confirmed by the experiments:
on the average problem in this test set, RPtree is more than a hundred times faster
than RP1 or RP2, and the advantage increases with increasing problem size. From a
practical perspeetive, the time needed by the RPtree algorithm is quite small when
compared to the cost of computing the initial MMD ordering. An equally important
advantage of the RPtree algorithm is that it requires only O(n) additional space,
whereas both RP1 and RP2 require O(r(L)) additional space. However, AIgorithms
RP1 and RP2 can be us ed to partition triangular factors arising from approximate or
incomplete Cholesky factorizations as wel! as unsymmetric and symmetric indefinite
faetorizations.
We have also experimented with a variant of the Minimum-Length-Minimum-
Degree (MLMD) orderi ng [5] as the primary ordering, but we do not report detailed
results here. The MLMD orderi ng incurs a great deal more fil! in L than the MMD al-
gorithm, and its current, fairly straightforward implementation is quite slow compared
to the MMD algorithm. We believe an implementation comparable in sophistication
to the MMD algorithm should not be significantly slower than MMD, and may also
reduce fil!. In spite of the greater fil!, the MLMD orderi ng is more effeetive in almost
all cases than MMD in reducing the number of factors in the partition of both L and
LQ. In some cases, the initial number of faetors obtained when MLMD is used as the
primary orderi ng is lower than the final number of levels obtained with MMD after
the secondary reordering (Q in problem (Pr2)). However, because of the increased fill,
choosing between MMD and MLMD as the primary ordering is not straightforward.

4.2. Triangular solution on a CM-2. Now we compare the performance of


the partitioned inverse approach with the conventional substitution algorithm for
triangular solution on a CM-2.
153

An efficient parallel substitution algorithm was implemented in CM Fortran, a


dialeet of Fortran 90. The data structure consists of several arrays of length equal
to T(L). We associate aset of T(L) virtual processors with the nonzeros, one with
each position in these arrays. We store the factors L and D where L = LD and L is
unit triangular, and solve L1<. = Qvia L1<. = D-IQ, since this removes a multiplication
from the inner loop. The matrix L is stored as a one-dimensional array containing
its nonzeros in column-major order. In addition, the level of vertex j in G(L) is
stored along with Lij . The elements of Qare stored at the processors containing the
corresponding diagonal element s of 1. The solution 1<. overwrites Q. Finally, a Boolean
vector indicates the location of the diagonal element s in the arrays. This vector thus
segments the arrays into columns of differing lengths-in other words, the matrix is
stored as a ragged array of columns of nonzeros. The Connection Machine software
provides some operations for such data struetures. It allows broadcast of values from
diagonal element s to all elements of the corresponding column (called a segmented
copy sean) and summation of the values in a column (with a segmented add sean).
AIso, the Connection Machine router allows processors to send data to any other
processor or read from any other processor; this is expressed using a veetor-valued
subscript in CM Fortran. Calls to utility library routines were used for the sean
operations, which are not part of CM Fortran.
The substitution algorithm for triangular solution loops sequentially over levels
of G(L) starting with the source vertices (level zero). At the beginning of step i,
those elements Xj for which levelU) = i are known. Recall that Xj is stored in the
processor holding Ljj . These known values of 1<. are sent to the processors holding
the corresponding column of L by a segmented copy sean. These processors then
multiply their L value by the element of 1<. they receive. The router is then used
to permute all these products into row-major order, so that the element s of each
set R; == {Lijxj I L ij i- 0 and level(j) = f} are stored in consecutive locations.
The vector-valued subscript used to accomplish this permutation is computed in a
preprocessing step. The rows R; now form a ragged array. A segmented add sean
forms the sums of these partial results within rows. Finally, the router is used to
send the sum of the elements of R; to the processor holding Lii and bi where it can
be subtraeted from bi • (In faet, our code avoids this last subtraetion entirely, doing
it as part of the add sean above.)
Thus an iteration of the loop involves one paralleI multiplication, one copy sean
and one add sean, and two uses of the router for permutation of data. The time
required to set up and load the data strueture, including the computation of the
permutation used, the loading of Q and extraetion of 1<., was not timed.
The sequential algorithm for solving a triangular system is quite similar to the al-
gorithm for computing a matrix-veetor product involving a triangular matrix. Hence
the code for triangular solution using the partitioned inverse approach is nearly iden-
tical to that for our substitution method in that the inner loop involves the same
operations. But the number of executions of the loop in the partitioned inverse ap-
proach is equal to the number of factors in the partition of L rather than the number
of levels in its graph.
We report the CM performance of these two methods for an n x n, dense, unit
lower-triangular matrix L. Our results are obtained with CM Fortran in the 'slice-
154

TABLE 2
CM-2 times (seconds) for full matrix substitution and partitioned solution.

n I T(L) Levels Substitution


in G(L) Time
Factors Partitioned
soln. time
256T 32,896 256 7.34 1 0.04
5121131,328 512 50.22 1 0.21

TABLE 3
CM-2 times (seeonds) for sparse triangu/ar substitution and partitioned so/ution.

r Matrix II Ordering Factor-


ization
7(L) Levels
in G(L)
Substitution
Time
Factors Partitioned
soln. time
r L 11
1 ReM ILU 23,526 823 19.22 816 11.66
I L2 II MLMD ILU 26,793 78 1.38 66 1.20
r La II MLMD exact 118,504 311 16.78 16 0.87

wise' execution model, which treats each Weitek chip of the CM-2 as a processor. For
this.experiment we used 256 Weitek processors on the Connectian Machine at NASA
Ames. Results are given in Table 2. Clearly, the partitioned method is superior, by
a factor roughly equal to the ratio of the number of levels in G( L) (which is n in this
case) to the number of factors in the partition of L (one).

Next, we performed an experiment using a sparse matrix A of order 4037 obtained


from a triangular mesh in the region around a three-element airfoi!. Three unit
lower triangular matrices LI, L 2 , and L 3 were obtained by approximate factorization.
LI is obtained by an incomplete LU factorization of A; we carry out the Gaussian
elimination process, but we allow nonzeros in L (and U) only where there is a nonzero
in A 2 • The orderi ng of A is abtained from a lexicographic sort of the (x, y) coordinates
of the grid which leads to the matrix; this orderi ng produces a large number of levels
in G(L). L 2 is the incomplete LU factor abtained when a variant of MLMD is used
as the primary ordering of A. L 3 is the exact lower-triangular factor of A, with the
same primary ordering as for L 2 •

In Table 3 we give the size of these factors, the number of levels (this is proportional
to the time required for ourparallei substitution algorithm), and the number of
factors, which is in practice proportional to the time required by the partitioned
inverse approach.

These results confirm that the time required to solve a triangular system by parti-
tioning of the inverse is quite well predicted by the number of factors in the partition.
It also shows that the number of levels in G( L) is a good predictor of the time required
for solution by substitution methods. We see that when L has a fairly rich structure
the partitioned inverse approach is mu ch better than the substitution method, but
when L is very sparse there is little gained. The use of an MLMD primary ordering
improves both substitution and partitioned methods. However, with the introduc-
tion of the additional fill in the exact factor L 3 (compared with L 2 ), the number of
levels in G(L) increases sharply (as does the time for substitution) while the number
of factors in the best reordered partition drops dramatically. The difference in the
solution time, even for this problem of modest size, is about a factor of twenty. Thus
we conelude that the method can be quite useful in highly paralle! machines when
155

the matrix L has a rich enough structure, as happens when it is an exact triangular
factor.

5. Concluding remarks. We have considered the characterization and com-


putation of partitioned inverse representations involving the fewest factors of a given
sparse triangular matrix L. Three variants of this problem have been considered in
this paperi for completeness, here we will also describe a fourth variant which will be
considered in a future paper. The four problems minimize the number of factors in
the in-place partitioned inverse of L
• over a fixed row and column ordering of L-problem (PrI),
• over symmetric permutations of L which preserve its lower triangular
structure-problem (Pr2),
• over symmetric permutations which preserve its lower triangular structure,
when L is aCholesky factor-problem (Pr2-Cholesky), and
• over symmetric permutations of the filled matrix F = L + D + LT which
preserve the structure of F, again when L is aCholesky factor-problem
(Pr3).
When L is obtained from unsymmetric, symmetric indefinite, or incomplete fac-
torization, Algorithms RPl and RP2 may be used to compute the best reordered
partition in O( ne) time and space. When L is aCholesky factor, Algorithm RPtree
is an extremely efficient O( n) time and space algorithm for computing the reordered
partition corresponding to the third problem listed above. We believe that the re-
suits in this paper demonstrate the potential superiority of the partitioned inverse
approach over the conventional substitution algorithm for sparse triangular solution
on highly parallel computerso
We now discuss the fourth variant of the partitioning problem. Given the factor-
ization A = LDLT of a symmetric, positive definite matrix, consider the filled matrix
F = L + D + LT and the corresponding undirected graph G(F) which is chordal.
In problem (Pr3) we ask for the minimum number of factors m in the partitioned
inverse representation of L over all vertex orderings that preserve the structure of the
filled graph G(F) (rather than preserving the structure of the DAG G(L) as (Pr2)
does). Such an ordering would have to be applied to the original matrix A, befare the
computation of the factorization.
We illustrate this idea by means of the example in Fig. 8. The filled matrix F
shown on the left corresponds to the Cholesky factor L in Fig. 5, and the filled graph
G(F) is the undirected graph corresponding to the DAG G(L) in the latter figure.
It can be shown that a solution to (Pr3) would partition the vertices of G(F) into
two factors P t = {I, 3, 4, 7, 8, 9, 12} and P2 = {2, 5, 6, 10, Il}. Arenumbering of
the columns and rows of F obtained by numbering the vertices in Pt before P2 (and
topologically within each Pi ) and the corresponding partition are shown in the matrix
on the right of Fig. 8. (The corresponding vertex orderi ng is shown on the right in
G(F).) It is easily seen that the permutation moves some element s of the lower
triangle into the upper triangle and vice versa, and that it preserves the structure
of F.
156

loö
o •
. 0
1 0
o ".
...
0 ••
000. •
o. "
000 •• 0"
000 "..
o •• o 0 0
0 ••
000. 000 :I:IU 0" ..
o oo
o
0000.
oo ..
oo
o
000
• oo
..00000
o.•
FIG. 8. Partitioning over the elass of perfect elimination orderings of the fil/ed graph G(F). The
original ordering is shown on the left, and an ordering and a parti/ion that solve problem (Pr3) are
shown on the right. The figure il/ustrates that the upper and lower triangles of the matrix on the left
are not preserved under the permutation, but the st;ucture of F is preserved.

Problem (Pr3) turns out to be much harder than (Pr2), but can be solved by devel-
oping several results concerning transitive perfect elimination orderings: i.e., perfect
elimination orderings of subgraphs of chordal graphs which make them transitively
dosed subgraphs as weIl. AIgorithms to solve (Pr3) wilI be report ed in [14, 15].

The numerieal stability of the partitioned inverse approach to triangular solution


has been investigated in [10]. The method has been shown to be normwise backward
stable and normwise forward stable when a certain scalar which can be considered as
a 'growth factor' is small; this condition is satisfied when L is well-conditioned. How-
ever, the method does not enjoy the componentwise backward and forward stability
properties of the substitution algorithm. If the factors are not invertible in place,
then the partitioned inverse approach may be unstable even when the 'growth factor'
is small.

There are several other directions to explore in future work.

The first issue concerns the uniqueness of the minimum-cardinality partition. AI-
gorithms that we have designed for computing the partitions in all of the above
problems belong to the dass of greedy algorithms. Further, for each i, they indude
as many elementary matrices as possible into a factor Pi subject to the condition
that the factors P1 , ••• , Pi - 1 have been obtained similarly. It is easily seen, however,
that minimum cardinality partitions need not be unique: for instance, for problem
(Pr2-Cholesky), it is possible to design an algorithm that processes the vertices of the
elimination tree from n down to 1 to compute the partition in the order Pm , ... , P1 •
Such a partition would put, for each i, as many elementary matrices as possible into
Pi subject to the condition that factors Pm , ••. , Pi+I have been obtained similarly.
In the context of highly parallei triangular solution, it may be preferable to have
each factor contain roughly the same number of nonzeros, since the matrix-vector
multiplication involves as many virtual processors as the number of nonzeros.

There is some flexibility in assigning elementary matrices to factors, and this may
be exploited to reduce the disparity between the number of nonzeros in the different
157

factors. If no successor of v E PÕ is induded in PÕ or PÕ+!, then it is possible to indude


v in the latter faetor, while preserving the fact that the column subgraph G(Pi+d is
transitively dosed.
A second issue is to reduce the number of faetors in the partitioned inverse further
by permitting some fill during the inversion of the factors, taking care that the method
remains numerically stable. A third issue is to partition the matrix L into a block
triangular matrix ensuring only that the diagonal blocks incur no fill upon inversion.
Then the subdiagonal blocks of L need not be inverted, since the solution to the
tri';'ngular system can be decomposed in the usual way into the solution of several
subsystems. The solution veetor associated with each subsystem can be computed
using the partitioned inverse of its diagonal block, and then by block back-substitution
the contribution this solution veetor makes to higher numbered subsystems can be
eliminated. Fourthly, it may be possible to use the transitive reduetion of the DAG
G(L) to design partitioning algorithms which are more efficient in practice than RPl
or RP2.

REFERENCES

[1] F. L. ALVARADO, Manipulation and visualization of sparse matrices, ORSA J. Comput., 2


(1990), pp. 180-207.
[2] F. L. ALVARADO AND R. SCHREIBER, Optimal parallei solution of sparse triangular systems.
SIAM J. SeL Stat. Comput., to appear, 1992.
[3] F. L. ALVARADO, D. C. Yu, AND R. BETANCOURT, Partitioned sparse A-l methods,IEEE
Trans. Power Systems, 5 (1990), pp. 452-459.
[4] E. ANDERSON AND Y. SAAD, Solving sparse triangular systems on parallei computers, Inter-
national Journal of High Speed Computing, 1 (1989), pp. 73-95.
[5] R. BETANCOURT, An efficient heuristie ordering algorithm for partial matrix refaetorization,
IEEE Trans. Power Systems, 3 (1988), pp. 1181-1187.
[6] I. S. DUFF, R. G. GRIMES, AND J. G. LEWIS, Sparse matrix test probiems, ACM Trans.
Math. Softw., 15 (1989), pp. 1-14.
[7] S. C. EISENSTAT, M. T. HEATH, C. S. HENKEL, AND C. H. ROMINE, Modifted cyclic
algorithms for solving triangular systems on distributed-memory multiprocessors, SIAM J.
SeL Stat. Comput., 9 (1988), pp. 589-600.
[8] J. R. GILBERT, Predicting structure in sparse matrix computations, Tech. Report 86-750,
Computer Scienee, Cornell University, 1986.
[9] S. W. HAMMOND AND R. SCHREIBER, Efficient [CCG on a shared memory multiprocessor.
International Journal of High-Speed Computing, to appear, 1992.
[10] N. J. HIGH AM AND A. POTHEN, The stabi/ity of the partitioned inverse approach to parallei
sparse triangular solution, Tech. Report CS-92-52, Computer Scienee, University of Wa-
terloo, Oet 1992. (Also Numerieal Analysis Report No. 222, Department of Mathematics,
University of Manchester, England.) Submitted to S/AM J. Sci. Stat. Comput.
[11] G. LI AND T. F. COLEMAN, A new method for solving triangular systems on a distributed
memory multiprocessor, SIAM J. SeL Stat. Comput., 10 (1989), pp. 382-396.
[12] J. W. H. LIU, Modifteation of the minimum-degree algorithm by multiple e/imination, ACM
Trans. Math. Softw., 11 (1985), pp. 141-153.
[13] J. W. H. LIU, The role of e/imination trees in sparse factorization, SIAM J. Mat. Anal. AppI.,
11 (1990), pp. 134-172.
[14] B. W. PEYTON, A. POTHEN, AND X. YUAN, Partitioning a chordal graph into transitive
subgraphs for parallei sparse triangular solution. Work in preparation, Oet. 1992.
[15] - - , A clique tree algorithm for partitioning a chordal graph into transitive subgraphs. Work
in preparation, 1992.
[16] A. POTH EN AND F. L. ALVARADO, A fast reordering algorithm for parallei sparse triangular
solution, SIAM J. SeL Stat. Comput., 13 (1992), pp. 645-653.
THE FAN-BOTH FAMILY OF COLUMN-BASED
DISTRmUTED CHOLESKY FACTORIZATION ALGORITHMS'

CLEVE ASHCRAFTt

Abstract. There are two classic column-based Cholesky factorization methods, the Jan-out
method that communicates factor columns among processors, and the Jan-in method that commu-
nicates aggregate update columns among processors. In this paper we show that these two very
different methods are members of a "fan-both" algorithm family.

=
To each member of this algorithm family is associated a pair ofintegers, (ql, q2), where qlq2 p,
the number of processors. The fan-out method is a (p, 1) method, while the fan-in method is a (1, p)
method. Methods with 1 < ql, q2 < P have characteristics of both fan-out and fan-in, and thus give
the family of methods its name.

The fan-out and fan-in methods have upper bounds on message counts of (p-l)1V1 and message
volume of(p -1)IELI, where IVI is the size of the matrix and IELI is the number ofnonzero entries
in the Cho~esky factor. In general these bounds are (ql + q2 - 2)1V1 and (ql + q2 - 2)IELI, and a
(..,jP,..,jP) method has bounds 2(..,jP - 1)1V1 and 2(..,jP - 1)IELI.

1. Introduction. The eholesky factorization of a sparse symmetric positive def-


inite matrix on a distributed architecture is an area of active research. This fundamen-
tal matrix computation is found throughout scientific and engineering applications,
and to achieve good performance for this computation in a distributed environment
is an important goal of the research community.
To date there are four column-based distributed eholesky algorithms: fan-out
[6], [8], [9], multifrontal [15], fan-in [2], and domain fan-out [3], [10]. Any column-
based eholesky method must communicate among the processors factor columns
[',i and/or aggregate update columns Li [.,;li,i. The fan-out method communicates
only factor columns while the fan-in method communicates only update columns.
The multifrontal and domain fan-out methods communicate both factor and update
columns.
The major contribution of this paper is to describe a family of column-based
factorization algorithms that contains the very different fan-out and fan-in methods
as members. Associated with each algorithm in the family is a pair of integers, (ql, q2),
where qlq2 = p, the number of processors. Fan-out has parameters (p, 1), while fan-in
has parameters (l,p). Any (q},q2) method where 1 < Q},q2 < P has properties of
both fan-out and fan-in, communicating both factor columns and update columns,
and thus we give the name "fan-both" to the algorithm family.
To our knowledge, the new fan-both methods have neither appeared in the lit-
erature nor been implemented in a distributed environment. We limit ourselves at
present to presenting a general formulation for the algorithm family, analyzing the

• This work has been partially supported by the University of Minnesota IMA Workshop on
Graph Theory and Sparse Matrices
t Boeing Computer Services, P. O. Box 24346, MS 7L-21, Seattle, Washington 98124,
cleve@espresso.boeing.com
160

required communication, and examining some properties on aset of matrices taken


from the Harwell-Boeing matrix collection [4]. In short, the central (vp, vp) fan-both
method can achieve superior load balanee when compared with the fan-out and fan-
in methods. Upper bounds on message counts and message volumes for the central
method are 2(vp -1)1V1 and 2(vp-l)IELI, where IVI is the size of the matrix and
IELI is the number of entries in the Cholesky factor. These results compare favorably
with the corresponding upper bounds of (p - 1)1V1 and (p - 1)IELI for the fan-out
and fan-in methods.
We assume that the reader is familiar with graph theory and sparse matrices at
the level of [7]. We have strived to make this paper self contained, but familiarity
with the original sources [2], [6], [8], [9], for the fan-out and fan-in methods may be
helpful to the reader.
In Section 2 we present the notation that will be used in this paper. Section 3
describes a general formulation of a column-based distributed Cholesky factorization.
The manner in which the computations are assigned to processors is explicitly defined
by a computation map. There are two important ways in which the computations are
scheduled: forward-Iooking algorithms (fan-out is an example) and backward-Iooking
algorithms (fan-in is an example). Using the computation map the number of mes-
sages and matrix entries communicated between processors is precisely determined.
Section 4 describes the fan-both family of algorithms, defined by a particular form
of computation map. The elassie fan-out and fan-in algorithms are described in detail.
Upper bounds on communication are defined in terms of the (ql, q2) parameters for
any member of the algorithm family.
Section 5 presents some empirical studies on four matrices from the Harwell-Boeing
collection [4]. We first prove an upper bound on the speedup of a factorization for any
column-based factorization algorithm. For each matrix, a number of processors elose
to the maximum speedup is chosen. (Any of the factorization methods can achieve
good performance for a small number of processors.) A new computation map is then
described, which balances computations among subsets of the processors.
For each of the four test matrices we have a number of processors and a map of
computations to processors. We then examine three properties.
1. Load Balance: We present profiles of the operation counts for the proces-
sors. The elapsed computation time is at least proportional to the maximum
number of operations performed by a processor. Using these profiles we are
able to determine upper bounds on efficiency for the methods.
2. Working Storage: Fan-in and fan-out require no temporary working stor-
age, asi de from a message buffer required by all the methbds. The (vp, vp)
fan-both method does require working storage, which can be as much as
n 2 /(vp - 1) extra entries to factor a n x n dense symmetric matrix. The
working storage for the sparse test matrices is much less, though non-trivial.
3. Communication: The loose bounds on communication cited above can
be considerably tightened, particularly for large numbers of processors. We
show that the (vp, vp) fan- both method does require less communication
than the fan-in or fan-out methods, (with one exception), but the difference
is nowhere near 'as great as predicted by the loose bounds.
161

There are two key ideas inherent in the fan-both family of algorithms.
1. Both faetor entries and aggregate update entries can be exchanged among
processors.
2. The sets of processors that exchange each type of information can be re-
stricted, and this can significantly reduce the communication.
These two ideas can be used in submatrix faetorizations, where the fundamental
data strueture is asubmatrix, and matrix-matrix computations form the fundamen-
tal computationaI tasks. An algorithm that sends entries of L, entries of L T and
aggregate update entries Ei [k,ili,i can employ three dimensions of communication,
with a resulting O(~IELI) communication volume. Seetion 6 outlines this extension
and presents some concluding remarks.

2. Notation. In this seetion we introduce the notation, the data struetures and
fundamental computational tasks that are performed during a column based sparse
Cholesky factorization. The matrix A is n x n and symmetric. The rows and columns
of Aare i\Umbered 0,1,···, n - 1. The nonzero strueture of A is represented by a
graph GA = (V, EA), where ak,i =J 0 if and only if (k,j) E EA. The nonzero structure
of the Cholesky factor L is represented in a similar manner by the graph GL = (V, EL),
where [k,i =J 0 if and only if (k,j) E EL. Throughout the paper we attempt to follow
this convention: if i, j and/or k appear as indices, then i ~ j ~ k.

2.1. The fundamental equation to compute a factor column. Using the


definition for matrix multiplication for A = LLT, the equation to compute the ph
column of L is given below.

(1)
[I:~~, ][
[n-t,i
li,i 1= [ a;~':'i ~ I:~;" ][
an-t,i
] - [

[n-t,i
li,i 1'

or more compaetly, f.,;li,j = a.,j - E{:ci [.,;li,i. If A is sparse, then most of the ai,i and
li ,i entries will be zero. The sets of indices for nonzero entries in row j and column j
are given as follows:
Ri = {i I Ii,i =J O} and
Equation (1) can now be written
(2) I')i,i = a',i - L [')i,i
iERj\{j}

In this equation the '*' in [',i is short for ei> the row indices of the nonzero entries of
column j in L. The '*' in a',i are the indices (k,j) E EA, where k ~ j. The '*' in l.,i
is short for ei n ei, the row indices of column i that will update column j.

2.2. The fundamental data structures. The right hand side of equation (2)
is composed of a number of columns, one for each index in Ri. The vector associated
with i in the row strueture of j we define below.

ti .
',J
= { a',i ~f ~ = ~
-I .[ .. tfZ-'-J
*.13,l T
162

The '*' in the i',i is short for Ci nC;. The subseript j means this veetor will contribute
to l.,j' The superseript i means that the update is from column i. The superscript
index can be extended to sets. For example,

is the sum of two component columns. The '*' subscript in i!~,i} is the set Ci n (Ch UC;).
In general, for aset ac we have:

i~,i = Lt:.i
iECl'

Using this notation, equation (2) can be written as

(3)

The best way to interpret this equation is to view the temporary vector i~j as a
vectar that will accumulate the original column a',i and updates from preceding
factor columns.

2.3. The fundamental computation tasks. The general sparse a.Jgorithm to


compute column j of L can be expressed as follows.
zero in~ = 0
7i~ ._ n J
load i.,j .- i.,j + a',J

for i E n j \ {j}
u p datei:4 := i~j - [.,;!j,i
scale [.,jlj,j = i.,}
There are three types of computations performed in equation (3). The first type is
to load the original entries in column j into the accumulation veetoL The second type
is to perform the update from column i to column j. The third type is to compute the
factor column by scaling the accumulated column. The update task is the cmod(j ,i)
task weil known in the literature, while the scale task is the cdi v (j) task [13].

3. A general formulation for the column-based distributed factoriza-


tion. The previous section defined the fundamental data structures that will be in
existence during the eholesky faetorization, and identified the fundamental tasks that
are performed on them. In a distributed environment, the data struetures and the
fundamental tasks will be partitioned among the processors in some manner.
Subseetion 3.1 presents the compuiation map that specifies which processor will
perform each fundamental task. The possibility that a total accumulation veetor i~j
may have components on two or more processors requires a new type of computation,
the assembly task.

Subseetion 3.2 describes two ways by which an l',i factor column can be computed.
At step j, the entries a',j will be loaded, and the column will be sealed. However,
there is considerable freedom in specifying which update operations are performed.
At step j, a backward-looking method executes updates to column j from preceding
163

eolumns, while a forward-looking method exeeutes updates from eolumn j to succeed-


ing eolumns. Eaeh of these two families has a member represented in the literature.

Subseetion 3.3 analyzes the communication that will be performed during a dis-
tributed faetorization. We measure two simple properties, the number of messages
and the number of floating point numbers eommunieated. These two eharaeteristies
are precisely determined by the eomputation map.

In this seetion we do not impose any speeial form at on the eomputation map.
Data struetures, algorithms and communication analysis are deseribed for a general
eomputation map. In Seetion 4 we will diseuss and analyze a family of algorithms
defined by a speeifie type of eomputation map.

3.1. The computation map. The responsibility for eomputing and aeeumulat-
ing the t~,j temporary veetors will be distributed among p proeessors in some fashion.
The manner in whieh the eomputations are distributed is defined by the computation
map from the edge set of L to the proeessors ids:

m:EL ...... {G,l,···,p-l}

For i < j and (j, i) E EL, proeessor m(j, i) is responsible for eomputing t~,j = -[.,;li,i.
Proeessor m(j,j) performs the load and seale tasks for eolumn j.
The strueture of row j in L ean be partitioned into a number of disjoint sets,
based on which proeessor owns the t~,j temporary veetor, as follows:

nj=n~un}u···unrl where n]={ilm(j,i)=q}

In a similar manner, the strueture of eolumn j is partitioned as follows:

Cj=CJUCJU"'UC;-l where Cj={klm(k,j)=q}

For eolumn j, the right hand side of equation (3) ean be expressed as the following
sum.

To simplify the notation, we write t~,i == t:1. Note the differenee between t~,j and
t~,i' The first has a eolumn id as a superseript, the second a proeessor id.

R
The total accumulated vector t.,j we write as t:,j' This eolumn is the sum of at
most p partial nonzero aeeumulated veetors.
p-l
- t*
t **,j = Rj . - ' " t q . - tO
, j ' - L...J * , j ' - *,j + t *,j
l
+ . . . + t *,j
p- l

q=O

The partial veetor t~,j will be nonzero only if the index set nl is nonempty.
It will be useful to speeify sets of proeessors that have interaetions with eaeh
eolumn j, either performing updates to j from preeeding eolumns, or using i',i to
update sueeeeding columns. We define subsets of {G, 1,···,p - I} as foIlows:
164

Note that processor m(j,j) is in both aj and (lj.


Our description of a general column-based distributed factorization uses this no-
tation, and requires one new fundamental task. Let q and r be processors in (lj with
q = m(j,j) and r i= q. Processors q and r wiII each compute their nonzero t~,j and
t:,j columns. They must be assembled together into the total accumulation vector.
The contributions from q wiII be loaded into t:,j on processor q. The contributions
from r must be accumulated into t:,j' and are assembled on processor q, as:
assemble t:,j:= t:,j + t:,j
For column j, there are l(lj I - 1 assembly tasks, one for each processor r E (lj such
that r i= m(j,j).

3.2. Two famiHes of distributed algorithms. The Cholesky factorization of


a n x n matrix can be naturally broken up into n steps. At the completion of step
j, the updates to column j have been completed and I.,j is determined. Within this
fram~work there is a great deal of freedom to specify at which step an update from
column i to column j wiII be performed.
We consider two possibilities. In the first, at step j all updates from j to k E
Cj \ {j} are performed. This first algorithm is forward-looking, for column j acts on
all succeeding columns. In the second method, at step j all updates to column j are
performed by columns i E Rj \ {j}. This second algorithm is backward-looking, for
column j receives updates from all preceding columns.
This is by no means the only two possible ways to structure the column updates.
For example, at step j one could perform all updates from preceding columns i E
R j \ {j} such that i + j is even, and perform all updates from j to succeeding columns
k E Cj \ {j} such that j + k is odd. This is not meant as apraetieal example, but
serves to ilIustrate the freedom that the designer of an algorithm has to structure the
computations.
The n steps need not be executed in ascending order, as step 0, step 1, "', step
(n - 1), but the execution sequence of the steps must obey a partial order. If column
i updates column j, then step i must be executed before step j. The dependencies
between columns, and thus the dependencies of the steps of the factorization, are
modeled using an elimination tree [16), [14).
The algorithms in Figures 3.2 and 3.2 specify the action taken by a processor
at step j of the factorization. These descriptions are true in spirit to many of the
distributed factorization algorithms in the literature to date. The fan-out method [8)
is a forward-looking method, while the fan-in [2) method is backward-looking. While
a code that implements one of these algorithms faithfulIy may not be as efficient as
it could be, it would be a good starting point from which further improvements may
be made.

An edge (j, i) E EL is identified with each update and scale computation task. At
st ep j, the processors in aj perform updates in forward Cholesky, while the processors
in (lj perform updates in backward Cholesky. The assembly tasks are present in
the distributed Cholesky algorithms, but are not found in the serial general sparse
method. The additions in the assembly tasks can be considered overhead, but through
165

careful coding the extra additions can be avoided. The major source of overhead in
the distributed factorizations is the communication of factor and/or update columns
among the processors. The number of messages and their volume is analyzed in the
following subsection.

if q = m(j,j) then
/ / processor owns the load and scaling tasks
/ / load partial accumulations from self
load t:,j := t:,j
/ / receive and assemble partial accumulations from other processors
for r E {3j \ {q}
reeeive t:,j from processor r
assemble t:,j := t:,j + t:,j
/ / scale the column
seale l.)j,j = t:,j
/ / send the factor column to all processors that need it
forrECtj\{q}
send l.,j to processor r
/ / use the factor column in owned update tasks
for k E eJ\ {j}
update t;,k := t;,k - [.,;lk,j
else
if q E (3j then
/ / send partial aceumulation to the owning processor
send t~,j to processor m(j,j)
if q E Ctj the n
/ / receive the factor column and use in update tasks
reeeive I.,j from processor m(j,j)
for k E eJ
update t;,k := t;,k - I.,;lk,j

FIG. 1. Forward Sparse Cholesky, Step j for Processor q

3.3. Communication analysis. The communication counts and volumes of


the messages that will be sent during the factorization can easily be computed. Any
communication associated with column j will depend on the number of processors in
Ctj and (3j. At step j there are ICtil- 1 messages containing I.,j and l{3jl- 1 partial
accumulated t:,j columns sent among the processors. The message count and volume
is given below.
n-I n-I
traffic = L:(ICtjl + l{3jl- 2) and volume = L:(ICtjl + l{3jl- 2)lejl
j=O j=O

If one knows when a t.,k column is to receive its first update or assembly, the operation can be
performed without the additions, as update t!,k := -I.,jlk,j or assemble t:,k := t:,k'
166

if q = mU,j) the n
/ / proeessor owns the load and sealing tasks
/ / eompute owned updates to eolumn j
for i E nl \ {j}
update t:,i := t:,i - I.,ili,;
/ / reeeive and assemble partial aeeumulations from other proeessors
for r E f3i \ {q}
reeeive t:,j from proeessor r
assemble t:,i := t:,i + t:,i
/ / seale the eolumn
seale I.,ili,j = t:,i
/ / send the factor eolumn to all proeessors that need it
for r E ai \ {q}
send I.,i to proeessor r
else
if q E f3i then
/ / eompute owned updates to eolumn j
for i E nl
update t:,i := t:,i - I.,;li,;
/ / send partial aeeumulation to the owning proeessor
send t;',j to proeessor mU,j)
if q E ai then
/ / reeeive the faetor eolumn
reeeive I.,i from proeessor mU,j)

FIG. 2. Backward Sparse Cholesky, Step j for Processor q


167

Rather laose upper bounds can be abtained by noting that lajl :::: p and l,8jl :::: p, and
IVI is n, the number of vertices, and IELI is the number of edges in the graph of L.
n-I
traffic :::: I;(p + p - 2) = 2(p - l)IVI
j=O
n-l n-I
volume :::: I;(p + p - 2)ICj l = 2(p -1) I; ICjl = 2(p - l)IEL I
j=O j=O

For each j, we have lajl :::: IRjl and l,8jl :::: ICjl. Forlarge p there will be many ealumns
where IRjl < P and/or ICil < p, and the above bounds will not be achievable. A mare
realistic upper bound on communication is given below.
n-I
traffic :::: I; (min(p, IRjl) + min(p, ICjl) - 2)
n-I
volume :::: I;(min(p, IRjl) + min(p, ICi I) - 2)ICj l
j=O

For simplicity, we avoid a mare detailed model of communication. For example, if


a factor column i.,i must be sent from processor q to 10'.iI-1 other processars, it could
be sent as IO'il-1 messages, each sent directly to the receiving processors, or it could
be sent in some at her fashion, e.g., a spanning tree root ed at q and including the
receiving processors. In the latter case, there would stiil be at least lail- 1 messages
and (Iaj I - 1) ICi I entries in those messages, and these quantities are monitared as our
traffic and volume. The total number of hops (or path length) to distribute the factor
column may be very dependent on the way it is sent among the processors, and an
efficient code should take into account the processor interconnection network.

In a similar way we do not specify the manner in which update columns are sent
among processors. For example, processars 7' and s may both have nonzero t:,j and t!,j
partial accumulated columns that need to be accumulated on processor q = m(j,j).
We assume for simplicity that each processol' 7' and s sends its partial column to q
that performs the assembly task. It could be more efficient that processor 7' sends
t:,j to processor s where it is accumulated into t!,j' which is then sent to processor
q. In general, the parti al accumulated columns could be accumulated in a spanning
tree with processor q at the root. This could prove mare efficient than our simple
model, for the total number of hops may be fewer. In any case, there will be at least
l,8il-1 sends of a partial accumulated column for j and (l,8il-1)ICjl entries in those
messages.

4. The fan-both algorithm family. The preceding section presented column-


based distributed factorizations in two flavors, forward-looking and backward-Iooking.
The computation map is the tool used to assign computations to processors. The sizes
of aj and ,8j, the sets of processors associated with updates from and to column j,
easily determine the message traffic and volume that will occur during the factoriza-
tion. In this section we present an algorithm family that has a very particular type
of computation map.

The members of this famiJy share a common formulatian of their computation


maps. Basically, the processars are considered logically arranged in a ql x q2 mesh,
168

where p = q1q2. For example, six proeessms ean be arranged in four different ways
where the proeessors are numbered eolumn major in the mesh.

110
111
pa 113] 112
[110 111 112 113 P4 115 1' [ P1 P4 ,
113
112 PS
114
115
When p is prime, there are only two possible eonfigurations, a 1 x p mesh and ap x 1
mesh. When p = 2d , as for hypereubes, there are d + 1 possible configurations.

Using a q1 x q2 proeessor mesh, our eomputation map has a simple definition:

m(j, i) = r(j) + C(i)q1


There are two funetions used in this definition, the column map c : V f-> {O,' .. ,q2-l}
and the row map r: V f-> {O,"', q1 - l}.

The eolumn map takes its name for the following property. If (jb i) and (h, i) are
both in EL, then the owning proeessors are:

The updates to j1 and h are both from eolumn i. The proeessms owning the two
update tasks are both found in eolumn c(i) of the proeessor mesh. In general, the set
ai, proeessors that will exeeute update tasks from eolumn i to sueeeeding eolumns,
is found in eolumn c(i) of the proeessm mesh.

In a similar way r(j) is the row map. If eolumn j reeeives updates from eolumns
il and i 2 , the proeessms owning these update tasks are:

Both proeessors are found in row r(j) of the proeessor mesh. For any eolumn j, the
set (3j, proeessors that will exeeute updates to eolumn j, are found in row r(j) of the
proeessor mesh.

It is clear how to bound the sizes of the aj and (3j sets. Eaeh set aj is found in
a eolumn of the proeessm mesh, and so lajl ::; q1' Eaeh set (3j is found in a eolumn
of the proeessor mesh, and so l(3jl ::; q2. We ean easily obtain loose upper bounds on
the communication statisties.
n-I
(4) traffie = L: (Iajl + l(3jl- 2) ::; (ql + q2 - 2)1V1
j=O

n-I
(5) volume = L: (Iajl + l(3jl- 2) ICjl ::; (q1 + q2 - 2)IELI
j=O

It is simple to minimize these upper bounds with respeet to the ql and q2 mesh dimen-
sions, subjeet to the constraint p = q1 Q2. The ,jP x ,jP proeessm mesh eonfiguration
minimizes these bounds:

traffie::; 2(,jP - l)1V1 and volume::; 2(,jP - l)IELI


169

The 1 x p and p x 1 mesh configurations have the following communication bounds:

traffic S (p - 1)1V1 and volume S (p - 1)IEL I

that are one half the bounds for a general computation map.
In many cases these bounds may overestimate the message traffic and volume,
even when the map is chosen to balance the computations over the processors. By
noting that lajl S ICjl and I/Jjl S IRjl, we obtain the following equations:
n-I
(6) traffic sL: (min(qI, ICjl) + min(qz, IRjl) - 2)
j=O

n-I
(7) volume S L: (min( qI, ICj I) + min( qz, IR I) -
j 2) ICj I
j=O

There, are two special members of this algorithm family that have appeared in
the literature. The fan-out method [6] is the first column-based distributed Cholesky
method to appear, and is forward-Iooking with ap x 1 processor configuration. The
fan-in method [2] is backward-Iooking with a 1 x p processor configuration. Because
of the historieal importance and the simplicity of the algorithms, we present these
methods in more detail in the following subsections.

4.1. The fan-out factorization algorithm. The fan-out method was not de-
signed from the perspective of mapping computations to processors. Instead, the
columns of the matrix were mapped to processors, and a very specific rule was used
to specify the processor to perform the update from one column to another column.
The processor that "owns" column j will perform all updates from preceding
columns i E Rj. This means that all factor columns l.,i that update column j must
be resident on the processor owning column j, at least temporarily. Once column j
has been computed, it must be sent to all processors that require it to update following
columns k in Cj \ {j}. It is the "fanning-out" of a factor column to the processors
that is the origin of the fan-out name. The processor mesh configuration is p x 1,
and therefore cU) = 0 for all columns j. The column processor set aj is a subset of
{O, 1, ,p - I} as usual, but the row processor Süt /Jj is the singleton {r(j)}. One
oo.

immediate consequence is that the factorization algorithm greatly simplifies, which


we present in Figure 4.1.

Since the row processor set /Jj is {r(j)}, each partial accumulated column t~,j is
identically zero for q # r(j). Therefore no partial accumulation columns are sent, and
the storage for t:,j can be just the storage for I.,j on processor r(j). The factorization
can be performed using one temporary vector to hold incoming factor columns from
other processors. Once a factor column is received, it is immediately used to update
all suceeeding owned eolumns.
Equations (6) and (7) reduee to the following:
n-I n-I
traffie S L: (min(p, ICj \) - 1) and volume S L: (min(p, ICjl) - 1) ICjl
j=O j=O
170

if q = m(j,j) then
/ / proeessor owns the eolumn
load t:,j := t;,j
/ / seale the eolumn
scale l')j,j = t:,j
/ / send the factor eolumn to all proeessors that need it
forrEaj\{q}
send I.,j to r
/ / use the factor eolumn in update tasks
for k E Cn {j}
update i;,k := t;,k - i.,jh,j
else if q E aj then
/ / reeeive the factor eolumn and use in update tasks
receive i.,j from proeessor m(j, j)
for k E CJ
update t;,k := t;,k - I.,jlk,j

FIG. 3. Fan-Out Cho/esky, Slep j for Processor q

4.2. The fan-in factorization algorithm. Like the fan-out method, the fan-in
method was not designed from the perspective of mapping eomputations to proeessors.
The eolumns of the matrix were mapped to proeessors, and a second speeifie rule was
used to speeify the proeessor to perform the update from one eolumn to another
eolumn.

The proeessor that "owns" eolumn j will perform all updates to sueeeeding eolumns
k E Cj . This means that a temporary data structure t;,k will exist for eaeh proeessor
q that owns a eolumn i.,j that will update column k. To eomplete the aeeumulation
t: i;
of k' the partial aeeumulation eolumns k must "fan-in" from the other proeessors
and 'be assembled. The proeessor mesh eo~figuration is 1 x p, and therefore r(j) = 0
for all eolumns j. The row proeessor set /3j is a subset of {O, 1" .. ,p - I}, and the
eolumn proeessor set aj is simply {c(j)}.
The fan-in factorization algorithm is also simplified when eompared to the general
formulation. Figure 4.2 presents a variant of the fan-in algorithm where onlyone
temporary vector is needed to accumulate the partial column i;,j' where CJ # 0. This
eolumn will either be sent to the owner of column j or will be factored, but it does
not need to be computed before st ep j of the factorization. This is beeause all i.,i
factor eolumns are always resident on the proeessor that needs to use them.
Equations (6) and (7) reduce to the following:
n-I n-I
traffic S L (min(p, In j I) - 1) and volume S L (min(p, Inj I) - 1) ICj I
j=O j=O

5. Empirical studies. The fan-out and fan-in methods have been implemented
in several studies [2], [3], [6], [8], [9]. To our knowledge, fan-both methods have yet
to be implemented on any distributed architecture. In this section we examine the
fan-both family of algorithms on four test matrices, two from the Harwell-Boeing
matrix collection [4]. Weare interested in the following questions:
171

if q = m(j,j) then
/ / processor owns the column
/ / load the original entries
load t:,i := a.,i
/ / compute updates from previously owned columns
for i E nl \ {j}
update t:,j := t:,j - 1.,ilj,i
/ / receive and assemble partial columns from other processors
for r E {3j \ {q}
receive t:,i from processor r
assemble t:,i := t:,i + t:,i
/ / scale the oolumn
scale 1.,jli ,i = t:,i
else if q E (3j then
/ / compute updates from previously bwned columns
/ / and send the partial column to processor m(j,j)
for i E nl
update t:,i := t:,i - 1.,ili,i
send t:,i to processor m(j,j)

FIG. 4. Fan-In Cholesky, Step j for Processor q

1. What limitations on speedup has any column-based factorization?


2. One upper bound on the efficiency of any computation is a function of the
load balanee of the processors, or equivalently, the range of number of oper-
ations performed by the processors. How evenly do the fan-in, fan-out and
fan-both methods distribute the factor operations among the processors?
3. The fan-in and fan-out methods can both be implemented using onlyone
vector of size IVI as working storage. The fan-both method does require
non-trivial working storagej for a n x n dense symmetric matrix the storage
per processor can be as much as n 2 /2v'P, as opposed to n 2 /2p for either
fan-in or fan-out. How much working storage will be required by a (V'P, V'P)
fan-both method for sparse matrices?
4. Communication between processors is the largest source of overhead for dis-
tributed factorizations. How do the loose and tight bounds on communication
compare with the true communication on real matrices?
We present results for four types of matrices, each representative of a different
type of behavior that we have observed over the test collection. The statistics for the
four matrices is given in the table below.
Matrix GRD6363ND is the 9-point finite difference discretization of a 63 x 63 grid
that has been ordered using nested dissection. Matrix GRD151515ND is the 27-point
finite difference discretization of a 15 x 15 x 15 grid that has been ordered using nested
dissection. Both GRD6363ND and GRD151515ND are optimally ordered with respeet
to fill and factor operations. These two grid matrices are considered well suited for a
paralleI factorization, for their elimination trees are well balaneed, short and bushy.
172

TABLE 1
Dimensions and Statistics of Four Test Matrices

GRD6363ND GRD151515ND BCSPWRI0 BCSSTK16


IVI 3969 3375 5300 4884
lEAl 34969 882882 21842 290378
IELI 99358 367237 27961 691388
ops 4.15 X 106 63.2 X 106 0.25 X 106 124 X 106
max speedup 140 256 52 172

The matrix BCSPWRIO models the eleetric grid of the eastern United States.
Power matrices are usually very sparse, and the Cholesky factor has relatively few fill
entries.
Matrix BCSSTK16 is a finite element model of a Corp of Engineers Dam. The
finite element s used in this model appear to be linear hexahedral element s with three
degrees of freedom at each grid point. The original element structure for this problem
is no longer available, but we have recovered an element strueture from the adjaeeney
strueture of the matrix. The generated element mesh is eompaet, having a small
diameter of elements in any direetion.
With the exeeption of the grid probIems, eaeh matrix was ordered by the multiple
minimum degree algorithm [12] and post-proeessed using the Jess and Kees algorithm
[11]. One hundred runs of the orderi ng was performed, and the best ordering with
respeet to faetor operations was chosen. The final ordering is a post-order traversal
of the resulting elimination tree.
Diagrams of the elimination trees for these matriees are given in Figures 5, 6, 7
and 8. The separate nodes are visible for GRD6363ND and BCSPWR10 as small
filled cirdes. In eaeh figure, the nodes are equally spaeed with respeet to height, and
their horizontal displaeement is determined as follows. The nodes are ordered in a
post-order traversal of the elimination tree that minimizes the stack storage for the
multifrontal method. The leaf nodes are equally spaced iIl: the horizontal direction,
and the position of an interior node is the average of the positions of its ehiIdren.

5.1. The maximum speedup for a column-based Cholesky factoriza-


tion. An upper bound on the maximum speedup of a column based faetorization
ean be easily obtained. Column 1.,j eannot be computed unt il eaeh 1.,; has been eom-
puted, where i E n j \ {j}. Furthermore, the update from i to j must be performed,
and then the sealing operations.
Let T; be the minimum elapsed time to eompute eolumn 1.,i as measured in fioating
point operations. If node i is a leaf in the elimination tree, meaning that ni = {i}, it
reeeives no updates from previously numbered nodes. Its only eost is the seale task,
and so Tö = ICd - 1.
If node i updates node j, then a lower bound on Tj is the sum of three terms:
• T;, the time when node i ean perform the update
• 21Ci nCjl, the number of operations in the update
• ICjl- 1, the number of operations in the seale task
173

GRD6363ND

3969 vertlces, 1024 leaves, helght 176

FIG. 5. GRD6363ND Elimination Tree


174

GRD151515ND

3375 vertices, 512 leaves, height 471

FIG. 6. GRD151515ND Elimination Tree


175

BCSPWR10

5300 vertlces, 2157 leaves, height 106

FIG. 7. BCSPWRJO Elimination Tree


176

BCSSTK16

4884 vertices, 237 leaves, helght 1326

FIG. 8. BCSSTK16 Elimination Tree


177

In general, the minimum elapsed time to compute column j is:

Therefore, a lower bound on the elapsed operations to compute L is:

TL = max {Tj}.
j a root
An upper bound on speedup for the factorization is simply the number of factor
operations divided by TL.
For each of the four test matrices, the upper bounds on speedup in Table 1 are
quite modest with respect to the amount of computation. We will consider the cases
where the number of processors used to compute the factorization is the dosest square
number to the maximum speedup. The average n1).mber of degrees of freedom, entries
and factor operations per processor is given in Table 2.
TABLE 2
Storage and Work per Processor

matrix p avg. d.o.f. avg. entries avg. operations


GRD6363ND 144 27.6 710 29643
GRD151515ND 256 13.2 1434 246875
BCSPWRlO 49 108.2 537 4807
BCSSTK16 169 28.9 4020 720930

The power network matrix is definitely an outlier, where the amount of parallelism
for the computation is large. In some sense this can be inferred from the elimination
tree, which has short height and is very bushy. The other three matrices show a much
larger number of entries and operations per processor. As the matrices grow larger
in terms of storage and factor operations, the available parallelism does increase, but
the storage and work per processor increases. As the matrix size increases, machine
architectures must have more memory per processor and faster CPU's to maintain
relative performance.

5.2. The balaneed mesh eomputation map. Some thought must be given
to designing a suitable computation map for each matrix and (ql' q2) processor config-
uration. After some experimentation we chose the following procedure to define the
row and column maps. It is a greedy procedure that attempts to balance computation
among rows and columns of the processor mesh.
To each node j is associated two weights in terms of floating point öperations: a
forward weight

Ii = ICjl-l + L 21Cj nCkl,


kECj\{j}

and a backward weight

bj = L 21Ci n Cjl + ICjl- 1.


iE'R, \{j}
178

The forward weight h indudes the scale task for column j and all updates from j to
succeeding columns. The backward weight bj indudes all updates to j from preeeding
columns and the seale task for eolumn j. The forward eomputations for node j are
performed by the processors in eolumn e(j) of the processor mesh, while the baekward
eomputations are performed by the proeessors in row r(j) of the proeessor mesh.
The basie idea behind the balaneed mesh map is to assign no de j to the row
of processors with the least aecumulated baekward weights and to the column of
proeessors with the least aecumulated forward weights.

Balanced Mesh Computation Map


set RowOps(O : ql - 1) and Co/Ops(O : q2 - 1) to zero
for j E V in some order
let q such that RowOps(q) :s
RowOps(r) for all r =f. q
RowOps(q) := RowOps(q) + bj
r(j) = q
let q such that ColOps(q) :s
ColOps(r) for all r =f. q
ColOps(q) := ColOps(q) + h
c(j) = q
It should be dear that the resulting eomputation map depends on the tie-breaking
seheme to choose a row or column of the mesh. This is a weak dependence, for the
distribution of the forward and backward weights are sufficiently varied that ties oceur
infrequently after the beginning columns are assigned.

Note two speeial cases. The (p, 1) fan-out method has q2 = 1, so all processors lie
in one column of the mesh, and the forward weights h do not contribute to the map.
Likewise, the (l,p) fan-in method does not use the backward weights bj to define the
map.

One crucial aspeet of the map algorithm needs to be defined, namely the order
in which the nodes are processed to distribute the forward and backward weights.
We shortly will evaluate the fan-in, fan-both and fan-out methods with respeet to
load balance, working storage and communication. We do this with respeet to one
particular way of processing the nodes to obtain the computation map, that of the
post-order traversal of the elimination tree. We choose this in hopes of minimizing
the working storage of the fan-both method. Other sequences of no des to define the
map could be considered.

1. A breadth first traversal of the elimination tree could be useful, if the desire
that each processor start eomputation as soon as possible. On the other
hand, if a post-order traversal of the tree is used to generate the map, it
is stiil likely that the leaf nodes will be distributed fairly evenly over the
processors.
2. Ordering the nodes by descending order of the forward weights h should
balance operations for the fan-in method.
3. Ordering the nodes by deseending order of the baekward weights bj should
balanee operations for the fan-out method.
4. Ordering the nodes by deseending order of the sums of the forward and
backward weights (b j +h) should balanee operations for the fan-both method.
179

Why do we not consider nested maps, e.g., a subtree-subcube map [8]? There are
two reasons.
1. Most of the elimination trees we have generated for the symmetric Harwell-
Boeing test matrices (using multiple minimum degree and post-processed
by the Jess and Kees algorithm) resemble that of BCSSTK16 in Figure 8.
There are two main trunks with many small subtrees hanging off the trunks.
For this matrix, 90% of the factor operations are performed in updates from
nodes in the trunks to nodes in the trunks. A subtree-subcube map would
allow at most a factor of two decrease in the communication in each of the
two large subtrees, and not decrease at all the communication involving the
large top separator.
2. For large numbers of processars, elose to the maximum speedup for a matrix,
a subtree-subcube map will show little to moderate improvement. This can
be seen by examining equations (6) and (7), replacing ql and q2 by functions
ql(j) :::; ql and q2(j) :::; q2. l\1ost of the nodes where ICil and Inil are large
occur near the top of the elimination tree, where ql (j) ~ ql and q2(j) ~ q2'
Those nodes where ql (j) ~ ql and q2(j) ~ q2 lie at the lower leveis of the
elimination tree, and have smaH ICil and Inil values.

5.3. Load halanee and an upper hound on efficiency. In Figure 9 we pro-


vide the operatian count profiles for the three algorithms and the four test matrices.
In these plots and all that follow, the (l,p) fan-in method is a solid line, the (VP, VP)
fan-both method has long dashes, and the (p, 1) fan-out method has short dashes.
For each matrix and method, the operations for each processor were computed,
then scaled by the average number of operations per processor. The operatian counts
were then sorted in ascending order and plotted.
Since the maximum number of operations a processor performs is a lower bound
on the execution time, an upper bound on efficiency can be easily computed using
these plots. For example, the fan-out method for GRD6363ND has one processor
that performs 20% more operations than the average. The upper bound on efficiency
is therefore 1/1.2 or 83%.
Whether these efficiencies are attainable is largely a function of the communication
cost and the implementation. We have performed simulations of the factorization
assuming zero communication cost. The resulting speedups are usually with 5% of
those predicted by the upper bounds on efficiency.
It is probably as important to balance the operations over the processor as it
is to minimize the maximum number performed by any processor. All methods do
roughly the same for the power matrix. The fan-in and fan-both methods do well for
BCSSTKI6, while the fan-out method has a large imbalance. The two grid problems
show a large imbalance, particularly for the fan-out method and to some degree
the fan-in method. The elimination trees for these matrices are broad and bushy,
supposedly weIl suited for parallei computation. If one were to implement the fan-
out method for these matrices, some effort should be made to find a computation
map that is better balaneed with respeet to operations.
180

GRD6363ND LOld laianee ProfII. ~r-__~G


=R~D~1~
51~5~
15=N=D~.=
~==~B~
.~~~ ~=fl~I.~__~
~.~P~

3 ~----------------------4 ~ ~------------------------~
~ ~ ~----------------------------~
J
........
i 3 ~------------------------~~--~
__
~~~----------------~.~....._.:_.. ...-,~~
~
~..--
.. ·--
i;I---:;:=::~=:r·::.·:7···:··=::~.. ·,..
~~ ~~
o
~------~~~.~
J.... ..~.._.,----------------~
~~ ~~~----------------------~
0. __i ____________________________

o
~
~
! ..... ! ~----------------------------~
~ L---------~2~M~P'-~-.-.-.o-~------------J

BCSPWR10. Lead S.IIrK:e Prolile BCSSTK16, Load 8.~~. P~fll.


:~--------------------------~

~~----------------------------~

. .<:.~/
i.
t ......:::....'~.: .../
i~~----------~~~z=----------~
i ..· ··/7
i . . . .·
~ ~~---------------------------,
..'
40 procenor.

Figure 9: Load Balanee ProfHes


181

5.4. Working storage. One advantage of the elassie fan-in and fan-out meth-
ods is that they ean be implemented using onlyone temporary storage vector for the
message buffer, whose size is that of the largest eolumn in the factor matrix. They
are effieient with regards to working storage, an important property for arehitectures
with limited amounts of storage on eaeh proeessor.

The fan-both method has a serious drawbaek in the amount of working storage
that is required. The worst seenario is to factor a symmetric dense matrix, where a
(-IP, -IP) method using a balaneed mesh map requires as mueh a n 2 /2-IP storage for
a proeessor, as opposed to n 2 /2p for a fan-in or fan-out method.

The working storage for a fan-both method is largely a function of the shape of
the elimination tree and the order in whieh the ealumns are processed. The load
balanee study in the Section 5.3 and the communication study in the Section 5.5 are
independent of the partieular order that the eolumns are processed. Working storage
for a fan-both method is not. In our experiment's we process the eolumns in the
post-order traversal of the elimination tree, as pres ent ed in Figures 5, 6, 7 and 8.

Figure 10 presents the working storage profiles for the four matriees and three
methods. We plot the profiles sealed by the average number of factor entries per
proeessor, and we do not eount the one message buffer vector. Sinee fan-in and fan-
out use no extra working storage, their profiles are elustered near one on the y-axis.

There are two curves for the (-IP, -IP) fan-both method. The upper eurve is for
a backward-Iooking method, where factor eolumns are held until all their updates
are eompleted. The lower eurve is for a forward-Iooking method, where an update
eolumn is created when the first update to it is performed, and is released when the
last has been computed.

(Note that the elassie fan-in method has a (l,p) eomputation map implemented
as a baekward-Iooking method, and the the elassie fan-out method is the dual, a (p, 1)
eomputation map with a forward-Iooking method. Their eonverses, a (1, p) forward
method and a (p, 1) baekward method, require mueh more working storage than either
a forward or baekward fan-both method, and are not praetieal.)

We see from these plots that the working storage for the (-IP, -IP) baekward fan-
both method is always more than the earresponding forward method. The working
storage for the forward method is fairly small, never more than six times the average
number of faetor entries per proeessor. (Note, here the working storage ineludes the
factor entries in eolumns owned by the processor, i.e., the overhead storage is the
profile minus one.) If storage per proeessor is tight, the fan-both method is at a
disadvantage.

5.5. Communication studies. The points we make are relevant when commu-
nication is the largest souree of overhead for a distributed faetorization. The fan-both
method has great potential to reduce the communication if the loose bounds of equa-
tions (4) and (5) are to be believed. However, when the number of processors is large,
we expect the loose bounds to overestimate the communication, particularly for the
fan-in and fan-out methods, and the tight bounds from equations (6) and (7) to be
more accurate.
182
GRD6363ND. Woo1dng Sto~. Profli. GAOt5t5t5NO. Wo,tdng St...OO Pro/Ile

~
..
I· r-------------~~~~~
I .. , I:
ls ~
f
.i .
~
õe
~. ~------------------------------~
i
I
r
.~
~
~----------------------------,

..
iN~,-~,~----~~~------~
i ............... :;:.:::;
~.
0
li
(:................. .

144 proc:euon 256proceuora

BCSPWRtO, Worklna StorllQO Pro/II. BCSSTK16. WOfkIna StoraotPrOflle

.
~N
.~r-----------------------------~
E
i 0r-____________________________--i

r:-r-------------------------------1
~
~
~ .'

f-\.'......--_... _..-=- ... '-'"- j


õ

..... :.:.=-
"""""=
•• .~.:~_ t l--- --------l
i~"" . . -- ~
õ7
1· ~-----------------4
~
~ i
161 proeölore

Figure 10; Working Storage Profiles


183

Table 3 presents some statistics for the four test matriees. As in the preeeding
load balanee and working storage studies, the number of proeessors chosen for eaeh
matrix is the square number elosest to the maximum speedup. The balaneed mesh
eomputation map is used to assign eomputations to proeessors.
TABLE 3
Communication Statistics, Tight Bounds, Loose Bounds and Observed

GRD6363ND, 144 proeessors


Traffie I IVI Volume IIELI
Fan-In Fan-Both Fan-Out Fan-In Fan-Both Fan-Out
loose bounds 143.0 22.0 143.0 143.0 22.0 143.0
tight bounds 21.3 16.0 24.0 45.2 19.4 40.8
observed 17.8 12.8 22.3 37.0 17.3 37.9
GRD151515ND, 256 proeessors
Traffie I IVI Volume IIELI
Fan-In Fan-Both Fan-Out Fan-In Fan-Both Fan-Out
loose ,bounds 255.0 30.0 255.0 255.0 30.0 255.0
tight bounds 73.7 25.3 106.0 127.9 28.7 166.0
observed 61.9 22.7 91.1 105.3 27.5 142.9
BCSPWR10, 49 proeessors
Traffie I IVI Volume I IELI
Fan-In Fan-Both Fan-Out Fan-In Fan-Both Fan-Out
loose bounds 48.0 12.0 48.0 48.0 12.0 48.0
tight bounds 4.0 5.2 4.3 10.1 7.6 8.1
observed 3.5 3.8 4.0 8.3 6.2 7.6
BCSSTKI6, 169 proeessors
Traffie I IVI Volume I IELI
Fan-In Fan-Both Fan-Out Fan-In Fan-Both Fan-Out
loose bounds 143.0 22.0 143.0 143.0 22.0 143.0
tight bounds 75.4 19.9 109.4 101.2 21.4 130.2
observed 70.8 19.0 95.9 95.3 20.9 117.6

TABLE 4
Average Size of Communicated Columns

avg. eommunicated ICj I


matrix avg. ICjl Fan-In Fan-Both Fan-Out
GRD6363ND 25.0 52.0 33.8 42.5
GRD151515ND 108.8 185.1 131.8 170.7
BCSPWR10 5.3 12.5 8.6 10.0
BCSSTK16 141.6 190.5 15.5.7 173.6

The statisties in Table 3 are the sum over all proeessors of the messages and
matrix entries, sealed by the matrix size and the number of faetor matrix entries,
respeetively. For eaeh matrix and method, the loose bounds of equations (4) and
(5), the tight bounds of equations (6) and (7), and the observed communication are
presented.
184

Note that the loose bounds always overestimate the tight bounds and observed
communication, many times by quite a lot, particularly for the fan-in and fan-out
methods. The tight bounds do a much better job of predicting the expected commu-
nication.
We can try to understand where the majority of communication is performed for
each of the three methods. If a level of a node is measured by the maximum distance
from it to a leaf in the elimination tree, the nodes at high levels have larger column
sizes than nodes at the lower levels, with the exception of nodes near the root, where
the column size decreases to one at the root.
In Table 4 we present the average size of a factor column for each matrix, along
with the average size of a communicated column for each of the three methods. Note
that in all cases the average column that is communicated is larger than the average
column of the matrix. The small columns at the lower levels of the elimination tree
are communicated many fewer times than colUJ;nns of larger sizes at the higher levels.

1. The fan-in method has the largest average column size, for the nodes at
the highest levels of the tree most likely will receive the largest numbers of
aggregate update columns, and these nodes have the largest column sizes in
the matrix.
2. The fan-out method has the second largest average column size of the three
methods. This can be explained by realizing that the nodes within the first i
levels of the root cannot send more than i-I messages, for i :::; p. Therefore,
it is columns on the "shoulders" of the elimination tree, with moderate to
large size, that are sent the most times in the fan-out method.
3. The fan-both method shows a more regular communication pattern for the
columns in different levels of the elimination tree. Each node can receive
as manyas q2 - 1 aggregate update columns and send off at most qt - 1
factor columns. For qt = q2 = VP, it is likely that columns in many levels
of the tree, from the root down to near the leaves, can receive elose the the
VP - 1 update columns and send off elose to the VP - 1 factor columns. This
explains both the elose correlations between the loose bounds, tight bounds
and observed communication for this method, as weIl as why the average
communicated column size is fairly elose to the average column size of the
factor matrix.

Figures 11, 12, 13 and 14 present the profiles of the number of messages sent
and received and the number of matrix entries sent and received by the processors.
For the two grid matrices and the dam matrix, the fan-both method always has less
communication than the fan-in and fan-out methods, and the profiles are more Hat,
meaning that the communication is more evenly distributed among the processors.

The average slopes of the profiIes is largest for the fan-out method and smallest
for the fan-both method. The slope of the fan-in method lies somewhere between.
There are sizable spikes at the right and/or left end s of the profiles. This probably
means that there is a reasonably small number of no des that either send or receive
the full number of columns possible, and that the processors owning these columns
have their statistics increased accordingly.

The power matrix has different profiIes. For-once the fan-in method generates the
fewest messages, though the fan-both method has the least volume.
185

~ r-------------------~ ~ r---------------------4
... ..
...-......... .....
!
• ~r-------------~~--~
. . .. . ... J , ~ I----------------------~-~~

"§ .. ............. ~.-


~~. _ . _-_ . _ .-- -_._ . _ ..
l.~--?
~ r--------- ------------4

144 proceuors 144 proeelSo,..

......
~ ~---------==~~
~ , 0- •••••••• ••
. . .....

L- - - - - _ . _ - - - -j

144 proc...ors 144 pr'OC ....ors

Figur 11: GRD6363 D ommunication Proflles


186

~~--------------~
r ~ ~---------------------~
.l........ /"

j~ .' I ~f--------~'"
.......:'..... ~§ ............................................/

~I~ . - .~
§r--------------------.~. §v------------------~
, . ..... . _. ....... - . ... . ... .. -

256 proce:lIora 256 prOC: ....OI'1l

1r-__ ~G~R~D~I~
51~5~15~N~D~~~n~d~v~O~
lu~
m~.~p~
ro~tl~
~_____, ir-___G~R~D~'~5~'5~'~5N~D~.~R~K~.~IV~.~v=OI~um~.~pro~"~=___-.
I f - - - -- - - - - - : l If----------I

. . . . . . . . . . . . ... .-..-.. .~.··~


i1 - - - - - - - -...-..
1i 1l~------------
,...-""'---1/
!i . . . ..../. .
h~ ~
IF---- - - - - - - I
i i··.. · ......·
Jj/--
~

~ ~---------==-_.~
.

2.56 proc....ora 256 ptOetIuon

Figure 12: GRD151515 D Communication Profiles


187

~ ~------------------~ ~ r-------------------~

~ ~--------------------~ ~ r-----------------------~

49 proc: ....or.

I BCSPWR10, Rec.tv. VoiLIme Profl,-

A
i:>:::::: _:: _
8
~~................. .... ~ ~........ .

-~
i ..
..j . ...

hf.-·,~-'.--'- - -- - - - 1
--------
i '
õ i
. ~~--------------~ : ~~-------------------4
~ I··-------------------~ ~I------------------~

4lproc: .. lOra .. , proc.. aora

Figure 13: BCSPWRI0 Communication ProfHes


188

BeSSTK1. ReCeIVe Trame Protl ..

~~--------------~
....... ~
.........
........
~ ......... . I .......
i
~

.
~~
0

~~----------------~ ~
- _. a _ __ -- - - - --
..... -- -- -- - . -----
_~

118 proceUOfl

BCSSTK16. Seod VoIu ... ProfIIe eeSSTK1 • • RKeNe VGtum. pron.

.........
II------.-
............
. .. ·~ 11--- - - - - - - - 1
. _. . ·_. ·
-.=
. . .·

..... __....................... -_..........-_ ..... .


i ~~.:.::..
·· - - ---I
I·. ·. · ~
ilk·.~··. ·~. . ========--"J hk==========--.J
h·I --------l
I~------I
h·I---- - - ----I
õ
"11- - - - - - - - 1
I. L
1.0 proee••on

Figure 14: BCSSTK16 Commu nication ProfHes


189

6. Conelusions and extensions. This paper has presented a general formula-


tion of column-based Cholesky factorizations in a distributed environment. The two
elassie algorithms, the fan-out and fa.n-in methods, have been shown to be members
of q more general fan-both family of a.lgorithms. The communication complexity of
this family has been analyzed. Upper bounds on the message counts and message
volumes are (p-1)1V1 and (p-I)IELI for the fan-out and fan-in methods. One central
member of this algorithm family is abiend of both the fan-out and fan-in methods,
and has communication bounds of 2(y'P-I)1V1 and 2(y'P-l)IELI. Empirical studies
of four test matrices have been presented.
The key idea of the fan-both family of algorithms is very simple. The fan-out
method communicates only factor colnmns, each may be sent to (p - 1) other proces-
sors. The fan-in method communicates only update columns, each may also be sent
to (p - 1) other processors. The central fan-both method communicates both factor
and update columns, each may be sent to only (Y'P - 1) other processors. The sum of
the number of types of communicated data times 'the number of processors to which
each is sent is the leading coefficient of the communication complexity.
This idea can be extended to further lower the communication volnme. During
the Cholesky factorization there are three types of active data: the entries of L, the
entries of LT, and partially accumulated entries Li Ik,iIi,i' A computation map

m: V x V x V ~ {D, 1,'" ,p -I}

specifies the processor m(i,j, k) to perform the computation tL = -lk,iIi,i' If the


computation map has the form

m(k,j, i) = r(k) + c(j)ifP + l(i)W,


where the row, column and level maps have the form

r :V ~ {D, 1,,,,, ifP-I}, e: V ~ {D, 1,,,,, ifP-l}, and I: V ~ {O,l, .. ·, ifP-I},

the number of matrix entries communicated during the factorization is bounded above
by 3~IELI, a lower complexity than the column based methods in this paper. This
algorithm is presented in [1] for the dense LU factorization.
The fan-both method has great potential to outperform the elassie fan-in and fan-
out methods, particularly on architectures where communication cost is relatively
high or where large numbers of processors are used. Its one drawback is non-trivial,
the amount of working storage required to hold external factor and aggregate update
columns. We do have prototype codes running on an i860 hypercube, and the fan-both
method does compute the factorization faster than the fan-in and fan-out methods.
We refrain from presenting incomplete results at this time. Valid comparison must
in elude the multifrontal method, as weil as computation maps designed specifically
for the fan-in and fan-out methods.
190

REFERENCES

[1] C. Ashcraft, A taxonomy oJ distributed dense LU Jactorization methods, Boeing Computer


Serviees Technical Report, ECA-TR-161, March 1990.
[2] C. Ashcraft, S. Eisenstat and J. Liu, AJan-in algorithm Jor distributed sparse numerieal
Jaetorization, SIAM J. Sci. Stat. Comput, 11, 1990.
[3] C. Ashcraft, S. Eisenstat, J. Liu and A. Sherman, A eomparison oJ three eolumn-based dis-
tributed sparse iactorization schemes, Technical Report YALEU/DCS/RR-81O, Depart-
ment of Computer Science, Yale University, 1990.
[4] 1. Duff, R. Grimes and J. Lewis, Sparse matrix test probiems, ACM Trans. Math. Soft., 15,
pp. 1-14, 1989.
[5] J. A. George, Nested dissection oJ a regular finite element mesh, SIAM J. Num. Ana!., 10,
pp. 345-363, 1973.
[6J J. A. George, M. Heath, J. Liu and E. Ng, Sparse Cholesky!aetorization on a shared-memory
multiprocessor, SIAM J. ScL Stat. Comput., 9, pp. 327-340, 1988.
[7] J. A. George and J. Liu, Computer Solution oJ Large Sparse Positive Definite Systems,
Prentice-Hall, Englewood Cliffs, N. J., 1981.
[8] J. A. George, J. Liu afld E. Ng, Communication reduetion in paral/eI sparse Cholesky on a
hypercube, in Hypercube Multiprocessors 1987, M. Heath, ed., SIAM Press, 1987.
[9]· J. A. George, J. Liu and E. Ng, Communication results Jor paral/eI sparse Cholesky on a
hypercube, ParalleI Computing, 10, pp. 287-298, 1989.
[lOJ L. Hulbert and E. Zmijewski, Limiting communication in paral/el sparse Cholesky Jaetoriza-
tion, SIAM J. Sci. Stat. Comput., 12, pp. 1184-1197., 1991.
[11] J. Lewis, B. Peyton and A. Pothen, AJast algorithm Jor re-ordering sparse matriees Jor
paral/el Jactorization, SIAM J. Sci. Stat. Comput., 10, pp. 1146-1173, 1989.
[12J J. Liu, Modifteation oJ the minimum degree algorithm by multiple e/imination, ACM Trans.
Math. Soft., 2, pp. 141-153, 1985.
[13J J. Liu, Computational models and task seheduling Jor paral/el sparse Cholesky Jaetorization,
ParalleI Computing, 3, pp. 327-342, 1986.
[14] J. Liu, The role oJ eliminatioll trees in sparse Jaetorization, SIAM J. Matrix. Ana!. App!.,
11, pp. 134-172,1990.
[15] R. Lucas, Solving planar systems oJ equatiolls Oll distributed multiproeessors, Ph.D. thesis,
Department of Electrical Engineering, Stanford University, 1987.
[16] R. Schreiber, A new implemelltation oJ sparse Gaussian eliminatioll, ACM TOMS, 8, pp.
256-276, 1982.
SCALABILITY OF SPARSE DIRECT SOLVERS

ROBERT SCHREIBER

Abstract. We shalJ say that a scalable algorithm achieyes efficiency that is bounded away from
zero as the number of processors and the problem size increase in such a way that the size of the
data structures increases linearly with the number of processors. In this paper we show that the
column-oriented approach to sparse Cholesky for distributed-memory machines is not scalable. By
considering message volume, no de contention, and bisection width, one may obtain lower bounds
on the time required for communication in a distributed algorithm. Applying this technique to
distributed, column-oriented, dense Cholesky leads to the condusion that N (the order of the matrix)
must scale with P (the number ofprocessors) so that storage grows like p2. So the algorithm is not
scalable. Identical condusions have previously been obtained by consideration of communication and
computation latency on the critical path in the algorithm; these results complement and reinforee
that condusion.

For the sparse case, both theory and some new experimental measurements, reported here, make
the same point: for column-oriented distributed methods, the number of gridpoints (which is O(N))
must grow as p2 in order to maintain parallei efficiency bounded above zero. Our sparse matrix
results employ the "fan-in" distributed scheme, implemented on machines with either a grid or a
fat-tree interconnect using a subtree-to-submachine mapping of the columns.

The alternative of distributing the rows and columns of the matrix to the rows and columns of
a grid of processors is shown to be scalable for the dense case. Its scalability for the sparse case
has been established previously [10]. To date, however, none of these methods has achieved high
efficiency on a highly parallei machine.

Finally, open problems and other approaches that may be more fruitful are discussed.

Keywords. massively parallei computer, sparse Cholesky factorization, distributed-memory,


scalable algorithms.

AMS(MOS) subject c1assifications: 65F50, 65F25, 68RIO.

1. Introduction. An efficient, highly parallei, distributed-memory, direet so-


lution algorithm for the sparse linear system Ax = b remains undiscovered, despite
some prolonged and extensive investigations by a number of researchers [2, 3, 4, 9, 10,
14,15,18,19,30]. The arrival of highly and massively paralle! supercomputers makes
this an opportune time to decide whether of not to continue the search, perhaps along
different lines, or to give it up in favor of iterat.ive methods.

Two lines of att.ack have been t.aken up to now. The MIMD, message-passing-
machine community has tended t.o concentrate on met.hods that are column ori-
ented [2, 3, 4, 9, 14, 18, 19, 30]. In these methods, columns of the matrix A and
its Cholesky faetor L are assigned to processors in some way - column j is held by
processor map(j) and map is determined as part of the method. Furthermore, the
methods organize the computation as a colleetion of column-oriented tasks: sparse
column scaling and sparse DAXPY. This dass of methods has also been proposed
and used for the dense problem on message-passing machines [1]. When scalability is

• Research Institute for Advanced Computer Science, MS T045-1 NASA Ames Research Center,
Moffett Field, CA 94035. This author's work was supported by the NAS Systems Division via
Cooperative Agreement NCC 2-387 between NASA and the University Space Research Association
(USRA).
192

not a primary issue, these methods may be entire!y appropriate. They are like!y to
be very useful on moderately parallei, shared memory machines.

A second approach is to map the data in two dimensions. This approach is favored
by Gilbert and Schreiber [10], Kratzer [15], and VenugopaI and Naik [29]. Recently,
Dongarra, Van de Geijn, and Walker [7] have shown the value of this approach for
the dense problem on MIMD message passing machines; the author has also used it
successfully for the dense problem on the Maspar MP-l, a massively paralleI SIMD
machine.

In this paper we investigate the scalability of these elasses of methods for dis-
tributed sparse Cholesky factorization. By asealable aIgorithm for this problem, we
mean one that maintains efficiency bounded away from zero as the number P of pro-
cessors grows and the size of the data structures grows roughly linearly in P. We
concentrate on the mode! problem arising from the 5-point, finite difference steneil
on an Ng x Ng grid. We will show that the column-oriented methods cannot work
weil when the number of gridpoints (N = N;) grows like O(P) or even O(PlogP).
We show that communication will make any column-oriented, distributed algorithm
useiess, no matter what the mapping of eolumns to proeessors. This is true because
column-oriented distribution is very bad for dense problems of order N when N is
not large compared with P.

Two improvements seem to be required.

1. A two-dimensional wrap mapping of the dense frontal matrices, at least for


those corresponding to fronts near the top of the e!imination tree.
2. A "fan-out" submatrix eholesky algorithm with multieast instead of individ-
ual messages.

It is reasonable to ask why one should be concerned with machines having thou-
sands of processors. Figure 1 should illustrate the reasons for believing that supercom-
puter architecture is now making an inevitable and probably permanent transition
from the modestly paralle! to the high ly paralleI (2.57 - 4,096 processors) or massivery
paralleI (4,097 - 65,536 processors). The following estimation of supercomputer ar-
chitecture during the coming decade helps motivate the work presented here.

• 1 Gflop processor chips (with multiple processors);


• Physically distributed memory; while hardware may provide the illusion of
shared memory, the latency for nonlocal access will be large and bandwidth
to nonlocal memory will be a constraining resource.
• Communication speed between nodes will be on the order of 100 bits in paral-
leI at 100 Mhz. Communicating a matrix element during a sparse Cholesky
factorization requires that a packet consisting of an 8-byte floating-point
number and a 4-byte index be sent. Since this requires 12 bytes = 96 bits,
roughly 100 Mwds/sec will be the achievable speed. Thus, the ratio of com-
putation speed per processor to interprocessor communication speed will be
in the 5 - 50 range. Patterson [23] gives comparable estimates.
• Interconnect may be a 2D or 3D grid or torus, or maybe a fat tree.

This paper builds on previous elforts at analysis of distributed matrix computa-


tions. The work of Leiserson [16] prefigures much later work of this type. Notable
193

Uniprocessor Performance
103

YProcessors

1rt
Desktop Processors

10'

8-
~
10'

10'

1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
Year

F1G. 1. Microprocessor and supercomputer performance per CPU.

efforts for dense problems incIude those of Li and Coleman [17] for dense triangu-
lar systems, and Saad and Schultz [28] j Ostrouchov, et al. [22], and George, Liu,
and Ng [8] have made some analyses for the sparse, column-mapped algorithms.
An interesting analysis of the effeet of a memory hierarchy on sparse Cholesky has
been provided by Rothberg and Gupta [26]. These investigators, working with a
nonuniform-access shared-memory system, have recently come to concIusions similar
to ours [27].
In Section 2 we introduce distributed implementations of Cholesky factorizationj
Section 3 develops some lower bounds on communication timej in Section 4 we com-
pute these bounds for the dense case and use them to illustrate the problem with
column mappingj Section 5 extends this work through an experiment for the sparse
casej in Section 6 we consider the problems that are still unresolved.

2. Distributed sparse Cholesky. Cholesky factorization may be understood


as the following program

cholesky(A, N)

for k = 1 to N do
ediv(k);
for j = +
k 1 to N do
emod(j, k);
od
od
194

Procedure cdiv( k) computes the square root of the diagonal element Au and scales
the kth column of A by 1/.,fAi;k to produce the ktk factor column L.k; procedure
emodU, k) subtracts L jk times the kth column from the ph column.
The execution order of this program is not the onlyone possible. The true depen-
denees require only that emodU, k) must follow ediv(k) and cdiv(k) must follow all
the emod(k,f) for f < k and LId i- O. A second form of Cholesky is this:

eholesky(A, N)

for k = 1 to N do
for f = I to k - I do
emod(k,f)j
od
cdiv(k) j
od

The first form is sometimes called "submatrix" Cholesky and sometimes called
a "right-looking" method. The second form goes by the names "column" or "left-
looking".
In the sparse case, sparsity is exploited within the vector ediv and emod operations.
Furthermore, most emod operations are omitted altogether because the multiplying
scalar L jk is zero.
The column-oriented distributed methods map columns to processorsj column k
is stored at processor map[k]. The operation ediv(k) is performed at map[k]. The
operation emodU, k) may be performed at map[j], in which case the column of L. k
must be sent out from map[k] after the cdiv(k) is performed. This approach is known
as a "fan-out" implementation. Alternatively, the emodU, k) may be performed at
map[k], as follows. Consider the set of updates to column j. There is one update
(emodU, k)) for each k < j such that Ljk i- o. Processor 11' can compute the scaled
column LjkL. k for each such k for which map[k] = 11'. It then adds these scaled
columns together to form an "aggregate update" vector u[j, 11'] and sends this vector
to processor map[j]. All communication is in the form of these aggregate updates.
Whenever one arrives at a processor, it is subtracted from the updated column. This
method is known as the "fan-in" distributed algorithm.
The node code for a fan-in method is shown in Figure 2. The data structure at
processor 11' is the integer N-vector map, the columns of A and L mapped to 11', The
set mycols = {k I map[k] = 11'}, and the sets row[j,1I'] = {k I map[k] = 11' and L jk i-
O},I ~ j ~ N.
As befits an MIMD code, the schedule of computation is not that presented in
either of the sequential methods above. Instead computations occur at times that
depend on the sequence of arriving data, and are not determined in advance.
It is clear from this code that running time has an O(N) term, because of the
195

Jan - in(A, L, N, map)


integer N, mapOj
real LO,AOi

mycols = {j I map[j] = myname} j


for j = 1 to N do
if ( row[j, myname] i= 0 II j E mycols ) then
t = Oi
for k E row[j, myname] do
t = t + Ajk(Ajk,"" Ankf i
od
if ( j t/ mycols ) then
Sen d aggregate update column t to processor map[j]
else
L. j = (A jj , ... " Anjf - ti
while not all aggregate updates have been received do
Receive an aggregate update column u[j, 7r] for column ji
L. j = L. j - u[j, 7r]i
od
L. j = L. j /.p:;;i
fi
fi
od

FIG. 2. Fan-in distributed, column-oriented Cholesky.


196

::mter, sequential for loop; this alone can be shown to imply nonscalability. Actual,
efficient implementations, however, avoid this sequentialloop.

3. Methodology. Consider any distributed-memory computation. In order to


assess the communication costs analytically, we have found it useful to employ certain
:tbstraet lower bounds. Our approach is neither new nor deep; it is a straightforward
:tCcounting for communication costs.
Our model assumes that machine topology is given. It assumes that memory con-
~ists of the memories local to processors. It assumes that the communication channeIs
:tre the edges of a given undirected graph G = (W, L), and that processor-memory
lInits are situated at some, possibly all, of the vertices of the graph. The model
indudes hypercube and grid-structured message-passing maehines, shared-memory
machines having physically distributed memory (the Tera machine) as weIl as tree-
~tructured machines like a CM-5.

Let V ~ W be the set of all processors and L be the set of all communication
links.
We assume identicallinks. Let fJ be the inverse bandwidth (slowness) of a link in
~econds per word. (We ignore start-up costs in this model.)
We assume that processors are identical. Let </J be the inverse computation rate
)f a processor in seconds per floating-point operation. Let fJo be the rate at which a
processor can send or receive data, in seconds per word. We expect that fJo and fJ
will be roughly the same.
A distributed-memory computation consists of aset of processes that exchange
information by sending and receiving messages. Let M be the set of all messages
:ommunicated. For m E M, Iml denotes the number of words in m. Each message
rn has a source processor sTc(m) and a destination processor dest(m), both elements
)f V.

For m E M, let d(m) denote the length of the shortest machine path from the
lOurce of the message m to its destination. We assume that each message takes
:t certain path of links from its source to its destination processor. Let p(m) =
:i1 ,i2 , ... ,id(m») be the path taken by message m. For any link i EL, let the set of
messages whose paths utilize i, {m E M I i E p(m)}, be denoted M(i).
The following are obviously lower bounds on the completion time of the compu-
;ation. The first three bounds are computable from the set of message M, each of
which is characterized by its size and its endpoints. The last depends on knowledge
)f the paths p(M) taken by the messages.
1. (Average flux)

LmEM Iml' d(m) . fJ


ILI .
2. (Bisection width) Given Va, vi ~ W, va and vi disjoint, define
sep(Va, Vi) == min I{L' ~ L I L' is an edge separator of vo and Vi}1
197

and

flux(Vo, Vi) == L: Iml·


{mEM I src(m)EV"dest(m)EV._;}

The bound is
flux(Vo, Vl) . i3
sep(Vo, Vi) .

3. (Arrivals/Departures (also known as node congestion»

max
vEV
L: Imli3o;
dest(m) =v

max
vEV
L: Imli3o.
src(m) = v

4. (Edge contention)

max L: Imli3·
lEL mEM(l)

Of course, the actual communication time may be greater than any of the bounds.
In particular, the communication resourees (the wires in the machine) need to be
scheduled. This can be done dynamically or, when the set of messages is known in
advanee, statically. With detailed knowledge of the schedule of use of the wires, better
bounds can be obtained. For the purposes of analysis of algorithms and assignment
of tasks to processors, however, we have found this more realistic approach to be
unnecessarily cumbersome. We prefer to use the four bounds above, which depend
only on the integrated (i.e. time-independent) information M and, in the case of the
edge-contention bound, the paths p( M).

4. Dense Cholesky. In this section we consider dense, distributed Cholesky


factorization. Since, for the model problem, a constant and substantial fraction of
the work in a .sparse Cholesky factorization is spent doing a final dense Cholesky
factorization of a matrix of order Ng, efficiency on this final dense problem is a sine
qua non for asealable sparse algorithm.

4.1. Mapping columns. Let us consider a right-Iooking, fan-out distributed


Cholesky. (Our experiments for the sparse case use the fan-in method of Figure REF,
which has been shown to better than fan-out with respeet to communication cost.
For the dense problem, however, the opposite is true.) Assume that the columns of a
dense symmetric matrix of order N are mapped to processors cyclically: column j is
stored in processor map(j) == j mod P.
We first examine the parallelism and the critical path. Execution time must
be at least N24>max 0,
~), no matter the scheduling of the tasks. The second
term comes from the operation count, N3/3. The first is due to the longest path
in the computation DAG, which has N 2 /2 multiplies and multiply-adds. This is
198

TABLE 1
Average interproeessor distanees.

Grid Torus

2D (2/3).../P (1/2).../P
3D p I/ 3 (3/4)PI /3

the path cdiv(l), cmod(2, 1), cdiv(2), cmod(3, 2), . ... By making column operations
an atomic unit of computation, we have lengthened the critical path from O(N) to
O(N2) operations. Therefore, at most O(N) processors can be used efficiently.
Next, consider communication costs on two-dimensional grid or toroidal machines.
Suppose that P is a perfect square and that the machine is a .../P x .../p grid. (This
assumption is not necessary for our conclusions, but it simplifies things.)
Consider a mapping of the computation in which the operation cmod(j, k) is per-
formed by processor map(j) (a fan-out method). After performing the operation
cdiv(k), processor map(k) must send column k to all processors {map(j) I j > kl.
Two possibilities present themselves. These sends may be done separately and
sequentially by processor map(k), with separate messages each taking its own path
to the several destinations; or they may be sent through a spanning tree of the proces-
sor graph, whose root is processor map( k) and whose nodes include the destination
processors.
In order to compute average flux, we need to know the average path length tra-
versed by the messages. Let us assume that N is greater than P. (Otherwise we
clearly have idle processors.) We first assume that the average message distanee is
just the average distance between two randomly chosen processors in the machine.
For a mesh in 2D this is (2/3).../P; for a 2D torus it is (1/2).../P. In 3D the square
roots become cube roots and the constants change to 1 for grids and (3/4) for tori
(Table 1). Even if we are clever about assigning data to processors, and we place
the early, large columns in the middle of the grid, we can at best reduee the average
distanees by a modest constant factor. So we will stiek to the estimate based on
ranclom positions for souree and destination.
Let us fix our attention on 2D grids. If separate messages are sent, the total flux
is (1/3)N2 p3/2. There are a total of ILI = 2P links; the total machine bandwidth is
roughly 2P/f3 and the flux-per-link bound is (1/6)N 2.../Pf3 seconds.
With spanning tree multieast, the "average distanee" computation changes. Most
of the sends will use a tree of total length P reaching all the proceSSQrs. Every matrix
element will therefore travel over P links, so the total information flux is (1/2)N 2P
and the average flux bound is (1/4)N2f3 seconds.
With multieast, only O( N 2 / P) words leave any processor. If N ~ P, processors
see almost the whole (1/2)N2 words of the matrix as arriving factor columns. The
bandwidth per processor is 130, so the arrivals bound is (1/2)N2f3o seconds. If N ~ P
the bound drops to half that, (1/4)N 2f3o seconds.
199

TABLE 2
Communication Costs for Colllmn-Mapped Fllll Cholesky.

Type of Bound Lower bound Communication


Scheme

Arrivals N'f3
TO

Average flux ~'f3 tree multieast

Average flux ~2 ,;pf3 separate messages

Bisection width ~2f3 tree multieast

Bisection width ~2,;pf3 separate messages

Consider a bisection of the machine through its vertical midline. Since most sends
must arrive at all processors, we may approximate the flux across the line by assuming
that every factor column crosses. With individual messages, it crosses (1/2}P times,
for a total flux of (1/4}N2 P words. With spl'nning tree multieast, the shape of the
tree plays a role. The number of crossings is at least one. This observation leads to a
weak bound. Instead, we will use a more realistie estimate that is not in fact a bound.
A realistic assumption is that on average the tree intersects the cut in (1/2}VP edges,
since the tree uses half of all edges and there are ,;p of them in the cut. Thus the
flux is (1/4}N2,;p words. The resulting lower bounds are (1/4}N2,;pf3 seconds with
separate messages and (1/4}N2 seeonds with tree multieast.
We summarize these bounds for 2D grids in Table 2.
From the critical path, average work per processor, and the bisection width
bounds, we have that the completion time is roughly max(~, 3~2q,,~) with tree
multieast and max( ~;t , 3~2q" N2f'i3} with separate messages. Contours of efficiency
(in the case P = 1,024) are shown in Figures 3 and 4.
We can irnmediately conelude that without spanning tree multieast, this is a
nonsealable distributed algorithm. We suffer a loss of efficiency as P is increased,
with speedup limited to O( -IN). Even with spanning tree multieast, we may not
take P > !!J and still achieve high efficiency. For example, with f3 = lOe/> and
P = 1,000, we require N > 12,000 (72,000 matrix element s per processor) in order
to achieve 50% efficiency. This is excessive for dense problems and will prove to be
excessive in the sparse case, too.

4.2. Mapping blocks. Dongarra, Van de Geijn, and Walker have already shown
that on the Intel Touchstone Delta machine (P = 528), mapping blocksis better than
mapping columns. In such a mapping, we view the maehine as an Pr X Pe grid and we
map elements Aij and Lij to processor (mapr(i) , mapc(j)). We assume a cyelie map-
pings here: mapr(i) == i mod Pr and similarly for mapc. In a right-looking method,
two portions of column k are ,needed to update the block Arows,eols: Lrows,k and Leols,k
200

Efficiency - No Broadcast, P=I024, Column Mapped


lOO

90

80

70

60

~ SO
<xl.

40

30

20

10 0.1I'i
0.1

2 3 4 S 6 7 9 10
N/P

FIG. 3. Iso-efficiency /ines for dense Cholesky with column cyclic mapping; separate messages.

Efficiency oo Broadcast, P=1024, CoJumn Mapped


lOOr----r----~--~----~--~----~~--r----r--__;

90 .1

80

70

FIG. 4. Iso-efficiency lines for dense Cholesky with column cyclic mapping, P = i,024; tree multi-
east.
201

TASLE 3
Communication Costs for Torus-Mapped Full Cholesky.

Type of Bound I Lower bound I Comment

Arrivals ~(l+~)
4 P. P C

Edge contention N2p (f"


Pr +~)
Pe
tree multieast

Edge contention ~(&+B:)


2 P. P C
separate messages

(rows and cols are integer vectors here). Again, we may send the data in the form of
individual messages from the Pr processors holding the data to those processors that
need it, or we may use multieast.
The analysis of the preceding section may now be done for this mapping. Now
the compute time must be at least N 2 t/J max (2~. ' 3";, ); the longest path in the task
graph has N 2 /2Pr multiplies and multiply-adds. For the multieast approach, the
spanning trees are linear conneetions of the processors in machine rows and columns.
With this information about the paths p( m) taken by messages, we may compute
the use of the most heavily loaded edge. This bound dominates the average flux and
biseetion width. Results are summarized in Table 3. With Pr and PC both O( VP),
the communication time drops like 0(P-l/ 2 ). With this mapping and with efficient
multieast, the algorithm is scalable even when I' > t/J. Note that P = 0(N2) so
that storage per processor is 0(1). (In fact, this scalable algorithm for distributed
Cholesky is due to ü'Leary and Stewart in 1985 [21].)
Contours of efficiency for P = 1,024 and Pr = Pc = 32 are shown in Figures 5
and 6.

5. Distributed sparse Cholesky and the model problem. The interesting


questions are about the sparse and not the dense case. The best way to extend the
results of the last section to the sparse case would be to do just that. In the dense case,
our analysis provides exact leading terms in the various communication cost measures.
But, even for the model problem, this proved to be dauntingly complicated.
The general concIusion can be analytieally derived, however. George, Liu, and Ng
have shown that the total number of words communicated during a distributed, load
balaneed Cholesky factorization of the model problem must be O(P N;). Thus, the
node congestion bound must be at least O(ND. Therefore, since the total time cannot
be less than this bound, and the operation count is O(N;), asymptotic efficiency
requires that P = O(Ng ) at most.
We have been able to provide more detailed information, however, from experimen-
tal measurement of communication loads. The experiment simulates (on a Sun work-
station) the fan-in, distributed, column-oriented sparse Cholesky described above.
The software used was Matlab, version 4.0, which has sparse matrix operations and
storage [11].
202
Efficiency oo No Tree MultieaSI, 2D Wrap Map, P=1024.
lOO

90

80

70

60

.... SO
=.
40

30

20

ID

2 4 5 6 7 8 9 ID
N/P

FIG. 5. Iso-efficiency lines for dense Cholesky with 2D cyc1ic mapping; separate messages.

Efficiency oo Tre. Multieast, 2D Wrap Map, P=I024.

2 4 5 6 7 9 10
N/P

FIG. 6. Iso-efficiency lines for dense Cholesky with 2D cyc1ic mapping; lree multieast.
203

Ops, Aul<, Bisect.icn Width, Arrivals for 8 X 8 Proccssor Grid


1 0 · . - - - - - - = - - - - - - -........-------~--...........

AvgOps

10' Bisedico Widlb


AvgAux

Max Arrivals

10'

10'

1ifLI--------~~--~~~1----------~~~~.
W W W
Grid Size, Ng

FIG. 7. Four lower bounds; Pr = Pc = 8.


First, the sparse Laplacian for an Ng x Ng grid was generated and factored, in
order to obtain the structure of the factor L. The elimination tree was then computed.
Finally, columns were "assigned" to processors by the subtree-to-submesh mapping
as follows:
1. the top-level separator was mapped cyclically to the whole machinej
2. the left subtree was mapped recursively to the left half-machinej
3. the right subtree was mapped recursively to the right half-machinej
The number of the processor assigned to column k is computed and stored in map( k).
The matrix L and the veetor map are all that is needed by a simulation that colleets
the time-integrated statistics
• a vector of operation counts per processorj
• a veetor of counts of arriving words per processorj
• the total flux of data in word-linksj
• the flux of data (in words) crossing the horizontal and vertical midlines of
the machine.
Figure 7 shows the computational load on the processors (Ops per Proe), the
biseetion width bound, the maximum number of words arriving at any processor and
the average flux of words per machine link as a funetion of the grid size Ng with
Pr = PC = 8j there are three data points on each curve, for grids of size Ng = 15,31,
and 63. The slope of the operations-per-processor curve is greater than that of the
communication curves, as expected, and when Ng ~ P efficiency will be good.
Figure 8 shows the behavior of these four metrics as P increases and Ng is fixed at
31. Now, the operations-per-processor curve drops as 1/P, the communication curves
204

Opa, FIux, BisectiOl1 Width, AnivaIJ for Ng = 31


1~r---~------~--~~--------~--~--~

AvgOps

=
Processor Grid Size, Pr Pc

FIG. 8. Four lower bounds; Ng = 31.

do not, and efficiency is very poor when P is not much smaller than Ng.
Figures 9 and 10 show two measures of efficiency over a range of values of Ng and
P, with the ratio fixed at one half and at two. The first is a measure of load imbalance,
the number of operations done by the most heavily loaded processor (MaxOps) scaled
by the average load (AvgOps). The other is a measure of relative communication
overhead, the average computation 1000 per processor (A vgOps) scaled by the average
number of words transferred per communication link (AvgFlux).
The results for the dense case lead us to suspect that efficiency will be roughly
constant if the ratio Ng / P is fixed.
The load-imbalance curve is practically Hat, showing that 1000 imbalance is not
a concern with this scaling. We cannot conelude from this data that 1000 imbalance
would ruin efficiency with P = O(N;). (We also cannot conelude that it would not.)
But the communication-overhead curve is still dropping as P and Ng increaseo
This confirms the main result of this work: one must scale the number of gridpoints
at least as the square of the number of processors in order to have efficiency bounded
above zero as P is increased. (For this implementation, using subtree to subgrid
mapping to a grid, even that may not be fast enough). Thus, the method is not
scalable by our earlier definition.
Recently, Thinking Machines Corporation has introduced a highly parallel ma-
chine with a "fat-tree" interconnect scheme. A fat tree is a binary tree of nodes.
Leaves are processors and internaI nodes are switches. The link bandwidth increases
geometrically with increasing distanee from the leaves. These could potentially work
significantly better than meshes, since average interprocessor distanee is now O(log P)
205

Scaled Communicatim Overhead and Load Ba1ance


3r-----,------r----~~----~----_r----~----_,

28

26

24

22

MaxOps I AvgOps
1.8

1.6

1.40~-----:20':---...,40':---...,60:-----=80:----1~OO:::------,I~20:::---7.140
Ng =(l/2)P

F1G. 9. Sealed communication and load balanee with Ng = (1/2)P.

Scaled Communication Overhead and Load Ba1ance


14r---r--,--,--~--~--r--r---r--,--,

4
MsxOps I AvgOps
2

~-----'4=O-~S~O-~60':---=70:---~80:---9~O----'~I~OO:--~I~IO:--~12~O-~130
Ng=2*P

FIG. 10. Sealed communication and load balanee with Ng =2P.


206

Scaled Communication Overbead in Fat Trees


7

6.5
Ng=Np

!:l 5.5
li:
t S
~
~ 4.5

~ 4
~
< 3.5 Ng -Np/2

2S Ng- Np/4

2
6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8
Fat Tree Height

FIG. 11. Scaled communication and laad ba/ance for fat trees, with Ng cx: P.

and the bisection bandwidth of the machine is O(P) instead of O( VP).


We simulated column-mapped sparse Cholesky for a fat tree with bandwidth that
doubles at each tree level. Columns were mapped in a subtree-to-subtree manner:

1. the top-level separator was mapped cydically to the whole machine;


2. the left subtree was mapped recursively to the left half-machine;
3. the right subtree was mapped recursively to the right half-machine;
Figure 11 gives scaled communication-overhead curves. It appears that the com-
munication curves are Hat for P = O(Ng ). Clearly, our candusions hold for fat trees
as well, although they seem to scale somewhat better than meshes. This is additional
evidence that column-mapped methods are not scalable for highly paralleI machines.

6. Further work. This work should be extended in several ways.

• Experimental performance data should be taken from actual distributed


dense and sparse Cholesky and compared with our predictions.
• Variants that map the spj'Lfse matrix data in some form of two-dimensional
cydic map, as has been suggested by Gilbert and the author, Kratzer, and
by Venugopal and Naik, should also be scrutinized experimentally.
• The whole Cholesky factorization can be viewed as a DAG whose nodes
are arithmetic operations and edges are values. (An n-input SUM operator
should be used so as not to predefine the order of updates to an element.)
Let us eal! this the computation DAG. The ultimate problem is to assign all
the nodes to processors in such a way that the completion time is minimized.
207

The computation DAG is quite large. Methods that work with an uncom-
pressed representation of this DAG suffer from excessive storage costs. (This
idea is quite like the veryold one of generating straight-line code for sparse
Cholesky in which the size of the program is proportional to the number of
flops, and hence is larger than the matrix and its factor.)
Of course, Cholesky DAGs have underlying regularity that allows for com-
pressed representations. One such representation is the strueture of L. Oth-
ers, smaller stiIl, have been derived from the supernodal structure of L and
are usually only as large as a constant multiple of the size of A.
All approaches to the problem to date have employed an assignment of com-
putation to processors that is derived from the structure of L rather than
from the computation DAG. None has succeeded. It is not known, however,
if this failure is due to a poor choice of assignment, or alternatively if any
assignment based only on the structure of L must in some way fail, or indeed
whether there is any assignment for sparse ,cholesky computation DAGs that
wiIl succeed. These issues require some investigation.
• In these proceedings, Ashcraft proposes a new elass of column-oriented meth-
ods in which the assignment of work to processors differs from the assignment
used in the algorithms we have investigated. His approach may make for a
substantial reduetion in the average flux and biseetion width requirements
of the method, and so it should be investigated further. We note, however,
that it wiIl not reduce the length of the critical path, since it is based on the
same task graph as all column-oriented methods.
• It appears that the scalable implementation of iterative methods is much
easier than it is for sparse Cholesky. Indeed, even naive distributed imple-
mentation of attractive iterative methods is quite efficient. For example,
with a regular grid, simple mappings of gridpoints to processors allows fast
calculation of matrix-vector products. Total flux is kept to a small fraetion
of the operation count by mapping compaet subgrids to processors, so that
most edges of the grid conneet gridpoints that reside on the same processor.
Reeent work of Hammond [12], Pomm~rell, Annaratone, and Fichtner [24],
and Pothen, Simon, and Wang [25] makes it elear that this can be done, at
some noticeable but supportable preprocessing cost, even for irregular grids.
When Krylov subspace methods are used, dot products may be annoying;
but all that we require to make them tolerable is, at worst, that the number
of gridpoints grow like P log P, not P2. Useful, fully parallei preconditioners
have also been developed. Finally, domain decomposition methods (which
can be viewed as the elass of preconditioned Krylov subspace methods de-
signed to take advantage of spatial locality) are even more suitable in the
distributed-memory environment. A good example of the power of parallel
domain decomposition methods has recently been provided by Bj!<lrstad and
Skogen [6], who found that P = 16,384 was no impediment to the efficient
solution of finite difference equations with Ng equal to only 640.
• We conelude by admitting that it is not yet elear whether sparse direet solvers
can be made competitive at all for highly (P > 256) and massively (P >
4,096) paralleI machines.
208

REFERENCES

[1) E. ANDERSON, A. BENZONI, J. DONGARRA, S. MOULTON, S. OSTROUCHOV, B.


TOURANCHEAU AND R. VAN DE GEIJN, LAPACK for distributed memory architectures:
progress report, In ParalleI Processing for Scientific Computing, SIAM, 1992.
[2) C. ASHCRAFT, S. C. EISENSTAT, AND J. W. H. LIU, A fan-in algorithm for distributed
sparse numerical factorization, SIAM J. Scient. Stat. Comput. 11 (1990), pp. 593-599.
[3) C. ASHCRAFT, S. C. EISENSTAT, J. W. H. LIU, AND A. H. SHERMAN, A comparison ofthree
column-based distributed ~parse factorization schemes, Research Report YALEU /DCS/RR-
810, Comp. Sci. Dept., Yale Univ., 1990.
[4) C. ASHCRAFT, S. C. EISENSTAT, J. W. H. LIU, B.W. PEYTON, AND A. H. SHERMAN,
A compute-ahead fan-in scheme for parallei sparse matrix factorization, In D. Pelletier,
editor, Proceedings, Supercomputing Symposium '90, pp. 351-361. Ecole Polytechnique de
Montreai, 1990.
[5) C. ASHCRAFT, The fan-both family of column-based distributed Cholesky factorization algo-
rithms, These proceedings.
[6) P. BJ0RSTAD AND M. D. SKOGEN, Domain decomposition algorithms of Schwarz type, de-
signed for massively paralleI computerso Proceedings of the Fifth International Symposium
on Domain Decomposition. SIAM, 1992.
[7) J. DONGARRA, R VAN DE GEIJN, AND D. WALKER, A look at scalable dense linear algebra
/ibraries, Proceedings, Sealable High Performance Computer Conferenee, Williamshurg,
VA,1992.
[8) A. GEORGE, J. W. H. LIU, AND E. NG, Communication results for parallei sparse Cholesky
factorization on a hypercube, ParalleI Comput. 10 (1989), pp. 287-298.
[9) A. GEORGE, M. T. HEATH, J. W. H. LIU, AND E. NG, Solution of sparse positive definite
systems on a hypercube, J. Comput. AppI. Math. 27 (1989), pp. 129-156.
[10) J. R. GILBERT AND R. SCHREIBER, Highly parallei sparse Cholesky factorization,
SIAM J. Seient. Stat. Comput. , to appear.
[11) J. R. GILBERT, C. MOLER, AND R. SCHREIBER, Sparse matrices in MATLAB: design and
implementation, SIAM J. Matrix Anal. AppI. 13 (1992), pp. 333-356.
[12) S. W. HAMMOND, Mapping Unstructured Grid Computations to Massively Parallei Computers,
PhD thesis, Dept. of Comp. Sei., Rensselaer Polytechnic Institute, 1992.
[13) S. W. HAMMOND AND R. SCHREIBER, Mapping unstructured grid problems to the Connection
Machine, In Piyush Mehrotra, J. Saltz, and R. Voigt, editors, Unstructured Scientifie
Computation on Multiprocessors, pp. 11-30. MIT Press, 1992.
[14) M. T. HEATH, E. NG, AND B. W. PEYTON, Parallei algorithms for sparse linear systems,
SIAM Review 33 (1991), pp. 420-460.
[15) S. G. KRATZER, Massively parallei sparse matrix computations, In P. Mehrotra, J. Saltz, and
R. Voigt, editors, Unstruetured Seientifie Computation on Multiprocessors, pp. 178-186.
MIT Press, 1992. A more complete version will appear in J. Supereomputing.
[16) C. E. LEISERSON, Fat-trees: universal networks for hardware-efficient supercomputing, IEEE
Trans. Comput. C-34 (1985), pp. 892-901.
[17) GUANGYE LI AND THOMAS F. COLEMAN, A parallei triangular solver for a distributed memory
multiprocessor, SIAM J. Seient. Stat. Comput. 9 (1988), pp. 485-502.
[18) M. Mu AND J. R. RICE, Performance of PDE sparse solvers on hypercubes, In P. Mehrotra,
J. Saltz, and R. Voigt, editors, Unstruetured Scientifie Computation on Multiproeessars,
pp. 345-370. MIT Press, 1992.
[19) M. Mu AND J. R. RICE, A grid based subtree-subcube assignment strategy for solving PDEs
on hypercubes, SIAM J. Scient. Stat. Comput. , 13 (1992), pp. 826-839.
[20) A. T. OmELSKI AND W. AIELLO, Sparse matrix algebra on parallei processor armys, These
proceedings.
[21) D. P. O'LEARY AND G. W. STEWART, Data-flow algorithms for parallei matrix computations,
Comm. ACM, 28 (1985), pp. 840-853.
[22) L.S. OSTROUCHOV, M.)'. HEATH, AND C.H. ROMINE, Modeling speedup in parallei sparse
209

matrix factorization, Tech Report ORNL/TM-11786, MathematicaI Sciences Section, Oak


Ridge National Lab., December, 1990.
[23] D. PATTERSON, Massively parallei computer architecture: observations and ideas on a new
theoretical model, Comp. ScL Dept., Univ. of California at Berkeley, 1992.
[24] C. POMMERELL, M. ANNARATONE, AND W. FICHTNER, Aset of new mapping and coloring
heuristics for distributed-memory parallei processors, SIAM J. Scient. Stat. Comput. 13
(1992), pp. 194-226.
[25] A. POTHEN, H. D. SIMON, AND L. WANG, Speetral nested disseetion, Report CS-92-01,
Comp. ScL Dept., Penn State Univ. Submitted to J. ParaIlel and Distrib. Comput.
[26] E. RoTHBERG AND A. GUPTA, The performance impaet of data reuse in parallei dense
Cholesky factorization, Stanford Comp. ScL Dept. Report STAN-CS-92-1401.
[27] E. RoTHBERG AND A. GUPTA, An efficient block-oriented approach to parallei sparse Cholesky
factorization, Stanford Comp. ScL Dept. Tech. Report, 1992.
[28] Y. SAAD AND M.H. SCHULTZ, Data communication in parallei architectures, Parallei Comput.
11 (1989), pp. 131-150.
[29] S. VENUGOPAL AND V. K. NAIK, EJJects of partitioning and scheduling sparse matrix faetor-
ization on communication and load balance, Proceedings, Supercomputing 91, pp. 866-875.
IEEE Computer Society Press, 1991.
[30] L. HULBERT AND E. ZMIJEWSKI, Limiting communication in parallei sparse Cholesky factor-
ization, SIAM J. Matrix Ana!. Applics. 12 (1991), pp. 1184-1197.
SPARSE MATRIX FACTORIZATION
ON SIMD PARALLEL COMPUTERS

STEVEN G. KRATZER* AND ANDREW J. CLEARYt

Abstract. Massively parallel SIMD eomputers, in prineiple, should be good platforms for
performing direet factorization of large, sparse matriees. However, the high arithmetic speed of
these machines ean easily be overeome by overhead in intra- and inter-proeessor data motion.
Furthermore, load balaneing is diffieult for an "unstructured" sparsity pattern that eannot be
dissected eonveniently into equal-size domains. Nevertheless, some progress has been made reeently
in LU and QR factorization of unstruetured sparse matriees, using some familiar eoneepts from
veetor-supercomputer implementations (elimination trees, supernodes, etc.) and some new ideas
for distributing the eomputations across many proeessors. This paper describes programs based on
the standard data-parallel eomputing model, as weil as those using a SIM D maehine to implement
a dataflow paradigm

Key words. sparse matriees, parallei proeessing

AMS(MOS) subject classiftcations. 65Y05, 65F50

1. INTRODUCTION

Computations involving large, sparse matrices have a great deal of inherent con-
currency. Therefore, one would expeet fine-grained, "massively parallel" computers
to provide good performance in these applications. In fact, several workers have
reported excellent throughput figures for iterative algorithms, such as conjugate
gradients and relaxation, running on these machines [1]. However, in some applica-
tions, the struetural and numerical properties of the problem make direet solution
(by factorization, forward and backward substitution) more appropriate. Develop-
ment of massively paralleI approaches for sparse matrix faetorization has been slow
because, even though the faetorization problem may contain enough concurrency
to occupy a large number of processors, the machine's high arithmetic throughput
can be overwhelmed easily by overhead in data motion.
Early work on sparse matrix faetorization emphasized reduetion in the storage
and operation counts, since memory was at a premium and software-controlled,
floating-point arithmetic was much slower than other operations. Reeent work, e.g.
[2], has been driven by the performance charaeteristics of modem computer archi-
teetures, which often indude high-speed floating-point hardware. Very high per-
formance has been obtained from pipelined supercomputers [3], while cost-effective
processing has been demonstrated on workstations [4]. Sparse-matrix codes have
been run successfully on MIMD (multiple instruetion/multiple data stream) multi-
processors, including shared-memory [5] and distributed-memory architectures [6].
Research on SIMD (single instruetion/multiple data stream) implementation of
sparse matrix faetorization is less mature, and the current status of this work is
the subject of this paper.

*Supereomputing Research Center, 17100 Science Drive, Bowie, MD 20715.


tSandia National Laboratory and Australian National University
212

In some applieations, the sparse matrix has a speeial strueture that allows it
to be mapped effieiently to an array of proeessors. For example, when nested
disseetion is used on a uniform grid problem, the faetorization deeomposes into
equal-sized subproblems [7]. However, there are many "unstruetured" problems
where such simplifieations are not appropriate, and the original matrix must be
treated with general sparse-matrix teehniques. This paper is eoneemed with LU
and QR faetorizations of these unstruetured matriees on SIMD eomputers. For LU
faetorization, we assume that the nonzero strueture of A is symmetrie, although the
numerieal values may be nonsymmetrie. For most applieations, wherever aij =I- 0
and aji = 0, we ean treat aji as a nonzero with very little additional eost. This
amounts to using the (symmetrie) strueture of A + AT in place of that of A, and
greatly simplifies both the symbolie preproeessing and the numerieal factorization.

2. SIM D ARCHITECTURES

Several parallel eomputers have been based on the SIMD eomputing model,
inclp.ding ILLIAC-IV, MPP, DAP, Conneetion Machine CM-2 and MasPar MP-l.
The last two designs will be summarized briefly here, sinee they are the foeus of
most of the eurrent research on numerieal applieations. In any SIMD architecture,
the proeessors all reeeive the same instruction from a sequeneer, and each proeessor
may be disabled by a eonditional test based on loeal data.
The MasPar MP-1 has a reetangular array of proeessors, each of which ean
eommunieate directly with its eight nearest neighbors. Grid-based eommunieations
operations inelude nearest-neighbor transfer and broadeast of a specified element in
eaeh row (column) aeross the entire row (eolumn). A router provides general point-
to-point eommunieations, but they are mueh slower than grid based transfers. Each
proeessor has aloeal memory that ean be addressed indireetly, to support parallel
array access with a proeessor-dependent index.
The Connection Machine CM-2 eonsists of a hypereube eommunieations net-
work where eaeh node (ealled a "sprint node") eontains 32 bit-serial processars, a
floating-point proeessor, and a memory unit. The hypereube ean be eonfigured via
software and firmware to emulate a,grid with any reasonable number of dimensions.
As with the MasPar, the proeessors have indireet-addressing faeilities and a router
for arbitrary eommunieations pattems.

3. ALGORITHMS

In this seetion, the algorithm for L U and QR factorization will be reviewed,


with partieular emphasis on the opportunities for exploiting parallelism in a SIMD
architecture.

3.1 LV factorization. Coneurreney in a eomputation often ean be illustrated


by the format in whieh the algorithm is written. The eode for dense LU faetorization
by rows in Figure 3-1 gives us no quick insight on potential speedups due to parallel
proeessing. For example, simple but nontrivial analysis is required to see that the
iterations of the innermost loop (on j) ean be done in paralleI, while the loops on i
and k must run sequentially. Figure 3.2, on the other hand, shows very elearly how
213

concurrency, in the form of vector+scalar*vector primitives, could be used in the


dense factorization. For a sparse matrix, additional parallelism is available because
the iterations of the outer loop (on j) need not be done in sequence. IT Ukj = 0
then L.i can be calculated without waiting for L*k. (Throughout this paper, A*i
and Ai* mean column j and row i of A, respectively, while Au, ,i:i' indicates the
submatrix of elements in rows i through i' and columns j through j'.)
In Figure 3-3, outer product operations are used to express concurrency. The
column-scaling operation (to compute a column of L) has n-fold parallelism. How-
ever, the Schur complement update, where most of the work is required, allows
n 2 -fold parallelism. LU factorization can also be expressed in terms of inner prod-
ucts, but the required summation of values that are spread across the processor grid
is relatively slow on the CM-2 and MP-1, so this formulation is not weil suited to
current SIMD machines.
To extend the outer product formulation to the sparse case, yielding the "mul-
tifrontal" algorithm, we use the concept of the elimination tree [8, 9]. This tree has
one node 'for each column of A. The parent of node j is node i if lij is the first
nonzero in column j below the diagonal. An important property of this tree is that
if lij =f:. 0 (meaning that a multiple of column j must be subtracted from column i),
then node i must be an ancestor (parent, grandparent, etc.) of node j. This allows
us to process the nodes by a postorder (depth-first) traversal of the tree, rather
than in order (1,2, ... ,n) as in Figure 3-3, since a postorder traversal visits a node
before its ancestors.

Figure 3-1. LU factorization of a dense, n x n matrix, written in scalar form.

Fori=lton {
For k = 1 to i - I {
L(i,k) = A(i,k)/U(k,k)
Forj=iton
A(i,j) = A(i,j) - L(i, k)*U(k,j)
}
Forj=iton
U(i,j) = A(i,j)
}

Figure 3-2. Dense LU factorization written in vector form.

Forj=lton {
For k = 1 to j - I {
Aj:n,j = Aj:n,j - Lj:n,kUkj
}
L;+1:n,j = Aj+l:n,j/ajj
Vj,j:n = Aj,j:n
}
214

Figure 3-3. Dense LU faetorization using outer products.

Forj=lton {
/* compute row j of U * /
U j,j:n= A j,j:n
/* eompute column j of L * /
ljj = 1
Lj+l:n,j = Aj+l:n,j/Ujj
/* outer product to update Schur complement */
Aj+l:n,j+l:n = Aj+l:n,j+l:n - Lj+l:n,jUj,j+l:n
}

This method is expressed in Figure 3-4 as a recursive procedure, which is initiated by


calling faetor(n). Sinee factor(j) proeesses each child of node j before dealing with
node j itself, a depth-mst traversal of the tree results. In a parallel environment,
sibling nodes can be processed concurrently, in addition to the parallelism in each
outer product computation.
The multifrontal algorithm has advantages for sequential computers, especially
those with storage hierarehies such as caehes and virtual memory paging. Only a
small amount of data must reside in fast memory while processing a given node,
and the depth-first traversalleads to a high degree of data access loeality. Another
advantage is that, compared with other formulations (such as Figure 3-2), fewer
indireet-addressing operations are needed. This reduction is achieved by noting that
when a node has the same nonzero structure as its parent, the "gather" and "seatter"
in Figure 3-4 are redundant because the element s to be gathered for front(parent(j)
are the same ones that were stored from front(j). Astring of adjacent nodes in the
tree with identical nonzero structures is known as a "supernode." Even in an
unstruetured matrix, the fiIl-in that occurs during faetorization tends to produce
large supemodes. It is convenient to transform the tree into a "supemode tree"
by merging the nodes of each supernode, and the computational kernel is now the
"partial factorization" of a front matrix. For a supemode J = {it, h, ... ,jkl, the
partial factorization of frontUl) is simply the mst k steps of the dense factorization
algorithmin Figure 3-3.

Figure 3-4. Reeursive procedure for multifrontal sparse LU factorization.

factor(j) {
/* factor the submatrix for subtree rooted at node j* /
for each child j' of j
{factorU'); }
gather front(j) == [aij : lij i= 0 A lkj i= OJ
eliminate x j from front(j)
save U j ' and L. j
seatter front(j) back to memory
}
215

3.2 QR factorization. A relatively simple method for sparse QR factorization


using Givens rotations is the "diagonal pivot row" algorithm [10]. First, using the
results of a symbolic Cholesky factorization of ATA, a data strueture for R is
alloeated and initialized to o. The numerieal values of A are then processed one
row at a time. For each row vector Ai* we apply rotation matriees {Qij} as follows,

then replaee R with R' and process the next row of A.


Qij is a Givens rotation operator that zeroes the ph component of Ai•. After
this rotation, the next nonzero component of A i * is in position parent(j), the parent
of node j in the elimination tree derived from the symbolie Cholesky faetorization
of ATA. This is due to the definition of parent(j) and the faet that the upper-
triangular factors from the Cholesky faetorization of ATA and the QR faetorization
of A are equal. Clearly, each rotation allows some parallelism beeause the nonzeroes
ean be proeessed simultaneously. Furthermore, rotations ean be done in parallel as
long as they involve different A rows and different R rows.
LU factorization of a large, sparse matrix ean be redueed to a large number of
small, dense subproblems by means of the multifrontal algorithm. Similarly, the
"generalized row merge" algorithm developed by Liu [11] ean reduee a large, sparse
QR factorization problem to a eollection of small, "essentially dense" subproblems.
This method is based on the "row merge tree," which is identieal to the elimination
tree for Cholesky factorization of ATA with the addition of a leaf node for eaeh row
of A. This QR factorization algorithm is similar to the multifrontal algorithm for
LU factorization (Figure 3-4) in that it performs a depth-mst traversal of the row
merge tree. The basie kemel to exeeute for each interior tree node is the "submatrix
merge" operation, whieh is the QR faetorization of two upper trapezoidal matriees
Rl and R 2 to yield a third trapezoidal matrix R 3 :

Givens or Householder transformations ean be used in this factorization [12].


Certainly the submatrix merge eomputation permits some parallelism, sinee several
transformations ean be performed simultaneously, and each transformation ean be
regarded as a veetor operation. FUrthermore, sibling no des in the tree ean be
proeessed in parallei, as in the multifrontal algorithm.

4. DATA-PARALLEL METHODS

In data-parallel programming, the algorithm's data structures are partitioned


into many pieees, and eaeh pieee is assigned to a proeessor. It may appear that
this is the only way to program a SIMD machine, but Seetion 5 shows that this is
not true. However, the data-parallel viewpoint has led to some useful ideas, as this
seetion illustrates.
216

4.1 Breadth-first multifrontal LU factorization. The elimination tree,


aside from its role in sequential algorithms, also serves as a precedence graph for
paralleI factorization. That is, the tree exposes the constraints that must be placed
on execution order. In particular, if elimination of one node (the outer product
operatian of Figure 3-3) is considered to be an atomic operatian, then a node can
be eliminated if and only if all of its children have been eliminated. (Violating
this constraint will give a fill pattern that is different - and possibly denser - than
that predicted by the symbolic factorization which was used for storage allocation.)
Therefore, we can extract the maximum parallelism by processing the tree one level
at a time, beginning farthest from the root, and eliminating all nodes at a given
level simultaneously. The elimination itself can be parallelized by mapping each
nonzero to a separate processor.
Gilbert and Schreiber [13] used this approach to perform sparse Cholesky fac-
torization on the CM-2. In their first implementation they found that most of
the run time was spent on communication, rather than computation, because the
router was used to map the elements of front(j) to those of front(parent(j)). They
also reported on an improved versian, which exploits the supemode structure of the
Cholesky factor to reduce router usage. This is similar to the use of supernodes
to reduce gather/scatter overhead in vector-supercomputer implementations. Each
"major step" of the program factors all of the supernodes at a given level of the
supernode tree. Between major steps, the router must be used to transport data
among the fronts. Within a major step, each front matrix is mapped to a square
region of the two-dimensional "playing field" of virtual processors, and grid-based
communications are used to transport pivot rows and columns across each region.
Although this approach allows massive parallelism to be applied even to matrices
of modest size, processor utilization could suffer for unstructured matrices because,
at a given tree level, the supernodes do not all have the same size. The number of
minor steps (each of which involves a multiply-add in every processor) in one major
step depends on the size of the largest front matrix to be factored at the current
tree level. After the smallest front (the one with the fewest nodes to eliminate) has
been completely processed, processor utilization begins to fall.
The highest throughput reported by the authors in [13] using their star-Lisp
implementation was quite small, less than one Mflops (millian floating-point op-
erations per second). Contributing faetors inelude communications costs and the
high overhead of star-Lisp. The authors point out that a fast implementation of
dense matrix faetorization is an essential building block in order to achieve high
performance on sparse matrices.

4.2 Depth-first multifrantal LU factorization. A method for LV factor-


ization of unstruetured, sparse matrices on a SIMD machine was described recently
[14]. The program processes one supemode at a time, using a depth-first traversal
of the elimination tree. Although this approach has been used successfully with
veetor supercomputers, at first glance one would not expect it to yield enough par-
allelism for the SIMD machines we are COllsidering here. However, experiments with
matrices from the Harwell-Boeing test collectian show that the processor utilization
achieved by this method is surprisingly high.
217

While proeessing an individual supernode J = {jl,h, ... }, the matrix element s


being accessed, namely front(h), ean be regarded as a dense matrix. Therefore,
the most obvious way to map this eomputation onto a P x P proeessor grid is to
assigh rows of the front to eonseeutive proeessor rows {I, ... , P} (wrapping eyeli-
eally from P baek to 1) and use a similar assignment for eolumns. This mapping
would maximize proeessor utilization, given that we are proeessing one supernode
at a time. However, a large overhead would oeeur when transferring data between
fronts. This is beeause, for a given nonzero Uij or lij, eaeh term eontributing to its
final value eomes from a different proeessor, and the router would be required to
handIe the irregular transfer patterns.
In order to reduee the eommunieations eost, a mapping based on the elimination
tree ean be used. Matrix element s are loaded into the proeessors aeeording to a
mapping funetion C/J as follows: aij is stored in proeessor (c/J(i), c/J(j)), where

C/J( k) == (level( k) mod P) + 1.

Level( k) is the level of node k in the tree, defined as the distanee of this node
from the root. All eontributions to the final value of Uij or lij are computed in
proeessor (c/J(i),c/J(j)), so the router is not needed. Grid-based eommunieations
(aeeessed via the FORTRAN-90 "spread" primitive) are used to spread pivot rows
and columns across the proeessor array during elimination. Many massively parallel
eomputers are designed to give higher bandwidth for such grid-based transfers than
for arbitrary communication patterns.
To illustrate this mapping, we show in Figure 4-1 a matrix strueture that typ-
ieally arises from disseetion orderings. l The sparsity strueture, the elimination
supernode tree and the mapping are included in Figures 4-1a, b and e, respeetively.
Figure 4-1e applies when the proeessor grid dimension P is at least as large as the
tree height H. If P < H, as is usually the ease, then the mapping causes the front
matriees to be folded onto the grid.

1 Dissection is mentioned here only for illustrative purposes; any reordering algorithm (such as
minimum degree) can be used with this parallei factorization method.
218

Flgure 4-1 a. SparsHy slructure from dissectad grld.


p
• •

Flgure 4-1b. Elimlnallon supemode


Iree. Each numbarad branch corresponds Flgure 4-1c. Mapping of submalrices
10 a blo ck In Flgura 4-1 a. onlo PxP processor grid.

For a typical sparse matrix, this mapping is not one-to-one; many nonzeros will
be mapped to each processor. For example, if level(j) = levelei') + kP for some
integer k, then ajj and ai'j' are mapped into the same processor. The lower- and
upper-triangular parts of the original matrix A, as well as the final L and U factors,
are stored in a three-dimensional array called MATRIX, which can be viewed as
a one-dimensional array within each processor. Space is allocated in MATRIX by
a simple bin sorting procedure during the preprocessing phase. To determine the
amount of storage used by L* ,j in processors {(I, c,b(j)) , .. . , (P, c,b(j))} , we set up P
empty bins and then assign each nonzero lij to bin c,b(i) . ("Assigning" a nonzero
simply means incrementing the bin's counter etPei) by 1.) L. ,j will occupy mj words
of storage in each of the aforementioned processors, where

Before performing the partial factorization of a front matrix front(j), the ele-
ments of front(j) are gathered from MATRIX into a four-dimensional array FRONT
of size (mj, mj, P, P), where mj was computed by the above mentioned storage al-
location process. Using compiler directives, the last two dimensions of FRONT are
dedared to be "physical," meaning that they correspond to the two dimensions of
219

the P x P processor grid (or, in the case of the CM-2 or the CM-5, the embedding
of this grid in the true physical network). The mst two dimensions of FRONT
are "virtual" and are mapped along the local memory space within each processor.
Thris, each processor contains an mj X mj submatrix of front(j).
Using the outer-product algorithm of Figure 3-3, the number of iterations (i.e.,
the number of parallei multiply-add operations) required to eliminate node j is
mJ. The value of mj depends on the number of nonzeros in front(j) as well as
the distribution of the mapping coordinates (~( i), ~(j» for these nonzeros in the
space [1, Pj x [1, Pj. If this distribution is very non-uniform, meaning that some
processors have many more elements of front(j) than do others, then mj will be
larger than necessary and many element s of FRONT will be unused; although each
paralleI arithmetic instruction performs p2 scalar operations, only a fraction of
these operations are actually used. To measure this phenomenon, we define the
processor efficiency 1] by

where the serial work (in flops) to factor A is Ws = 2 L n~, and nj is the number
j
of nonzeros in L.j. For an n X n dense matrix such that P divides n evenly, 1] = l.
Note that 1] measures a structural quantity, namely the uniformity with which
nonzeros have been mapped onto the processor grid by the function~. 1] does
not measure the relative amounts of time spent on computation, communication,
gather/scatter, and other overheads. Certainly these overhead costs are important,
and they depend on the details of software implementation and compiler perfor-
mance.
In Figure 4-2 1] is plotted against·P (the processor grid dimension) for matrices
from finite element grids of various sizes. The highest efficiency is obtained when
a large matrix is stored on a small processor grid. Table 1 shows the efficiency for
several matrices from the Harwell-Boeing collection, demonstrating that efficiencies
above 30% are obtained for many practical problems on arrays of 4096 processors.
While most CM:-2 and MP-1 systems contain at least 4096 processors, some new
architectures such as the CM-5 and iWarp will pack more processing throughput into
a smaller number of processors. This approach will provide even better efficiency
for these new systems.
220

100

~ 80
~
>. Fjnite element
ü
e ~
Q) 60
-- 150x150
Ti
:;:
-+-- 1 OOxl OO
W 40
-<>-- 50x50
(;
(/)
(/)
Q) 20
ü
2
a..
O+-~.-~.-~.-~.-~.-~.-~
o 10 20 30 40 50 60 70
P (for P by P processor grid)
Figure 4-2. Processor efficiency 11 vs. P for sparse matriees from finite element grids, with 9-
point operator and minimum-degree ordering.

Table 1. Processor efficiency for matrices from the Harwell-Boeing collection on


64 X 64 processor grids.
Matrix size (n) Processor Efficiency % (1])
BCSSTK27 1224 20.2
BCSSTK24 3562 35.6
BCSSTK28 4410 29.4
BCSSTK29 13992 48.8
BCSSTK30 28924 46.0

Although the tree-based mapping eliminates the need for the router, indirect
addressing (indexing) is required before and after each supernode elimination in
order to implement the gather and scatter operations in Figure 3-4. (The element s
of a given front are not stored at the same memory address in every processor.)
The CM-2 or MP-1 hardware performs indexing at speeds comparable with that
of floating-point computation, but we must also be concerned with the complexity
of computing the indices of the values being accessed. In principle, one could
precompute and store all of the indices, but for large matrices the indices would
use far more storage than the numerieal values, and this would severely limit the
applicability of the program. The method us ed in this work involves storing one
integer (derived from the elimination tree) for each nonzero, and performing grid-
based communications and an integer addition in order to compute the indices.
Details are provided in [14].
221

This approach to sparse LU factorization has been implemented as a Fortran


program for the MasPar MP-l. lnitial throughput measurements were not encour-
aging, since for a large matrix (BCSSTK30) on a 4096-processor machine only 11.3
Mflops was obtained. As Figure 4-3 shows, most of the time was spent on commu-
nicationsj blocking methods may help to reduce this problem. The time spent on
arithmetic and gather/scatter also needs to be reduced in order for this program to
be of practical value.

galherlscatter

52.68%

communications
arit~metic

Figure 4-3. Relative time usage during LU factorization of BCSSTK30 matrix on MP-\ with
4K prilcessors.

5. MESSAGE-PASSING METHODS

In the previous section we concentrated on algorithms designed around the


data-parallel model of computation, which is the typical model utilized for SIMD
computerso The message-passing model of computation is commonly used with
MIMD machines, but in this section we look at programs that use this model for
SIMD machines. Two programs will be considered: one for QR factorization, and
one for Cholesky factorization. In both cases, computations and communications
are applied to fixed-Iength segments of rows or columns of the matrices, where
the segment length is an adjustable parameter that controIs the granularity of the
computations. Segmentation of the data allows efficient use of the CM-2 floating
point hardware, which has a vector-type architecture, and reduces the number of
communications startups.
In these SIMD message-passing programs, data and computations are mapped
to the hardware so that only nearest-neighbor communication is required. In the
QR algorithm, the processors are configured into a one-dimensional ring, while for
the Cholesky algorithm a two-dimensional torus is used. These and other differences
are mainly a resuit of the differing requirements of the QR calculation versus that
of Cholesky.

5.1 QR factorization. An implementation of QR factorization on the CM-


2 is described in [15]. The mapping of computations to processors bears some
resemblance to the tree-based mapping for LU factorization that was described in
Section 4.2, but the temporal alignment of computations is very different. At any
given time, rather than working together on one tree node (or supernode), each
222

sprint node (each duster of 32 processors - see Section 2) works on a different tree
node. Grid-based communieations are used to carry messages between sprint nodesj
each message indudes fl.oating-point data and control tokens.
QR factorization by Givens rotations can be mapped efl.ieiently to a ring of
P "stages" (a state will be defined later) with nearest-neighbor conneetivity. All
rotations that eliminate nonzeroes in column j are mapped to state ifJ(j), where

ifJ(j) == (level(j) mod P) + 1,


as in section 4.2. Here, level(j) is the level of node j in the elimination tree of ATA.
This mapping is a one-dimensional version of the mapping of section 4.2, and
is illustrated in figure 5-1. The motivation for this mapping is that, after a row
Ai* has undergone a rotation in state ifJ(j) to eliminate the nonzero in column
position j, the next nonzero in this row will be in column j = parent(j). Therefore
level(j') = level(j) -1, and ifJ(j') = (ifJ(j) -1) mod P. Henee, with this mapping,
nea.rest-neighbor communications are exaetly what the algorithm needs, and we can
exploit the high-bandwidth grid communieations facilities of the CM-2.
Although the QR faetorization of A is dosely related to the Cholesky fae-
torization of ATA, these two operations differ in the preeedenee constraints that
govern eoncurrency. For multifrontal Cholesky factorization (or LU faetorization
of symmetrie-pattern matriees), in order to preserve the precomputed fiU pattern,
a node can be eliminated only if it eurrently is a leaf - that is, only if its children
have been eliminated. For QR factorization, however, work ean be performed for
any node j for which the leftmost nonzero in some row Ai* is in column j.
To implement sparse QR factorization on a ring of proeessors, the rows of A
are passed around the ring, while the rows of R are stored in the loeal memories
of the stages. If aij is the leftmost nonzero in row Ai*, then this row is loaded
initially into stage ifJ(j), which is where R.j resides. Each row of A that is initially
loaded into this st age, or sent from the previous stage, earries a tag that indudes a
"node index" between 0 and k - 1, where k is the number of R rows stored in this
stage. This node index is used by the stage to fetch the appropriate R row from
local memory. The stage computes the Givens rotation

[e
s
-s]
e
[A~*],
R, *

then retums the modified value of Ri* to loeal memory and delivers the new value
of Ai* to the next st age. The tag aceompanying the latter value is found from a
lookup table in the stage, and differs from the one carried by A i * as it entered this
stage.
223

Tree levels handled


by this stage

0, p, 2P, ...

1, P+ 1, 2P+ 1, ...

P-l, 2P-l, ...

Figure 5-1. Mapping of tree nodes for sparse QR factorization to stages of ring.

In order for this approach to give reasonable processor efficiency, the number of
stages P must be less than the tree height H, so that the mapping "wraps around"
from st age P to st age 1. If we let each sprint node be one stage, then a CM-2 with
16,384 processors has P = 512 stages; many applications give rise to sparse matrices
with tree heights exceeding this number. Even so, one might expect the load balanee
to be poor because in a typical tree, each level in the "canopy" (farthest from the
root) has more nodes than each level near the root. However, the amount of work
to be done at a given level is not simply proportional to the number of nodes at this
level. Each canopy node is associated with only a few rows from A, and each of
these rows is very sparseo Closer to the root, many A rows arrive to be processed at
each node, and the rows are denser because they have suffered im-in from previous
rotations. These effects combine to yield processing efficiency (the average fraction
of processors that are doing useful work) around 50% for most matrices from regular
and irregular finite element grids, as long as the tree height is greater than P.
The CM-2 is used here as a ring-connected pipeline of stages; each stage (each
sprint node) functions as a vector processor, with a vector length of 32. A row with
more than 32 nonzeroes is broken up into segments, where each segment contains
up to 32 nonzeroes and a control tag. Aside from the aforementioned "node index,"
this tag also indudes bit fields to mark the first and last segments of each row. If
a row A i * is several segments long, then it can be pipelined through several stages,
with each stage applying a different rotation to a different segment of the row at
any given moment.
As a row from A passes through a stage, the rotation applied to it may con-
224

tribute some mI, lengthening the row because only the nonzeroes are transmitted
between stages. Even though the stages operate synchronously, they accept and de-
liver data at differing rates because they are operating on different portions of the
matrix, with different amounts of :!ill-in occurring. Software-managed queues are
placed between the stages to absorb fluctuations in data rates, thereby increasing
processor utilization.
Figure 5-2 shows a typical prome of processor utilization versus time step for
this QR faetorization method. The fluctuations in Figure 5-2 are due to the com-
plex interaetions among the P coupled queuing proeesses. Average utilization was
approximately 50% for this run; the tree height for this example was 1050. Note
that "utilization" in this context refers to the fraction of virtual processors (out of
16384) that are busy during each iteration. Therefore, this measure is similar to
the quantity 11 that was de:!ined for sparse LU faetorization in Section 4.2; both are
struetural measures of load balanee that do not address the question of overheads
(for communications, gather / scatter, etc.). For this CM -2 C /Paris code, those pro-
cessors that are not idle (due to load imbalance) spend about 25% of their time on
floating-point arithmetic, and the remainder on overhead. Even some of this 25%
is wasted because most of the processors are disabled during the computation of
the Givens rotation parameters; this is an unfortunate consequence of the SIMD
computation mode!. For this reason, more extensive performance measurements
were not conducted with this code. Perhaps better performance could be obtained
by implementing this approach on a MIMD machine, where some processors can
compute rotation parameters while others are performing other tasks.

5.2 Cholesky factorization. This section describes work in progress by the


authors on SIMD implementation of sparse Cholesky factorization, using a message-
passing approach that is similar to that described above for QR factorization. While
the QR and Cholesky factorizations are eIosely related, there are some key differ-
ences that must be considered when migrating from one to the other. In this section
we discuss the Cholesky program, ineIuding differences from the QR work and the
motivations for these changes.
While QR factorization by Given rotations seems most naturally stated in terms
of rows, Cholesky factorization usually deals with columns. With dense matrices
this distinetion affects only the memory addressing pattems, but for sparse matrices
there is a stronger reason for choosing to work with columns. The main compu-
tational kemel in Cholesky factorization by rows is the computation of an inner
product between two rows. If the matrix is sparse, then the rows involved in a
particular calculation are sparse and we see no efficient way for a SIMD machine to
encode, or to determine at run time, the interseetion of the nonzero structures of
the two rows. In contrast, with a column-oriented algorithm, the main computa-
tion will be a sparse linear combination of two columns, where it is known a priori
that the nonzero strueture of one column will be a subset of the strueture of the
other. Thus, with suitable gather and scatter operations, the linear combination can
be done with no wasted operations. Therefore we have chosen a column-oriented
approach.
Figure 5-2. Processor utilization vs. time step for a CM-2 with 16K processors (512 stages)
factoring maoix BCSSTK24 from Harwell-Boeing collection.

There 'are several choices for the type of columns to be used in communica-
tion operations. One possibility that was ruled out is to pass partially-computed
columns of L among processorSj each time li, column arrives at li, processor, it is
updated by one or more columns of L that are stored in that processor. This
scheme is inappropriate because there is no way to guarantee that, when li, column
arrives at li, processor to be updated by another column, the updating column is
completed. Thus, the two possibilities that we have considered are using completed
columns of the Cholesky factor (the "fan-out" method) or using partial updates (the
"fan-in" method) [16]. On MIMD machines it has been demonstrated that using
partial updates leads to li, reduced communication volume [17] and to more efficient
algorithms (see [18] for li, comparison on the Ncube/2). However, communication
volume by itself is not as important li, factor for SIMD machines, as the number
of idle processors must be considered also. For the present we have chosen to use
completed columns mostly because the control information needed to route these
is simpler than for partial updates. In the future we plan to further consider the
possibility of using partial updates.
Thus, the major action during each step of Cholesky factorization is the sub-
traction of li, multiple of li, segment of li, completed column of L from that of an
uncompleted column. When an uncompleted column has had its last modiflcation
performed on it, it is post-processed and then sent out to other processors that need
it to modify columns. More than one copy of the completed column may exist in
the processor grid at one time, allowing it to modify more than one column at li,
time.
226

The QR method of section 5.1 does not exploit the independence of sibling
nodes, even though this is an easily exploited source of large-grained parallelism.
This is a consequence of the level-to-processor mapping necessitated by the use of
the one-dimensional ring. To allow us to exploit this parallelism, we have expanded
the ring to a two-dimensional torus. One reason for this change is that in Cholesky
factorization, work can begin only at the leaves of the tree, and the leaves are often
not weIl distributed across the height of the tree. This would lead to poor load
balanee, especially for problems with short trees. By changing to a two dimensional
grid, we have essentially reduced the tree height that is required for good load
balanee, by allowing much greater flexibility in how we map the tree to the processor
grid.
The mapping of matrix columns to processors affects both communication over-
head and load balanee. Communication overhead depends on the total number of
processors that must eventually receive a column, and the number of processors
that a column must pass through without being used. An ideal scheme would re-
sult ·in many modifying columns being sent at once, but each going to only a few
processors, so that communication is reduced and yet each processor still has useful
work to do. We have chosen to use a mapping which increases the load balanee at
the possible expense of increased communication, based on the fact that the CM
has a high communication bandwidth and will most likely to be able to handle
the extra communication. All grid communications take place in the south or west
grid directions. The mapping assigns column j to the north or east neighbor of the
sprint node handling column parent(j). The sprint node chosen is the one which has
currently been assigned fewer columns, helping to maintain the total load balanee
A CM-2 implementation of this approach is currently being tested, and results will
be reported in a future publication.
6. CONCLUSIONS

Massively-parallel SIMD computers offer very high arithmetic speeds for pro-
grams that can exploit the hardware efficiently. This exploitation requires careful
attention to several issues, including granularity, load balancing, and overhead.
Each of these issues also arises in MIMD implementations, but the criteria are
different. Massively-parallel SIMD architectures generally call for a relatively fine
granularity of decomposition, such as one processor for each nonzero (rather than
one per column, subtree or submatrix) during any given step.
Load balancing for a SIMD program means equalizing not only the amount of
work for all processors, but also the type of work. For example, some processors
cannot be inverting pivot element s (division) while others are updating Schur com-
plements (multiplication and addition). Even if the programming language allows
constructs such as if-then-else or loops with processor-dependent iteration counts,
these are implemented by conditionally disabling some processors, leading to effi-
ciency loss. This issue is especially important when dealing with "unstructured"
sparse matrices. One way to deal with this problem is to break the data into equal-
sized blocks, or segments, whose size is related to the processor array dimensions.
227

Several types of overhead can slow down a SIMD program. Communications


overhead can easily overwhelm the speed of arithmetic in current SIMD machines
even for dense-matrix factorization. Communications speed depends significantly
on the transfer pattern, although this dependence may weaken in future machine
designs. Within each processor, local memory access is not "free"; in fact, it can
be a significant overhead cost. It is relatively easy to amortize these overhead
costs during a very large (say, 5000 X 5000) dense matrix operation, but the dense
submatrices that arise in a sparse factorization are generally much smaller, causing
these costs to be more stubborn.
Another type of overhead is control complexity. This is clearly present in SIMD
message-passing programs such as those described in Section 5, where distributed-
control tokens must be decoded. However, even data-parallel programs often incur
overhead in computing context s (for conditional execution) and array indices.
In spite of these obstacles, good performance of sparse matrix factorization code
on SIMD machines is a very real possibility. Large, sparse problems have enormous
amounts of inherent concurrency, but as Section 3 points out, this concurrency is
exhibited in several different dimensions (across tree levels, within each level, within
each supernode, etc.) Future hardware designs with faster, more flexible commu-
nications networks will help to exploit this concurrency, and new system software
may allow the programmer to more easily trade off vectorization and parallelism.

REFERENCES

[1] O. McBRYAN, The Conneetion Machine: PDE Solution on 65,536 Processors, Thinking
Machines Corp. Technical Report CS86-1, 1986.
[2] A. DAVE AND I. DUFF, Sparse Matrix CaIculations on the Cray-2, Parallei Comput., 5 (1987),
pp. 55-64.
[3] C. YANG, A Vector/parallel Implementation of the Multifrontal Method for Sparse Sym-
metric Positive Definite Linear Systems on the Cray Y/MP, Cray Research Inc. Technica!
Report 1990.
[4] E. ROTHERBERG AND A. GUPTA, Tecbniques for Improving the Performance ofSparse Matrix
Factorization on Multiproeessor Workstations, Stanford Univ. Report CSL-TR-90-430, 1990.
[5] A. GEORGE, M. HEATH AND J. LIU, ParaIlel Cholesky Faetorization on a Shared-Memory
Multiprocessor, Lin. Alg. AppI., 77 (1986), pp. 165-187.
[6] R. LUCAS, W. BLANK AND J. TIEMAN, A ParalleI Solution Method for Large Sparse Systems
of Equations, IEEE Trans. Computer Aided Design, CAD-6 (1987), pp. 981-991.
[7J P. WORLEV AND R. SCHREIBER, Nested Dissection on a Mesh-Connected Proeessor Array, in
New Computing Environment: ParalleI Veetor and Systolie, ed. by A. Wouk, SIAM, 1986.
[8] J. LIU, The Role ofElimination 'frees in Sparse Factorization, SIAM J. Matrix Ana!. AppI.,
11 (1990), pp. 134-172.
[9] R. SCHREIBER, A New Implementation of Sparse Gaussian Elimination, ACM Trans. Math.
Software, 8 (1982) pp. 256-276.
[10] A. GEORGE AND M. HEATH, Solution ofSparse Linear Least Squares Problems Using Givens
Rotations, Lin. Alg. AppI., 34 (1980), pp. 69-83.
[11] J. LIU, On General Row Merging Sehemes for Sparse Givens Transformations, SIAM J. Sci.
Stat. Comp., 7 (1986), pp. 1190-1211.
[12] A. GEORGE AND J. LIU, Householder Reflections versus Givens Rotations in Sparse Orthog-
onal Decomposition, Lin. Alg. App!., 88 (1987), pp. 223-238.
228

[13] J. GILBERT AND R. SCHREIBER, Highiy Parallei Sparse Choiesky Factorization, SIAM J.
Scientific and Statistical Computing, 13 (1992) pp. 1151-1172.
[14] S. KRATZER, Sparse LU Factorization on Massiveiy Parallei SIMD Computers, Technical
Report SRC-TR-92-072, Supercomputing Research Center, ApriI. 1992.
[15] S. KRATZER, Massiveiy Paralle! Sparse Matrix Computations, Technica! Report
SRC-TR-90-008, Supercomputing Research Center, February, 1990.
[16] M. HEATH, E. NG AND B. PEYTON, Paralle! A!gorithms for Sparse Linear Systems, SIAM
Review, 33 (1991), pp. 420-460.
[17] C. ASHCRAFT, S. EISENSTAT, J. LJU, AND A. SHERMAN, A Comparison of Three Co!-
umn-based Distributed Sparse Factorization Schemes, Technica! Report, Dept. of Computer
Science, York Univ., 1990.
[18] A. CLEARY, A Camparisan of AJgorithms for Choiesky Factorization on a Massiveiy Parallei
MIMD Computer, Proc. 5th SIAM Conf. on ParalleI Processing, March, 1991.
THE EFFICIENT PARALLEL ITERATIVE SOLUTION OF LARGE
SPARSE LINEAR SYSTEMS·

MARK T. JONES AND PAUL E. PLASSMANNt

Abstract. The development of efficient, general-purpose software for the iterative solution of
sparse linear systems on parallel MIMD computers depends on reeent results from a wide variety of
research areas. Parallel graph heuristies, convergence analysis, and basie linear algebra implemen-
tatian issues must all be considered.
In this paper, we discuss how we have incorporated these results into a general-purpose iter-
ative solver. We present two recently developed asynchronous graph coloring heuristies. Several
graph reduction heuristics are described that are used in our implementation to imprave individ-
ual processor performance. The effeet of these variaus graph reduction schemes on the salutian of
sparse triangular systems is categorized. Finally, we report on the performance of this solver on
two large-scale applications: a piezoelectric crystal finite-el€:ment modeling problem, and a nonlin-
ear optimization problem to determine the minimum energy configuration of a three-dimensional
superconduetor made!.
Key words: graph coloring heuristics, iterative methods, parallei algorithms, preconditioned con-
jugate gradients, sparse matriees
AMS(MOS) subject classifications: 65F10, 65F50, 65Y05, 68RIO

1. Introduction. The eomputational kernel of many large-seale applications is


the solution of sparse linear systems. Given the increasing performance of individual
processors and the drarnatic reeent improvements in engineering paralleI machines
composed of these proeessors, asealable paralleI computer is an attraetive vehiele for
solving these probIems. In this paper we endorse a particular perspective: (1) we
note that in many applieations one is interested in solving as large a problem as can
feasibly fit into the available memory of the machine, and (2) that the underlying
geometric strueture of these applications is often three-dimensional or greater. These
observations, and a simple "baek-of-the-envelope" ealculation,l lead one to conelude
that a paralleI direet factorization method is in general not feasible for such probIems,
in terms of the arnount of space and time required. This perspective motivates one
to eonsider an approaeh to the iterative solution of sparse linear systems in a manner
that ensures scalable performance. 2

• This paper is based on a talk presented by the second author at the IMA Workshop on Sparse
Matrix Computations: Graph Theory Issues and AIgorithms, Oetober 14-18, 1991. This work was
supported by the Applied Mathematieal Sciences subprogram of the Office of Energy Research,
U.S. Department of Energy, under Contract W-31-109-Eng-38.
t Mathematics and Computer Science Oivision, Argonne National Laboratory, 9700 South Cass
Ave., Arganne, IlIinois 60439
1 For example, consider a three-dimensional problem discretized on an O(k x k x k) grid and
ordered by nested dissection. We assume that we must solve a dense system of size O(P), the size
of the largest separator. This task requires O(k 6 ) work and O(k 4 ) space. By contrast, for an iterative
scheme we assume that the number of iterations required is at worst O(k) (Le., proportional to the
relative refinement of the mesh). The work per iteration is proportional to the size of the linear
system, or O(k3 ). Thus, the total work required by the iterative method would be O(k 4 ) and the
space required O(k 3 ).
230

In this paper we present an approach to solving such systems that satisfies the
requirements above. Central to our method is a reordering of the matrix based on a
coloring of the symmetric graph corresponding to the nonzero structure of the matrix,
or a related graph. To determine this ordering, we use a recently developed parallel
heuristic. However, if many colors are used, a straightforward parallel implementa-
tion, as is described in [10], suffers poor processor performance on a high-performance
processor such as the Intel i860. In this paper we present several possible graph reduc-
tions that can be employed to greatly improve the performance of an implementation
on high-performance RISC processors.
Consider an implementation of any of the standard general-purpose iterative meth-
ods [7, 16]: consistently ordered SOR, SSOR accelerated by conjugate gradients (CG),
or CG preconditioned with an incomplete matrix factorization. It is evident that the
major obstacle to a scalable implementation [6] is the inversion of sparse triangular
systems with a structure based on the struc~ure of the linear system. For example,
the parallelism inherent in computing and applying an incomplete Cholesky precondi-
tioner is limited by the solution of the triangular systems generated by the incomplete
Cholesky factors [21]. It was noted by Schreiber and Tang [20] that if the nonzero
structure of the triangular factors is identical to that of the original matrix, the mini-
mum number of major parallel steps possible in the solution of the triangular system
is given by the chromatic number of the symmetric adjacency graph representing
those nonzeros. Thus, given the nonzero structure of a matrix A, one can generate
greater parallelism bycomputing a permutation matrix, P, based on a coloring of the
symmetric graph G(A). The incomplete Cholesky factor L of the permuted matrix
P ApT is computed, instead of the factor based on the original matrix A.
In this permutation, vertices of the same color are grouped and ordered con-
secutively. As a consequence, during the triangular system solves, the unknowns
corresponding to vertices of the same color can be solved for in parallel, after the
updates from previous color groups have been performed. The result of Schreiber
and Tang states that the minimum number of inherently sequential computational
steps required to solve either of the triangular systems, Ly = b or iT x = y, is given
by the minimum possible number of colors, or chromatic number, of the graph.
We note that this bound on the number of communication steps assumes that
only veetol' operations are performed during the triangular systems solves. This
assumption is equivalent to restricting oneself to a fine-grained parallel computational
model, where we assign each unknown to a different processor. When many unknowns
are assigned to a single processor, it is possible to reduce the number of communication
steps by solving non-diagonal submatrices of L on individual processors at each step.
In this case, the minimum number of communication steps is given by a coloring of a
quotient graph obtained from a partitioning of unknowns to processors.
The remainder of the paper is organized as follows. In §3 we present several
possible graph reductions, including the clique partitions that allow for the use of
higher-level Basic Lineal' AIgebra Subprograms (BLAS) in the software. We consider

2 That is, we are interested in a solver where, for fixed problem size per processor, the performance
per processor is essentially independent of the number of processors used.
231

a general frarnework that can incorporate these ideas into efficient triangular system
solvers in §4. Finally, in §5 we present experimental results obtained for our software
implementation on the Intel DELTA for problems arising in two different applications
and· in §6 we discuss our concIusions.

2. Asynchronous parallei graph coloring heuristies. In this section we


consider two recently developed graph coloring heuristics suitable for asynchronous
parallel computers. Our perspective is that if a scalable iterative solver is to be based
on a matrix ordering derived from a graph coloring, then a scalable heuristic is nec-
essary to determine this coloringo The two paralleI heuristics we review are based on
Monte Carlo steps for which expeeted running times are known: a synchronous PRAM
heuristic developed by Luby [15], and a recent asynchronous heuristic presented by
Jones and Plassmann [13]. The interesting aspect of the asynchronous method is that
it combines aspeets of sequential greedy graph coloring heuristics with a Monte Carlo
step to determine independent sets. In this sectioI\. we show how a modification can
be made to Luby's maximal independent set heuristic to make it both asynchronous
and satisfy the same running time bound obtained for the second heuristic.
First, we briefly review the graph coloring problem. Let G = (V, E) be a symmetric
graph with vertex set V, with IVI = n, and edge set E. We say that the funetion
17 : V -+ {I, ... , s} is an s-coloring of G, if 17( v) =f. 17(w) for all edges (v, w) E E. We
denote the minimum possible value for s, the chromatic number of G, by X(G).
The question as to whether a general graph G is s-colorable is NP-complete [5].
It is known that unIess P = N P, there does not exist a polynomial approximation
scheme for solving the graph coloring problem [5]. In fact, the best polynomial time
heuristic known [8] can theoreticaIly guarantee a coloring of only size e (nl log n) x( G),
where e is some constant.
Given these pessimistic theoretical results, it is quite surprising that, for certain
cIasses of graphs, there exist a number of sequential graph coloring heuristics that
are very effeetive in praetice. For graphs arising from a number of applications, it
has been demonstrated that these heuristics are often able to find colorings that are
within one or two of an optimal coloring [4, 10].
These sequential heuristics are based on a greedy heuristic that colors vertices
in an order determined by a cost function. Choices for the cost funetion that are
particularly effective are the saturation degree order (choose the most constrained
vertex [3]) or the incidence degree order (choose the vertex adjacent to the maximum
number of previously colored vertices [4]). Unfortunately, these heuristics do not
parallelize weIl, because they essentially represent a breadth-first search of the graph.
A different approach was suggested by Luby [15]. His observation was that if one
can determine a maximal independent set efficiently in paralleI, then a partition of
the vertices of the graph into maximal independent sets yields a coloringo Luby's
algorithm for determining an independent set, I, is based on the following Monte
Carlo rule. Here we denote the set of vertices adjacent to vertex v by adj(v).

1. For each vertex v E V determine a distinet, random number p(v).


2. v E I<=> p(v) > p(w), Vw E adj(v).
232

In the Monte Carlo algorithm described by Luby [15], this initial independent set
is augmented to obtain a maximal independent seto The approach is the following.
After the initial independent set is found, the set of vertices adjacent to a vertex in
I, the neighbor set N(I), is determined. The union of these two sets is deleted from
V, the subgraph induced by this smaller set is constructed, and the Monte Carlo step
is used to choose an augmenting independent seto This process is repeated until the
candidate vertex set is empty and a maximal independent set (MIS) is obtained. The
complete Monte Carlo algorithm suggested by Luby for generating an MIS is shown
in Fig. 1. In this figure we denote by G(V' ) the subgraph of G induced by the vertex
set V'. Luby shows that an upper bound for the expeeted time to compute an MIS
by this algorithm on a CRCW P-RAM is EO(log(n)). The algorithm can be adapted
to a graph coloring heuristic by using it to determine a sequence of distinet maximal
independent sets and by coloring each MIS a different color. Thus, this approach will
solve the (~+ 1) vertex coloring problem, where ~ is the maximum degree of G, in
expected time EO((~ + 1) log(n)).

I fo-- 0;
V' fo-- V;
While G(V ' ) =1= 0 do
Choose an independent set t in G(V');
lfo--Iut;
V' fo-- V' \ (I' u N(t));
enddo

FIG. 1. Luby's Monte Carlo algorithm for determining a maximal independent set

A major deficiency of this approach on currently available paralleI computers is


that each new choice of random numbers in the MIS algorithm requires a global syn-
chronization of the processors. A second problem is that each new choice of random
numbers incurs a great deal of computational overhead, because the data struetures
associated with the random numbers must be recomputed. The asynchronous heuris-
tic proposed by Jones and Plassmann [13] avoids both of these drawbacks. This
heuristic is presented in Fig. 2. The heuristic is written assuming that each vertex
v is assigned to a different processor and the processors communicate by passi ng
messages.
With the asynchronous heuristic the first drawback (global synchronization) is
eliminated by choosing the independent random numbers only at the start of the
heuristic. With this modification, the interprocessor communication can proceed
asynchronously once these numbers are determined. The second drawback (compu-
tational overhead) is alleviated because with this heuristjc, once a processor knows
the values of the random numbers of the vertices to which it is adjacent, the number
of messages it needs to wait for can be computed and stored. Likewise, each pro-
cessor computes only once the processors to which it needs to send a message once
its vertex is colored. Finally, note that this heuristic has more of the "flavor" of the
sequential heuristic, since we choose the smallest color consistent with the adjacent
vertices previously colored.
233

Choose p(v)j
n-wait = Oj
send-queue = 0j
For each W E adj(v) do
Send p( v) to processor responsible for Wj
Receive p(w)j
if (p( w) > p( v)) then n-wait = n-wait + 1j
else send-queue ~ send-queue U {w}j
enddo
n-recv = Oj
While (n-recv < n-wait) do
Receive u(w)j
n-recv = n-recv + 1j
enddo
u(v) = smallest available color consistent with the
previously colored neighbors of Vj
For each W E send-queue do
Send u( v) to processor responsible for Wj
enddo

FIG. 2. An asynchronous paralle/ coloring heuristic

An upper bound for the expected running time of a synchronous version of this
algorithm of EO(log( n) / log log( n)) can be obtained for graphs of bounded degree [13].
The central idea for the proof of this bound is the observation that the running time
of the heuristic is proportional to the maximum length monotonic path in G. A
monotonic path of length t is defined to be a path of t vertices {vt, V2, ... ,Vt} in G
such that P(Vl) > P(V2) > ... > p(Vt).
We now show that the Luby's MIS algorithm can be modified to obtain the same
bound. Consider the following modification to the asynchronous coloring heuristic
given in Fig. 2. Let the function ,(v) equal one if v is in the independent set I, two if
v is in N(I), and let it be undefined otherwise. In Fig. 3 we present an asynchronous
algorithm to determine a MIS.
The following lemma proyes the correctness of the asynchronous algorithm.

LEMMA 2.1. At the termination of the algorithm given in Fig. 3, the function
,(v), v E V defines a maximal independent seto

Proof: At the completion of the algorithm in Fig. 3, 1'(v) is defined for each v E V.
Thus, each vertex v E V satisfied one of the following based on the definition of 1':
1. v E I, or
2. v E N(I).
It is dear that the set I is independent, and each member of N(I) must be adjacent
to a member of I. Thus, the above two conditions imply that the independent set I
is maximal. D
234

Choose p(v)j
n-wait = Dj
send-queue = 0j
For each W E adj(v) do
Send p( v) to processor responsible for Wj
Receive p(w)j
if (p( w) > p( v» then n-wait = n-wait + 1j
else send-queue +- send-queue U {w} j
enddo
n-recv = Dj
While (n-recv < n-wait) do
Receive "Y(w)j
n-recv = n-recv + 1j
enddo
if (all the previously assigned neighbors W of v
have "Y(w) = 2), then "Y(v) = lj
else "Y(v) = 2j
endif
For each W E send-queue do
Send "Y( v) to processor responsible for Wj
enddo

FIG. 3. An asynchronous algorithm to determine a maximal independent set

Based on Theorem 3.3 and Corollary 3.5 given in [13], we have the following
corollary.

COROLLARY 2.2. For graphs of bounded degree A, the expected running time is
EO(log( n) /log log( n» for the maximal independent set algorithm given in Fig. 3.

Proof: As for the bound for the asynchronous paralleI coloring heuristie, the expected
running time for the asynchronous maximal independent set algorithm is proportional
to the expeeted length of the longest monotonic path. By Theorem 3.3 and Corol-
lary 3.5 in [13] this length is bounded by EO(log(n)/loglog(n». D
Finally, we note that this maximal independent set algorithm can be used in place
of Luby's MSI algorithm to generate a sequence of maximal independent sets, each
of which can be colored a different color. The running time of this coloring heuristic
would again be bounded by EO(log( n) / log log( n» because the maximum number of
colors used is bounded by A + 1, and we have assumed the maximum degree A of
the graph is bounded.

3. Graph reduetions. In this seetion we present several graph reduetions that


are used in our iterative solver implementation. These reduetions are employed in
§4 to describe several possible alternatives for the solution of the triangular systems
involving the preconditioned systems.
235

It is often observed that the sparse systems arising in many applieations have a
great deal of special local strueture, even if the systems are described as "unstruc-
tured." We have attempted to illustrate some of this local structure, and how it can
be identified, in the following sequence of figures.
In Fig. 4 we depict a subsection of a graph that would arise from a two-dimensional,
linear, multicomponent finite-element model with three degrees of freedom per node
point. We illustrate the three degrees of freedom by the three dots at each node
point; the linear element s imply that the twelve degrees of freedom sharing the four
node points of each face are completely connected. In the figure we show edges
only between the nodes, these edges represent the complete interconneetion of all the
vertices on each element or face.

FIG. 4. A subgraph generated by a two-dimensional, linear finite element model with three degrees
of freedom per node point. The geometric partition shown by the dotted lines yields an assignment
of the vertices in the enclosed subregion to one processor.

The dashed lines in the figure represent a geometrie partitioning of the grid; we
assume that the vertices in the central region are all assigned to one processor. We
make several observations about the local strueture of this subgraph. First, we note
that the adjacency strueture of the vertices at the same geometric node (Le., the
nonzero strueture of the associated variabIes) are identical, and we call such vertiees
identieal vertiees. It was noted by Schreiber and Tang [20] that a coloring of the graph
corresponding to the geometric no des results in a system with small dense bloeks, of
order the number of degrees of freedom per node, along the diagonal. We not e that
this observation can also be used to decrease the storage required for indireet indexing
of the matrix rows since the struetures are identical.
We also consider another graph reduetion based on the local clique strueture of
the graph. In Fig. 5 the dotted lines showone possible way the vertiees assigned to
the shown partition and its neighbors can be partitioned into cliques. Denote such a
partition by Q. If we associate a super vertex with each clique, the quotient graph
236

@I @ @I I@ @ @
_~_ ...................................... ~.~::::::: ..... ,
................ i ................................ ;, ..............................:
.•••..........

"":] 1::::11:::::::11:::
. . . . ·. . . . . . . . . . r.
. . . . (f)...! l"~ ·~ 1 I"~ ·~ I Ci). . .

I
• • • • •••• • •••• • •• •:~ .~.~.~.~.~ .~.~ .;.~.~ .~.~.~.;.;: •• ;~ .~.~.~.;.;.;.;.;.; .;.~ ';';';';:.i- ;~~. ~:::::::: ~ ••••• ,

@] ! (i) OO] fOOOOj! rOO""


FIG. 5. A parlition of the verlices into cliques

G / Q can be construeted based on the rule that there exists an edge between two
super vertices v and w if and only if there exists an edge between two vertices of their
respeetive partitions in G. The quotient graph construeted by the clique partition
shown in Fig. 5 is shown in Fig. 6.

FIG. 6. The quotient graph given the clique parlition shown in Fig. 5

Of course the quotient graph rednction is not limited to the choice of a maximal
clique partition; any local partition of the subgraph assigned to a processor can be
used to generate the reduced graph. We use a clique decomposition because the
submatrix associated with the clique is dense, thus allowing for the use of higher
level dense linear algebra operations (BLAS) in an implementation. The aspect of
the graph reduetion is discussed in more detail in §4. Finally, we not e that the
237

efficient determination of identical nodes, and aloeal maximal elique decomposition, is


straightforward. Since the adjacency structure of the vertices assigned to a processor
is known locally, no interprocessor communication is required, and a greedy heuristic
can be used to determine a elique partition.
It is important to note that the graph reductions described in this section are
highly dependent on the determination of a good partition (assignment of vertices to
processors). We do not consider the problem of determining a good partition in this
paper. For the applications problems we consider in §5, a physical partition can be
used to generate a good vertex assignment to processors. When the determination of a
partition is not straightforward, a partitioning heuristic would have to be used. Some
possibilities existj for example, recent advances in the automatic partitioning of three-
dimensional domains [22] or in spectral dissection methods [18] could be employed.
However, the paralleI graph partitioning problem deserves much additional research.

4. The inversion of triangular systems. In this section we review the prob-


lem of th~ parallel solution of a sparse triangular system. The triangular system
solution is the central problem in the parallelization of the standard iterative meth-
ods. For example, it is involved in the application of a preconditioner derived from
an incomplete factorization, or in an SOR or SSOR iteration.

For i = 1, ... ,x do
1. Local Solve (requires no interprocessor communication):
Li,iYi = bi
2. Update (communication without interdependencies):
bJ; fo- bJ; - LJ;,K;YK;
enddo

FIG. 7. A general /ramework for the parallel forward elimination of the lower triangular system
Ly=b

Consider the lower triangular matrix L decomposed into the following block struc-
ture.

[~,' 0 0

Xl·
La,} L 2 ,a 0
(4.1)

Lx,} L x,2 ...


In Fig. 7 we present a general framework for the forward elimination required to solve
the system Ly = b. By Yi and bi we mean the partition of components implied by the
block partition of L given above. The index sets Ji and K i can be anything equivalent
to the standard forward elimination algorithm. With this framework we divide the
solution in two phases. In phase 1, the diagonal block solution phase, we assume
that no interprocessor communication is required. In the second phase, when the
partial updates to the right hand side are performed, we inelude all the interprocessor
communication, but we assume that this communication can be performed in any
order. Thus, the number of major communication step required in this framework is

238

We classify a number of possible approaches to solving these triangular systems


based on the choice of the diagonal blocks L;,; as follows:

Pointwise colorings - Given a coloring of the graph G(A) for the incomplete
factorization matrix A, we order unknowns corresponding to same colored
vertiees consecutively. An implementation based on this approach and com-
putational results are given in [10].
Partitioned inverse - One can determine a product decomposition of L; for ex-
ample,

"
(4.2) L = IIL;,
;=1

where the nonzero structure, S, of the product element s satisfy S(L;) =


S(L-;1) [1,2]. The inversion of L can be performed with K, matrix products
once the partitioned inverse is formed. We note that this can always be
done with a pointwise coloring, where K, is the number of colors used. It has
been observed by Robert Schreiber [19] that the partitioned inverse approach
can reduce the steps in the pointwise coloring approach by a factor of two.
Suppose two colors are used. We write the pointwise system as

(4.3) L _ [ D1 ,1
-
0]
L 2 ,1 D 2 ,2 '

where D 1,1 and D 2 ,2 are diagonal. Schreiber makes the following observation:

(4.4) L- 1 _ [ D1.~ 0 ]
- -D2,~L2,1D1.~ D2,~ ,
where the structures of L and L- 1 are identical. Thus, one can group pairs
of colors together and form the inverse of the combined diagonal block by a
simple rescaling of the off-diagonal part.
Nodewise colorings - Identify adjacent vertices with identical structure. As de-
scribed in §3, such vertiees often arise in finite element models for indepen-
dent degrees of freedom defined at the same geometric node. Let the set
I identify identieal nodes. A matrix ordering based on a coloring of Gil,
where identically colored nodes are ordered consecutively, yields a system
where L;,; is block diagonal, with dense blocks the size of the number of
identical nodes at each node point. Given a geometric partition of the nodes,
these dense blocks are local to a processor. In addition, the observation
made by Schreiber and illustrated in Equation 4.4 can be used to decrease
the number of major communication step by a factor of two for a nodewise
coloring. The inverse formula given in Equation 4.4 with D 1 ,1 and D 2 ,2 block
diagonal will still preserve the nonzero structure of L, because the nonzero
structure of the columns in each dense block are identical.
Quotient graph colorings derived from aloeal clique partition - This ap-
proach is used in our implementation. The local cliques correspond to local
dense diagonal blocks in L;,;. The inverses of these blocks are computed.
Thus the local solve, step 1 in Fig. 7, can be implemented using Level-2
239

BLAS. Usually the number of colors required to color the quotient graph will
be smaller than the number of colors required for the original graph. How-
ever, if fewer colors are used, reeent theoretical results [11] indicate that the
convergence of the iterative algorithm could suffer. This aspect is discussed
more fully in §5.
Quotient graph colorings derived from general local systems - Any local
structure can chosen for the diagonal systems Li,i ' However, if general sparse
systems are used, the processor performance is not necessarily improved over
a pointwise coloring. In addition, load balancing becomes more difficult as
larger partitions are chosen.

Given the possibilities above, we have chosen to implement a method based on


quotient graph colorings derived from aloeal clique partition. This approach enables
our software to take advantage of both any identic~1 node structure and local clique
partitions. The former allows for a reduction in the indirect indexing requiredj the
latter allows for the use of larger dense blocks and consequentially better performance
with the Level-2 BLAS. The software is designed so that the maximum size of the
identical node sets, the maximum clique size, and maximum number of cliques per
color can all be set by the user in case of load balancing or convergence probiems.
However, for the results presented in §5, no such Iimits were imposed.

5. Computational results. In this section we present computational results


obtained on the Intel DELTA with the software we have developed. We consider
two applications: a piezoelectric crystal modeling problem, and a three-dimensional
superconductivity modeling problem. These problems are described in more depth
in [12]j we give only a brief description of them here.

5.1. The piezoelectric crystal modeling problem. The first set of sparse
systems that we consider arise from a second-order finite element of a piezoelectric
crystal strip oscillator. These crystals are thin strips of quartz that vibrate at a fixed
frequency when an electric forcing field is applied to the crystal. A diagram of a strip
oscillator affixed to an aluminum substrate with epoxy is shown in Fig. 8.

Piezoelectric Quanz Crystal

Aluminum

FIG. 8. Piezoelectric crystal strip oscil/ator

Second-order, 27-node finite element s are used to model the crystal. Higher-
order element s are required to accurately model high frequency vibrational modes of
240

the crystal. There are four degrees of freedom at each geometric node point: three
mechanieal displacements and an electric field potential. The solution phase has two
steps. First, the deformation of the crystal caused by thermal displacement is found.
For example, if the crystal was mounted on aluminum at 25°C, it will deform when
the temperature is raised to 35°C. This requires solving a nonlinear static thermal
stress problem. Second, to find the vibrational modes of interest for the deformed
crystal, we solve a linear vibration problem - a generalized eigenproblem.
To solve the nonlinear static thermal stress problem, a series of linear systems
of the form K u = f must be solved, where K represents the stiffness matrix, u
represents the displacements, and f represents the forees due to thermal loads and
displacement constraints. The major task here, of course, is the solution of very large,
sparse systems of equations.
To solve the linear vibration problem, we must solve a generalized eigenproblem
of the form K x = w 2 Mx, where K represents the stiffness matrix, M represents
the mass matrix, x is a vibrational modeshape, and w is a vibrational mode. We
use a shifted, inverted variant of the Lanczos algorithm to solve this eigenproblem
[17]. This method has been shown to be very efficient for the parallel solution of the
vibration problem [9]. Again, the major computational task is the solution of large
sparse systems of linear equations.
The three-dimensional finite element grid needed to model the crystals is much
more refined in the length and width directions than it is in the thickness direction.
We can take advantage of this fact and partition the grid among the processors in
only the length and width directions. This approach reduces communication and
maps nicely onto the DELTA architecture. Each processor is assigned a rectangular
solid corresponding to a portion of the three-dimensional grid. Each processor is
responsible for evaluating the finite elements in its partition and for maintaining all
relevant geometric and solution data for its partition.
TABLE 1
Average megaflop rat es per processor for the triangular system solution as a fundion of the number
of processors used. The problem size per processor is kept approximately constant. Shown are the
number of processors nsed (p), the problem sizes (n), and the number of nonzeros in the lower
triangular systems (nnz). AIso shown are the size of the redneed systems once identical nodes
(ni-nod.) and local cliqnes (neliqu.) are identified.

I p I n nnz I ni-node I nelique I Avg. Mflops/Processor I


512 640050 137516706 94875 16002 4.97
256 318770 68285090 46875 7938 4.87
128 158130 33669282 22875 3906 5.00
64 77490 16330914 11163 1922 4.88

In Tables 1 and 2 we present results obtained on the Intel DELTA for solving
linear systems generated for the piezoelectric crystal modeling problem. The average
megaflop rates given in Table 1 demonstrate the scalable performance of the solver;
for fixed problem size per processor, the performance per processor is essentially
independent of the number of processors used.
In Table 2 we show the times required for the symbolic manipulations and for the
241

TABLE 2
Times (in seconds) to /ind the identical nodes (ti-nodo), local c/iques (telique), and colorings (teolor)
for the piezoelectric crystal problem. The time to reduce the graph (i. e., compule the quotienl gmph)
is included in the times ti-nodo and teliquo. Also given is the time (in seconds) for one back solve,
tBS, and one forward solve, tFS. The asynchronous parallei coloring heuristic given in Fig. e was
used to compule the coloring for the reduced graph. AIso given are the number of colors, x, used by
the pamllel coloring heuristic.

I p I ti-node I telique I toolor I tFS tBS IXI


512 2.55 0.208 0.0320 0.0600 0.0517 16
256 2.48 0.196 0.0260 0.0628 0.0530 17
128 2.43 0.167 0.0221 0.0635 0.0517 15
64 2.37 0.160 0.0271 0.0625 0.0521 14

solution of the triangular systems. Note that the symbolic manipulation is only done
once - to initialize the conjugate gradient iteratio~. In fact, since the structure of
the sparse system is constant, these symbolic data structures remain the same for the
linear system to be solved at each nonlinear iteration. The implementation of the
matrix multiplication is done in essentially the same manner as the forward and back
solves. Thus, the time for one conjugate gradient iteration is roughly 2(tFS + tBS).
i,From the results given in Table 2, the total time required to determine the identi-
cal nodes, local diques, coloring, and set up the required data structures corresponds
to roughly 10 to 12 conjugate gradient iterations. Note that these times indude
all the required symbolic work; we have induded the time to compute the quotient
graphs with these times. Since the number of conjugate gradient iterations required
for these problems is typically several hundred, and considering the time required to
integrate and assembly the stiffness matrix, the time required for the symbolic work
is relatively inexpensive.
The two level-2 BLAS routines that are involved in the triangular system solves
are DGEMV and DTRMV, the matrix-vector multiplications routines for general and
triangular matrices respectively. By comparing the results presented in Table 2 with
the performance of these routines on one processor we can get some idea of the relative
efficiency of our implementation. We have used the assemblier implementation of the
BLAS routines developed by Kuck & Associates (14). For matrix sizes of 20 and 50
they achieve performances of 2.39 and 5.35 megaflops for the DTRMV routine on a
single i860 processor. Likewise, for the DGEMV routine they achieve performances
of 6.73 and 15.90 megaflops for matrix sizes of 20 and 50. Since the average dique
size for the problems presented in Table 1 is approximately 40, the measured per
processor performance for the paralle! implementation appears to be quite good.

5.2. The layered superconductor modeling problem. The sparse linear


systems for the superconductivity problem arise in the determination of the damped
Newton step in the inner loop of an optimization algorithm. The optimization alga-
rithm attempts to determine the minimizer of a free energy functional that is defined
on a three-dimensional rectangular mesh with the geometric layout depicted in Fig. 9.
The structure of the sparse linear system is determined by the Hessian of free energy
given a linear finite difference discretization of the model.
242

x
o " - ------;1--
-- - -----,.,--- 1 "
3 "" 4 ""
---------~----------~------
6 "", 7 "" 8

FIG. 9. The 9-dimensionallayered superconduetor model partitioned in 2-dimensions

Shown in the figure are aIternating layers of supereondueting and insulating ma-
terial. The independent variables are two veetor fields, one defined in the superoon-
dueting sheets, and the other in the insulating layer. The two fields are eoupled in the
free ~nergy formulation. When the model is diseretized a finer degree of resolution
is generaIly given to the insulating layers. For the problems of interest, the number
of grid points necessary to represent the model in the direetion perpendicular to the
layers (the X-axis in Fig. 9) is smaIler than the number of points required in the two
direetions paraIlel to the layers (the Y-axis and Z-axis in Fig. 9). We make use of
this property and partition the grid in the Y and Z direetions. For example, in Fig. 9
the Y-Z domain is shown partitioned among 9 proeessors.
We denote the diseretization in the X, Y, and Z direetions by NX, NY, and NZ,
respeetively. As the diseretization within an insulating layer, NK, varies, the size of
the 10eaI cliques ehanges, and therefore so does the individual proeessor performance.
In Table 3 we note the effeet of varying the layer diseretization on the i860 proeessor
performance during the solution of the linear systems. For these numbers we have
used 128 proeessors and fixed the loeal problem size to be roughly equivalent. The
second oolumn shows the average size of the identical nodes found in the graph by
the solveri the third eolumn shows the average clique size found. The finaI oolumn
shows the average eomputational rate per proeessor during the solution of the linear
systems.
TABLE 3
The effeet of varying the layer discretization on the processor performance in solving the linear
systems

I NK I Avg. I-Node Size I Avg. Clique Size I Avg. Mflops/Proeessor I


2 8.0 32.0 2.97
4 14.0 44.8 5.42
6 20.0 60.0 6.71
8 26.0 78.0 8.96

In Table 4 we present results for the linear solver on three problems with differing
geometrie eonfigurations on 512 processors.
243

TABLE 4
Computational results obtained for three difJerent problem configurations on 512 processors

I PROBLEM-1 I PROBLEM-2 I PROBLEM-3 I


NX 24 64 20
NK 8 4 2
NY 80 64 150
NZ 96 96 150
N 6.0 X 105 1.6 X 106 1.8 X 106
NNZ 2.0 X 108 1.7 X 108 1.9 X 108
GFlops 3.25 2.55 1.38

In the solution of both of these systems, the diagonal of the matrix was scaled
to be one. If the incomplete factorization fails (a negative diagonal element created
during the factorization), a small multiple of the identity is added to diagonal, and the
factorization is restarted. This process is repeated until a successful factorization is
obtained [16]. The average number of conjugate gradient iterations required to solve
one nonlinear iteration of the thermal equilibrium problem for the crystal model to
a relative accuracy of 10-7 is approximately 700. The average number of conjugate
gradient iterations required per nonlinear iteration for the superconductivity problem
is approximately 250. The linear systems arising in the superconductivity problem
are solved to a relative accuracy of 5.0 X 10-4 • However, it should be noted that
these are speciallinear systems: they are highly singular (more than one-fifth of the
eigenvalues are zero, because of physical symmetries). However, they are consistent
near aloeal minimizer because a projection of the right hand side (the gradient of
the free energy function) onto the null space of the matrix is zero near the minimizer.

6. ConcIusjons. In this paper we have presented an implementation of a general-


purpose iterative solver for MIMD machines. The scalable performance of the solver
is based on a reordering of the sparse system according to a graph coloting of a re-
duced graph obtained from the nonzero structure of the sparse linear system. This
approach is effective for any of the standard iterative methods; however, the experi-
mental results we present are for the conjugate gradient algorithm with an incomplete
matrix factorization preconditioner.
We have emphasized an approach where all the manipulations required by the
solver are all done in parallel. In this spirit, we have presented two recently de-
veloped paralleI heuristics for determining a graph coloringo We have shown that
the synchronous heuristic proposed by Luby, based on determining a sequence of
maximal independent sets, can be modified to run in an asynchronous manner.
Furthermore, we show that the expected running time of the modified' heuristic is
EO(log(n)/loglog(n)) for bounded degree graphs using the bounds developed for
the other coloring heuristic.
A number of possible approaches toward the solution of the sparse triangular
system solutions are elassified. We have chosen to use a graph reduction based on
a elique partition in our implementation for two reasons: (1) to allow for the use
of higher-Ievel BLAS in a triangular system solver, and (2) to reduce the number
244

of required colors and the size of the quotient graph. The implementation allows
the user to specify the maximum clique size and the maximum number of cliques
per color, in case load-balancing or convergence problems arise. In the experimental
results section we demonstrate the improvement in processor performance for larger
clique sizes for the superconductivity problem. In addition, the concentration of the
basic computation in the BLAS allows for an efficient, portable implementation.
Finally, we note that reeent theoretieal results have shown that for amodel prob-
lem, the convergence rate improves as the number of colors is increased [11]. This
possibility was investigated for the piezoelectric crystal problem, and a definite, but
moderate, decrease in the convergence rate was found in going from a pointwise col-
oring (~ 108 colors) to a clique coloring (~ 10 colors). However, the increase in
efficiency of the implementation for the clique coloring more than offset the conver-
gence differences.
Overall, we feeI that this approach represents an effective approach for efficiently
solving large, sparse linear systems on massively paralleI machines. We have demon-
strated that our implementation is able to solve general sparse systems from two
different applications, achieving both good processor performance and convergence
properties.

Acknowledgment. The second author acknowledges helpfuI discussions with Fer-


nando Alvarado, Stanley Eisenstat, and Robert Schreiber while attending the IMA
workshop. In addition, we thank the referee for a number of constructive comments
on the paper.

REFERENCES

[1] F. L. ALVARADO, A. POTHEN, AND R. SCHREIBER, Highly parallei sparse triangular solution,
Tech. Rep. CS-92-09, The Pennsylvania State University, May 1992.
[2] F. L. ALVARADO AND R. SCHREIBER, Optimal parallei solution of sparse triangular systems,
SIAM Journal on Scientifie and Statistieal Computing, (to appear).
[3] D. BRELAz, New methods to color the vertices of a graph, Comm. ACM, 22 (1979), pp. 251-256.
[4] T. F. eOLEMAN AND J. J. MORE, Estimation of sparse Jacobian matrices and graph coloring
probiems, SIAM Journal on Numerieal Analysis, 20 (1983), pp. 187-209.
[5] M. R. GAREY AND D. S. JOHNSON, Computers and [ntraetabi/ity, W. H. Freeman, New York,
1979.
[6] J. L. GUSTAFSON, G. R. MONTRY, AND R. E. BENNER, Development of parallei methods
for a J02,/-processor hypercube, SIAM Journal on Seientifie and Statistieal Computing, 9
(1988), pp. 609-638.
[7] L. A. HAGEMAN AND D. M. YOUNG, Applied [terative Methods, Aeademic Press, New York,
1981.
[8] D. S. JOHNSON, Worst ease behavior of graph coloring algorithms, in Proceedings 5th South-
eastern Conferenee on Combinatorics, Graph Theory, and Computing, Utilitas Mathemat-
iea Publishing, Winnipeg, 1974, pp. 513-527.
[9] M. T. JONES AND M. L. PATRICK, The Lanezos algorithm for the generalized symmetrie
eigenprob/em on shared-memory architeetures, Preprint MCS-P182-1090, Mathematics and
Computer Science Division, Argonne National Laboratory, Argonne, III., 1990.
[10] M. T. JONES AND P. E. PLASSMANN, SealaMe iterative solution of sparse linear systems,
Preprint MCS-P277-1191, Mathematies and Computer Scienee Division, Argonne National
Laboratory, Argonne, III., 1991.
[11] - - , The effeet of many-color orderings on the eonvergenee of iterative methods, in Proeeed-
245

ings of the Copper Mountain Conference on Iterative Methods, SIAM LA-S lG , 1992.
[12] - - , Solution of large, sparse systems of linear equations in massively parallei applieations,
Preprint MCS-P313-0692, Mathematics and Computer Science Division, Argonne National
Laboratory, Argonne, III., 1992.
[13] - - , A parallei graph coloring heuristie, SIAM Journal on Scientific and Statistical Comput-
ing, 14 (1993).
[14] KUCK & ASSOCIATES, CLASSPACK Basie Math Library User's Guide (Release 1.1), Kuck &
Associates, Inc., Champaign, IL, 1990.
[15] M. LUBY, A simple parallei algorithm for the maximal independent set problem, SIAM Journal
on Computing, 4 (1986), pp. 1036-1053.
[16] T. A. MANTEUFFEL, An incomplete factorization technique for positive definite linear systems,
Mathematics of Computation, 34 (1980), pp. 473-497.
[17] B. NOUR-OMID, B. N. PARLETT, T. ERICSSON, AND P. S. JENSEN, How to implement the
speetral transformation, Mathematics of Computation, 48 (1987), pp. 663-673.
[18] A. POTHEN, H. SIMON, AND K.-P. LIOU, Partitioning sparse malrices with eigenvectors of
graphs, SIAM Journal on Matrix Analysis, 11 (1990), pp. 430-452.
[19] R. SCHREIBER. Private communication, 1991.
[20] R. SCHREIBER AND W.-P. TANG, Vectorizing the conjugate gradien.t method. Unpublished
manuscript, Department of Computer Science, Stanford University, 1982.
[21] H. A. VAN DER VORST, High performance preconditioning, SIAM Journal on Scientific and
Statistical Computing, 10 (1989), pp. 1174-1185.
[22] S. VAVASIS, Automatic domain partitioning in three dimensions, SIAM Journal on Scientific
and Statistical Computing, 12 (1991), pp. 950-970.

Вам также может понравиться