Вы находитесь на странице: 1из 137

XiaomingChen YuWang

HuazhongYang

Parallel Sparse
Direct Solver
for Integrated
Circuit
Simulation
Parallel Sparse Direct Solver for Integrated Circuit
Simulation
Xiaoming Chen Yu Wang

Huazhong Yang

Parallel Sparse Direct Solver


for Integrated Circuit
Simulation

123
Xiaoming Chen Yu Wang
Department of Computer Science Tsinghua University
and Engineering Beijing
University of Notre Dame China
Notre Dame, IN
USA Huazhong Yang
Tsinghua University
and Beijing
China
Tsinghua University
Beijing
China

ISBN 978-3-319-53428-2 ISBN 978-3-319-53429-9 (eBook)


DOI 10.1007/978-3-319-53429-9
Library of Congress Control Number: 2017930795

Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microlms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specic statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional afliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

With the advances in the scale and complexity of modern integrated circuits (ICs),
Simulation Program with Integrated Circuit Emphasis (SPICE) based circuit sim-
ulators are facing performance challenges, especially for post-layout simulations.
Advances in semiconductor technologies have greatly promoted the development of
parallel computers, and, hence, parallelization has become a promising approach to
accelerate circuit simulations. Parallel circuit simulation has been a popular research
topic for a few decades since the invention of SPICE. The sparse direct solver
implemented by sparse lowerupper (LU) factorization is the biggest bottleneck in
modern full SPICE-accurate IC simulations, since it is extremely difcult to par-
allelize. This is a practical challenge which both academia and industry are facing.
This book describes algorithmic methods and parallelization techniques that aim
to realize a parallel sparse direct solver named NICSLU (NICS is short for
Nano-Scale Integrated Circuits and Systems, the name of our laboratory in
Tsinghua University), which is specially targeted at SPICE-like circuit simulation
problems. We propose innovative numerical algorithms and parallelization frame-
work for designing NICSLU. We describe a complete flow and detailed parallel
algorithms of NICSLU. We also show how to improve the performance of NICSLU
by developing novel numerical techniques. NICSLU can be applied to any
SPICE-like circuit simulators and has been proven to be high performance by actual
circuit simulation applications.
There are eight chapters in this book. Chapter 1 gives a general introduction to
SPICE-like circuit simulation and also describes the challenges of parallel circuit
simulation. Chapter 2 comprehensively reviews existing work on parallel circuit
simulation techniques, including various software algorithms and hardware accel-
eration techniques. Chapter 3 covers the overall flow and all the core steps of
NICSLU.
Starting from Chap. 4, we present the proposed algorithmic methods and par-
allelization techniques of NICSLU in detail. We will describe two parallel factor-
ization algorithms, a full factorization with partial pivoting and a re-factorization
without partial pivoting, based on an innovative parallelization framework. The two
algorithms are both compatible with SPICE-like circuit simulation applications.

v
vi Preface

Three improvement techniques are presented in Chap. 5 to further enhance the


performance of NICSLU. Test results of NICSLU, including benchmark results and
circuit simulation results, are presented and analyzed in Chap. 6.
In Chap. 7, we present a graph-based performance model to evaluate the per-
formance and nd the bottlenecks of NICSLU. This model is expected to help
readers understand the performance of NICSLU in more depth, and, thus, poten-
tially nd further improvement points.
Chapter 8 concludes the book by summarizing the proposed innovative tech-
niques and discussing possible future research directions.
To better understand the algorithms and parallelization techniques presented in
this book, readers can download the source code of an old version of NICSLU from
http://nics.ee.tsinghua.edu.cn/people/chenxm/nicslu.htm.
The content of this book describes part of my Ph.D. work at the Department of
Electronic Engineering, Tsinghua University, Beijing, China. This work cannot be
accomplished without support of my advisors, colleagues, family members, and
friends. First of all, I would like to thank my advisors, Profs. Yu Wang and
Huazhong Yang, for their endless guidance and support in this challenging
research. I would like to acknowledge Boxun Li, Yuzhi Wang, Ling Ren, Gushu Li,
Wei Wu, Du Su, and Shuai Tao for their great help during my Ph.D. study. Last but
certainly not least, I would like to thank my parents and my wife. Without their
unconditional support and encouragement, I would not have been able to study
sincerely in Tsinghua University. I really cherish the 9-year time spent in Tsinghua
University, and I cannot imagine how difcult my life would have been without
their support and encouragement.

Beijing, China Xiaoming Chen


Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Circuit Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Challenges of Parallel Circuit Simulation . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Device Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Sparse Direct Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Theoretical Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Focus of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Direct Parallel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Parallel Direct Matrix Solutions . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Parallel Iterative Matrix Solutions . . . . . . . . . . . . . . . . . . . . 19
2.2 Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Parallel BBD-Form Matrix Solutions . . . . . . . . . . . . . . . . . 22
2.2.2 Parallel Multilevel Newton Methods . . . . . . . . . . . . . . . . . . 24
2.2.3 Parallel Schwarz Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 Parallel Relaxation Methods . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Parallel Time-Domain Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Parallel Numerical Integration Algorithms . . . . . . . . . . . . . 28
2.3.2 Parallel Multi-Algorithm Simulation . . . . . . . . . . . . . . . . . . 30
2.3.3 Time-Domain Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Matrix Exponential Methods . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Hardware Acceleration Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.2 FPGA Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii
viii Contents

3 Overall Solver Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43


3.1 Overall Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Pre-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.1 Zero-Free Permutation/Static Pivoting . . . . . . . . . . . . . . . . . 46
3.2.2 Matrix Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.3 Symbolic Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Numerical Full Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1 Symbolic Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.2 Numerical Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.3 Partial Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Numerical Re-factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Factorization Method Selection . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Right-Hand-Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.1 Forward/Backward Substitutions . . . . . . . . . . . . . . . . . . . . . 60
3.5.2 Iterative Renement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Parallel Sparse Left-Looking Algorithm . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Parallel Full Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Data Dependence Representation . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.3 Algorithm Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Parallel Re-factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.1 Data Dependence Representation . . . . . . . . . . . . . . . . . . . . 74
4.2.2 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.3 Algorithm Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Improvement Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 Map Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.2 Map Denition and Construction . . . . . . . . . . . . . . . . . . . . 81
5.1.3 Sequential Map Re-factorization . . . . . . . . . . . . . . . . . . . . . 82
5.1.4 Parallel Map Re-factorization . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Supernodal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 Supernode Denition and Storage . . . . . . . . . . . . . . . . . . . . 85
5.2.3 Supernodal Full Factorization . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.4 Supernodal Re-factorization . . . . . . . . . . . . . . . . . . . . . . . . 90
Contents ix

5.3 Fast Full Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94


5.3.1 Motivation and Pivoting Reduction . . . . . . . . . . . . . . . . . . . 94
5.3.2 Sequential Fast Full Factorization . . . . . . . . . . . . . . . . . . . . 96
5.3.3 Parallel Fast Full Factorization . . . . . . . . . . . . . . . . . . . . . . 97
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.1 Speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.2 Performance Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Results of Benchmark Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.1 Comparison of Different Algorithms . . . . . . . . . . . . . . . . . . 102
6.3.2 Relative Speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.3 Speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.4 Other Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4 Results of Simulation Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1 DAG-Based Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.1 Theoretical Maximum Relative Speedup . . . . . . . . . . . . . . . 121
7.2.2 Predicted Relative Speedup . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2.3 Bottleneck Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Chapter 1
Introduction

In the 1970s, the Electronics Research Laboratory of the University of Califor-


nia, Berkeley developed the Simulation Program with Integrated Circuit Emphasis
(SPICE), which is a general-purpose integrated circuit (IC) simulator that can be used
to check the integrity of circuit designs and predict circuit behaviors at the transistor
level. In the next a few decades, SPICE served as a simulation kernel of a number of
circuit simulators from both academia and industry, and also greatly promoted the
advance of the Electronic Design Automation (EDA) industry. Today, SPICE has
become the de facto standard transistor-level IC simulation tool. SPICE-like circuit
simulators are widely used in analog and mixed-signal circuit design and verification.
However, with the advances in the scale and complexity of modern ICs, SPICE-like
circuit simulators are facing performance challenges. In the SPICE-like circuit sim-
ulation flow, the lowerupper (LU) factorization-based sparse direct solver is usually
very time-consuming so it is a severe performance bottleneck. In this chapter, we will
give a fundamental introduction to SPICE. Following that, we will explain the chal-
lenges in parallelization of SPICE-like circuit simulators. Finally, we will present
the focus of this book.

1.1 Circuit Simulation

With the rapid development of the IC and computer technologies, EDA techniques
have become an important subject in the electronics area. The appearance and devel-
opment of EDA techniques have greatly promoted the development of the semi-
conductor industry. The development trend of modern very-large-scale integration
(VLSI) circuits is to integrate more functionalities into a single chip. To facilitate this,
the scale of modern ICs is extremely large and electronic systems are also becoming
more complex generation by generation. In addition, electronic devices are upgrad-
ing frequently and IC vendors are facing a huge challenge of the time-to-market.
Such a rapid developing electronic world, on the other hand, has brought a challenge
to EDA techniques: the performance of modern EDA tools must keep pace with
Springer International Publishing AG 2017 1
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_1
2 1 Introduction

the development of modern ICs, such that IC vendors can ceaselessly develop new
products for the world.
As one of the core components of EDA techniques, SPICE [1]-like transistor-
level circuit simulation is an essential step in the design and verification process
of a very broad range of ICs and electronic systems such as processors, memories,
analog and mixed-signal circuits, etc. It serves a critical mission and cheap way of
predicting circuit performance and identifying possible faults before the expensive
chip fabrication. As a fundamental step in the IC design and verification process,
SPICE simulation techniques including fundamental algorithms and parallel simula-
tion approaches have been widely studied and under a long-term active development
in the last a few decades, since the invention of SPICE. Today, there are a number of
circuit simulators from both academia and industry which are developed based on the
original SPICE code. SPICE has already become the de facto standard transistor-level
simulation tool. SPICE-like circuit simulators are widely adopted by universities and
IC vendors all over the world.
Modern SPICE-like circuit simulators usually integrate a large number of device
models, including resistors, capacitors, inductors, independent and dependent
sources, various semiconductor devices including diodes, metaloxide
semiconductor field-effect transistors (MOSFET), junction field-effect transistors
(JFET), etc., as well as many macro-models to represent complicated IC components.
Modern circuit simulators also support a wide variety of circuit analysis, including
direct current (DC) analysis, alternating current (AC) analysis, transient analysis,
noise analysis, sensitivity analysis, pole-zero analysis, etc. DC analysis, which cal-
culates a quiescent operating point, serves as a basic starting point for almost all of
the other simulations. Transient analysis, or called time-domain simulation, which
calculates the transient response in a given time interval, is the most widely used
function in analog and mixed-signal circuit design and verification among all the
simulation functions offered by SPICE. All of the device models and simulation
functions provided by modern circuit simulators, provide a strong support to the
transistor-level simulation of modern complementary metaloxidesemiconductor
(CMOS) circuit design and verification.
Figure 1.1 shows a typical framework of SPICE-like circuit simulators, in which
the blue boxes are essential components and the dotted boxes are supplementary
functionalities provided some software packages. The SPICE kernel accepts text
netlist files as input. Although some software packages have a graphical interface
such that users can draw circuit schematics using built-in symbolics, the schematics
are automatically compiled into netlist files by front-end tools before simulation.
Netlist files describe everything of the circuit that will be simulated, including the
circuit structure, parameters of devices and models, simulation type and control
parameters, etc. The SPICE kernel reads the netlist files and builds internal data
structures. Then device models are initialized according to the model parameters
specified in the input files. Based on the Kirchhoffs laws [2], a circuit equation is
created, which is then solved by numerical engines to get the response of the circuit.
Finally, back-end tools like waveform viewer can be used to show and analyze the
response.
1.1 Circuit Simulation 3

Fig. 1.1 General framework Graphical User


of SPICE-like circuit Interface
simulators

Schematic
Symbolic
library
Netlist file Model
parameters

SPICE
Simulation Device model
library

Output
files

Waveform
post-processing

1.1.1 Mathematical Formulation

SPICE employs the modified nodal analysis (MNA) [3] method to create the circuit
equation. In this subsection, we will use transient simulation as an example to explain
the principle and simulation flow of SPICE. In transient simulation, the equation cre-
ated by MNA has a unified form which can be expressed by the following differential
algebraic equation (DAE)

d
f (x(t)) + q(x(t)) = u(t), (1.1)
dt

where t is the time, x(t) is the unknown vector containing node voltages and branch
currents, f (x(t)) is a nonlinear vector function denoting the effect of static devices
in the circuit, q(x(t)) is a nonlinear vector function denoting the effect of dynamic
devices in the circuit, and u(t) is the known stimulate of the circuit.
In most practical cases, Eq. (1.1) does not have an analytical solution, so the
only way to solve it is to use numerical methods. The implicit backward Euler and
trapezoid methods [4] are usually adopted to solve the DAE in SPICE-like circuit
simulators. If we adopt the backward Euler method to discretize Eq. (1.1) in the time
domain, we get

dq(x) x(tn+1 ) x(tn )


f (x(tn+1 )) + = u(tn+1 ), (1.2)
dxT tn+1 tn
4 1 Introduction

where tn and tn+1 are the discrete time nodes. If the solutions at and before time node
tn (i.e., x(t0 ), x(t1 ), . . . , x(tn )) are all known, then the solution at time node tn+1 (i.e.,
x(tn+1 )) can be solved from Eq. (1.2).
Equation (1.2) is nonlinear and can be abstracted into the following implicit equa-
tion:
Fn+1 (x(tn+1 )) = 0, (1.3)

where Fn+1 denotes the implicit nonlinear function at time node tn+1 , which can
be solved by the NewtonRaphson method [4]. Namely, Eq. (1.3) is solved by the
following iteration form:
     
J x(tn+1 )(k) x(tn+1 )(k+1) = Fn+1 x(tn+1 )(k) + J x(tn+1 )(k) x(tn+1 )(k) , (1.4)

where J is the Jacobian matrix and the superscript is the iteration number.
Equation (1.4) can be further abstracted into a linear system form

Ax = b, (1.5)

where the matrix A and the right-hand-side (RHS) vector b only depend on the
intermediate results of the kth iteration. Till now, we have described that the core
operation to solve the circuit equation Eq. (1.1) in SPICE-like circuit simulation is
to solve the linear system Eq. (1.5), which is obtained by discretizing and linearizing
the DAE using numerical integration methods (e.g., the backward Euler method) and
the NewtonRaphson method.
Although the above equations are all derived from transient simulation, the core
method is similar for other simulation functions. Basically, for ordinary differential
equations (ODE), implicit integration methods are adopted to discretize the equation
in the time domain, and then the NewtonRaphson method is adopted to linearize the
nonlinear equation at a particular time point. Consequently, for any type of SPICE-
like simulations, the core operation is always solving linear equations associated
with the circuit and the simulation function. The major difference is in the format
of the equation. For example, in frequency-domain simulation, we need to solve
complex linear systems instead of real linear systems. Therefore, the linear solver is
an extremely important component in any SPICE-like circuit simulator.

1.1.2 LU Factorization

LU factorization [5], also called triangular factorization, which is a variant of the


Gaussian elimination method and belongs to direct methods [6], is widely adopted
to solve linear systems in many practical applications. LU factorization factorizes
a matrix A into the product of a lower triangular matrix L and an upper triangular
matrix U. In theory, the matrix does not need to be square; however, LU factorization
1.1 Circuit Simulation 5

is usually applied to square matrices to solve linear systems. For an N N square


matrix A, LU factorization can be described by the following form:

A11 A12 A1N L 11 U11 U12 U1N
A21 A22 A2N L 12 L 22 U22 U2N

.. .. .. .. = .. .. . . .. . . (1.6)
. . . . . . . . ..
AN1 AN2 AN N L N1 L N2 L N N UN N

Elements of L and U are mathematically computed by the following two equations:

j1
i = 1, 2, . . . , N
Ui j = Ai j L ik Uk j (1.7)
j = i, i + 1, . . . , N
k=1

j1
i = 1, 2, . . . , N
Li j = Ai j L ik Uk j . (1.8)
Ujj j = 1, 2, . . . , i 1
k=1

To solve a linear system using LU factorization, at least the following two steps are
required: triangular factorization (i.e., A = LU) and forward/backward substitutions
(solving y from Ly = b and solving x from Ux = y). In practice, due to the
numerical instability problem caused by round-off errors, one needs to do pivoting
when performing LU factorizations. In most cases, a proper permutation in rows (or
columns) is sufficient for ensuring the numerical stability of LU factorization. Such
an approach is called a partial pivoting. Row permutation-based LU factorization
with partial pivoting can be expressed as follows:

PA = LU, (1.9)

where P is the row permutation matrix indicating the row pivoting order. LU factor-
ization with full pivoting involves both row and column permutations, i.e.,

PAQ = LU, (1.10)

where P and Q are the row and column permutation matrices indicating the row and
column pivoting orders, respectively.
The time complexity of LU factorization is O(N 3 ) for dense matrices, so it can
be very time-consuming when solving large linear systems. However, for sparse
matrices, the time complexity is greatly reduced, so efficiently solving a large sparse
linear system by LU factorization is possible. In order to enhance the performance of
solving sparse linear systems by LU factorization, an additional pre-analysis step to
reorder the row and column permutations to minimize fill-ins [6] is required before
factorization, which will be explained in detail in Chap. 3.
6 1 Introduction

Netlist

Parsing netlist

Matrix creation by
MNA & pre-analysis

DC analysis

Transient iteration
Device model
evaluation
Newton-Raphson
iteration
Matrix/RHS load
Updating SPICE iteration
time node Sparse LU factorization
(A=LU)
N
Iteration converged?
Forward & backward
Y substitutions (Ly=b,
Ux=y)
N
Time node ended?

Y
Waveform

Fig. 1.2 Typical flow of SPICE-like transient simulation

1.1.3 Simulation Flow

Figure 1.2 shows a typical flow of SPICE-like transient simulation, which can be
derived from the mathematical formulation presented in Sect. 1.1.1. The SPICE ker-
nel first reads a circuit netlist written in a pure text format, and then parses the netlist
file to build internal data structures. A complete SPICE flow also includes many
auxiliary and practical functionalities, e.g., netlist check and circuit topology check.
After internal data structures are built, the SPICE kernel calculates the symbolic
pattern of the circuit matrix by MNA, followed by a pre-analysis step on the sym-
bolic pattern. Typically, the pre-analysis step reorders the matrix to minimize fill-ins
during sparse LU factorization. We will discuss the pre-analysis step in Sect. 3.2 in
detail. After a DC analysis to obtain the quiescent operating point, the SPICE kernel
enters the main body of transient simulation, taking the quiescent operating point as
the initial condition.
The main body of transient simulation is marked in blue in Fig. 1.2. Accord-
ing to the mathematical formulation presented in Sect. 1.1.1, SPICE-like transient
simulation has two nested levels of loops. The outer level is the transient iteration
and the inner level is the nonlinear NewtonRaphson iteration. The outer level loop
1.1 Circuit Simulation 7

discretizes the DAE Eq. (1.1) into Eq. (1.2) (i.e., Eq. (1.3)) in the time domain
by some numerical integration method. The inner level loop solves the nonlinear
equation Eq. (1.3) using the Newton-Raphson method (i.e., Eq. (1.4)) at a particular
time node. Once the NewtonRaphson method converges, the time node is increased
by estimating the local truncation error (LTE) of the adopted numerical integration
method, and then the inner level loop runs again at the new time node. Typically, a
SPICE-like transient simulation can perform thousands of iterations.
Each iteration in the inner level loop is called a SPICE iteration. In the SPICE iter-
ation, a device model evaluation is first performed, which is followed by matrix/RHS
load. Device model evaluation uses the solution obtained in the previous SPICE iter-
ation. The purpose of the two steps is to calculate the Jacobian matrix and the RHS of
Eq. (1.4), i.e., the coefficient matrix A and the RHS vector b. After the linear system
is constructed, a sparse solver is invoked to solve it, and then we get the solution
of the current SPICE iteration. Typically, SPICE-like circuit simulators adopt sparse
LU factorization to solve the linear system. Matrices created by SPICE-like circuit
simulators have an unique feature that, although the values change during SPICE
iterations, the symbolic pattern of the matrix keeps unchanged. This is also one of
the reason that SPICE-like circuit simulators usually adopt sparse LU factorization,
due to that some symbolic computations can be executed only once.
It is well known that there are two types of methods to solve linear systems:
direct methods [6] and iterative methods [7]. SPICE-like circuit simulators usually
adopt sparse LU factorization, which belongs to direct methods. The main reasons
of using direct methods include the high numerical stability of direct methods and
the poor convergence of iterative methods. Iterative methods usually require good
preconditioners to make the matrix diagonal dominant such that they can converge
quickly. However, circuit matrices created by MNA are typically quite irregular
and singular so they are difficult to be per-conditioned. In addition, during SPICE
iterations, the matrix values always change so the preconditioner is always required
in every iterations, which leads to a high-performance penalty. On the contrary,
direct methods do not have this limitation. By carefully pivoting during sparse LU
factorization, we can always get accurate solutions except for that the matrix is
ill-conditioned. Another advantage of using direct methods in SPICE-like circuit
simulation is that, if a fixed time step is used in transient simulation of linear circuits,
the coefficient matrix A keeps the same over all time nodes, so the LU factors also
keep the same and only forward/backward substitutions are required to solve the
linear system, which significantly saves the runtime of sparse LU factorization.

1.2 Challenges of Parallel Circuit Simulation

With the advances in the scale and complexity of modern ICs, SPICE-like circuit
simulators are facing performance challenges. For modern analog and mixed-signal
circuits, pre-layout simulations can usually take a few days [8] and post-layout sim-
ulations can even take a few weeks [9]. The extremely long simulation time may
8 1 Introduction

significantly affect the design efficiency and the time-to-market. In recent years, the
rapid evolution of parallel computers has greatly promoted the development of par-
allel SPICE simulation techniques. Accelerating SPICE-like circuit simulators by
parallel processing simulation tasks has become a popular research topic for a few
decades.
Generally speaking, parallelism can be achieved by two different granularities:
multi-core parallelism and multi-machine parallelism. In this book, we will focus on
multi-core parallelism, as it is easier to implement and the communication cost is
much smaller. Typically, multi-core parallelism is implemented by multi-threading
on shared-memory machines. Parallelism can be integrated into every step of the
SPICE-like simulation flow shown in Fig. 1.2. Considering the runtime of each step,
there are two major bottlenecks in SPICE-like transient simulation: device model
evaluation and the sparse direct solver. The two steps consume most of the simulation
time. To parallelize and accelerate SPICE-like circuit simulators, the primary task
is to parallelize the two steps. In this section, we will explain the challenges of
parallelizing SPICE-like circuit simulators.

1.2.1 Device Model Evaluation

Device model evaluation dominates the total simulation time for pre-layout circuits.
It may take up to 75% of the total simulation time and scales linearly with the
circuit size [10]. Parallelizing device model evaluation is straightforward, as one
only needs to distribute all the device models on multiple cores, achieving a simple
task-level parallelism. The inter-thread communication cost is almost zero, and load
balance is very easy to achieve by evenly distributing all the devices on multiple
cores. Such a method will demonstrate a good scalability for the device model eval-
uation step. However, even though the parallel efficiency of device model evaluation
achieves 100%, the overall parallel efficiency is still low due to many non-ignorable
sequential simulation tasks. Another challenge comes from the pure computational
cost. As modern MOSFET models become more complex, the computational cost
also increases rapidly. To reduce the computational cost of device model evalua-
tion, people have proposed some acceleration techniques, such as piecewise linear
approximation of device models [11, 12] and hardware acceleration approaches [13].

1.2.2 Sparse Direct Solver

The sparse direct solver dominates the total simulation time for post-layout circuits. It
may consume 5090% of the total simulation time for large post-layout circuits [10].
Parallelizing the sparse direct solver is quite difficult. It is a big challenge that has
not been well solved for several decades. Although there are many popular software
packages that implement parallel sparse direct solvers, they are not suitable for circuit
1.2 Challenges of Parallel Circuit Simulation 9

matrices created by MNA. The following three features of circuit matrices make it
difficult to parallelize the sparse direct solver for circuit matrices.
Circuit matrices created MNA are extremely sparse. The average number of
nonzero elements per row is typically less than 10. Such a sparsity is much lower
than that of matrices from other areas, such as finite element analysis. This fea-
ture leads to a strong requirement of a high-efficiency scheduling algorithm. If the
scheduling efficiency is not high enough, the scheduling overhead may dominate
the solver time, as the computational cost of each task is relatively small.
Data dependence in sparse LU factorization is quite strong. To realize a high-
efficiency parallel sparse direct solver, one should carefully investigate the data
dependence and explore parallelism as much as possible. Due to the sparse nature
of circuit matrices, data-level parallelism is not suitable for circuit matrices. On
the contrary, task-level parallelism should be adopted.
The symbolic pattern of circuit matrices is irregular. This feature affects load
balance of parallel LU factorization. In addition to the irregular symbolic pattern
of the matrix, dynamic numerical pivoting also changes the symbolic pattern of
the LU factors at runtime, leading to a difficulty to achieve load balance, especially
when assigning tasks offline.
These features result in that the parallel efficiency of the sparse direct solver cannot
be high. Unlike device model evaluation that can nearly achieve a 100% parallel
efficiency, one can only expect 46 speedup using eight cores for the sparse
direct solver. The scalability will be even poorer if the number of cores becomes
more. In some cases, the performance may be even lower if using more cores.

1.2.3 Theoretical Speedup

The famous Amdahls law [14] says that the theoretical speedup of a parallel program
is mainly determined by the percentage of sequential tasks, as shown in the following
equation:
1
speedup = rp , (1.11)
rs +
P
where rs and r p (rs + r p = 1) mean the portion of sequential and parallel tasks,
respectively, and P is the number of used cores. In SPICE-like circuit simulation,
many tasks should be executed in sequential; otherwise the parallel cost can be very
high. For example, matrix/RHS load after device model evaluation is also difficult to
parallelize, which is mainly due to memory conflicts. Namely, different devices may
fill the same position of the matrix/RHS so a lock must be used for every position
of the matrix/RHS, leading to high cost due to numerous races. The cost can be
even higher when the number of used cores becomes more. These sequential tasks
significantly affects the efficiency and scalability of parallel SPICE simulations.
10 1 Introduction

5% sequential, 80% model evaluation, 15% sparse solver


5% sequential, 50% model evaluation, 45% sparse solver
5% sequential, 20% model evaluation, 75% sparse solver
10% sequential, 70% model evaluation, 20% sparse solver
10% sequential, 45% model evaluation, 45% sparse solver
10% sequential, 20% model evaluation, 70% sparse solver
10
9
8
7
6
Speedup

5
4
3
2
1
0
2 4 6 8 10 12 14 16
Number of cores

Fig. 1.3 Theoretical speedup of parallel SPICE simulation (Amdahls law)

According to Eq. (1.11), Fig. 1.3 plots some predicted theoretical speedups of
parallel SPICE simulation. In this illustration, the parallel efficiency of device model
evaluation and the sparse direct solver are assumed to be 100 and 70%, respectively.
As can be seen, even if the percentage of sequential tasks is only 5%, the speedup
can only be about 8 when using 16 cores. If the percentage of sequential tasks is
10%, the speedup reduces to 6 when using 16 cores, corresponding to an overall
parallel efficiency of only 37.5%. To achieve highly scalable parallel simulations,
the parallel efficiency of all tasks must be very close to 100%, which also means
that the percentage of sequential tasks must be very close to zero. However, this is
impossible in practical SPICE-like circuit simulators. Consequently, for a practical
simulator, linear scalability cannot be achieved by simply parallelizing every task in
the simulation flow.
1.3 Focus of This Book 11

1.3 Focus of This Book

As explained in Sect. 1.2, device model evaluation is easy to parallelize and there are
many techniques to accelerate it, but the sparse direct solver is difficult to parallelize
or accelerate due to three challenges. In this book, we will describe a parallel sparse
direct solver named NICSLU (NICS is the abbreviation of Nano-Scale Integration
Circuits and Systems, the name of our laboratory in Tsinghua University). NICSLU
is specially designed for SPICE-like circuit simulation applications. In particular,
NICSLU is well suited for DC and transient simulations in SPICE-like simulators.
The following technical features make NICSLU be a high-performance solver in
circuit simulation applications:
Three numerical techniques are integrated in NICSLU to achieve a high numerical
stability: an efficient static pivoting algorithm in the pre-analysis step, a partial
pivoting algorithm in the factorization step, and an iterative refinement algorithm
in the right-hand-solving step.
We propose an innovative framework to parallelize sparse LU factorization. It is
based on a detailed dependence analysis and contains two different scheduling
strategies, cluster mode and pipeline mode, to fit different data dependence and
sparsity of the matrix, making the scheduling be efficient on multi-core central
processing units (CPU).
Novel parallel sparse LU factorization algorithms are developed. Sufficient paral-
lelism is explored among highly dependent tasks by a novel pipeline factorization
algorithm.
In addition to the standard sparse LU factorization algorithm, we also propose
a map algorithm and a lightweight supernodal algorithm to accelerate factoriz-
ing extremely sparse matrices and slightly dense matrices. To integrate the three
numerical kernels together, we propose a simple but effective method to automat-
ically select the best algorithm according to the sparsity of the matrix.
A numerically stable pivoting reduction technique is proposed to reuse previous
information as much as possible during successive factorizations in circuit simu-
lation.
We have published five papers about NICSLU [1519]. Most techniques presented
in this book are based on these publications. However, we will add more introductory
contents and update the technical descriptions and experimental results in this book.

References

1. Nagel, L.W.: SPICE 2: A computer program to simulate semiconductor circuits. Ph.D. thesis,
University of California, Berkeley (1975)
2. Paul, C.: Fundamentals of Electric Circuit Analysis, 1st edn. Wiley, Manhattan, US (2001)
3. Ho, C.W., Ruehli, A.E., Brennan, P.A.: The modified nodal approach to network analysis. IEEE
Trans. Circuits Syst. 22(6), 504509 (1975)
12 1 Introduction

4. Sli, E., Mayers, D.F.: An Introduction to Numerical Analysis, 2nd edn. Cambridge University
Press, England (2003)
5. Turing, A.M.: Rounding-off errors in matrix processes. Q. J. Mech. Appl. Math. 1(1), 287308
(1948)
6. Davis, T.A.: Direct Methods for Sparse Linear Systems, 1st edn. Society for Industrial and
Applied Mathematics, US (2006)
7. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and
Applied Mathematics, Boston, US (2004)
8. Ye, Z., Wu, B., Han, S., Li, Y.: Time-domain segmentation based massively parallel simulation
for ADCs. In: Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE, pp. 16
(2013)
9. Corporation, Cadence: accelerating analog simulation with full spice accuracy. Cadence Cor-
poration. Technical report (2008)
10. Daniels, R., Sosen, H.V., Elhak, H.: Accelerating analog simulation with HSPICE precision
parallel technology. Synopsys Corporation, Technical report (2010)
11. Li, Z., Shi, C.J.R.: A quasi-Newton preconditioned Newton-Krylov method for robust and
efficient time-domain simulation of integrated circuits with strong parasitic couplings. In: Asia
and South Pacific Conference on Design Automation 2006, pp. 402407 (2006)
12. Li, Z., Shi, C.J.R.: A quasi-Newton preconditioned Newton - Krylov Method for Robust and
efficient time-domain simulation of integrated circuits with strong parasitic couplings. IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst. 25(12), 28682881 (2006)
13. Kapre, N., DeHon, A.: Performance comparison of single-precision SPICE model-evaluation
on FPGA, GPU, cell, and multi-core processors. In: 2009 International Conference on Field
Programmable Logic and Applications, pp. 6572 (2009)
14. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing
capabilities. In: Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, pp.
483485 (1967)
15. Chen, X., Wu, W., Wang, Y., Yu, H., Yang, H.: An EScheduler-based data dependence analysis
and task scheduling for parallel circuit simulation. IEEE Trans. Circuits Syst. II: Express Briefs,
58(10), 702706 (2011)
16. Chen, X., Wang, Y., Yang, H.: An adaptive LU factorization algorithm for parallel circuit
simulation. In: Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific,
pp. 359364 (2012)
17. Chen, X., Wang, Y., Yang, H.: NICSLU: an adaptive sparse matrix solver for parallel circuit
simulation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 32(2), 261274 (2013)
18. Chen, X., Wang, Y., Yang, H.: A fast parallel sparse solver for SPICE-based circuit simulators.
In: Design, Automation Test in Europe Conference Exhibition (DATE), 2015, pp. 205210
(2015)
19. Chen, X., Xia, L., Wang, Y., Yang, H.: Sparsity-oriented sparse solver design for circuit sim-
ulation. In: Design, Automation Test in Europe Conference Exhibition (DATE), 2016, pp.
15801585 (2016)
Chapter 2
Related Work

Parallel circuit simulation has been a popular research topic for several decades since
the invention of SPICE. Researchers have proposed a large amount of parallelization
techniques for SPICE-like circuit simulation [1]. In this chapter, we will compre-
hensively review state-of-the-art studies on parallel circuit simulation techniques.
Before that, we would like to briefly introduce classifications of these parallel tech-
niques. Based on different points of view, parallel circuit simulation techniques can
also have different classifications. From the implementation platform point of view,
parallel circuit simulation techniques can be classified into software techniques and
hardware techniques. Hardware techniques include field-programmable gate array
(FPGA)and graphics processing unit (GPU)-based acceleration approaches. For
software techniques, from the domain of parallel processing point of view, they can
be further classified into direct parallel methods, parallel circuit-domain techniques,
and parallel time-domain techniques. From the algorithm level of parallel processing
point of view, there are intra-algorithm and inter-algorithm parallel techniques.

2.1 Direct Parallel Methods

According to the simulation flow shown in Fig. 1.2, the most straightforward way
to parallelize SPICE-like circuit simulators is to parallelize every step in the SPICE
simulation flow. Basically, the following major steps in the SPICE simulation flow
can be parallelized: netlist parsing and simulation setup, matrix pre-analysis, device
model evaluation, sparse direct solver, matrix/RHS load, and time node control.
However, as explained in Sect. 1.2, some steps are quite sequential and difficult to
parallelize. In addition, steps before entering SPICE iterations (i.e., netlist parsing,
simulation setup, and matrix pre-analysis) are executed only once, so their perfor-
mance is insensitive to the overall performance. According to the percentage of the
runtime, one may only focus on the parallelization of device model evaluation and
Springer International Publishing AG 2017 13
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_2
14 2 Related Work

the sparse direct solver, which are the two most time-consuming components in the
SPICE flow. Such simulation techniques can be called direct parallel methods as
they are straightforward to implement in existing SPICE-like simulation tools. This
is also the conventional parallelization method adopted by many commercial prod-
ucts. As explained in Sect. 1.2, the parallel efficiency of device model evaluation can
be close to 100% but the parallel efficiency of other steps, especially the sparse direct
solver, cannot be as high as expected. This means that, the overall parallel efficiency
is mainly limited by the poor scalability of those steps that cannot be efficiently par-
allelized. A detailed description of direct parallel methods is presented in an early
publication [2]. It gives several methods to improve the parallel efficiency for the
matrix/RHS load step using multiple locks or barriers.
In fact, for direct parallel methods, people pay more attention to the parallelization
of the sparse direct solver, due to its high runtime percentage and high difficulty of
parallelization. In what follows, we will review existing techniques for parallel direct
and iterative matrix solutions.

2.1.1 Parallel Direct Matrix Solutions

SPICE-like circuit simulators typically use sparse LU factorization to solve linear


systems. Although there are many popular software packages that implement par-
allel sparse LU factorization algorithms, most of the efforts have been made for
general-purpose sparse linear system solving in recent years, while very few stud-
ies are carried out specially for circuit simulation problems. The main difficulty in
parallelization of sparse direct solvers for SPICE-like circuit simulation problems
comes from the highly sparse and irregular nature of circuit matrices. Unstructured
and irregular sparse operations much be processed with load balance. In addition,
the parallelization overheads associated with a small number of floating-point oper-
ations (FLOP) for each task must be well controlled. On the other hand, the sparsity
offers a new opportunity to parallelize sparse direct methods, as multiple rows or
columns may be computed simultaneously. In direct methods, the numerical factor-
ization step usually spends much more time than forward/backward substitutions,
so peoples interest mainly focuses on parallelization of the numerical factorization
step.
State-of-the-art popular sparse direct solvers include the SuperLU series
(SuperLU, SuperLU_MT and SuperLU_Dist) [37], the Unsymmetric Multifrontal
Package (UMFPACK) [8], KLU [9], the Parallel Sparse Direct Solver (PAR-
DISO) [1012], Multifrontal Massively Parallel Sparse Direct Solver (MUMPS) [13
15], the Watson Sparse Matrix Package (WSMP) [16], etc. Among these solvers,
only KLU is specially design for circuit simulation applications. However, KLU is
purely sequential. According to the fundamental algorithms adopted by these solvers,
they can be classified into two main categories: dense submatrix-based methods
and non-dense-submatrix-based methods. The basic idea of dense submatrix-based
solvers is to collect and reorganize arithmetic operations of sparse nonzero elements
2.1 Direct Parallel Methods 15

into regular dense matrix operations, such that the basic linear algebra subprogram
(BLAS) [17] and/or the linear algebra package (LAPACK) [18] can be invoked to deal
with dense submatrices. These solvers can be further classified into two categories:
supernodal methods and multifrontal methods.

2.1.1.1 Supernodal Methods

A supernode is generally defined as a set of successive rows or columns of U or L with


triangular diagonal block full and the same structure in the rows or columns below or
on the right side of the diagonal block [3, 6]. Row- and column-order supernodes with
the same row and column indexes can also be combined to form a single supernode, as
illustrated in Fig. 2.1. Supernodes can be treated as dense submatrices for storage and
computation, such that both the computational efficiency and cache performance can
be improved. To efficiently process dense matrix computations, vendor-optimized
BLAS and/or LAPACK is usually required.
To explore parallelism from the sparsity, a task graph which is a direct acyclic
graph (DAG) is usually used to represent the data dependence in sparse LU factor-
ization. In SuperLU_MT [5, 7], the task graph is named elimination tree (ET) [19].
SuperLU_MT is based on the sparse left-looking algorithm developed by Gilbert and
Peierls [20], which is named G-P algorithm, and utilizes column-order supernodes.
Based on the dependence represented by the ET, SuperLU_MT uses a pipelined
supernodal algorithm to schedule the parallel LU factorization. Due to the fact that
partial pivoting can interchange the order of rows so that an exact column-level
dependence graph cannot be determined before factorization, the concept of ET con-
tains all potential column-level dependence regardless of the actual pivot choices,
which means that it is a toplimit of the column-level dependence. An example of the
ET is shown in Fig. 2.2. The use of the ET enables a static scheduling graph that can
be determined before factorization, but the overhead is that the ET overdetermines
the column-level dependence and contains much redundant dependence.
PARDISO [1012] also utilizes supernodes to realize a parallel LU factorization
algorithm, but its strategy is quite different from SuperLU. The authors of PARDISO

U U U

L L L

Fig. 2.1 Examples of supernode


16 2 Related Work

1 2 3 4 5 6 7 8 9
1 9
2
3 8
4
5 7
6
7 3 6
8
9 1 2 4 5
(a) Matrix A (b) Elimination tree

Fig. 2.2 Example of elimination tree

have developed a parallel left-right-looking algorithm [11] which is associated with


a complete block supernode diagonal pivoting method, where rows and columns of
a diagonal supernode can be interchanged without affecting the task dependence
graph. Such a strategy enables a complete static task dependence graph that repre-
sents the dependence exactly without any redundance, but the overhead is that it can
sometimes lead to unstable solutions so an iterative refinement is required after for-
ward/backward substitutions for PARDISO. PARDISO further explores the parallel
scalability by a two-level dynamic scheduling [12]. According to the comparison of
different sparse solvers presented in [21], PARDISO is one of highest performance
sparse solver for general sparse matrices.

2.1.1.2 Multifrontal Methods

The main purpose of the multifrontal [22] technique is somewhat similar to that of the
supernodal technique, but the basic theory and implementation are quite different.
The multifrontal technique factorizes a sparse matrix with a sequence of dense frontal
matrices, each of which corresponds to one or more steps of the LU factorization. We
use the example shown in Fig. 2.3 to demonstrate the basic idea of the multifrontal
method. The first pivot, say element (1, 1), is selected, and then the first frontal
matrix is constructed by collecting all the nonzero elements that will contribute to
the elimination of the first pivot row and column by the right-looking algorithm, as
shown in Fig. 2.3b. The frontal matrix is then factorized by a dense right-looking-like
pivoting operation, resulting in the factorized frontal matrix shown in Fig. 2.3c. As
can be seen, the computations of the frontal matrix can be done by dense kernels
such as BLAS so the performance can be enhanced. After eliminating the first pivot,
the second pivot, say element (3, 2), is selected. A new frontal matrix is constructed
by collecting all the contributing elements that are from the original matrix and
the previous frontal matrix, as shown in Fig. 2.3d. It is then also factorized and the
2.1 Direct Parallel Methods 17

(a) 1 2 3 4 5 6 7 (b) 1 4 5 (c) 1 4 5


1 1 X X X 1 U U U
2 2 X 2 L X X
3 3 X 3 L X X
4 4 X 4 L X X
5 7 X 7 L X X
6 First pivot: First pivot:
7 before after
Matrix A factorization factorization

(d) 2 3 4 5 7 (e) 2 3 4 5 7
3 X X X X X 3 U U U U U
4 X X X X X 4 L U U U U
5 X X 5 L L X X X
7 X 7 L L X X X
Second pivot: Second pivot:
before factorization after factorization

Fig. 2.3 Illustration of the multifrontal method [23]

resulting frontal matrix is shown in Fig. 2.3e. The same procedure will be continued
until the LU factors are complete. The multifrontal technique can also be combined
with the supernodal technique to further improve the performance by simultaneously
processing multiple frontal matrices with the identical pattern.
There are several levels of parallelism in the multifrontal algorithm [14]. First, one
can also use the ET to schedule the computational tasks, such that independent frontal
matrices can be processed concurrently. This is a task-level parallelism. Second, if a
frontal matrix is large, it can be factorized by a parallel BLAS so this is a data-level
parallelism. Third, the factorization of the dense node at the root position of the ET
can be factorization by a parallel LAPACK.
Many software packages are based on the multifrontal technique. UMFPACK [8]
is an implementation of the multifrontal method to solve sparse linear systems.
Although the solver itself is purely sequential, its parallelism can be simply explored
by invoking parallel BLAS. MUMPS [1315] is a multifrontal-based distributed
sparse direct solver. WSMP [16] is a collection of various algorithms to solve sparse
linear systems that can be executed both in sequential and parallel. For sparse unsym-
metric matrices, it adopts the multifrontal algorithm.
18 2 Related Work

2.1.1.3 Non-Submatrix-Based Methods

Unlike the supernodal or multifrontal algorithm, this category of methods do not


form any dense submatrices during sparse LU factorization. A representative solver
is KLU [9], which is an improved implementation of the G-P sparse left-looking
algorithm [20]. As circuit matrices are generally extremely sparse, it is difficult
to form big dense submatrices during sparse LU factorization, and, thus, this type
of solvers is considered to be more suitable for circuit simulation applications, as
supported by the test results of KLU.
A multi-granularity parallel LU factorization algorithm has been proposed in [24].
However, it can only be applied to symmetric matrices. Actually, in nonlinear circuit
simulation, the matrix is usually unsymmetric, so symmetric LU factorization is
useless. In addition, for symmetric matrices, Cholesky factorization [25] is about
twice as efficient than LU factorization.
ShyLU [26], developed by the Sandia National Laboratory, is a two-level hybrid
sparse linear solver. The first level hybrid comes from the combined direct and
iterative algorithms. The matrix is partitioned into four blocks, i.e.,
 
DC
A= (2.1)
RG

where D and G are square and D is a non-singular block-diagonal matrix. D can


be easily factorized by a sparse LU factorization, and then an approximate Schur
complement [27] is calculated, i.e.,

Algorithm 1 Algorithm of ShyLU [26].


1: Factorize D by sparse LU factorization
2: Compute approximate Schur complement:

S G RD1 C

3: Solve
Dz = b1

4: Solve
Sx2 = b2 Rz

using iterative methods where S is used as the pre-conditioner. S is the exact Schur
complement but it does not need to be explicit formed
5: Solve
Dx1 = b1 Cx2
2.1 Direct Parallel Methods 19

S G RD1 C. (2.2)

The approximate Schur complement serves as a pre-conditioner to solve the linear


equations corresponding to the right-bottom block using iterative methods. For a
linear system     
DC x1 b1
= , (2.3)
RG x2 b2

ShyLU solves it using the algorithm shown in Algorithm 1. The second level hybrid
comes from the combined multi-machine and multi-core parallelism. ShyLU has also
been tested in SPICE-like circuit simulation. According to very limited results [28],
the performance of ShyLU in circuit simulation, especially the speedup over KLU,
is not so remarkable (the speedup over KLU is only 20 using 256 cores for a
particular circuit).
Till now, very few sparse linear solvers are specially designed for circuit simu-
lation applications, and very few public results of sparse linear solvers are reported
for circuit matrices. We believe that a comprehensive comparison and investigation
between various algorithms of sparse linear solvers on circuit matrices from different
applications can provide lots of new insights and guidelines to the development of
sparse linear solvers for circuit simulation.

2.1.2 Parallel Iterative Matrix Solutions

Compared with direct methods, iterative methods can significantly reduce the mem-
ory requirement as they are executed almost in-place. Iterative methods are also quite
easy to parallelize as the core operation is just sparse matrix-vector multiplication
(SpMV). There are a great number of parallel SpMV implementations on modern
multi-core CPUs, many-core GPUs, or reconfigurable FPGAs [2934]. However,
in fact, there are very few researches that have investigated iterative methods for
solving linear systems in SPICE-like circuit simulation applications. Commercial
general-purpose circuit simulators rarely use iterative methods, mainly due to the
convergence and robustness issues of iterative methods. To improve the convergence,
iterative methods require good pre-conditioners, which should have the following
two properties. First, the pre-conditioner should approximate the matrix very well
to ensure good convergence. Second, the inverse of the pre-conditioner should be
cheap to compute to reduce the runtime of the linear solver. In most cases, we do
not need to explicitly calculate the inverse but the equivalent implicit computations
should also be cheap. For parallel iterative methods, many research efforts have been
carried out on how to build robust pre-conditioners, as iterative methods themselves
are straightforward to parallelize.
An example of a pre-conditioned linear system can be simply expressed as follows:

M1 Ax = M1 b, (2.4)
20 2 Related Work

where M is the pre-conditioner. M is selected such that solving the linear system
of Eq. (2.4) by iterative methods can converge much faster than solving the original
linear system Ax = b. If M is exactly A, then the left side of Eq. (2.4) is exactly
the identity matrix so it can be trivially solved. However, if we have obtained the
exact A1 , it is equivalent to that we have already solved the original linear system.
In other words, it is unnecessary to compute the exact inverse. On the contrary, the
pre-conditioner should be selected such that it can approximate the matrix as exactly
as possible with a very cheap method.
In mathematics, pre-conditioner techniques can be classified into two main
categories: incomplete factorization per-conditioner and approximate inverse pre-
conditioner [35]. Incomplete factorization tries to find an approximate factorization
of the matrix, i.e.,
A L U.
(2.5)

Typically, the approximate factors are obtained from LU factorization by dropping


small values under a given threshold. Tradeoffs can be explored between the number
of fill-ins in the approximate factors, i.e., the approximate accuracy, and the compu-
tational cost of the pre-conditioner. The approximate inverse pre-conditioner tries to
calculate a sparse matrix M which minimizes the Frobenius norm of the following
residual:
F(M) = ||I AM||2F . (2.6)

Researchers have proposed some iterative algorithms that can efficiently calculate
the sparse approximate inverse matrix M [35].
Based on the theory of these pre-conditioners, a few parallel pre-conditioners have
been developed for circuit simulation problems [3638]. A common feature of these
early work is that they treat the pre-conditioner and the iterative solve as a black-box
and do not utilize any information from circuit simulation.
In SPICE-like circuit simulation, there is another opportunity to apply pre-
conditioners for iterative solvers to solve linear systems. Due to the quadratic con-
vergence of the NewtonRaphson method, the matrix values change slow during
SPICE iterations, especially when the NewtonRaphson iterations are converging.
This property provides us an opportunity to utilize the LU factors in a certain iteration
to serve as a pre-conditioner for subsequent iterations which are solved by sequen-
tial or parallel generalized minimal residual (GMRES) methods [3941]. Compared
with the previous approaches that apply additional pre-conditioners, the computa-
tional cost of the pre-conditioner can be almost ignored in these methods, as com-
putation of the pre-conditioner, i.e., the complete LU factorization, is an inherent
step in circuit simulation. Another advantage is that the pre-conditioner can be used
in multiple iterations if the matrix values change very slow. However, due to the
sensitivity of iterative methods on matrix values, it is difficult to judge when the
pre-conditioner is invalid. To overcome this problem, the nonlinear device models
are piecewise linearized, and once nonlinear devices change their operating regions,
the pre-conditioner is required to update [3941].
2.1 Direct Parallel Methods 21

R2 R2
R3 g Cgd d
d
R3 g

R1 Cgs gds Cds


s gmVgs
s
R4 R5 R4 R5 R1

(a) Original nonlinear circuit (b) Linearized circuit

d R2 d R2
Cgd/h Cgd/h
g g
R3 gmVgs gds C /h R3 gmVgs gds Cds/h
ds
Cgs/h s s
R4
R5 R1 R5 R1
(c) Original weighted graph (d) Sparsified weighted graph

R2
R3 g Cgd d

gds Cds
gmVgs
s
R5 R1

(e) Support circuit

Fig. 2.4 Example of support circuit pre-conditioner [4244]

The above pre-conditioners are purely based on the matrix information, com-
pletely ignoring the circuit-level information. In other words, they are pure matrix-
based methods. Another type of pre-conditioners utilizes circuit-level information
named supper circuit pre-conditioner [4244], which is based on the support graph
and graph sparsification theories [45]. The basic idea is to extract a highly sparsified
circuit network which is called a support circuit that is very close to the original cir-
cuit, so that matrix factorization for the support circuit can be quickly done almost in
linear time, and can be served as the pre-conditioner for GMRES. Figure 2.4 shows
an example of the creation of the support circuit pre-conditioner.
The Sandia National Laboratory has proposed another type of pre-conditioner for
SPICE-like circuit simulation [46]. It first partitions the circuit into several blocks
and then uses the block Jacobi pre-conditioner for the GMRES solver. This approach
22 2 Related Work

fails on some circuits so its applicability in real SPICE-like circuit simulation needs
further investigation.
A common problem with pre-conditioned iterative methods in SPICE-like circuit
simulation is the universality. Although existing researches have shown that the
proposed approaches can work well for the circuits they have tested, unlike direct
methods, it cannot guarantee that these approaches can also work well for any circuit.
All of the existing iterative methods in circuit simulation are likely ad hoc approaches,
and, hence, more universality should be explored.

2.2 Domain Decomposition

The concept of domain decomposition has different meanings under various con-
texts. Generally speaking, domain decomposition can be described as a method that
solves a large problem by partitioning the problem into multiple small subprob-
lems and then solving these subproblems separately. From the circuit point of view,
to realize parallel simulation, a natural idea is to partition the circuit into multiple
subcircuits such that each subcircuit can be solved independently, if the boundary
condition is properly formulated at either circuit level or matrix level. Actually,
domain decomposition is widely used in modern parallel circuit simulation tools,
especially in fast SPICE simulation techniques. There are basically several types
of methods in domain decomposition-based parallel simulation techniques: parallel
bordered block-diagonal (BBD)-form matrix solutions, parallel multilevel Newton
methods, parallel Schwarz methods, and parallel waveform relaxation methods.

2.2.1 Parallel BBD-Form Matrix Solutions

This type of methods is more like a matrix-level technique rather than a domain
decomposition technique. However, building a BBD-form matrix requires to partition
the circuit, and the performance of solving the BBD-form matrix strongly depends on
the quality of the partition, so we put this type of methods in domain decomposition
instead of direct parallel methods.
Figure 2.5 illustrates how to create the BBD form by circuit partitioning. The
circuit is partitioned into K non-overlapped subdomains, in which one subdomain
contains all the interface nodes and the other subdomains are subcircuits. After such a
partitioning, the matrix created by MNA naturally have a BBD form, where there are
K 1 diagonal block matrices D1 , . . . , D K 1 , K 1 bottom-border block matrices
R1 , . . . , R K 1 , K 1 right-border block matrices C1 , . . . , C K 1 , and a right-bottom
block matrix G. The diagonal blocks correspond to the internal equations of all the
subcircuits. The border blocks correspond to all the connections between subcircuits
and interface nodes. The right-bottom block corresponds to the internal equations
of interface nodes. LU factorizing of a BBD-form matrix is based on the Schur
2.2 Domain Decomposition 23

Subcircuit Subcircuit D1 C1
1 2

...
Interface
nodes D K-1 C K-1

Subcircuit
...... K-1
R1 ... R K-1 G

(a) Circuit partitioning (b) BBD-form matrix

Fig. 2.5 Illustration of how to create the BBD form by circuit partitioning

Algorithm 2 LU factorization of a BBD-form matrix.


1: Factorize K 1 diagonal blocks

D1 = L1 U1 , . . . , D K 1 = L K 1 U K 1

2: Update K 1 bottom-border blocks

R1 = R1 U11 , , R K 1 = R K 1 U1
K 1

3: Update K 1 right-border blocks

C1 = L11 C1 , , C K 1 = L1
K 1 C K 1

4: Accumulate updates to the right-bottom block


K 1
G=G Rk Ck
k=1

5: Factorize the right-bottom block

G = LK UK

complement theory [27]. The factorization process can be described by Algorithm 2.


There are several opportunities to parallelize the factorization of a BBD-form matrix.
First, factorizations of diagonal blocks are completely independent so they can be
trivially parallelized. In addition, factorization of each diagonal block can also be
parallelized. The same conclusion also holds for the updates to all the border blocks.
24 2 Related Work

Second, accumulation to the right-bottom block can be partially parallelized. Third,


factorization of the right-bottom block can also be parallelized.
Constructing the BBD-form matrix can be achieved by either matrix-level meth-
ods or circuit-level methods. Existing approaches often create the BBD-form matrix
by partitioning the circuit [4750], although pure matrix-level methods also exist [51
53]. From the circuit design point of view, large circuits are usually designed hier-
archically and structured. This will greatly help reduce the difficulty of partitioning
the circuit. In fact, matrix-level methods create the BBD-form matrix by creating
a network based on the symbolic pattern of the matrix, and then partitioning the
network into subdomains.
A few practical issues should be considered when implementing parallel BBD-
form matrix solutions. First, the right-bottom block can be a severe bottleneck in the
parallel solver, as it can be quite dense and dominates the overall computational time.
In fact, not only factorizing the right-bottom block can be expensive, accumulating
updates to that block can also be time-consuming. The reason is that the accumulation
cannot be efficiently parallelized as multiple different submatrices may update the
same position, which requires a lock for each nonzero element in the right-bottom
block. As the size of the right-bottom block depends on the number of interface
nodes, a high-quality partitioning is required. Second, load balance is big problem.
As can be seen, the size of each diagonal block depends on the size of each subcircuit
after circuit partitioning. In practice, the size of different circuit modules can vary
much so it is difficult to obtain equal-sized subcircuits. If we force to get a partition
with equal-sized subcircuits, the number of cut-off links will be significantly large,
i.e., the right-bottom block will be large. One solution is to partition the circuit into
a number of small subcircuits such that load balance can be achieved by dynamic
scheduling. However, this method implies that the circuit is quite large and has many
submodules, which is not always true in practice.

2.2.2 Parallel Multilevel Newton Methods

The above BBD-form matrix solutions are still matrix-level approaches but not real
circuit-level approaches. The idea can also be extended for solving nonlinear equa-
tions by the concept of the multilevel Newton technique [5457]. Multilevel Newton
methods are actually algorithm-level methods but they are operated at the circuit
level.
The basic idea of multilevel Newton methods can be described as follows. Each
subdomain is first solved separately using the NewtonRaphson method with a given
boundary condition, and then the top-level nonlinear equation is solved by integrating
the updated solutions from all the subdomains. The two levels of NewtonRaphson
iterations are repeated until all the boundary conditions are converged. Multilevel
Newton methods can be formulated as follows. After the circuit is partitioned into
K subdomains in which one subdomain contains the interface nodes, we have K
equations to describe the whole system
2.2 Domain Decomposition 25

f i (xi , u) = 0, i = 1, 2, . . . , K 1 (2.7)

g(x1 , x2 , . . . , x K 1 , u) = 0, (2.8)

where xi is the unknown vector of subdomain i (i = 1, 2, . . . , K 1), and u is the


unknown vector corresponding to the interface nodes. Equation (2.7) is the local non-
linear equation of subdomain i, and Eq. (2.8) corresponds to the top-level nonlinear
equation. Equations (2.7) and (2.8) are solved hierarchically by multilevel Newton
methods. First, an inner NewtonRaphson iterations loop is performed to solve each
local equation Eq. (2.7) under a fixed boundary condition u until convergence. After
that, an outer NewtonRaphson iterations loop is performed to solve the top-level
global equation Eq. (2.8) based on the solutions received from all the local equa-
tions. The two levels of NewtonRaphson iterations loop will be repeated until all
the solutions, i.e., xi and u, are converged.
In addition to that multilevel Newton methods can be easily parallelize, there
is another unique advantage for multilevel Newton methods. In general cases, the
quadratic convergence property of the NewtonRaphson method is still retained in
multilevel Newton methods. In the mean time, the overall computational cost can
be significantly reduced, as the NewtonRaphson method for each subcircuit can
be converged quickly due to the small size of each subcircuit. Consequently, the
performance improvement of parallel multilevel Newton methods comes from two
aspects: one is the improved fundamental algorithm and the other is the parallelism.

2.2.3 Parallel Schwarz Methods

The above parallel BBD-form matrix solutions and parallel multilevel Newton meth-
ods are both master-slave approaches, in which the master may be a severe bottle-
neck. To resolve this bottleneck, Schwarz methods can be adopted [58]. Different
from the above nonoverlapping partition methods, in Schwarz methods, the circuit
is partitioned into multiple overlapped subdomains.
A parallel simulation approach using the Schwarz alternating procedure has been
proposed in [59, 60]. A circuit can be partitioned into K 1 nonlinear subdomains
1 , 2 , . . . , K 1 and a linear subdomain K . This is equivalent to partition the
matrix A into K 1 overlapped submatrices A1 , A2 , . . . , A K 1 corresponding to all
the nonlinear subdomains and a background matrix A K corresponding to the overlaps
of subdomains 1 , 2 , . . . , K 1 and the linear subdomain K , as illustrated in
Fig. 2.6. After partitioning, linear systems during SPICE simulation is solved by the
Schwarz alternating procedure, as shown in Algorithm 3, in which all the subdomains
are solved in parallel.
Compared with parallel BBD-form matrix solutions and parallel multilevel New-
ton methods, the main advantage of parallel Scharwz methods is that, parallel
Scharwz methods do not belong to the masterslave parallelization framework but
they only involve point-to-point communications, potentially resulting in better
26 2 Related Work

A1
1 2
A2
5
A3
3 4
A5 A4

(a) Circuit partitioning (b) Matrix partitioning

Fig. 2.6 Illustration of overlapped circuit partitioning and its corresponding matrix partitioning

Algorithm 3 Schwarz alternating procedure [59, 60].


1: Choose initial guess of the solution x
2: Calculate the residual
r = b Ax

3: repeat
4: for k = 1 to K in parallel do
5: Solve
Ak k = rk

6: Update solution
xk = xk + k

7: Update residuals on boundary


8: end for
9: until all boundary conditions are converged

parallel scalability, as the bottleneck of the master is avoided. Since Schwarz meth-
ods belong to the category of iterative methods, they suffer from the convergence
problem. A general conclusion is that the convergence speed can be significantly
improved by increasing the overlapping areas. However, increasing overlaps lead to
higher computational cost.

2.2.4 Parallel Relaxation Methods

Relaxation techniques have been developed to solve linear or nonlinear equations


from a variety of areas. In the circuit simulation area, there are a large amount
of researches about parallel simulation using relaxation methods. Relaxation can
be applied to three types of equations: linear equations, nonlinear equations, and
2.2 Domain Decomposition 27

differential equations. Remember that we have presented several fundamental equa-


tions of SPICE-like circuit simulation in Sect. 1.1.1. Relaxation for linear equations
is applied to Eq. (1.5). Typical algorithms include the Gauss-Jacobi method and
the Gauss-Seidel method. They are iterative methods so they both suffer from the
convergence problem. The linear relaxation methods can also be extended to non-
linear equations, e.g., Eq. (1.3). For both linear and nonlinear relaxation methods,
the convergence speed is linear. In circuit simulation, a large number of the efforts
are focused on relaxation for differential equations. This leads to a type of methods
called waveform relaxation [6169], which solves the circuit differential equation
(i.e., Eq. (1.1)) in a given time interval by relaxation techniques.
We briefly explain waveform relaxation in a mathematical form. Equation (1.1)
can be rewritten into a different form
dq(x) dx(t)
+ f (x(t)) u(t) = 0. (2.9)
dxT dt

Let C(x(t)) be dq(x)


dx T
and b(u(t), x(t)) be f (x(t)) u(t), then Eq. (2.9) is further
rewritten into the following form

dx(t)
C(x(t)) + b(u(t), x(t)) = 0. (2.10)
dt
If we use the Gauss-Seidel method to solve Eq. (2.10), it results in the following
linear system


i dx (k+1)
Ci j (x1(k+1) , . . . , xi(k+1) , xi+1
(k)
, . . . , x N(k) ) j
dt
j=1

N dx (k) (2.11)
+ Ci j (x1(k+1) , . . . , xi(k+1) , xi+1
(k)
, . . . , x N(k) ) dt
j

j=i+1
+ bi (x1(k+1) , . . . , xi(k+1) , xi+1
(k)
, . . . , x N(k) ) = 0, i = 1, 2, . . . , N

where the superscript is the iteration count. Waveform relaxation solves the circuit
DAE Eq. (1.1) in a given time interval by iterating Eq. (2.11) until the solution is
converged.
To enable parallel waveform relaxation, one also needs to partition the circuit into
subcircuits, while the interactions between subcircuits can be approximated by proper
devices, e.g., artificial sources. An DAE is built for each subcircuit and then solved
by waveform relaxation based on Eq. (2.11). When solving a subcircuit, interactions
from other subcircuits are considered and the latest solutions of interacted subcircuits
are always used. As can be seen, parallel waveform relaxation combines both domain
decomposition and time-domain parallelism.
Although waveform relaxation has been widely studied since the 1980s, they are
actually not widely used in practical circuit simulators today. The reasons mainly
include the convergence conditions and limitations of waveform relaxation. As wave-
form relaxation is an iterative method, convergence is always a problem. A necessary
28 2 Related Work

condition for Eq. (2.10) to have a unique solution, requires that the matrix C(x(t))1
exists. This also implies that there must be a grounded capacitor at each node. Such
a requirement cannot be always satisfied for actual circuits, especially for pre-layout
circuits. In addition, waveform relaxation also requires that one node of each inde-
pendent voltage source or inductor must be the ground, which also restricts the
applicability of waveform relaxation.

2.3 Parallel Time-Domain Simulation

Except the relaxation methods, most of the above-mentioned methods have a common
point that the parallelism is explored at each time node. If we put our focus to the
whole time axis in transient simulation, parallelism can also be explored in the
time domain by many other techniques. Namely, different time nodes in the time
domain may be computed concurrently by either parallel integration algorithms or
multiple algorithms calculating different time nodes. As mentioned in Sect. 1.1.3, the
DAE associated with transient simulation is usually solved by numerical integration
algorithms. Numerical integration algorithms are typically completely sequential at
the time node level, as a node can be computed only when one or more previous
nodes are finished. To explore parallelism in the time domain, one needs to carefully
resolve this problem.

2.3.1 Parallel Numerical Integration Algorithms

To explore parallelism along the time axis in SPICE-like transient simulation,


WavePipe has been proposed [70]. WavePipe enables the simultaneous computation
of multiple time nodes by two novel techniques: backward pipelining and forward
pipelining.

2.3.1.1 Backward Pipelining

An illustration of the backward pipelining is shown in Fig. 2.7. Consider a two-step


numerical integration method. Using the solutions at time nodes t1 and t2 as the initial
conditions, a thread can calculate the solution at t3 . To enable backward pipelining,
at the same time, a second thread can calculate the solution at t3 which is smaller than
t3 , using the solutions at t1 and t2 as the initial conditions as well. One may argue that
the solution at t3 is useless because t3 is always beyond t3 due to the use of the latest
solutions, which means that t3 does not contribute to a faster calculation. However, the
calculation of t3 is actually useful for parallel simulation. Recall that the time step of
numerical integration methods is determined by the LTE of the previous integration
step. Due to the existence of t3 , the first thread, which will calculate the solution at
2.3 Parallel Time-Domain Simulation 29

Fig. 2.7 Backward Thread 1


pipelining of WavePipe [70]
Thread 2

t4
t2 t3' Backward
Backwardt3 t4'
t1

a new time node using the solutions at t3 and t3 as the initial conditions, can move
forward by a larger time step to t4 , compared with a sequential integration method
that uses the solutions at t3 and t2 as the initial conditions, due to the reduced LTE.
At the same time, the second thread calculates the solution at t4 which is smaller than
t4 . The calculations of t3 and t4 are called backward steps. As can be seen, backward
pipelining results in larger time steps so accelerates transient simulation along the
time axis. The basic principle behind backward pipelining is that it provides better
initial conditions so that the integration time step can be larger.

2.3.1.2 Forward Pipelining

Forward pipelining operates in a different way than backward pipelining. As illus-


trated in Fig. 2.8, a thread is calculating the solution at t3 using the solutions at t1
and t2 as the initial conditions, and a second thread is attempting to calculate the
solution at t4 which is beyond t3 . The problem is that, if the second thread also uses
the solutions at t2 and t1 as the initial conditions, the calculated solution at t4 is unsta-
ble if the maximum step size is already exhausted at t3 . In the forward pipelining
approach, the second thread uses the solutions at t3 and t2 as the initial conditions
to calculate the solutions at t4 . Obviously, t3 is still under calculation so its final
solution is not available at this time. Recall that the NewtonRaphson method con-
verges quadratically, and in SPICE-like transient simulation, it only requires a few
NewtonRaphson iterations to achieve convergence at each time node. When t3 is
under calculation and its intermediate solution does not satisfy the LTE tolerance,
the solution should be close to the final solution. Hence, the second thread can use
the intermediate solution of t3 as the initial condition to calculate the solution at

Fig. 2.8 Forward pipelining Thread 1


of WavePipe [70]
Thread 2
t2 t4
Forward
t3
t1
30 2 Related Work

t4 . The penalty is the increased number of NewtonRaphson iterations at t4 , due to


the inaccurate initial conditions. The calculation of t4 is called a forward step. The
authors of WavePipe have proposed how to predict the time step and maintain the
accuracy and stability. In addition, backward pipelining and forward pipelining can
be combined together by a carefully designed thread scheduling policy.
There is no doubt that WavePipe has provided new insights to development of
parallel time-domain simulation techniques, and the method can also be applied
to other problems which need to solve differential equations. However, it can be
expected that WavePipe requires find-grained inter-thread communications so the
scalability will become poor when the number of threads increases.

2.3.2 Parallel Multi-Algorithm Simulation

A completely different parallel time-domain simulation technique named multi-


algorithm simulation has been proposed in recent years [7174]. Different from all the
other parallel simulation techniques which explore intra-algorithm parallelism, i.e.,
parallel computing is only applied in a single algorithm, multi-algorithm simulation
explores parallelism between different algorithms. The starting point of this method
is the applicability of different integration algorithms to different circuit behaviors,
e.g., an algorithm that is suitable for smooth waveforms may not be suitable for
oscillating waveforms. Consequently, using a single algorithm may not be always
the best solution in circuit simulation. Instead, running a pool of algorithms of differ-
ent characteristics can be a better way. The challenge is how to efficiently schedule
multiple algorithms and integrate the solutions of multiple algorithms together on
the fly.
Figure 2.9 shows the general framework of parallel multi-algorithm simulation.
To explore parallelism between algorithms, n different algorithms are running

Circuit solution vector with K latest time nodes

thead

ttail
Lock

Algorithm 1 Algorithm 2 Algorithm n

Fig. 2.9 Framework of parallel multi-algorithm simulation [7174]


2.3 Parallel Time-Domain Simulation 31

independently in parallel to process the same simulation task. Each algorithm main-
tains a complete SPICE context including the sparse direct solver, device model
evaluation, numerical integration method, NewtonRaphson iterations, etc. Due to
the different characteristics of these algorithms, their speeds at the time axis are also
different. The high performance is achieved by an algorithm selection strategy. To
synchronize the solutions of these algorithms, a solution vector containing K latest
time nodes is maintained. Let thead and ttail be the first and last time nodes of the
solution vector. As the vector is global and can be accessed by all the algorithms, a
lock is required when an algorithm attempts to access it. The update strategy to the
solution vector can be described as follows. Once an algorithm finishes solving one
time node, it will access the solution vector by acquiring the lock. If the current time
node, say talg , is beyond thead , then thead is updated by talg and ttail is moved forward
by one node. If talg is between ttail and thead , then talg is inserted and ttail is still moved
forward by one node. However, if talg is behind ttail , it indicates that this algorithm
is too slow so the current solution is discarded, and then the current algorithm picks
up the latest time node in the solution vector, i.e., thead , to calculate the next time
node. Additionally, before each algorithm starts to calculate the next new time node,
it also checks the solution vector to load the latest time node. Such a scheduling and
update policy implies an algorithm selection strategy that always selects the fastest
algorithms at any time, so that the speedups over a single algorithm-based simulation
can be far beyond the number of used cores.

2.3.3 Time-Domain Partitioning

Different from the above two approaches, a cruder method to implement parallel time-
domain simulation is to directly partition the time domain, such that each segment of
the time domain can be computed in parallel [75]. The major problem is that the initial
solution of each segment, which is necessary in any numerical integration method,
is unknown. However, considering a fact that many actual circuits have stationary
operation status, with different initial solutions, the circuit will eventually go to a
stationary status so the response will finally converge. This fact enables us to simulate
the time-domain response in parallel by partitioning the time domain into multiple
segments. The initial solution of each segment is selected as the DC operating point.
Of course, the waveform obtained by this method has errors. However, if we only need
to calculate some high-level or frequency-domain factors of analog circuits, such as
the signal to noise-plus-distortion ratio, this method can be applied, because a small
error in the waveform does not affect the frequency-domain response. Experimental
results show that this method can accelerate analog circuit simulations by more than
50 using 100 cores. However, anyway, such a method is not a unified approach
and it can only be applied to special simulations of special circuits.
32 2 Related Work

2.3.4 Matrix Exponential Methods

The matrix exponential method [76] is another approach to solve the circuit DAE
expressed as Eq. (1.1). Unlike conventional numerical integration methods such as
the backward Euler method or the trapezoid method [77] which are implicit, the
matrix exponential method is explicit but also A-stable [78].
For the circuit DAE expressed as Eq. (1.1), the matrix exponential method says
that its solution within the time interval [tn , tn+1 ] can be written as the following
form [79]:

x (tn+1 ) = e(tn+1 tn )J(x(tn )) x (tn ) +


 tn
tn+1
 
e(tn+1 tn )J(x(tn )) C1 (x(tn )) f (x (tn + )) + u (tn + ) d ,
0
(2.12)
where C(x(tn )) is a matrix of capacitances and inductances linearized at tn . If we
assume that the charges in nonlinear elements behave linearly within the time interval
[tn , tn+1 ], then the integration can be approximately calculated and the second-order
implicit approximation is of the following form:

tn+1 tn 1
x (tn+1 ) = C (x (tn+1 )) f (x (tn+1 ))
2  
tn+1 tn 1
+ e(tn+1 tn )J(x(tn )) x (tn ) + C (x (tn )) f (x (tn ))
2
(tn+1 tn )J(x(tn ))
1
+ e I J (x (tn )) C1 (x (tn )) u (tn ) (2.13)
 
(tn+1 tn )J(x(tn ))

e (tn+1 tn ) J (x (tn )) I J2 (x (tn ))

+ C 1
(x (t )) u (t ) C1
(x (t )) u (t ) .


n+1 n+1 n n

tn+1 tn

The computation of the matrix exponential e(tn+1 tn )J(x(tn )) can be reduced using
Krylov subspace methods [80, 81]. Parallelism can be trivially explored in Krylov
subspace methods, as their major operation is just SpMV.
Generally speaking, compared with conventional numerical integration methods,
the matrix exponential method has advantages in the performance, accuracy, and
scalability. It has been studied in both nonlinear [8284] and linear [85, 86] cir-
cuits simulation. However, as a new technique in SPICE-like circuit simulation, the
applicability for general nonlinear circuits, especially for highly stiff systems, still
requires further investigations.
2.4 Hardware Acceleration Techniques 33

2.4 Hardware Acceleration Techniques

In recent years, with the rapid development of various accelerators such as GPUs and
FPGAs, hardware acceleration techniques are widely used in many areas to accel-
erate scientific computing. Underlying state-of-the-art accelerators provide much
more computing and memory resources than general-purpose CPUs, offering much
higher computing capability and memory bandwidth. However, regardless of the
claimed generality in computing, there are some architectural limitations that must
be dealt with when developing general-purpose applications such as circuit simu-
lation. GPUs and FPGAs have been investigated to accelerate SPICE-like circuit
simulation recently. Existing researches are mainly focused on accelerating device
model evaluation and the sparse direct solver.

2.4.1 GPU Acceleration

GPUs, known as graphic processors, have been extended to general-purpose com-


puting since about 10 yeas ago. Programming languages including the famous
compute unified device architecture (CUDA) [87] and open computing language
(OpenCL) [88] have also been developed to help users easily program GPUs. GPUs
offer massive thread-level parallelism by integrating thousands of cores in a sin-
gle processor. Modern GPUs execute programs in a single-instruction-multiple-data
(SIMD) manner. This means that, threads are grouped into batches and each batch
executes the same instruction on different data. Parallelism is explored both in one
batch and between multiple batches. By executing thousands of concurrent threads,
the peak performance of high-end GPUs can be one order of magnitude higher than
that of state-of-the-art CPUs.
Accelerating device model evaluation by GPUs is straightforward, as the comput-
ing processes of all the devices with the same model are almost identical, and there
are no inter-model communications. Such a computational pattern can be perfectly
mapped to a GPUs SIMD engine. Dozens of speedups can be achieved by GPUs,
compared with CPU-based device model evaluation [8991].
Different from device model evaluation, porting sparse direct solvers onto GPUs
faces many challenges. As the SIMD-based GPU architecture is designed for highly
regular applications, irregular computational and memory access patterns involved by
sparse direct solvers can significantly affect the performance of GPUs, and, hence,
they must be well dealt with when implementing sparse direct linear solvers on
GPUs. Like most of the CPU-based sparse direct solvers, general-purpose GPU-
based sparse direct solvers also gather nonzero elements into dense submatrices and
then adopt the CUDA BLAS [92] to solve them [93101]. Since only small dense
submatrices can be formed in sparse matrix factorization, the overhead associated
with kernel launching and data transfer between the CPU and GPU can be larger than
the computational cost. One can invoke batched BLAS executions on GPUs to avoid
34 2 Related Work

this problem. However, another problem of load imbalance raises, as it is impossible


to form a batch of dense submatrices with the identical size. Two GPU-based sparse
direct solvers have been proposed for circuit matrices by employing hybrid task-level
and data-level parallelism, which invoke a single kernel to process all computations
without invoking the CUDA BLAS [102104].
Basically, sparse direct solvers are memory-intensive applications so the high
computing capability of modern GPUs cannot be fully explored. Reported results
indicate that most of the existing GPU-based sparse direct solvers can achieve only
a few times of speedups compared with single-threaded CPU-based solvers. Such
performance can be easily exceeded by a parallel CPU-based solver. The performance
of sparse direct solvers running on GPUs is mainly restricted by the off-chip memory
bandwidth of GPUs. On the contrary, modern CPUs have a large cache which can
significantly reduce the requirement for the off-chip memory bandwidth. From this
point of view, it is not a good idea to use present GPUs to accelerate sparse direct
solvers, especially for those solvers designed for extremely sparse circuit matrices.

2.4.2 FPGA Acceleration

FPGA is known for the reconfigurability so it has advantages of both general-purpose


CPUs and application-specific integrated circuits (ASIC). On one hand, FPGAs can
be programmed to different functionalities so they can also be treated as general-
purpose processors. On the other hand, the performance of FPGAs can be higher
than CPUs and close to that of customized ASICs due to that an FPGA can be
reconfigured specifically for a parallel algorithm. In recent years, FPGAs have been
widely investigated to accelerate SPICE-like circuit simulation in three respects:
device model evaluation, the sparse direct solver, and the whole flow control.
FPGA-based device model evaluation has been proposed in several studies [105
108]. FPGA-based sparse direct solvers have been widely studied in recent a few
years [109115]. Some of them are specially targeted at circuit matrices. Due to
the complete reconfigurability, FPGAs can realize very find-grained parallel LU
factorization at the basic operation level. Figure 2.10 illustrates an example of the
dataflow graph used in [110]. In addition, SPICE iteration control has also been
ported onto FPGAs to achieve more speedups [116118]. Generally speaking, the
speedups obtained by FPGA-based acceleration techniques are similar to that of
GPUs.
One major shortage of FPGA-based sparse direct solvers is the universality. As
an FPGA can be configured to fit a specific matrix, i.e, the FPGA is programmed to
fit a specific symbolic pattern and computational flow, once the symbolic pattern of
the LU factors changes due to different pivot choices, which also leads to a change
of the computational flow, the FPGA needs to be re-programmed. This issue greatly
restricts the practicability of FPGA-based sparse direct solvers.
Almost all of the existing hardware acceleration techniques are experimental. It is
difficult to apply them in practical applications due to the inflexibility and poor
2.4 Hardware Acceleration Techniques 35

Fig. 2.10 Dataflow graph


for sparse LU factorization 2,2 3,2 1,1 3,1
on FPGAs [110]
/
/ 1,4
2,4
3,4
*
*
4,3 3,3
-

/ -

4,4

universality of the hardware platforms. For example, memory reallocation and


dynamic memory management, which is required by partial pivoting of sparse direct
solvers, is difficult to implement on both GPUs and FPGAs. Another important prob-
lem is that the performance of hardware platforms may strongly depend on the run-
time configurations. For example, performance of many CUDA programs strongly
depends on the number of launched threads. The optimal number of threads, in turn,
depends on the underlying hardware. Such problems require users to have rich knowl-
edge about the GPU architecture and the code to tune the runtime configurations.

References

1. Li, P.: Parallel circuit simulation: a historical perspective and recent developments. Found.
Trends Electron. Des. Autom. 5(4), 211318 (2012)
2. Saleh, R.A., Gallivan, K.A., Chang, M.C., Hajj, I.N., Smart, D., Trick, T.N.: Parallel circuit
simulation on supercomputers. Proc. IEEE 77(12), 19151931 (1989)
3. Li, X.S.: Sparse gaussian elimination on high performance computers. Ph.D. thesis, Computer
Science Division, UC Berkeley, California, US (1996)
4. Li, X.S., Demmel, J.W.: SuperLU_DIST: a scalable Distributed-Memory sparse direct solver
for unsymmetric linear systems. ACM Trans. Math. Softw. 29(2), 110140 (2003)
5. Li, X.S.: An overview of SuperLU: algorithms, implementation, and user interface. ACM
Trans. Math. Softw. 31(3), 302325 (2005)
6. Demmel, J.W., Eisenstat, S.C., Gilbert, J.R., Li, X.S., Liu, J.W.H.: A supernodal approach to
sparse partial pivoting. SIAM J. Matrix Anal. Appl. 20(3), 720755 (1999)
7. Demmel, J.W., Gilbert, J.R., Li, X.S.: An asynchronous parallel supernodal algorithm for
sparse gaussian elimination. SIAM J. Matrix Anal. Appl. 20(4), 915952 (1999)
36 2 Related Work

8. Davis, T.A.: Algorithm 832: UMFPACK V4.3-An Unsymmetric-Pattern multifrontal method.


ACM Trans. Math. Softw. 30(2), 196199 (2004)
9. Davis, T.A., Palamadai Natarajan, E.: Algorithm 907: KLU, A direct sparse solver for circuit
simulation problems. ACM Trans. Math. Softw. 37(3), 36:136:17 (2010)
10. Schenk, O., Grtner, K.: Solving unsymmetric sparse systems of linear equations with PAR-
DISO. Future Gener. Comput. Syst. 20(3), 475487 (2004)
11. Schenk, O., Grtner, K., Fichtner, W.: Efficient sparse LU factorization with Left-Right look-
ing strategy on shared memory multiprocessors. BIT Numer. Math. 40(1), 158176 (2000)
12. Schenk, O., Grtner, K.: Two-Level dynamic scheduling in PARDISO: improved scalability
on shared memory multiprocessing systems. Parallel Comput. 28(2), 187197 (2002)
13. Amestoy, P.R., Duff, I.S., LExcellent, J.Y., Koster, J.: A fully asynchronous multifrontal solver
using distributed dynamic scheduling. SIAM J. Matrix Anal. Appl. 23(1), 1541 (2001)
14. Amestoy, P.R., Guermouche, A., LExcellent, J.Y., Pralet, S.: Hybrid scheduling for the parallel
solution of linear systems. Parallel Comput. 32(2), 136156 (2006)
15. Amestoy, P., Duff, I., LExcellent, J.Y.: Multifrontal parallel distributed symmetric and unsym-
metric solvers. Comput. Methods Appl. Mech. Eng. 184(24), 501520 (2000)
16. Gupta, A., Joshi, M., Kumar, V.: WSMP: A High-Performance Shared- and Distributed-
Memory parallel sparse linear solver. Technical report, IBM T. J. Watson Research Center
(2001)
17. Dongarra, J.J., Cruz, J.D., Hammerling, S., Duff, I.S.: Algorithm 679: a set of level 3 basic
linear algebra subprograms: model implementation and test programs. ACM Trans. Math.
Softw. 16(1), 1828 (1990)
18. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,
Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users Guide, 3rd
edn. Society for Industrial and Applied Mathematics, Philadelphia, PA (1999)
19. Liu, J.W.H.: The role of elimination trees in sparse factorization. SIAM J. Matrix Anal. Appl.
11(1), 134172 (1990)
20. Gilbert, J.R., Peierls, T.: Sparse partial pivoting in time proportional to arithmetic operations.
SIAM J. Sci. Statist. Comput. 9(5), 862874 (1988)
21. Gould, N.I.M., Scott, J.A., Hu, Y.: A numerical evaluation of sparse direct solvers for the
solution of large sparse symmetric linear systems of equations. ACM Trans. Math. Softw.
33(2), 132 (2007)
22. Liu, J.W.H.: The multifrontal method for sparse matrix solution: theory and practice. SIAM
Rev. 34(1), 82109 (1992)
23. Zitney, S., Mallya, J., Davis, T., therr, M.S.: Multifrontal vs Frontal techniques for chemical
process simulation on supercomputers. Comput. Chem. Eng. 20(6-7), 641646 (1996)
24. Fischer, M., Dirks, H.: Multigranular parallel algorithms for solving linear equations in VLSI
circuit simulation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 23(5), 728736
(2004)
25. Davis, T.A.: Direct Methods for Sparse Linear Systems, 1st edn. Society for Industrial and
Applied Mathematics, US (2006)
26. Rajamanickam, S., Boman, E., Heroux, M.: ShyLU: A Hybrid-Hybrid solver for multi-
core platforms. In: 2012 IEEE 26th International Parallel Distributed Processing Symposium
(IPDPS), pp. 631643 (2012)
27. Zhang, F.: The Schur Complement and Its Applications. Numerical Methods and Algorithms.
Springer, Berlin, Germany (2005)
28. Thornquist, H.K., Rajamanickam, S.: A hybrid approach for parallel Transistor-Level Full-
Chip circuit simulation. In: International Meeting on High-Performance Computing for Com-
putational Science, pp. 102111 (2015)
29. MehriDehnavi, M., El-Kurdi, Y., Demmel, J., Giannacopoulos, D.: Communication-Avoiding
Krylov techniques on graphic processing units. IEEE Trans. Magn. 49(5), 17491752 (2013)
30. Fowers, J., Ovtcharov, K., Strauss, K., Chung, E.S., Stitt, G.: A High memory bandwidth
FPGA accelerator for sparse Matrix-Vector multiplication. In: 2014 IEEE 22nd Annual
International Symposium on Field-Programmable Custom Computing Machines (FCCM),
pp. 3643 (2014)
References 37

31. Tang, W.T., Tan, W.J., Ray, R., Wong, Y.W., Chen, W., Kuo, S.H., Goh, R.S.M., Turner,
S.J., Wong, W.F.: Accelerating sparse matrix-vector multiplication on GPUs using Bit-
Representation-Optimized schemes. In: 2013 SCInternational Conference for High Per-
formance Computing, Networking, Storage and Analysis (SC), pp. 112 (2013)
32. Greathouse, J.L., Daga, M.: Efficient sparse Matrix-Vector multiplication on GPUs using the
CSR storage format. In: SC14: International Conference for High Performance Computing,
Networking, Storage and Analysis, pp. 769780 (2014)
33. Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarath, S., Sadayappan, P.: Fast sparse Matrix-
Vector multiplication on GPUs for graph applications. In: SC14: International Conference for
High Performance Computing, Networking, Storage and Analysis, pp. 781792 (2014)
34. Grigoras, P., Burovskiy, P., Hung, E., Luk, W.: Accelerating SpMV on FPGAs by compressing
nonzero values. In: 2015 IEEE 23rd Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), pp. 6467 (2015)
35. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and
Applied Mathematics, Boston, US (2004)
36. Basermann, A., Jaekel, U., Hachiya, K.: Preconditioning parallel sparse iterative solvers for
circuit simulation. In: Proceedings of the 8th SIAM Proceedings on Applied Linear Algebra,
Williamsburg VA (2003)
37. Suda, R.: New iterative linear solvers for parallel circuit simulation. Ph.D. thesis, University
of Tokio (1996)
38. Basermann, A., Jaekel, U., Nordhausen, M., Hachiya, K.: Parallel iterative solvers for sparse
linear systems in circuit simulation. Future Gener. Comput. Syst. 21(8), 12751284 (2005)
39. Li, Z., Shi, C.J.R.: An efficiently preconditioned GMRES method for fast Parasitic-Sensitive
Deep-Submicron VLSI circuit simulation. In: Design, Automation and Test in Europe, Vol.
2, pp. 752757 (2005)
40. Li, Z., Shi, C.J.R.: A Quasi-Newton preconditioned Newton-Krylov method for robust and
efficient Time-Domain simulation of integrated circuits with strong parasitic couplings. Asia
S. Pac. Conf. Des. Autom. 2006, 402407 (2006)
41. Li, Z., Shi, C.J.R.: A Quasi-Newton preconditioned newtonKrylov method for robust and
efficient Time-Domain simulation of integrated circuits with strong parasitic couplings. IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst. 25(12), 28682881 (2006)
42. Zhao, X., Han, L., Feng, Z.: A Performance-Guided graph sparsification approach to scalable
and robust SPICE-Accurate integrated circuit simulations. IEEE Trans. Comput.-Aided Des.
Integr. Circuits Syst. 34(10), 16391651 (2015)
43. Zhao, X., Feng, Z.: GPSCP: A General-Purpose Support-Circuit preconditioning approach to
Large-Scale SPICE-Accurate nonlinear circuit simulations. In: 2012 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), pp. 429435 (2012)
44. Zhao, X., Feng, Z.: Towards efficient SPICE-Accurate nonlinear circuit simulation with On-
the-Fly Support-Circuit preconditioners. In: Design Automation Conference (DAC), 2012
49th ACM/EDAC/IEEE, pp. 11191124 (2012)
45. Bern, M., Gilbert, J.R., Hendrickson, B., Nguyen, N., Toledo, S.: Support-Graph precondi-
tioners. SIAM J. Matrix Anal. Appl. 27(4), 930951 (2006)
46. Thornquist, H.K., Keiter, E.R., Hoekstra, R.J., Day, D.M., Boman, E.G.: A parallel precondi-
tioning strategy for efficient Transistor-Level circuit simulation. In: 2009 IEEE/ACM Inter-
national Conference on Computer-Aided DesignDigest of Technical Papers, pp. 410417
(2009)
47. Chan, K.W.: Parallel algorithms for direct solution of large sparse power system matrix equa-
tions. IEE Proc.Gener. Transm. Distrib. 148(6), 615622 (2001)
48. Zecevic, A.I., Siljak, D.D.: Balanced decompositions of sparse systems for multilevel parallel
processing. IEEE Trans. Circuits Syst. I: Fundam. Theory Appl. 41(3), 220233 (1994)
49. Koester, D.P., Ranka, S., Fox, G.C.: Parallel Block-Diagonal-Bordered sparse linear solvers
for electrical power system applications. In: Proceedings of the Scalable Parallel Libraries
Conference, 1993, pp. 195203 (1993)
38 2 Related Work

50. Paul, D., Nakhla, M.S., Achar, R., Nakhla, N.M.: Parallel circuit simulation via binary link
formulations (PvB). IEEE Trans. Compon. Packag. Manuf. Technol. 3(5), 768782 (2013)
51. Hu, Y.F., Maguire, K.C.F., Blake, R.J.: Ordering unsymmetric matrices into bordered block
diagonal form for parallel processing. In: Euro-Par99 Parallel Processing: 5th International
Euro-Par Conference Toulouse, pp. 295302 (1999)
52. Aykanat, C., Pinar, A., atalyrek, U.V.: Permuting sparse rectangular matrices into Block-
Diagonal form. SIAM J. Sci. Comput. 25(6), 18601879 (2004)
53. Duff, I.S., Scott, J.A.: Stabilized bordered block diagonal forms for parallel sparse solvers.
Parallel Comput. 31(34), 275289 (2005)
54. Frohlich, N., Riess, B.M., Wever, U.A., Zheng, Q.: A new approach for parallel simulation
of VLSI circuits on a transistor level. IEEE Trans. Circuits Syst. I: Fundam. Theory Appl.
45(6), 601613 (1998)
55. Honkala, M., Roos, J., Valtonen, M.: New multilevel Newton-Raphson method for parallel
circuit simulation. Proc. Eur. Conf. Circuit Theory Des. 1, 113116 (2001)
56. Zhu, Z., Peng, H., Cheng, C.K., Rouz, K., Borah, M., Kuh, E.S.: Two-Stage Newton-Raphson
method for Transistor-Level simulation. IEEE Trans. Comput.-Aided Des. Integr. Circuits
Syst. 26(5), 881895 (2007)
57. Rabbat, N., Sangiovanni-Vincentelli, A., Hsieh, H.: A multilevel newton algorithm with
macromodeling and latency for the analysis of Large-Scale nonlinear circuits in the time
domain. IEEE Trans. Circuits Syst. 26(9), 733741 (1979)
58. Smith, B., Bjorstad, P., Gropp, W.: Domain Decomposition: Parallel Multilevel Methods for
Elliptic Partial Differential Equations, 1st edn. Cambridge University Press (2004)
59. Peng, H., Cheng, C.K.: Parallel transistor level circuit simulation using domain decomposition
methods. In: 2009 Asia and South Pacific Design Automation Conference, pp. 397402 (2009)
60. Peng, H., Cheng, C.K.: Parallel transistor level full-Chip circuit simulation. In: 2009 Design,
Automation Test in Europe Conference Exhibition, pp. 304307 (2009)
61. Lelarasmee, E., Ruehli, A.E., Sangiovanni-Vincentelli, A.L.: The waveform relaxation
method for Time-Domain analysis of large scale integrated circuits. IEEE Trans. Comput.-
Aided Des. Integr. Circuits Syst. 1(3), 131145 (1982)
62. Achar, R., Nakhla, M.S., Dhindsa, H.S., Sridhar, A.R., Paul, D., Nakhla, N.M.: Parallel and
scalable transient simulator for power grids via waveform relaxation (PTS-PWR). IEEE Trans.
Very Large Scale Integr. (VLSI) Syst. 19(2), 319332 (2011)
63. Odent, P., Claesen, L., Man, H.D.: A combined waveform Relaxation-Waveform relaxation
newton algorithm for efficient parallel circuit simulation. In: Proceedings of the European
Design Automation Conference, 1990, EDAC, pp. 244248 (1990)
64. Rissiek, W., John, W.: A dynamic scheduling algorithm for the simulation of MOS and Bipolar
circuits using waveform relaxation. In: Design Automation Conference, 1992, EURO-VHDL
92, EURO-DAC 92. European, pp. 421426 (1992)
65. Saviz, P., Wing, O.: PYRAMID-A hierarchical waveform Relaxation-Based circuit simulation
program. In: IEEE International Conference on Computer-Aided Design, 1988. ICCAD-88.
Digest of Technical Papers, pp. 442445 (1988)
66. Erdman, D.J., Rose, D.J.: A newton waveform relaxation algorithm for circuit simulation. In:
1989 IEEE International Conference on Computer-Aided Design, 1989. ICCAD-89. Digest
of Technical Papers, pp. 404407 (1989)
67. Saviz, P., Wing, O.: Circuit simulation by hierarchical waveform relaxation. IEEE Trans.
Comput.-Aided Des. Integr. Circuits Syst. 12(6), 845860 (1993)
68. Fang, W., Mokari, M.E., Smart, D.: Robust VLSI circuit simulation techniques based on over-
lapped waveform relaxation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 14(4),
510518 (1995)
69. Gristede, G.D., Ruehli, A.E., Zukowski, C.A.: Convergence properties of waveform relax-
ation circuit simulation methods. IEEE Trans. Circuits Syst. I: Fundam. Theory Appl. 45(7),
726738 (1998)
70. Dong, W., Li, P., Ye, X.: WavePipe: parallel transient simulation of analog and digital circuits
on Multi-Core Shared-Memory machines. In: Design Automation Conference, 2008. DAC
2008. 45th ACM/IEEE, pp. 238243 (2008)
References 39

71. Ye, X., Dong, W., Li, P., Nassif, S.: MAPS: Multi-Algorithm parallel circuit simulation. In:
2008 IEEE/ACM International Conference on Computer-Aided Design, pp. 7378 (2008)
72. Ye, X., Li, P.: Parallel program performance modeling for runtime optimization of Multi-
Algorithm circuit simulation. In: 2010 47th ACM/IEEE Design Automation Conference
(DAC), pp. 561566 (2010)
73. Ye, X., Li, P.: On-the-fly runtime adaptation for efficient execution of parallel Multi-Algorithm
circuit simulation. In: 2010 IEEE/ACM International Conference on Computer-Aided Design
(ICCAD), pp. 298304 (2010)
74. Ye, X., Dong, W., Li, P., Nassif, S.: Hierarchical multialgorithm parallel circuit simulation.
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 30(1), 4558 (2011)
75. Ye, Z., Wu, B., Han, S., Li, Y.: Time-Domain segmentation based massively parallel simulation
for ADCs. In: Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE, pp. 16
(2013)
76. Chua, L.O., Lin, P.Y.: Computer-Aided analysis of electronic circuits: algorithms and com-
putational techniques, 1st edn. Prentice Hall Professional Technical Reference (1975)
77. Sli, E., Mayers, D.F.: An Introduction to Numerical Analysis, 2nd edn. Cambridge University
Press, England (2003)
78. Dahlquist, G.G.: A special stability problem for linear multistep methods. BIT Numer. Math.
3(1), 2743 (1963)
79. Nie, Q., Zhang, Y.T., Zhao, R.: Efficient Semi-Implicit schemes for stiff systems. J. Comput.
Phys. 214(2), 521537 (2006)
80. Hochbruck, M., Lubich, C.: On Krylov subspace approximations to the matrix exponential
operator. SIAM J. Numer. Anal. 34(5), 19111925 (1997)
81. Saad, Y.: Analysis of some Krylov subspace approximations to the matrix exponential oper-
ator. SIAM J. Numer. Anal. 29(1), 209228 (1992)
82. Zhuang, H., Wang, X., Chen, Q., Chen, P., Cheng, C.K.: From circuit theory, simulation
to SPICE_Diego: a matrix exponential approach for Time-Domain analysis of Large-Scale
circuits. IEEE Circuits Syst. Mag. 16(2), 1634 (2016)
83. Zhuang, H., Yu, W., Kang, I., Wang, X., Cheng, C.K.: An algorithmic framework
for efficient Large-Scale circuit simulation using exponential integrators. In: 2015 52nd
ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 16 (2015)
84. Weng, S.H., Chen, Q., Wong, N., Cheng, C.K.: Circuit simulation via matrix exponential
method for stiffness handling and parallel processing. In: 2012 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), pp. 407414 (2012)
85. Chen, Q., Zhao, W., Wong, N.: Efficient matrix exponential method based on extended Krylov
subspace for transient simulation of Large-Scale linear circuits. In: 2014 19th Asia and South
Pacific Design Automation Conference (ASP-DAC), pp. 262266 (2014)
86. Zhuang, H., Weng, S.H., Lin, J.H., Cheng, C.K.: MATEX: A distributed framework for tran-
sient simulation of power distribution networks. In: 2014 51st ACM/EDAC/IEEE Design
Automation Conference (DAC), pp. 16 (2014)
87. NVIDIA Corporation: NVIDIA CUDA C Programming Guide. http://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html
88. Khronos OpenCL Working Group: The OpenCL Specification v1.1 (2010)
89. Gulati, K., Croix, J.F., Khatri, S.P., Shastry, R.: Fast circuit simulation on graphics processing
units. In: 2009 Asia and South Pacific Design Automation Conference, pp. 403408 (2009)
90. Poore, R.E.: GPU-Accelerated Time-Domain circuit simulation. In: 2009 IEEE Custom Inte-
grated Circuits Conference, pp. 629632 (2009)
91. Bayoumi, A.M., Hanafy, Y.Y.: Massive parallelization of SPICE device model evaluation
on GPU-based SIMD architectures. In: Proceedings of the 1st International Forum on Next-
generation Multicore/Manycore Technologies, pp. 12:112:5 (2008)
92. NVIDIA Corporation: CUDA BLAS. http://docs.nvidia.com/cuda/cublas/
93. Christen, M., Schenk, O., Burkhart, H.: General-Purpose sparse matrix building blocks Using
the NVIDIA CUDA technology platform. In: First Workshop on General Purpose Processing
on Graphics Processing Units. Citeseer (2007)
40 2 Related Work

94. Krawezik, G.P., Poole, G.: Accelerating the ANSYS direct sparse solver with GPUs. In: 2009
Symposium on Application Accelerators in High Performance Computing (SAAHPC09)
(2009)
95. Yu, C.D., Wang, W., Pierce, D.: A CPU-GPU hybrid approach for the unsymmetric multi-
frontal method. Parallel Comput. 37(12), 759770 (2011)
96. George, T., Saxena, V., Gupta, A., Singh, A., Choudhury, A.: Multifrontal factorization of
sparse SPD matrices on GPUs. In: 2011 IEEE International Parallel Distributed Processing
Symposium (IPDPS), pp. 372383 (2011)
97. Lucas, R.F., Wagenbreth, G., Tran, J.J., Davis, D.M.: Multifrontal Sparse Matrix Factorization
on Graphics Processing Units. Technical report. Information Sciences Institute, University of
Southern California (2012)
98. Lucas, R.F., Wagenbreth, G., Davis, D.M., Grimes, R.: Multifrontal computations on GPUs
and their Multi-Core Hosts. In: Proceedings of the 9th International Conference on High
Performance Computing for Computational Science, pp. 7182 (2011)
99. Kim, K., Eijkhout, V.: Scheduling a parallel sparse direct solver to multiple GPUs. In: 2013
IEEE 27th International Parallel and Distributed Processing Symposium Workshops Ph.D.
Forum (IPDPSW), pp. 14011408 (2013)
100. Hogg, J.D., Ovtchinnikov, E., Scott, J.A.: A sparse symmetric indefinite direct solver for GPU
architectures. ACM Trans. Math. Softw. 42(1), 1:11:25 (2016)
101. Sao, P., Vuduc, R., Li, X.S.: A distributed CPU-GPU sparse direct solver. In: Euro-Par 2014
Parallel Processing: 20th International Conference, pp. 487498 (2014)
102. Ren, L., Chen, X., Wang, Y., Zhang, C., Yang, H.: Sparse LU factorization for parallel circuit
simulation on GPU. In: Proceedings of the 49th Annual Design Automation Conference. DAC
12, pp. 11251130. ACM, New York, NY, USA (2012)
103. Chen, X., Ren, L., Wang, Y., Yang, H.: GPU-Accelerated sparse LU factorization for circuit
simulation with performance modeling. IEEE Trans. Parallel Distrib. Syst. 26(3), 786795
(2015)
104. He, K., Tan, S.X.D., Wang, H., Shi, G.: GPU-Accelerated parallel sparse LU factorization
method for fast circuit analysis. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(3),
11401150 (2016)
105. Kapre, N., DeHon, A.: Accelerating SPICE Model-Evaluation using FPGAs. In: 17th IEEE
Symposium on Field Programmable Custom Computing Machines, 2009. FCCM 09, pp.
3744 (2009)
106. Kapre, N.: Exploiting input parameter uncertainty for reducing datapath precision of SPICE
device models. In: 2013 IEEE 21st Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), pp. 189197 (2013)
107. Martorell, H., Kapre, N.: FX-SCORE: a framework for fixed-point compilation of SPICE
device models using Gappa++. In: Field-Programmable Custom Computing Machines
(FCCM), pp. 7784 (2012)
108. Kapre, N., DeHon, A.: Performance comparison of Single-Precision SPICE Model-Evaluation
on FPGA, GPU, Cell, and Multi-Core processors. In: 2009 International Conference on Field
Programmable Logic and Applications, pp. 6572 (2009)
109. Wu, W., Shan, Y., Chen, X., Wang, Y., Yang, H.: FPGA accelerated parallel sparse matrix
factorization for circuit simulations. In: Reconfigurable Computing: Architectures, Tools and
Applications: 7th International Symposium, ARC 2011, pp. 302315 (2011)
110. Kapre, N., DeHon, A.: Parallelizing sparse matrix solve for SPICE circuit simulation using
FPGAs. In: International Conference on Field-Programmable Technology, 2009. FPT 2009,
pp. 190198 (2009)
111. Wang, X., Jones, P.H., Zambreno, J.: A configurable architecture for sparse LU decomposition
on matrices with arbitrary patterns. SIGARCH Comput. Archit. News 43(4), 7681 (2016)
112. Wu, G., Xie, X., Dou, Y., Sun, J., Wu, D., Li, Y.: Parallelizing sparse LU decomposition
on FPGAs. In: 2012 International Conference on Field-Programmable Technology (FPT),
pp. 352359 (2012)
References 41

113. Johnson, J., Chagnon, T., Vachranukunkiet, P., Nagvajara, P., Nwankpa, C.: Sparse LU decom-
position using FPGA. In: International Workshop on State-of-the-Art in Scientific and Parallel
Computing (PARA) (2008)
114. Siddhartha, Kapre, N.: Heterogeneous dataflow architectures for FPGA-based sparse LU fac-
torization. In: 2014 24th International Conference on Field Programmable Logic and Appli-
cations (FPL), pp. 14 (2014)
115. Siddhartha, Kapre, N.: Breaking sequential dependencies in FPGA-Based sparse LU fac-
torization. In: 2014 IEEE 22nd Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), pp. 6063 (2014)
116. Kapre, N., DeHon, A.: VLIW-SCORE: beyond C for sequential control of SPICE FPGA
acceleration. In: 2011 International Conference on Field-Programmable Technology (FPT),
pp. 19 (2011)
117. Kapre, N., DeHon, A.: SPICE2: spatial processors interconnected for concurrent execution
for accelerating the SPICE circuit simulator using an FPGA. IEEE Trans. Comput.-Aided
Des. Integr. Circuits Syst. 31(1), 922 (2012)
118. Kapre, N.: SPICE2A spatial parallel architecture for accelerating the SPICE circuit simu-
lator. Ph.D. thesis, California Institute of Technology (2010)
Chapter 3
Overall Solver Flow

In this chapter, we will present the basic flow of our proposed solver NICSLU, as a
necessary background of the parallelization techniques. We will also introduce the
usage of NICSLU in SPICE-like circuit simulators. Basically, a sparse direct solver
uses the following three steps to solve sparse linear systems:
Pre-analysis or pre-processing. This step performs row and column reordering to
minimize fill-ins which will be generated in numerical LU factorization. NICSLU
also performs a symbolic factorization to predict the sparsity of the matrix and
pre-allocate memories for numerical factorization.
Numerical LU factorization. This step factorizes the matrix obtained from the
first step into LU factors. This is the most complicated and time-consuming step
in a sparse direct solver. NICSLU has two different factorization methods: full
factorization with partial pivoting and re-factorization without partial pivoting. In
circuit simulation, NICSLU can smartly decide to call which method according to
the numerical features of the matrix.
Right-hand-solving. This step solves the linear system by forward/backward sub-
stitutions. NICSLU also has an iterative refinement step which can be invoked to
refine the solution when necessary. NICSLU can also smartly decide whether to
call iterative refinement according to the numerical features of the matrix.
As the main contents of book are focused on the numerical LU factorization part,
in this chapter, we will also present the sequential LU factorization algorithm adopted
by NICSLU, which will be the foundation of the proposed parallel LU factorization
algorithms. Although our descriptions are for NICSLU, most of the algorithms and
techniques are actually general and not restricted to NICSLU.

Springer International Publishing AG 2017 43


X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_3
44 3 Overall Solver Flow

3.1 Overall Flow

Figure 3.1 shows the overall flow of NICSLU. The above-mentioned three steps are
clearly marked in this figure. The pre-analysis step is performed only once but the
numerical LU factorization and right-hand-solving steps are both executed many
times in the NewtonRaphson iterations in a SPICE-like circuit simulation flow.
During the SPICE iterations, the symbolic pattern of the matrix keeps the same but
the values change. This is an important feature of the sparse matrix in SPICE-like
circuit simulators, which avoids multiple executions of the pre-analysis step.
The pre-analysis step of NICSLU includes three steps: a static pivoting or zero-free
permutation, the approximate minimum degree (AMD) algorithm, and a symbolic
factorization. Once the symbolic factorization is finished, we calculate a sparsity
ratio (SPR) which is an estimation of the sparsity of the matrix. The SPR will be
used to select the factorization algorithm, such that the performance of NICSLU is
always high for different matrix sparsity.
As mentioned above, NICSLU offers two numerical factorization methods: full
factorization and re-factorization. The factorization method is selected according to
the concept of pseudo condition number (PCN), which is calculated at the end of the
numerical factorization step. For both methods, NICSLU provides three different
factorization algorithms: map algorithm, column algorithm, and supernodal algo-
rithm. The factorization algorithm is selected according to the SPR value to achieve
high performance for various sparsity. For full factorization, there is a minimum
suitable sparsity such that parallel factorization can really achieve acceleration than
sequential factorization. If the sparsity of a matrix is smaller than the suitable spar-
sity, parallel factorization may be even slower than sequential factorization, and, thus,
we should choose sequential factorization in this case. The SPR is used to control
whether full factorization should be executed in parallel or sequential.
The right-hand-solving step includes two steps: forward/backward substitutions
and iterative refinement. Forward/backward substitutions obtain the solution by solv-
ing two triangular equations and the iterative refinement refines the solution to make
it more accurate. Substitutions involve much fewer numerical computations than
numerical factorization, so they are always executed in sequential in NICSLU. If the
iterative refinement step is selected to execute, when the refinement should stop is
automatically controlled by NICSLU according to the PCN value.
All algorithms and parallelization techniques of NICSLU will be described in
three chapters. In this chapter, we will introduce the pre-analysis step, the sequential
column algorithm, and the right-hand-solving step, which render a general flow of
the solver. In the next chapter, we will introduce the parallelization techniques for
the column algorithm. In Chap. 5, we will introduce the map algorithm and the
supernodal algorithm, as well as their parallelization techniques.
3.1 Overall Flow 45

Static pivoting/Zero-free
permutation

Pre-analysis
Approximate minimum
degree

Symbolic factorization &


SPR=FLOPs/NNZ(L+U-I)
Numerical LU
factorization
Full factorization Method Re-factorization
selection
Sequential Sequential
or parallel or parallel

Algorithm Algorithm
selection selection

Map
Create map if
not created

Newton-Raphson iterations
Column full Map re-
Map
factorization factorization

Column
Column full Column
Column re-
factorization factorization

Supernodal
Supernodal full Supernodal
Supernodal re-
factorization factorization

PCN=max |Ukk|/min |Ukk|


k k
Right-hand-solving

Forward/backward
substitutions

Iterative refinement
(automatic control)

Fig. 3.1 Overall flow of NICSLU


46 3 Overall Solver Flow

3.2 Pre-analysis

In this section, we will introduce the pre-analysis step of NICSLU. Since the pre-
analysis algorithms adopted by NICSLU are all existing algorithms, we only briefly
explain the fundamental theories of them without presenting their detailed algorithm
flows. If readers are interested in them, please refer to the corresponding references
which will be cited in the following contents.

3.2.1 Zero-Free Permutation/Static Pivoting

This is the first step of pre-analysis. The primary purpose of this step is to obtain a
zero-free diagonal. NICSLU offers two options to perform the zero-free permutation.
The first option is to permute the matrix only based on the symbolic pattern regardless
of the numerical values. The other option is to permute the matrix such that the product
of the diagonal absolute values is maximized. Permuting a zero-free diagonal or
putting large elements on the diagonal helps reduce off-diagonal pivots during the
numerical LU factorization phase. If the latter option is selected, we also call it static
pivoting. We adopt the MC64 algorithm [1, 2] from the Harwell subroutine library
(HSL) [3] to implement static pivoting. If one only wants to obtain a symbolically
zero-free diagonal, a zero-free permutation algorithm numbered MC21 [4, 5] in HSL
is invoked. We will briefly introduce the two algorithms in the following contents.

3.2.1.1 Zero-Free Permutation (MC21 Algorithm)

If the MC64 algorithm is not selected or it fails, NICSLU performs the MC21 algo-
rithm to obtain a zero-free diagonal. The MC21 algorithm tries to find a maximum
matching for all the rows and all the columns, such that each column is matched to
one row and the one row can only be matched to one column. If a complete matching
cannot be found, i.e., there are rows and columns that cannot be matched, it means
that the matrix is structurally singular and NICSLU returns an error code to indicate
such an error.
The MC21 algorithm is based on depth-first search (DFS). To perform DFS, a
bipartite graph with 2N vertexes is created from the matrix, in which N vertexes
correspond to rows and the other N vertexes correspond to columns. A vertex cor-
responding to row i is marked as R(i) and a vertex corresponding to column j is
marked as C( j). Any nonzero element in the matrix Ai j corresponds to an undirected
edge (R(i), C( j)) in the bipartite graph. An array = {1 , 2 , . . . , N } is used to
record matched rows and columns. i = j means that row i is matched to column
j, and the nonzero element Ai j is the matched element that will be exchanged to the
diagonal after the MC21 algorithm is finished. The MC21 algorithm starts from each
column vertex C( j). All the adjacent row vertexes of C( j) are visited. If there is a
3.2 Pre-analysis 47

Fig. 3.2 Illustration of the (a) 1 2 3 4 5 6 7 8 9 (b) R


MC21 algorithm. a The
C
symbolic structure of matrix 1 1 1
A. b The bipartite graph, 2 2 2
where the red edges indicate 3 3 3
the final matched rows and 4 4 4
columns 5 5 5
6 6 6
7 7 7
8 8 8
9 9 9

row vertex R(i) that is not matched to any column vertex, then row i is matched to
column j, i.e., i = j. If all the adjacent row vertexes of C( j) have been matched,
then a DFS procedure is performed based on matched rows and columns to find a
path until an unmatched row vertex is reached. All the row and column vertexes on
the path are marked as matched one-to-one. Figure 3.2 shows an example of such
a procedure. Assume that the first 4 columns have already been matched and the
matched elements are marked in red in Fig. 3.2a. Now we are trying to visit column
5 which has two nonzero elements at rows 3 and 6. Unfortunately, rows 3 and 6 both
have already been matched. Therefore, we start DFS from the columns which are
matched to the rows of the nonzero elements in column 5. First, column 3 is revisited
and we find an unmatched row 5, so column 3 is now matched to row 5. Then, column
5 can be matched to row 3. The same procedure will be continued until all the rows
and columns are matched one-to-one. Figure 3.2b shows the final matching results of
this example by red edges. Finally, for j = 1, 2, . . . , N , column j is exchanged to
column j, and then all the diagonal elements are symbolic nonzeros. Mathematically,
the MC21 algorithm is equivalent to find a column permutation matrix Q, such that
AQ has a zero-free diagonal.

3.2.1.2 Static Pivoting (MC64 Algorithm)

Static pivoting is an alternate and better method for zero-free permutation. The MC64
algorithm has two steps. First, it finds a column permutation such that the product of
all the diagonal absolute values is maximized. The second step is to scale the matrix
such that each diagonal element is 1 and each off-diagonal element is bounded by
1 in the absolute value.
The MC64 algorithm first tries to find a permutation = {1 , 2 , . . . , N } to
maximize the product of all the diagonal absolute values, i.e.,


N
 
 A j, . (3.1)
j
j=1
48 3 Overall Solver Flow

j records that row j is matched to column j . After the permutation is found, column
j is exchanged to column j, such that all the diagonal elements are nonzeros and
the product of the diagonal absolute values is maximized. Mathematically, this is
equivalent to find a column permutation matrix Q, such that the product of the
diagonal values of AQ is maximized. The algorithm to find the permutation is based
on the Dijkstras shortest path algorithm. The basic idea is quite similar to zero-
free permutation. When performing a DFS, the length of the path which equals to
an inverse form of the product of the absolute values of elements on the path is
recorded. The shortest path is found from all possible paths, which corresponds to
the permutation that maximizes the product of the diagonal absolute values. Once
the permutation is found, two diagonal scaling matrices Dr and Dc are generated
to scale the matrix, such that each diagonal element of Dr AQDc is 1 and all the
off-diagonal elements are in the range of [1, +1]. Details of the MC64 algorithm
can be found in [2].
By default, NICSLU runs the MC64 algorithm first. If static pivoting cannot find
a shortest path that makes all the rows and columns matched one-to-one, this means
that the matrix is numerically singular. In this case, NICSLU will abandon static
pivoting and run zero-free permutation instead. NICSLU also provides an option to
specify whether scaling the matrix is required. If not, NICSLU only maximizes the
product of all the diagonal absolute values without scaling the matrix.

3.2.2 Matrix Ordering

The purpose of matrix ordering is to find an optimal permutation to reorder the matrix
such that fill-ins are minimized during sparse LU factorization. This is a special step
in sparse matrix factorizations. Figure 3.3 explains why matrix ordering is important
in sparse LU factorization. For sparse matrix factorization, different orderings can
generate significantly different fill-ins. If the matrix is ordered like the case shown
in Fig. 3.3a, then after LU factorization, both L and U are fully filled, leading to
a high fill-in ratio. On the contrary, if the matrix is ordered like the case shown in
Fig. 3.3b, no fill-ins are generated after LU factorization. For this simple example, it
is obvious that the ordering shown in Fig. 3.3b is a good one. As the computational
cost of sparse LU factorization is almost proportional to the number of FLOPs, which
in turn, depends on the number of fill-ins, generating too many fill-ins will greatly
degrade the performance of sparse direct solvers. Consequently, matrix ordering
is a necessary step for every sparse direct solver. Finding the optimal ordering to
minimize the fill-ins is actually a nondeterministic polynomial time complete (NPC)
problem [6], and, hence, people use heuristic algorithms to find suboptimal solutions
to this problem.
NICSLU adopts the AMD algorithm [7, 8], which is a very popular ordering algo-
rithm, to perform matrix ordering for fill-in reduction. Heuristics in AMD means that
the matrix ordering is done step by step, and in each step, we use a greedy strategy to
select the pivot to eliminate, such that fill-ins are minimized only at the current step,
3.2 Pre-analysis 49

(a) Fill-ins

= *

(b)

= *

Fig. 3.3 Different orderings generate different fill-ins. a A bad ordering leads to full fill-ins.
b A good ordering does not generate any fill-in

without considering its impact to the subsequent elimination steps. AMD can only
be applied to symmetric matrices, so a matrix after the zero-free permutation/static
pivoting step, say A, is first symmetrized by calculating A = A + AT . Mathemati-
cally, AMD finds a permutation matrix P and then applies symmetric row and column
permutations to the symmetric matrix, i.e., PA P T , such that factorizing PA P T gen-
erates much fewer fill-ins than directly factorizing A . As A is constructed from A,
factorizing PA P T also tends to generate fewer fill-ins than factorizing A.
Figure 3.4 illustrates the basic theory of AMD based on the elimination graph (EG)
model. The EG is defined as an undirected graph, with N vertexes numbered from 1
to N corresponding to the rows and columns of the matrix. Except for the diagonal,
any nonzero element in A , say Ai, j , corresponds to an undirected edge (i, j) in the
EG. According to the Gaussian elimination procedure, eliminating a vertex from the
EG will generate a clique (a clique means a subgraph where its vertexes are pairwise
connected) which is composed of vertexes which are adjacent to the eliminated
vertex. For the example illustrated in Fig. 3.4, if vertex 1 is eliminated, vertexes
{2, 3, 4} form a new clique so they are connected pairwise. The newly generated
edges, i.e., (2, 4) and (3, 4), correspond to the four fill-ins in the matrix, i.e., A2,4 ,
A3,4 , A4,2 and A4,3 , which are denoted by red squares in Fig. 3.4. According to this
observation, in order to minimize fill-ins, one should always select the vertex that
generates the fewest fill-ins at each step. However, calculating the exact number of
fill-ins is an expensive task, so AMD uses the approximate vertex degree instead of
the number of fill-ins, when selecting pivots to eliminate. Such an approximation
leads to a very fast speed without affecting the ordering quality for most practical
matrices [7].
50 3 Overall Solver Flow

Fig. 3.4 Illustration of the 1


elimination process.
Eliminating node 1 generates 4 4
a clique {2, 3, 4}. The newly 2 2
added edges (2, 4) and (3, 4)
correspond to fill-ins 5 5
generated in the matrix 3 3

1 1
2 2
3 3
4 4
5 5

Fill-ins

As can be seen, additional edges are generated in the EG during the elimination
process. This leads to two challenges in the implementation of the AMD algorithm.
First, it is difficult to predict the required memory for the EG before the algorithm
starts, so we need to dynamically reallocate the memory. Second, after a vertex is
eliminated from the EG, additional edges are required to be inserted into the EG.
This leads to a severe problem that the memory spaces which store edges need to
be moved frequently. To overcome the two problems, a realistic implementation of
AMD actually adopts the concept of quotient graph [9], which can be operated in-
place and is much faster than the EG model. We omit the detailed implementation
of AMD in this book. Readers can refer to [7].

3.2.3 Symbolic Factorization

The main purpose of symbolic factorization in the pre-analysis step of NICSLU


includes workload prediction and sparsity estimation. Symbolic factorization pre-
dicts the symbolic pattern of the LU factors without considering the numerical values.
Different from other symbolic factorization methods that calculate an upper bound of
the symbolic pattern by considering all possible pivoting choices [10], we do not con-
sider anything about pivoting and just assume that there are no off-diagonal pivots.
This also means that the symbolic pattern predicted by our method is a lower limit.
The upper limit severely overestimates the number of nonzeros in the LU factors,
but the lower limit just underestimates the symbolic pattern a little in most cases.
Symbolic factorization is performed column by column, as shown in Algorithm 4.
For each column, a symbolic prediction and a pruning are performed. We will
introduce the two steps in Sects. 3.3.1 and 3.3.4 respectively. Basically, symbolic
3.2 Pre-analysis 51

Algorithm 4 Symbolic factorization.


Input: Symbolic pattern of an N N matrix A
Output: Symbolic pattern of the LU factors without considering pivoting
1: for k = 1 : N do
2: Symbolic prediction for column k (Sect. 3.3.1)
3: Pruning for column k (Sect. 3.3.4)
4: end for

Algorithm 5 Calculating FLOPs.


Input: Symbolic patterns of L and U
Output: FLOPs
1: FLOPs = 0
2: for j = 1 : N do
3: for i < j where Ui j is a nonzero element do
4: FLOPs+ = 2 NNZ(L(i + 1 : N , i))
5: end for
6: FLOPs+ = NNZ(L( j + 1 : N , j))
7: end for

prediction calculates the symbolic pattern of a column and pruning is used to reduce
the computational cost for subsequent columns. Without any numerical computa-
tions, the symbolic factorization is typically much faster than numerical LU factor-
ization.
Once the symbolic factorization is finished, we calculate the number of FLOPs
by using Algorithm 5, and then estimate the sparsity of the matrix by calculating the
SPR defined as
FLOPs
SPR = (3.2)
NNZ(L + U I)

where NNZ means the number of nonzeros. The SPR estimates the average number of
FLOPs per nonzero in the LU factors, which is a good estimator of the sparsity of the
matrix. Davis has pointed that circuit matrices typically have a very small SPR [11].
As mentioned above, in our symbolic factorization, the SPR may underestimate
the actual sparsity if some off-diagonal elements are selected as pivots during LU
factorization. Fortunately, in most cases, there are not too many off-diagonal pivots,
so the underestimated sparsity can be very close to the actual sparsity.
The estimated SPR is used to select the LU factorization algorithm (map algorithm,
column algorithm, and supernodal algorithm), as illustrated in Fig. 3.1. Basically, if
the matrix is too sparse, the map algorithm runs faster than the column algorithm.
While the matrix is slightly dense, the supernodal algorithm runs faster than the col-
umn algorithm. Consequently, the optimal factorization algorithm should be selected
according to the matrix sparsity. We will further explain this point in Chaps. 5 and 6. In
addition, the SPR is also used to control whether full factorization will be executed in
52 3 Overall Solver Flow

parallel or sequential. The basic observation behind such a strategy is that, for highly
sparse matrices, due to the extremely low computational cost, the overhead caused by
parallelism (scheduling overhead, synchronization overhead, workload imbalance,
memory and cache conflicts, etc.) can be a non-ignorable part in the total runtime.
What we have found from experiments is that for extremely sparse matrices, parallel
full factorization cannot be faster than sequential full factorization. Consequently,
we use the SPR to automatically control the sequential or parallel execution of full
factorization. According to our results, the threshold is selected to be 50. Namely,
NICSLU runs parallel full factorization when the SPR is larger than 50; otherwise
sequential full factorization is selected. We will explain the selection of the threshold
by experimental results in Chap. 6.

3.3 Numerical Full Factorization

In this section, we will introduce the fundamental theories of the numerical LU


factorization step in NICSLU, which is also the most important step of NICSLU. For
convenience and simplicity, the matrix obtained from the pre-analysis step is still
denoted as A. As the pre-analysis step is executed only once in circuit simulation,
this simplification will not cause ambiguity.
The basic numerical LU factorization algorithm adopted by NICSLU is a modified
version of the G-P sparse left-looking algorithm [12]. The primary modification is
the use of the pruning algorithm [13], which is not adopted by the original G-P
algorithm. The modified G-P algorithm is also adopted by KLU so the sequential
algorithm of NICSLU is almost same as that of KLU. In this section, we will give a
basic introduction to the modified G-P algorithm, which is also the column algorithm
shown in Fig. 3.1. The parallelization methodologies and improvement techniques

Algorithm 6 Modified G-P sparse left-looking algorithm [12].


Input: N N matrix A obtained from pre-analysis
Output: Matrix L and U
1: L = I
2: for k = 1 : N do
3: Symbolic prediction: determine the symbolic pattern of column k, i.e., the
columns that will update column k
4: Numeric update: solve Lx = A(:, k) using Algorithm 7
5: Partial pivoting on x using Algorithm 8
6: U(1 : k, k) = x(1 : k)
x(k : N )
7: L(k : N , k) =
xk
8: Pruning: reduce the symbolic prediction cost of subsequent columns
9: end for
3.3 Numerical Full Factorization 53

of the G-P algorithm, which are also the primary innovations of NICSLU, will be
presented in the next two chapters.
The modified G-P algorithm factorizes an N N square matrix by sequentially
processing each column in four main steps: (1) symbolic prediction; (2) numerical
update; (3) partial pivoting; and (4) pruning, as shown in Algorithm 6. The algorithm
flow clearly explains why this algorithm is also called left-looking: when doing the
symbolic prediction and numerical update for a given column, dependent columns
on its left side will be visited. We will present brief descriptions and algorithms of
the four steps in the following four subsections.
As mentioned above, NICSLU offers two numerical LU factorization methods:
full factorization and re-factorization. The main difference between them is that re-
factorization does not invoke partial pivoting. In this section, we introduce the full
factorization algorithm, and in the next section, we will introduce the re-factorization
algorithm.

3.3.1 Symbolic Prediction

Symbolic prediction is the first step of factorizing a column. It calculates the symbolic
pattern of a given column, which also indicates the dependent columns that will be
visited in the numerical update step. Like zero-free permutation, symbolic prediction
is also done by DFS. In order to perform DFS, we also need to construct a DAG.
In symbolic prediction, the DAG is constructed from the symbolic pattern of L with
finished columns. The DAG has N vertexes corresponding to all the columns. Except
for the diagonal elements in L, any nonzero element in L, say L i, j , corresponds to
a directed edge (i, j) in the DAG. For a given column, say column k, The DAG
procedure starts from nonzero elements in A(:, k) until all reachable vertexes are
visited. For each nonzero element in A(:, k), we can get a vertex sequence by DFS.
All the vertexes in all the sequences are topologically sorted, and, finally, we get
the symbolic pattern of column k. The resulting symbolic pattern contains nonzero
elements of the given column of both L and U.
Figure 3.5 illustrates an example of the DFS procedure. Suppose that we are
doing symbolic prediction for column 10. There are two nonzero elements in
A(:, 10): A1,10 and A2,10 . Starting from A1,10 , we get a DFS sequence {1, 3, 5, 8, 10}.
Starting from A2,10 , we get another DFS sequence {2, 4, 9, 12, 7, 10, 11}. The
two sequences are merged and topologically sorted, so we get the final sequence
{1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12}, which indicates the symbolic pattern of column 10.
Note that the DAG is updated once the symbolic prediction of a column is finished.
The updated DAG will be used for symbolic predictions of subsequent columns.
The above descriptions are more of a theory. In a practical implementation of the
symbolic prediction, the DAG does not need to be explicitly constructed. The storage
of L is directly used in symbolic prediction. In addition, topological sorting is not an
actual step, either. The topological order is automatically guaranteed by an elaborate
update order to the resulting sequence during the DFS procedure.
54 3 Overall Solver Flow

1 2 3 4 5 6 7 8 9 10 11 12
1
2
3 1 2
4
5 6 4
6 3
7
8 7 9
5 8
9
10 10
11 11 12
12
Nonzeros in A(:, 10) DAG used for DFS of column 10
Fill-ins of column 10

Fig. 3.5 Illustration of the DFS for symbolic prediction [11]. This example is illustrated for when
we are doing symbolic prediction for column 10

3.3.2 Numerical Update

The purpose of numerical update is to calculate the numerical values for a given
column based on the symbolic pattern obtained in the symbolic prediction. Algo-
rithm 7 shows the algorithm flow of the numerical update for a given column. This
is typically the most time-consuming step in numerical LU factorization.

Algorithm 7 Solving Lx = A(:, k).


Input: Values and nonzero patterns of columns 1 to k 1 of L, and symbolic pattern
of column k of U
Output: x//x is a column vector of length N
1: x = A(:, k)
2: for j < k where U jk is a nonzero element do
3: x( j + 1 : N ) = x( j + 1 : N ) L( j + 1 : N , j) x j //MAD operation
4: end for

When updating a given column, say column k, numerical update uses dependent
columns on the left side to update column k. The dependence is determined by
the symbolic pattern of U(1 : k 1, k). Namely, column k depends on column j
( j < k), if and only if U jk is a nonzero element. The numerical update is actually
a set of multiplication-and-add (MAD) operations. Figure 3.6 illustrates the MAD
operation in a clearer way. In this example, we are doing numerical update for column
k and U(1 : k 1, k) has two nonzero elements. The numerical update for column
k involves three MAD operations, as marked by different colors in Fig. 3.6.
As can be seen from Algorithm 7, numerical update requires an uncompressed
array x of length N . This array serves as a temporary working space and stores all the
3.3 Numerical Full Factorization 55

Fig. 3.6 Illustration of the k


numerical update
U
j

immediate results during numerical update, as well as the final results of numerical
update. The necessity of this array is explained as follows. The symbolic patterns
of column k and its dependent columns are different, so for compressed storage
formats, it is expensive to simultaneously access two nonzero elements at the same
row in the two columns with different symbolic patterns. For example, assume that
we are using column j to update column k. We traverse the compressed array of
L(:, j), and for each nonzero element in L(:, j), say L i j , we need to find the address
of L ik or Uik to perform the numerical update. Since L and U are both stored in
compressed arrays, finding the address of L ik or Uik requires a traversal on L(:, k)
or U(:, k). On the contrary, if we use an uncompressed array x instead, the desired
address is simply the ith position of the array x. To integrate the uncompressed
array into numerical update, we need an operation named scattergather. Namely,
the numerical values of the nonzero elements are first scattered into x, and after
numerical update is finished, the numerical values stored in x will be gathered into
the compressed arrays of L and U. Figure 3.7 illustrates such an operation. Assume
that we are performing numerical update on column k. First, all the nonzero elements

1 1 1
2 2 2
3 4 4 4
Column k
Column j

3 3 3
1 6 6 3 6
4 4 4
6 3 3 1 3
5 5 5
8 8 6 8
6 6 6
1 1 1
7 7 7
8 8 8
(a) (b) Scatter column k (c) Numerical update (d) Gather nonzero
into an uncompressed on the uncompressed elements into
array array compressed storage

Fig. 3.7 Illustration of the scattergather operation


56 3 Overall Solver Flow

in column k are scattered into the uncompressed array x. Then, numerical update is
performed using all the dependent columns. Finally, the numerical results stored in
the uncompressed array x are gathered into the compressed storage of column k.

3.3.3 Partial Pivoting

It is well-known that numerical problems can occur in Gaussian elimination if some


diagonal elements are too small. In order to ensure the numerical stability, pivoting
is required to put elements with large magnitude on the diagonal. NICSLU adopts
a threshold-based partial pivoting strategy, as shown in Algorithm 8. For a given
column, say column k, partial pivoting is done by two main steps. First, the element
with the largest magnitude is found from the elements in column k of L, say xm .
Second, we check if the diagonal magnitude is large enough (i.e., whether |xk |
|xm |, where is the given threshold whose default value is 0.001), and if not, the
diagonal element xk and the element with the largest magnitude xm are exchanged.
The permutation caused by partial pivoting is also recorded. Once partial pivoting is
finished, values stored in the uncompressed array x are stored back to the compressed
storages of L and U, as shown in lines 6 and 7 of Algorithm 6.

Algorithm 8 Partial pivoting on x for column k


Input: k, x, and pivoting threshold //the default value of is 103
Output: x //elements of x may be exchanged when returning
1: Find the element with the largest magnitude from x(k : N ), say x m
2: if |x k | < |x m | then //the diagonal element is not large enough
3: Exchange the positions of x k and x m , and record the permutation as well
4: end if

According to Algorithm 8, the word partial that describes the pivoting method
means that the pivot of a column is selected from the corresponding column of
L, but not the full column or the full matrix. Note that full pivoting can also be
adopted to achieve a better numerical stability. However, full pivoting involves more
complicated row and column permutations. In most cases, partial pivoting can achieve
satisfactory numerical stability.

3.3.4 Pruning

Pruning is the last step of factorizing a column. Actually pruning is not a necessary
step in the left-looking algorithm, just like the original G-P algorithm. However,
pruning can significantly reduce the computational cost of the symbolic prediction
3.3 Numerical Full Factorization 57

(a) j k (b) j k m
U U

k k
X
l l
X
Pruned
L L

Fig. 3.8 Illustration of pruning. a After column k is factorized, column j is pruned. b When we
are doing symbolic prediction for column m, the pruned nonzero elements in column j are skipped
when performing DFS

step of subsequent columns. The detailed theory of pruning is proposed in [13]. We


use Fig. 3.8 to briefly illustrate the theory of pruning. Suppose that column k which
depends on column j has been finished, and column j has a nonzero element at
row k. The pruning theory says that any nonzero element in column j with row
index larger than k can be pruned, as shown in Fig. 3.8a. When doing symbolic
prediction for a subsequent column that also depends on column j, say column m, the
pruned elements in column j can be skipped during DFS, as shown in Fig. 3.8b. The
reason can be explained as follows. As column k depends on column j, according
to the theory of symbolic prediction, the pruned elements will also generate nonzero
elements in column k at the same rows, as shown in Fig. 3.8a. Since column j has a
nonzero element at row k, it will also generate a nonzero element at the same row in
column m, and, hence, column m must depend on column k. This guarantees that the
effects of the pruned nonzero elements to column m will not lose, as there must exist
the corresponding nonzero elements at the same rows in column k. The key factor
that makes pruning effective is that column j must have a nonzero element at row k,
otherwise column m may not depend on column k, and then the effect of the pruned
nonzero elements will lose. Please note that in a practical implementation with partial
pivoting, one must use the pivoted row index instead of the original row index. We
omit the details of this point in this book as they are complicated and our focus is to
present the fundamental theories. Algorithm 9 shows the algorithm flow for pruning
a given column. Please note that pruning does mean that the nonzero elements are
really eliminated. It just marks that some nonzero elements are not required to visit
during DFS in symbolic prediction.
58 3 Overall Solver Flow

Algorithm 9 Pruning for column j.


Input: Symbolic pattern of U(:, j) and symbolic pattern of L before column j
Output: Pruned positions of columns before column j
1: for i < j where Ui j is a nonzero element do
2: if column i is not pruned then
3: for k = i + 1 : N where L ki is a nonzero element do
4: if k == j then
5: Prune column i at row k
6: break
7: end if
8: end for
9: end if
10: end for

3.4 Numerical Re-factorization

In the previous section, we have presented the full LU factorization algorithm. NIC-
SLU also offers another numerical factorization method named re-factorization. The
main difference between them is the use of partial pivoting. Re-factorization does
not perform partial pivoting. This difference leads to many other differences between
the two factorization methods. In the case with partial pivoting, partial pivoting can
exchange row orders so symbolic prediction is required for every column. This also
means that, symbolic prediction cannot be separated from numerical factorization
because the symbolic pattern depends on the numerical pivot choices. However,
if partial pivoting is not adopted, the symbolic pattern does not change, so all the
symbol-related computations, i.e., symbolic prediction and pruning, can be skipped.
Consequently, in numerical LU re-factorization, we only need to perform numerical
update for each column. The premise is that the symbolic pattern of the LU factors
must be known prior to re-factorization, so re-factorization can only be called after
full LU factorization has been called at least once. Re-factorization uses the symbolic
pattern and pivoting order obtained in the last full factorization. Algorithm 10 shows
the algorithm flow of numerical LU re-factorization. The scattergather operation is
also required in the re-factorization algorithm, which means that the uncompressed
array x is also required.

3.4.1 Factorization Method Selection

Without partial pivoting, there may be small elements on the diagonal so the numer-
ical instability problem may occur. However, in SPICE-like circuit simulation,
there is an opportunity that we can call much more re-factorizations than full fac-
torizations without making the results unstable. The opportunity comes from the
3.4 Numerical Re-factorization 59

Algorithm 10 Numerical LU re-factorization algorithm.


Input: Matrix A and the symbolic pattern of the LU factors
Output: Numerical values of the LU factors
1: L = I
2: for k = 1 : N do
3: x = A(:, k)//x is a column vector of length N
4: for j < k where U jk is a nonzero element do
5: x( j + 1 : N ) = x( j + 1 : N ) L( j + 1 : N , j) x j //MAD operation
6: end for
7: U(1 : k, k) = x(1 : k)
x(k : N )
8: L(k : N , k) =
xk
9: end for

NewtonRaphson method. As the NewtonRaphson iterative method converges


quadratically, when it is converging, the matrix values change very slow. This key
observation offers us to call re-factorization instead of full factorization without
affecting the numerical stability when NewtonRaphson iterations are converging.
Based on the above observation, NICSLU offers two alternative methods to control
the selection of the factorization method. The first method is completely controlled
by users and NICSLU does not interpose the selection. We can utilize the conver-
gence check method used in conventional SPICE-like circuit simulators to select
the factorization method. In SPICE-like circuit simulators, the following method is
usually used to check whether the NewtonRaphson iterations are converged
 (k)     
x x(k1)  < AbsTol + RelTol min x(k)  , x(k1)  (3.3)

where the superscript is the iteration count, and AbsTol and RelTol are two given
absolute and relative tolerances for checking convergence. Since the Newton
Raphson method has the feature of quadratic convergence, we can simply relax
the two tolerances to larger values to judge whether the NewtonRaphson iterations
are converging, i.e.,
 (k)     
x x(k1)  < BigAbsTol + BigRelTol min x(k)  , x(k1)  (3.4)

where BigAbsTol >> AbsTol and BigRelTol >> RelTol. They can be determined
empirically. If Eq. (3.4) holds, it indicates that the NewtonRaphson iterations are
converging, so one can invoke re-factorization instead of full factorization; otherwise
full factorization must be called.
Although the above method is quite effective in practice, the solver is not a black
box under such a usage. This increases the difficulty for users to use the solver. The
second method is completely controlled by the solver itself so the usage is black box.
Toward this goal, we calculate the PCN after each full factorization or re-factorization
by
60 3 Overall Solver Flow

max |Ukk |
k
PCN = . (3.5)
min |Ukk |
k

We determine the factorization method in the (k + 1)th iteration according to the


PCN values of the previous two iterations by the following method

PCN(k) > PCN(k1) (3.6)

where is a given threshold whose default value is 5. If Eq. (3.6) holds, it means
that the matrix values change dramatically so full factorization should be called;
otherwise we can invoke re-factorization instead.
Please note that for both methods, the thresholds should be selected to be a little
conservative such that the numerical stability can always be guaranteed.

3.5 Right-Hand-Solving

This is the last step followed by numerical LU factorization or re-factorization. In


NICSLU, right-hand-solving includes two steps: forward/backward substitutions and
an iterative refinement which is automatically controlled by NICSLU.

3.5.1 Forward/Backward Substitutions

Forward/backward substitutions solve the two triangular equations Ly = b and Ux =


y to get the solution of Ax = b. The implementation is quite straightforward and sim-
ple. It is worth mentioning that forward/backward substitutions involve much less
FLOPs than a numerical factorization, so parallelization of forward/backward sub-
stitutions may not generate performance gain than sequential forward/backward sub-
stitutions due to the extremely low SPR. Therefore, in NICSLU, forward/backward
substitutions are always sequential.

3.5.2 Iterative Refinement

The purpose of iterative refinement is to refine the solution to get a more accurate
solution. NICSLU automatically determines whether iterative refinement is required
according to whether PCN is in a given range, i.e.,

< PCN < (3.7)


3.5 Right-Hand-Solving 61

Algorithm 11 Iterative refinement.


Input: Matrix A, RHS b, initial solution x, residual requirement eps and maximum
number of iterations maxiter
Output: refined solution x
1: iter = 0
2: Calculate residual r = Ax b
3: r0 = ||r||22
4: if r0 eps then //residual is satisfied, exit
5: return
6: end if
7: while iter + + < maxiter do
8: Solve Ad = r
9: Update solution x = d
10: Update residual r = Ax b
11: r1 = ||r||22
12: if r1 eps then //residual is satisfied, exit
13: break
14: end if
15: if 2 r1 r0 then //significant improvement, continue
16: r0 = r1
17: continue
18: end if
19: if r1 r0 then //insignificant improvement, exit
20: break
21: end if
22: x+ = d//bad refinement, restore the previous solution and exit
23: break
24: end while

where the default values of and are 1012 and 1040 respectively. If the condition
number is small, it means that the matrix is well-conditioned and the solution is
accurate enough, so refinement is not required. If the condition number is too large,
it indicates that the matrix is highly ill-conditioned. In this case, iterative refinement
usually does not have any effect. These two points explain why we use Eq. (3.7) to
determine whether iterative refinement is required.
The iterative refinement algorithm used in NICSLU is shown in Algorithm 11. It
is a modified version of the well-known Wilkinsons algorithm [14]. If one of the
following four conditions holds, the iterations stop.
The number of iterations reaches the allowed number maxiter (line 7). maxiter
is given by users and its default value is 3 in NICSLU.
The residual ||Ax b||22 satisfies the requirement eps (line 12). eps is given by
users and its default value is 1 1020 .
62 3 Overall Solver Flow

The residual saturates (line 19). This means that the residual changes slightly
compared with the residual in the previous iteration. Although the residual may still
be reduced by running more iterations, it is uneconomical as the iterative refinement
causes additional computational cost but the improvement of the solution is tiny.
The residual reaches the minimum (line 22). This means that the residual becomes
larger after a certain number of iterations. If this happens, NICSLU restores the
solution corresponding to the minimal residual and then stops the iterative refine-
ment.
It is worth mentioning that the iterative refinement algorithm is not always suc-
cessful. It is possible that for some ill-conditioned matrices, although the solution
is inaccurate, the iterative refinement algorithm cannot improve the solution at all.
Since it is an iterative algorithm, it has convergence conditions. Deriving the con-
vergence conditions is beyond the scope of this book. A detailed derivation can be
found in [15].

References

1. Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the
diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20(4), 889901 (1999)
2. Duff, I.S., Koster, J.: On algorithms for permuting large entries to the diagonal of a sparse
matrix. SIAM J. Matrix Anal. Appl. 22(4), 973996 (2000)
3. STFC Rutherford Appleton Laboratory: The HSL Mathematical Software Library. http://www.
hsl.rl.ac.uk/
4. Duff, I.S.: On algorithms for obtaining a maximum transversal. ACM Trans. Math. Softw. 7(3),
315330 (1981)
5. Duff, I.S.: Algorithm 575: permutations for a zero-free diagonal. ACM Trans. Math. Softw.
7(3), 387390 (1981)
6. Yannakakis, M.: Computing the minimum fill-in is NP-complete. SIAM J. Algebraic Discrete
Meth. 2(1), 7779 (1981)
7. Amestoy, P.R., Davis, T.A., Duff, I.S.: An approximate minimum degree ordering algorithm.
SIAM J. Matrix Anal. Appl. 17(4), 886905 (1996)
8. Amestoy, P.R., Davis, T.A., Duff, I.S.: Algorithm 837: AMD, an approximate minimum degree
ordering algorithm. ACM Trans. Math. Softw. 30(3), 381388 (2004)
9. George, A., Liu, J.W.H.: A quotient graph model for symmetric factorization. In: Sparse matrix
proceedings, pp. 154175 (1979)
10. George, A., Ng, E.: Symbolic factorization for sparse gaussian elimination with partial pivoting.
SIAM J. Sci. Stat. Comput. 8(6), 877898 (1987)
11. Davis, T.A., Palamadai Natarajan, E.: Algorithm 907: KLU, a direct sparse solver for circuit
simulation problems. ACM Trans. Math. Softw. 37(3), 36:136:17 (2010)
12. Gilbert, J.R., Peierls, T.: Sparse partial pivoting in time proportional to arithmetic operations.
SIAM J. Sci. Statist. Comput. 9(5), 862874 (1988)
13. Eisenstat, S.C., Liu, J.W.H.: Exploiting structural symmetry in a sparse partial pivoting code.
SIAM J. Sci. Comput. 14(1), 253257 (1993)
14. Martin, R.S., Peters, G., Wilkinson, J.H.: Iterative refinement of the solution of a positive
definite system of equations. Numerische Mathematik 8(3), 203216 (1966)
15. Moler, C.B.: Iterative refinement in floating point. J. ACM 14(2), 316321 (1967)
Chapter 4
Parallel Sparse Left-Looking Algorithm

In this chapter, we will propose parallelization methodologies for the G-P sparse
left-looking algorithm. Parallelizing sparse left-looking LU factorization faces three
major challenges: the high sparsity of circuit matrices, the irregular structure of the
symbolic pattern, and the strong data dependence during sparse LU factorization.
To overcome these challenges, we propose an innovative framework to realize par-
allel sparse LU factorization. The framework is based on a detailed task-level data
dependence analysis and composed of two different scheduling modes to fit different
data dependences: a cluster mode suitable for independent tasks and a pipeline mode
that explores parallelism between dependent tasks. Under the proposed scheduling
framework, we will implement several different parallel algorithms for parallel full
factorization and parallel re-factorization. In addition to the fundamental theories,
we will also present some critical implementation details in this chapter.

4.1 Parallel Full Factorization

In this section, we will present parallelization methodologies for numerical full fac-
torization. Due to partial pivoting, the symbolic pattern of the LU factors depends
on detailed pivot choices, leading to that the column-level dependence cannot be
determined before numerical factorization. In addition, the dependence dynamically
changes during numerical factorization. However, we need to know the detailed data
dependence before scheduling the parallel algorithm. This is the major challenge
when developing scheduling techniques for parallel numerical full factorization.

Springer International Publishing AG 2017 63


X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_4
64 4 Parallel Sparse Left-Looking Algorithm

4.1.1 Data Dependence Representation

According to the theory of the G-P sparse left-looking algorithm, it is easy to derive
that column k depends on column j ( j < k), if and only if U jk is a nonzero element.
This conclusion describes the fundamental column-level dependence in the sparse
left-looking algorithm. Our parallel algorithms are based on the column-level paral-
lelism. In order to schedule the parallel factorization, a DAG that expresses all the
column-level dependence is required. However, the problem is that we cannot obtain
the exact dependence graph before numerical factorization because partial pivoting
can change the symbolic pattern of the LU factors. To solve this problem, we adopt
the concept of ET [1], which has already been mentioned in Sect. 2.1.1.1, to construct
an inexact dependence graph. The ET describes an upper bound of the column-level
dependence by considering all possible pivoting choices during a partial pivoting-
based factorization. In other words, regardless of the actual pivoting choices, the
column-level dependence is always contained in the dependence graph described
by the ET. Consequently, the ET greatly overdetermines the actual column-level
dependence.
An ET is actually a DAG, with N vertexes corresponding to all the columns in
the matrix. A directed edge in the ET (i, j) means that column j potentially depends
on column i. In this case, vertex j is the parent of vertex i, and vertex i is a child of
vertex j. Since the column-level dependence described by the ET is an upper bound,
the edge (i, j) does not necessarily mean that column j must depend to column i.
It just means that there exists a pivoting order, and if the matrix is strong Hall and
pivoted following that order, column j depends on column i. The original ET theory
is derived only based on symmetric matrices; however, the ET can also be applied
to unsymmetric matrices. For unsymmetric matrices, the ET can be constructed
from AT A [2, 3]. More specifically, if Lc denotes the Cholesky factor of AT A (i.e.,
Lc LcT = AT A), then the parent of vertex i is the row index j of the first nonzero
element below the diagonal of column i of Lc . The ET can be computed from A
in time almost linear to the number of nonzero elements in A by a variant of the
algorithm proposed in [1], without explicitly constructing AT A.

4.1.2 Task Scheduling

4.1.2.1 Scheduling Method Consideration

Once the ET is obtained, tasks (i.e., columns) can be scheduled by the ET, as the ET
contains all the potential column-level dependence. Many practical parallel applica-
tions adopt dynamic scheduling as it can usually achieve good load balance. Take
SuperLU_MT [2, 3] as an example to introduce the dynamic scheduling method.
Each column is assigned a flag which indicates the status of the column which can
be one of the following four values: unready, ready, busy, and done. A ready
4.1 Parallel Full Factorization 65

Algorithm 12 Dynamic scheduling.


1: Initialize: put ready tasks into task pool by the main thread
2: for all available threads running in parallel do
3: loop
4: Lock task pool
5: Fetch a task and remove it from task pool
6: Unlock task pool
7: if task not fetched then
8: Exit
9: end if
10: Mark the task as busy and execute it
11: Once the task is finished, mark it as done
12: Lock task pool
13: Put tasks which become ready into task pool
14: Unlock task pool
15: end loop
16: end for

task means that all of its children are finished. A task pool is maintained to store ready
tasks. The task pool is global and can be accessed by all the working threads. Once a
thread finishes its last task, it tries to fetch a new task from the task pool. As the task
pool is shared by all the threads, any access to the task pool is a critical section [4]
and requires a mutex [5] to avoid conflicts. For example, without using a mutex, 2
threads may fetch the same ready task if they access the task pool simultaneously.
Mutex operations involve system calls [6], so the overhead is quite large. A mutex
operation can typically spend thousands of CPU clock cycles. Once a new task is
fetched from the task pool, it is removed from the task pool, and then the thread
marks it as busy and executes it. After the task is finished, it is marked as done. The
thread searches all the unready tasks which now become ready and put them into
the task pool. This is the so-called dynamic scheduling method which is a standard
scheduling method and used in many practical parallel applications. Algorithm 12
shows a typical flow of the dynamic scheduling method.
However, such a dynamic scheduling method is not suitable for parallel LU fac-
torization for circuit matrices. The difficulty comes from the high sparsity of circuit
matrices. Sparse matrices from other applications are generally denser than circuit
matrices, so the computational cost of a task can be much larger than its scheduling
cost. In this case, dynamic scheduling can be adopted, since the scheduling cost of
each task can be ignored compared with the computational cost. However, for circuit
matrices, the computational cost of a task can be extremely small, so the schedul-
ing cost may be larger than the computational cost, leading to very low scheduling
efficiency.
To reduce the scheduling cost, we propose two different scheduling methods for
NICSLU: a static scheduling method and a pseudo-dynamic scheduling method. In
66 4 Parallel Sparse Left-Looking Algorithm

Task T1 T2 ...... TP-1 TP TP+1 TP+2 ...... T2P ......

...... ...... ......

Thread 1 2 ...... P-1 P 1 2 ...... P ......

Task assignment A thread is executing a task

Fig. 4.1 Illustration of the static scheduling method

the both scheduling methods, tasks are sorted in a topological order, such that sequen-
tially finishing these tasks does not violate any dependence constraint. Suppose there
are M tasks and they are denoted as T1 , T2 , . . . , TM in a topological order. Let P
be the number of available threads. Static scheduling says that tasks are assigned to
threads orderly, as shown in Fig. 4.1. In short, task T j is assigned to thread

( j mod P) + 1, j mod P = 0
. (4.1)
P, j mod P = 0

Once a thread finishes its last task, it begins to process the next task by increasing
the task index by P. Such a static scheduling method is quite easy to implement with
a negligible assignment overhead, as the assignment is completely known and fixed
before execution. However, it is well known that static scheduling may cause load
imbalance due to the unequal workloads of tasks. Load imbalance can also be caused
by runtime factors. For example, when a thread begins to execute a new task, say task
Ti , the previous task in the task sequence, Ti1 , may be un-started. This also means
that task Ti1 is skipped in the time sequence. Figure 4.1 shows such an example
in which thread 2 runs faster than other threads. In this case, workloads of threads
may differ much, and, hence, there raises the load imbalance problem.
To solve the load imbalance problem of static scheduling, we further propose
a pseudo-dynamic scheduling method, which uses atomic operations and combines
advantages of both dynamic scheduling and static scheduling. In the pseudo-dynamic
scheduling method, a pointer named max_busy is maintained to point to the head-
most task that is being executed. Once any thread finishes its last task, max_busy is
atomically increased by one and then pointed to the next task. Figure 4.2 illustrates
the pseudo-dynamic scheduling method. The atomicity guarantees that even if multi-
ple threads are increasing max_busy simultaneously, they will get different results.
Algorithm 13 shows the proposed pseudo-dynamic scheduling method. It has two
advantages compared with static scheduling and conventional dynamic scheduling.
On one hand, such a method ensures that any thread always executes the next task with
the smallest index and no task can be skipped, and, thus, workloads of threads tend
to be balanced and load imbalance can be improved. On the other hand, compared
with the conventional dynamic scheduling method, the pseudo-dynamic scheduling
4.1 Parallel Full Factorization 67

Task ...... ...... ......

......
max_busy

Busy tasks

Task ...... ...... ......

......
max_busy

Busy tasks

Fig. 4.2 Illustration of the pseudo-dynamic scheduling method

Algorithm 13 Pseudo-dynamic scheduling.


Input: M tasks T1 , T2 , . . . , TM in topological order, and P available threads
1: Initialize: max_busy = P, and for p = 1 : P, thread p begins to execute task
Tp
2: for all available threads running in parallel do
3: loop
4: if the current task is finished then
5: k = atomic_add(&max_busy)//atomic_add performs an atomic add on
the input parameter and returns the result
6: if k > M then
7: Exit
8: end if
9: Execute task Tk
10: end if
11: end loop
12: end for

method greatly reduces the scheduling overhead, since an atomic operation is much
cheaper than a mutex operation.
In NICSLU, except for that the parallel supernodal full factorization (described
in the next chapter) uses the static scheduling method, other factorization and re-
factorization methods all use the pseudo-dynamic scheduling method.
68 4 Parallel Sparse Left-Looking Algorithm

4.1.2.2 Dual-Mode Scheduling

Figure 4.3b shows an example of the ET. Here we first give a simple explanation
of the statement that the ET is an upper bound of the column-level dependence.
If we do not consider any pivoting, the column-level dependence is determined by
the symbolic pattern of U. As can be seen from Fig. 4.3a, column 10 only depends
on column 7. However, the ET shows that column 10 can potentially depend on 8
columns out of all the 10 columns, except column 3 and column 10 itself. In order to
schedule tasks by utilizing the ET, we further levelized the ET, as shown in Fig. 4.3c.
The levelization is actually an ASAP scheduling of the ET. In other words, we can
define a level for each vertex in the ET as the maximum length from the vertex to
leaf vertexes, where a leaf vertex is defined as a vertex without any children vertex.
The level of a vertex can be calculated by the following equation:

level(k) = max{level(c1 ), level(c2 ), . . .} + 1, (4.2)

where c1 , c2 , . . . are the children vertexes of vertex k. Visiting all the vertexes in a
topological order can calculate their levels in linear time. After the ET is levelized,
we can rewrite the ET into a tabular form, which is named Elimination Scheduler
(ESched), as illustrated in Fig. 4.4. It is obvious that tasks in the same level are
completely independent, so they can be factorized in parallel. Guided by the ESched,
we will propose a dual-mode scheduling method for parallel LU factorization. In
NICSLU, all parallel factorization methods are based on the proposed dual-mode
scheduling method.
There is a fundamental observation about the ESched that some levels at the front
have many tasks but the rest of levels have much fewer tasks. This observation is
caused by the ASAP nature of the ESched: leaf tasks are all put to the first level and
tasks with weak dependence are put to front levels. According to this observation,

10 10

1 2 3 4 5 6 7 8 9 10 9 5 9
1
2 8 2 8
3
4
5 6 7 7
6
7 4 4 5
8
9
10 1 3 6 1 2 3

(a) Matrix A (b) ET (c) Levelized ET

Fig. 4.3 Illustration of the ET and levelization of the ET


4.1 Parallel Full Factorization 69

Fig. 4.4 Illustration of the


Level Tasks
ESched
1 1 2 3 6
2 4 5
3 7
4 8
5 9
6 10

Fig. 4.5 ESched-guided Level Tasks


dual-mode task scheduling
1 1 2 3 6 Cluster
2 4 5 mode
3 7
4 8 Pipeline
5 9 mode
6 10

Thread 1

Thread 2

we can set a threshold to distinguish the two cases. In what follows, we assume that
there are L levels in total, and the first L c levels and the remaining L p = L L c
levels are distinguished, i.e., the first L c levels have many tasks in each level and the
remaining L p levels have very few tasks in each level.
For the L c front levels that have many tasks in each level, tasks in each level can
be factorized in parallel as tasks in the same level are completely independent. This
parallel mode is called cluster mode. All the levels belonging to cluster mode are
processed level by level. For each level, tasks are assigned to different threads (tasks
assigned to one thread are regarded as a cluster), and the load balance is achieved
by equalizing the number of tasks among all the clusters. Each thread executes the
same code (i.e., the modified G-P sparse left-looking algorithm) to factorize the
tasks which are assigned to it. Task-level synchronization is not required since tasks
in the same level are independent, which reduces bulk of the synchronization cost.
However, a barrier is required to synchronize all the threads, which means that the
cluster mode is a level-synchronization algorithm. Figure 4.5 shows an example of
task assignment to 2 threads in the cluster mode.
For the remaining L p levels, each level has very few tasks, which also means that
there is insufficient task-level parallelism, so the cluster mode cannot be efficient. We
explore parallelism between dependent levels by proposing a new approach called
pipeline mode. First, all the tasks belonging to the pipeline mode are sorted into a
70 4 Parallel Sparse Left-Looking Algorithm

Dependence

Sequential column 1 column 2 column 3 column 4 ......

Barrier
column 1 column 3 ...... Thread 1
Cluster
mode
column 2 column 4 ...... Thread 2

column 1 column 3 ...... Thread 1


Pipeline Wait
mode Thread 2
column 2 column 4 ......

Time

Fig. 4.6 Time diagram of the cluster mode and the pipeline mode

topological sequence (in the above example shown in Fig. 4.4, the topological
sequence is {7, 8, 9, 10}), and then perform a static scheduling or pseudo-dynamic
scheduling to assign tasks to working threads. Parallelism is explored between depen-
dent tasks, and, thus, task-level synchronization is required in the pipeline mode. Each
thread factorizes a fetched column at a time. During the factorization, it needs to wait
for dependent columns to finish. Figure 4.5 also shows an example of task assign-
ment to 2 threads in the pipeline mode. To better understand the two modes, Fig. 4.6
illustrates the time diagram of the two parallel modes, compared with sequential
factorization.

4.1.3 Algorithm Flow

In the cluster mode, each thread executes the modified G-P sparse left-looking algo-
rithm to factorize columns that are assigned to the thread. Since there is no column-
level synchronization in the cluster mode, fine-grained inter-thread communication
is not required. We only need a barrier to synchronize all the threads for each level
belonging to the cluster mode.
The pipeline mode is more complicated. In the pipeline mode, all the available
threads run in parallel, as shown in Algorithm 14. Suppose that a thread begins to
factorize a new column, say column k. The pseudo-code can be partitioned into two
parts: pre-factorization and post-factorization. In both parts, a set S is maintained to
store all the newly detected columns that are found in the last symbolic prediction.
Pre-factorization is composed of two passes of incomplete symbolic prediction and
numerical update. In both passes, symbolic prediction skips all unfinished columns,
and then all the finished columns stored in S are used to update the current column
4.1 Parallel Full Factorization 71

Algorithm 14 Pipeline mode full factorization algorithm.


1: for all available threads running in parallel do
2: while the tail of the pipeline sequence is not reached do
3: Get a new un-factorized column, say column k//by static scheduling or
pseudo-dynamic scheduling
4: if the previous column in the pipeline sequence is not finished then //pre-
factorization
5: S=
6: Symbolic prediction
7: Determine which columns will update column k
8: Skip all unfinished columns
9: Put newly detected columns into S
10: Numerical update
11: Use the columns stored in S to update column k
12: end if
13: if there are skipped columns in the above symbolic prediction then
14: S=
15: Symbolic prediction
16: Determine which columns will update column k
17: Skip all unfinished columns
18: Put newly detected columns into S
19: Numerical update
20: Use the columns stored in S to update column k
21: end if
22: Wait for all the children of column k to finish
23: S = //post-factorization
24: Symbolic prediction
25: Determine the exact symbolic pattern of column k
26: Determine which columns will update column k
27: Without skipping any columns
28: Put newly detected columns into S
29: Numerical update
30: Use the columns stored in S to update column k
31: Partial pivoting
32: Pruning
33: end while
34: end for

k. These columns are marked as used and they will not be put into S again in later
symbolic predictions when factorizing column k. The second pass of symbolic pre-
diction starts from the skipped columns in the first pass, and then the thread waits for
all the children of column k to finish. After that, the thread enters post-factorization.
In post-factorization, the thread performs a complete symbolic prediction without
72 4 Parallel Sparse Left-Looking Algorithm

skipping any columns, as all the dependent columns are finished now, to determine
the exact symbolic pattern of column k. However, used columns will not be put into
S so S only contains the dependent columns which have not been used by column k.
The thread uses these newly detected columns to perform the remaining numerical
update on column k. Finally, partial pivoting and pruning are performed.
The pipeline mode exploits parallelism by pre-factorization. In the sequential
algorithm, one column, say column k, starts strictly after the previous column, i.e.,
column k 1, is finished. However, in the pipeline mode, before the previous column
is finished, column k has already accumulated some numerical update from some
dependent and finished columns.
Although partial pivoting can change the row ordering, it cannot cause inter-
thread conflicts in the pipeline mode algorithm. The reason is that the ET contains
all possible column-level dependence if partial pivoting is adopted. If two columns
can cause conflicts due to partial pivoting, they cannot be factorized at the same
time since one of the two columns must depend on the other column in the ET.
However, pruning in the pipeline mode algorithm may cause inter-thread conflicts.
For example, if one thread is pruning a column but another thread is trying to visit
that column, it will cause unpredictable results or even a program crash. We will
discuss how to solve this problem in the next subsection.

4.1.4 Implementation Details

The pipeline mode algorithm involves two practical issues in the implementation,
which require special attentions.

How to determine whether a column is finished and how to guarantee the topo-
logical order during the symbolic prediction in the pre-factorization? We have
found that only using a flag for each column to indicate whether it is finished is

a d

b c

(a) Dependence graph

Visit a and
Some thread skip a Visit d Visit c Time
Other threads Finish a Finish c

(b) Time diagram

Fig. 4.7 Example used to illustrate the problem of symbolic prediction in pre-factorization
4.1 Parallel Full Factorization 73

insufficient. This problem can be explained by an example illustrated in Fig. 4.7.


Assume that Fig. 4.7a is a part of the DAG used in symbolic prediction, as explained
in Sect. 3.3.1. One thread is performing DFS and visiting vertex a. If a is not fin-
ished currently, then a is certainly skipped. Of course b and c are also skipped
since they are children of a and they are both unfinished. Then the thread tries to
visit d. If d is finished, and during the moment that this thread stops visiting a and
starts to visit d, a and c are both finished by other threads, then this thread will
visit c after visiting d. To better understand this case, a time diagram is shown in
Fig. 4.7b. This leads to an error, because a is not visited but its child c is visited
first, leading to an incorrect topological order which will also lead to wrong results
in numerical update.
The reason of this problem is that the judgement of whether a and its children b
and c are finished must be done simultaneously. In other words, the judgement of
whether they are finished should be an atomic operation. However, in the above
example, the judgements of a and c are not done at the same time, leading to the
problem that c is visited first without visiting its parent, a. In this case, the critical
section is too long, so it is too expensive to use a mutex to lock the critical section.
An alternate solution is to snapshot the states of all the columns before symbolic
prediction, and then a thread always visits the snapshotted states during symbolic
prediction, regardless of the actual states of columns. However, snapshotting the
states of all the columns is also expensive since a mutex is also required.
In the implementation of NICSLU, we develop a much cheaper method to real-
ize a pseudo-snapshot method. Besides max_busy used in the pseudo-dynamic
scheduling method, we use another pointer named min_busy to point to the min-
imum busy task that is being executed. Before each pass of symbolic prediction, a
snapshot is taken for min_busy, i.e., a copy of min_busy is made. The snap-
shot of the states of all the columns is done by the copy of min_busy. Although
min_busy may be updated during symbolic prediction, the copied value cannot
change. During symbolic prediction, if the index of a task that is being visited
is smaller than the copied min_busy, then this task is finished; otherwise it is
considered unfinished regardless of its actual state. Once a thread finishes its last
task, min_busy is updated. The new min_busy equals the minimum value of
the minimum busy tasks of all the threads. Updating min_busy does not require
atomic operations. Although multiple threads can update min_busy simultane-
ously, the resulting min_busy may only be smaller than or equal to the actual
minimum busy task, but it can never exceed the actual minimum busy task. This
guarantees the correctness of min_busy without any atomic operations. There
may be a small performance penalty if min_busy is smaller than the actual
minimum busy task; however, such a pseudo-snapshot method is quite cheap to
implement and completely solves this problem.
How to perform pruning in parallel full factorization? The problem is that pruning
can change the order of the L indexes of some columns. If one thread is pruning a
column, and another thread is visiting this pruning column simultaneously, it will
cause unpredictable behaviors or even a problem crash. Although this problem can
be resolved by utilizing a mutex, the cost is expensive since the critical section
74 4 Parallel Sparse Left-Looking Algorithm

is too long. Our solution is to store an additional copy of the L indexes for each
column. The original L indexes are used for pruning and the copy will never be
changed. If a thread is going to visit a column during symbolic prediction, it first
checks whether this column is pruned. If so, it visits the pruned indexes. This will
not cause any problem since if a column is pruned, it will not be pruned again, so it
will not be changed any more. If the visiting column is not pruned, it indicates that
the column may be pruned at any time in the future, so we can only visit the copied
indexes. This method completely avoids the conflict but leads to some additional
storage overhead and runtime penalty.

4.2 Parallel Re-factorization

In this section, we will present parallelization methodologies for numerical re-


factorization. Re-factorization assumes that the symbolic pattern of the LU factors
is known, so it can only be invoked when full factorization with partial pivoting is
executed at least once. Re-factorization re-uses the existing symbolic pattern and
pivoting order. Without partial pivoting, the symbolic pattern of the LU factors is
fixed during numerical re-factorization, and, hence, the column-level dependence
is also known and fixed. Consequently, scheduling parallel re-factorization is much
easier than scheduling parallel full factorization. In addition, in this case, we can
optimize the implementation more by fully utilizing the fixed dependence graph.

4.2.1 Data Dependence Representation

As mentioned above, the column-level dependence of the sparse left-looking algo-


rithm is determined by the symbolic pattern of U. For re-factorization, the dependence
represented by the symbolic pattern of U is actually exact. In other words, unlike the
ET, the symbolic pattern of U does not contain any redundant column-level depen-
dence. The symbolic pattern of U can be described by a DAG, which is named EG.
Please note that a previous concept of EG has already been introduced in Sect. 3.2.2,
but, here, the EG is different from that introduced in Sect. 3.2.2. The EG here is com-
posed of N vertexes, corresponding to all the columns of the matrix. Except for the
diagonal elements, any nonzero element in U, say Ui j , corresponds to a directed edge
(i, j) (i < j) in the EG, indicating that column j depends on column i. Figure 4.8b
shows an example of the EG, corresponding to the symbolic pattern of U shown in
Fig. 4.8a. Although EG and ET are similar, in the sense of that they both represent
column-level dependence, there is a big difference between them. Due to that the ET
contains much redundant dependence, the ET is longer and narrower than EG, and
the EG tends to be wider and shorter.
4.2 Parallel Re-factorization 75

1 2 3 4 5 6 7 8 9 10 2 2
1 1
1
2 7 3 7 3
6 5 6 5
3 4 4
4
5 8 8
6
7 9 9
8
9
10 10 10

(a) Matrix U (b) EG (c) Levelized EG

Fig. 4.8 Illustration of the EG and levelization of the EG

Please note that we do not need to explicitly construct the dependence graph
for parallel re-factorization. As the column-level dependence can be completely
determined by the symbolic pattern of U, the dependence graph is implied in the
symbolic pattern of U. Namely, the symbolic pattern of U is just the EG.

4.2.2 Task Scheduling

For parallel re-factorization, we also adopt the dual-mode scheduling method pro-
posed in Sect. 4.1.2.2 to schedule tasks. First, the EG is levelized by calculating the
level of each vertex using Eq. (4.2), as illustrated in Fig. 4.8c. The EG has a similar
feature as the ET. Some front levels have many tasks in each level and the remaining
levels have very few tasks in each level. An ESched is constructed according to the
levelized EG. The cluster mode and the pipeline mode are launched based on the
ESched. For the example shown in Fig. 4.8, the scheduling result is shown in Fig. 4.9,
assuming that there are 2 threads.

Fig. 4.9 ESched-guided Level Tasks


dual-mode task scheduling
1 1 2 3 7 Cluster
2 4 5 6 mode
3 8
Pipeline
4 9 mode
5 10
Thread 1

Thread 2
76 4 Parallel Sparse Left-Looking Algorithm

Algorithm 15 Pipeline mode re-factorization algorithm.


1: for all available threads running in parallel do
2: while the tail of the pipeline sequence is not reached do
3: Get a new un-factorized column, say column k//by pseudo-dynamic
scheduling
4: x = A(:, k)//x is a column vector of length N
5: for j < k where U jk is a nonzero element do
6: Wait for column j to finish//inter-thread communication
7: x( j + 1 : N ) = x( j + 1 : N ) L( j + 1 : N , j) x j //MAD operation
8: end for
9: U(1 : k, k) = x(1 : k)
x(k : N )
10: L(k : N , k) =
xk
11: Mark column k as finished
12: end while
13: end for

4.2.3 Algorithm Flow

In the cluster mode, each thread executes Algorithm 10 to factorize the columns that
are assigned to it. Like the cluster mode of full factorization, inter-thread synchro-
nization is not required, but we need a barrier to synchronize all the threads for each
level belonging to the cluster mode.
The pipeline mode algorithm in re-factorization is also much simpler than that
in full factorization. Algorithm 15 shows the pipeline mode re-factorization algo-
rithm. The major difference between Algorithm 15 and Algorithm 10 is in line 6
of Algorithm 15. In the pipeline mode re-factorization algorithm, when a thread is
trying to access a column, it will first wait for that column to finish. This is the only
inter-thread communication in the pipeline mode re-factorization algorithm. Such
a pipeline mode algorithm breaks the computational task of each column into fine-
grained subtasks, such that column-level dependence is also broken. Parallelism is
explored between dependent columns by running multiple subtasks in parallel. The
pipeline mode algorithm ensures a detailed computational order such that all the
numerical updates are done in a correct topological order.
We use Fig. 4.10 to illustrate the pipeline mode algorithm. Suppose that 2 threads
are factorizing column j and column k simultaneously. Column k depends on column
j and another column i (i < j < k). Assume that column i is already finished. While
factorizing column k, column k can be first updated by column i, corresponding to
the red line in Fig. 4.10. When it needs to use column j, it waits for column j until
it is finished (if currently column j is already finished, then no waiting is required).
Once column j is finished, column k can be updated by column j, corresponding
to the blue line in Fig. 4.10. At this moment, the thread that factorized column j
4.2 Parallel Re-factorization 77

Fig. 4.10 Illustration of the i j k


pipeline mode algorithm
U
i

just now is now factorizing another unfinished column. Such a parallel execution
approach is very similar to the pipeline mechanism of CPUs, so we call it pipeline
mode.

4.2.4 Implementation Details

Waiting for a dependent column to finish can be done by two methods: blocked wait-
ing and spin waiting. Blocked waiting does not consume CPU resources; however,
it involves system calls, so the performance overhead is quite large. In the pipeline
mode algorithm, since inter-thread synchronization happens very frequently, blocked
waiting can significantly degrade the performance. Consequently, in NICSLU, we
use spin waiting for inter-thread synchronization. Implementing spin waiting is quite
easy. A binary flag is set for each column. If the flag is 0, it indicates that the column
is unfinished; otherwise the column is finished. We use a spin loop to implement the
waiting operation. There is another problem that must be resolved. If the column that
is being waited fails in factorization due to some reason, e.g., zero pivot, then the
waiting thread will never exit the waiting loop because the dependent column can
never be finished. In this case, the waiting thread falls into a dead loop. To resolve this
problem, we set an error code for each thread. During the waiting loop, we check all

Algorithm 16 Spin waiting (waiting column k).


1: while state[k] == 0 do
2: for t = 1 : P do
3: if err[t]! = 0 then //zero indicates success and nonzero indicates failure
4: Exit the current function
5: end if
6: end for
7: end while
8: Continue other operations...
78 4 Parallel Sparse Left-Looking Algorithm

the error codes. Once an error from some other thread is detected, the waiting thread
exits the waiting loop and also exits the current function. The spin waiting method is
shown in Algorithm 16, assuming that we are waiting column k and there P available
threads in total. In Algorithm 16, err is the array that stores all the error codes of all
the threads, and state is the state flag assigned to each column to indicate whether
a column is finished. The overhead of spin waiting is that it always consumes CPU
resources. Therefore, when spin waiting is adopted, the number of invoked working
threads cannot exceed the number of available cores; otherwise the performance will
be dramatically degraded due to CPU resource conflicts.

References

1. Liu, J.W.H.: The role of elimination trees in sparse factorization. SIAM J. Matrix Anal. Appl.
11(1), 134172 (1990)
2. Li, X.S.: Sparse gaussian elimination on high performance computers. Ph.D. thesis, Computer
Science Division, UC Berkeley, California, US (1996)
3. Demmel, J.W., Gilbert, J.R., Li, X.S.: An asynchronous parallel supernodal algorithm for sparse
gaussian elimination. SIAM J. Matrix Anal. Appl. 20(4), 915952 (1999)
4. Wikipedia: Critical Section. https://en.wikipedia.org/wiki/Critical_section
5. Wikipedia: Mutual Exclusion. https://en.wikipedia.org/wiki/Mutual_exclusion
6. Wikipedia: System Call. https://en.wikipedia.org/wiki/System_call
Chapter 5
Improvement Techniques

In the previous two chapters, we have presented the basic flow of our solver
and the parallelization methodologies for both numerical full factorization and re-
factorization, as well as the factorization method selection strategy. The numerical
factorization algorithms described are based on the G-P sparse left-looking algo-
rithm, which is a column-level algorithm. Although the G-P algorithm is widely
used in circuit simulation problems, actually whether it is really the best algorithm
for circuit matrices is unclear. Till now, very little work has been published to com-
prehensively analyze the performance of different computational granularities for
circuit matrices, but most efforts have been done for general sparse matrices which
are much denser than circuit matrices. In this chapter, we will point out that the
pure G-P algorithm is not always the best for circuit matrices. We will introduce two
improvement techniques for the G-P sparse left-looking algorithm. Inspired by the
observation that the best algorithm depends on the matrix sparsity, we will propose a
map algorithm and a supernodal algorithm which are suitable for extremely sparse
and slightly dense circuit matrices, respectively. Combining with the G-P algorithm,
we will integrate three algorithms in NICSLU. For a given matrix, the best algorithm
is selected according to the matrix sparsity, such that NICSLU always achieves high
performance for circuit matrices with various sparsity. In addition, based on the
observation that the matrix values change slow during NewtonRaphson iterations,
we will propose a novel pivoting reduction technique for numerical full factoriza-
tion to reduce the computational cost of symbolic prediction without affecting the
numerical stability.

5.1 Map Algorithm

In this section, we will introduce the map algorithm including the map definition and
the algorithm flow in detail. The map algorithm is proposed to reduce the overhead
of cache miss and data transfer for extremely sparse circuit matrices, so that the
performance can be improved for such matrices.
Springer International Publishing AG 2017 79
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_5
80 5 Improvement Techniques

5.1.1 Motivation

As mentioned in Sect. 3.3.2 and Fig. 3.7, an uncompressed array x of length N is


required to store immediate results for the scatter-gather operation during numerical
update in the G-P sparse left-looking algorithm. The necessity of this array is to solve
the indexing problem for compressed arrays, due to that sparse matrices are stored in
a compressed form, i.e., only the values and positions of nonzero elements are stored.
This leads to a problem when we want to visit a nonzero element of the sparse matrix
from the compressed storage, because we do not know its address in the compressed
form in advance. The key idea to solve this problem in the G-P algorithm is to use the
uncompressed array x, which temporarily holds the immediate values of a column
which is being updated.
The use of the uncompressed array x leads to the following two problems if the
matrix is extremely sparse.
Except x, matrices A, L, and U are all stored by compressed arrays. In the G-
P algorithm, for each column, we need to transfer values between compressed
matrices and the uncompressed vector x (lines 6 and 7 in Algorithm 6 and lines 7
and 8 in Algorithm 10). Figure 5.1 illustrates such a transfer. When we want to store
the values of the uncompressed vector x back to the compressed array, we need
to traverse the compressed array. For each nonzero element, we get the position
from the compressed array, read the value from the corresponding position in x,
and then write it back to the compressed array. If the matrix is extremely sparse,
such a data transfer can dominate the total computational cost because numerical
computations are too few.
Generally speaking, visiting successive memory addresses benefits cache hit but
visiting random memory addresses will lead to a high cache miss rate. When the
matrix is extremely sparse, each column has very few nonzero elements. In this
case, it is easy to understand that visiting nonzero elements in x can lead to a high
cache miss rate because of the large address stride between nonzero elements,
especially when the matrix is large. Circuit matrices are typically very sparse.
It can be obtained from [1] that the SPR (defined in Eq. (3.2)) of many circuit
matrices is less than 5, but for non-circuit matrices, the SPR can be up to 1000.
This means that there are very few nonzero elements in each row/column of the
LU factors of many circuit matrices.

Fig. 5.1 Transferring data value a b c


between the compressed Compressed
array and the uncompressed position 5 3 8
array

Uncompressed x b a c
1 2 3 4 5 6 7 8
5.1 Map Algorithm 81

The map algorithm is proposed to resolve the above two problems for extremely
sparse matrices. In the map algorithm, the uncompressed array x is avoided. Instead,
the addresses corresponding to the positions that will be updated during sparse LU
factorization are recorded in advance.

5.1.2 Map Definition and Construction

The map algorithm does not use the uncompressed array x. Instead, the compressed
storages of L and U are directly used in the numerical factorization. To solve the
indexing problem, the concept of map is proposed. The map is defined as a pointer
array which records all the addresses corresponding to the positions that will be
updated during the G-P left-looking sparse LU factorization. The map records all such
addresses in sequence. By employing the map, in the G-P algorithm, we only need to
directly update the numerical values which are pointed by the corresponding pointers
recorded in the map, instead of searching the update positions from compressed
arrays. After each update operation, the pointer is increased by one to point to the
next update position.
Creating the map is trivial. We just need to go through the factorization process
and record all the positions which are updated during sparse LU factorization in
sequence. Algorithm 17 shows the algorithm flow for creating the map. Besides
the map itself, we also record another array ptr , the location of each rows first
pointer in the map, which will be used for parallel map-based re-factorization. In
SPICE-like circuit simulation, the map is created after each full factorization. As
most of the factorizations are re-factorizations, the map is re-created very few times,
so its computational cost can be ignored. Actually our tests have shown that the time
overhead of creating a map is generally less than the runtime of one full factorization.

Algorithm 17 Creating the map.


Input: The symbolic pattern of the LU factors
Output: The map map and the map pointers ptr
1: Allocate memory spaces for map and ptr
2: ptr [1] = 0
3: for k = 1 : N do
4: for j < k where U jk is a nonzero element do
5: for i = j + 1 : N where L i j is a nonzero element do
6: map = the position of Uik or L ik in its compressed storage
7: + + map
8: end for
9: end for
10: ptr [k + 1] = map// ptr records the location of each rows first pointer in map
11: end for
82 5 Improvement Techniques

5.1.3 Sequential Map Re-factorization

Algorithm 18 shows the sequential map re-factorization algorithm. It is much simpler


than the original G-P sparse left-looking re-factorization algorithm which is shown
in Algorithm 10. The flow of the three for loops is quite similar to a dense LU
factorization or Gaussian elimination algorithm. For each numerical update, we do
not need to find the update position by employing the uncompressed array x. Instead,
the positions recorded in the map are directly used to indicate the update positions
(line 4). After each numerical update, the map pointer is increased by one to point
to the next update position (line 5).
To better understand the map re-factorization algorithm, Fig. 5.2 illustrates a sim-
ple example. Assume that we are now updating column k, and column k depends on
columns i 1 and i 2 because Ui1 k and Ui2 k are nonzero elements. We first use column
i 1 to update column k. Column i 1 has two nonzero elements at rows i 2 and i 4 (note
that nonzero elements in each column of L are not required to be stored in order).
The first operation is Ui2 k = Ui1 k L i2 i1 (the red lines in Fig. 5.2). The address of
Ui2 k is the first pointer of the pointers for column k in the map. The second operation
is L i4 k = Ui1 k L i4 i1 (the blue lines in Fig. 5.2). The address of L i4 k is the second
pointer of the pointers for column k in the map. The third operation is to use the sole
nonzero element in column i 2 to update column k (the green lines in Fig. 5.2), and it
can be done by a similar way.
The map re-factorization algorithm brings us two advantages for extremely sparse
matrices. First, the cache efficiency is improved because the uncompressed vector
x is avoided. Second, indirect memory accesses are also reduced, because directly
visiting compressed arrays of sparse matrices only involves successive and direct
memory accesses. However, the map algorithm is only suitable for extremely sparse
matrices. There are two reasons for this point. First, if a matrix is not extremely
sparse, the overhead of cache miss and data transfer between compressed arrays
and the uncompressed array can be ignored as the floating-point computational cost

Algorithm 18 Sequential map re-factorization algorithm.


Input: Matrix A, the symbolic pattern of the LU factors, and the map map
Output: Numerical values of the LU factors
1: for k = 1 : N do
2: for j < k where U jk is a nonzero element do
3: for i = j + 1 : N where L i j is a nonzero element do
4: map = map L i j U jk //numerical update
5: + + map//point to the next update position
6: end for
7: end for
L(k : N , k)
8: L(k : N , k) =
Ukk
9: end for
5.1 Map Algorithm 83

k
U
i1
*
(1)
i2
-
* (3) *
i3
i4 (2) -
L -
(a) Numerical update of
column k

Pointers for column k


ptr to ptr to ptr to
map ... i2 i4 i3 ...

value ... ... ... ...

position ... i2 i4 ... i3 ... i1 i 2 i3 i4 ...


Column i1 Column i2 Column k
(b) Map

Fig. 5.2 Illustration of the map algorithm

dominates the total runtime. Second, for non-extremely sparse matrices, the map can
be so long that the main memory may not hold the map.
Please note that the map algorithm can only be applied to re-factorization but not
full factorization, because the map can be created only when the symbolic pattern
of the LU factors is known. As shown in Fig. 3.1, in NICSLU, if the map algo-
rithm is selected, we still perform the column algorithm in full factorization. In
re-factorization, the map is first created if it is not yet created. In SPICE-like cir-
cuit simulation, the map algorithm not only takes advantages of the high sparsity of
circuit matrices, but also utilizes the unique feature that the matrix values change
slow in NewtonRaphson iterations. Since full factorization is performed very few
times, map creation is required infrequently, either. This means that, successive re-
factorizations which follow the same full factorization can use the same map, so the
map is not required to be re-created for these re-factorizations. This feature signifi-
cantly saves the overhead of map creation.
84 5 Improvement Techniques

5.1.4 Parallel Map Re-factorization

Algorithm 19 Pipeline mode map re-factorization algorithm.


1: for all available threads running in parallel do
2: while the tail of the pipeline sequence is not reached do
3: Get a new un-factorized column, say column k//by pseudo dynamic schedul-
ing
4: map = ptr [k]//get the map for column k
5: for j < k where U jk is a nonzero element do
6: Wait for column j to finish//inter-thread communication
7: for i = j + 1 : N where L i j is a nonzero element do
8: map = map L i j U jk //numerical update
9: + + map//point to the next update position
10: end for
11: end for
L(k : N , k)
12: L(k : N , k) =
Ukk
13: Mark column k as finished
14: end while
15: end for

In parallel map re-factorization, we also apply the dual-mode scheduling strategy (i.e.,
the cluster and pipeline modes) to schedule tasks. The only point that is worth men-
tioning is that, in the parallel map re-factorization algorithm, since each thread does
not compute successive columns, the map pointers ptr constructed in Algorithm 17
are required to obtain the map starting position for desired columns. Algorithm 19
shows the algorithm flow of the pipeline mode map re-factorization algorithm. Before
factorizing a column, a thread first obtains the map for that column from ptr , i.e.,
the first update position of that column (line 4). The numerical update part is almost
the same as that in the sequential map algorithm, i.e., Algorithm 18. In the pipeline
mode, before visiting a dependent column, we also need to wait for it to finish (line
6).

5.2 Supernodal Algorithm

In this section, we will present the supernodal algorithm in detail. The supernodal
algorithm is proposed to enhance the performance for slightly dense circuit matri-
ces by utilizing dense submatrix kernels. Different from the supernodal algorithm
adopted by SuperLU and SuperLU_MT [2, 3] which is actually a supernode-panel (in
SuperLU and SuperLU_MT, a panel means a set of successive columns which may
have different symbolic patterns) algorithm, our supernodal algorithm is a supernode-
column algorithm. Although circuit matrices can be sometimes slightly dense, they
5.2 Supernodal Algorithm 85

are still much sparser than sparse matrices from other applications, such as finite ele-
ment analysis. Such an observation prevents us from adopting such a heavyweight
supernode-panel algorithm. On the contrary, we adopt the lightweight supernode-
column algorithm which can well fit slightly dense circuit matrices.

5.2.1 Motivation

Although circuit matrices are usually very sparse, they can also be dense in some
special cases. For example, post-layout circuits will contain large power and ground
meshes so matrices created by MNA can be dense due to the mesh nature. For the
LU factors of such matrices, there are many nonzero elements that can form dense
submatrices. To efficiently solve such matrices, we borrow the concept of supernode
from SuperLU and develop a lightweight supernode-column algorithm which is quite
suitable for slightly dense circuit matrices. The performance can be greatly improved
by utilizing a vendor-optimized BLAS library.

5.2.2 Supernode Definition and Storage

We have already given a brief introduction to supernodes in Sect. 2.1.1.1. In NIC-


SLU, the definition of supernode is a special case of the cases introduced in
Sect. 2.1.1.1. We adopt the same definition of supernode as that adopted by SuperLU
and SuperLU_MT. A supernode is defined as a set of successive columns of L with
triangular diagonal block full and the same structure in the columns below the diago-
nal block [4, 5]. Figure 5.3a illustrates an example of a supernode, which is composed

Padding
L

(a) Supernode (b) Storage of a supernode


(column-order)

Fig. 5.3 Supernode definition and storage of a supernode


86 5 Improvement Techniques

of 4 columns of L. A supernode is stored by a column-wise dense matrix. The upper


triangular diagonal part of U is not stored in the supernode so these positions are
left blank, i.e., they can be regarded as paddings. Figure 5.3b illustrates the storage
corresponding to the supernode shown in Fig. 5.3a. Besides the dense matrix used to
store the numerical values, we also need an integer array to store the row indexes of
the supernode.
Employing supernodes in the G-P sparse left-looking algorithm is compatible.
Supernode construction in the G-P algorithm is easy. As the G-P algorithm is a
column-based algorithm, once the symbolic pattern of a column, say column k,
is known, we can check its symbolic pattern with the previous (left) column, i.e.,
column k 1, to check whether they can belong to the same supernode. Namely,
if the number of nonzero elements in L(k : N , k) equals the number of nonzero
elements in L(k 1 : N , k 1) minus one, and the symbolic pattern of L(k : N , k)
is a subset of that of L(k 1 : N , k 1), then columns k and k 1 belong to the
same supernode.

5.2.3 Supernodal Full Factorization

5.2.3.1 Sequential Algorithm

After grouping columns with the same symbolic pattern together, numerical updates
from these columns can be combined together by utilizing supernodal operations, i.e.,
supernode-column updates. Figure 5.4 explains why we can perform a supernode-
column update instead of multiple column-column updates. Suppose we are fac-
torizing column k, and there is a nonzero element U jk in column k. This means
that column k depends on column j. We further assume that column j belongs to a
supernode which ends at column s, as illustrated in Fig. 5.4. We do not care the first
(leftmost) column of the supernode, since it has no impact to the supernode-column
update. According to the theory of the symbolic prediction presented in Sect. 3.3.1,
there must be fill-ins at rows j + 1, j + 2, . . . , s in column k. Consequently, column

Fig. 5.4 Explanation of s k


supernode-column update
U
j

s
Fill-ins

L
5.2 Supernodal Algorithm 87

Algorithm 20 Sequential supernodal full factorization algorithm.


Input: N N matrix A obtained from pre-analysis
Output: Matrix L and U
1: L = I
2: for k = 1 : N do
3: Symbolic prediction: determine the symbolic pattern of column k, i.e., the
columns that will update column k
4: Supernode detection: determine whether column k belongs to the same
supernode as column k 1
5: Numeric update: solve Lx = A(:, k) using Algorithm 21
6: Partial pivoting on x using Algorithm 8
7: U(1 : k, k) = x(1 : k)
x(k : N )
8: L(k : N , k) =
xk
9: Pruning: reduce the symbolic prediction cost of subsequent columns
10: end for

Algorithm 21 Solving Lx = A(:, k) using supernodal updates.


Input: Values, nonzero patterns and supernode information of columns 1 to k 1
of L, and symbolic pattern of column k of U
Output: x//x is a column vector of length N
1: x = A(:, k)
2: for j < k where U jk is a nonzero element do
3: if column j has not been used to update column k then
4: if column j belongs to a supernode that ends at column s then //perform
supernode-column update
5: x( j : s) = L( j : s, j : s)1 x( j : s)
6: x(s + 1 : N ) = L(s + 1 : N , j : s) x( j : s)
7: Mark columns j to s as used
8: else //perform column-column update
9: x( j + 1 : N ) = x j L( j + 1 : N , j)
10: end if
11: end if
12: end for

k must also depend on columns j + 1, j + 2, . . . , s. To make a general conclusion of


this point, if a column depends on another column that belongs to a supernode, then
this column must depend on a set of successive columns from the said dependent
column to the last (rightmost) column of the supernode.
The construction of supernodes results in that the numerical updates from suc-
cessive columns in a supernode can also be grouped together by utilizing two BLAS
routines: triangular solving dtrsv (or ztrsv for complex numbers) and matrix vec-
tor multiplication dgemv (or zgemv for complex numbers). Algorithm 20 shows the
88 5 Improvement Techniques

algorithm flow of the sequential supernodal full factorization algorithm, where the
numerical update flow is shown in Algorithm 21. Compared with the basic column-
based G-P algorithm which is shown in Algorithms 6 and 7, there are two major
differences. First, after the symbolic prediction of each column, supernode detec-
tion (line 4 of Algorithm 20) is performed to determine whether the current column
belongs to the same supernode as the pervious column. Second, the numerical update
is different. As shown in lines 410 of Algorithm 21, if a dependent column belongs
to a supernode, we use two BLAS routines to perform a supernodal-column update;
otherwise the conventional column-column update is executed. It is easy to verify that
the supernode-column update is equivalent to multiple successive column-column
updates in theory.
The proposed supernodal algorithm has three advantages compared with the
column-based G-P algorithm for slightly dense matrices. First, due to the dense
storage of supernodes, indirect memory accesses within supernodes are avoided.
Second, we can utilize vendor-optimized BLAS library to compute dense submatrix
operations, so that the performance can be significantly enhanced. Finally, the cache
efficiency can also be improved because supernodes are stored by continuous arrays.
The proposed supernode-column algorithm is different from SuperLU or PAR-
DISO, although they also utilize supernodes to enhance the performance for dense
submatrices. SuperLU and PARDISO both use a so-called supernode-supernode or
supernode-panel algorithm, where each supernode is updated by dependent supern-
odes. The reason why they use such a method is that, when multiple columns depend
on a same supernode, the common dependent supernode will be read for multiple
times to update these columns separately. Consequently, gathering these columns
into a destination supernode (regardless of whether they have the same symbolic
pattern) and updating them together will make the common dependent supernode be
read only once. However, considering the fact that modern CPUs always have large
caches and supernodes in circuit matrices cannot be too large, many supernodes can
reside in cache simultaneously. Reading a supernode multiple times cannot signifi-
cantly degrade the performance. In addition, the supernode-panel algorithm adopted
by SuperLU and SuperLU_MT can introduce some additional computations and
fill-ins. Consequently, we develop the supernode-column algorithm which is more
lightweight than the supernode-supernode or supernode-panel algorithm adopted by
SuperLU and PARDISO. Another different from SuperLU is in the implementation
of supernodal numerical update step. In SuperLU and SuperLU_MT, actually there
are only supernodes but there is no concept of column. Even if a column cannot
form a supernode with its neighboring columns, it is still treated as a supernode.
Any numerical update is performed by calling BLAS routines. In NICSLU, how-
ever, we do not call BLAS for column-column updates, which are computed by our
own code. As calling library routines involves some extra penalty, such as the stack
operations, using BLAS to compute a single-column supernode is not a good idea,
since the computational cost is too small, compared with other overhead associated
with calling library routines.
5.2 Supernodal Algorithm 89

Algorithm 22 Pipeline mode supernodal full factorization algorithm.


1: for all available threads running in parallel do
2: while the tail of the pipeline sequence is not reached do
3: Get a new un-factorized column, say column k//by static scheduling or
pseudo dynamic scheduling
4: if the previous column in the pipeline sequence is not finished then //pre-
factorization
5: S=
6: Symbolic prediction
7: Determine which columns and supernodes will update column k
8: Skip all unfinished columns
9: Put newly detected columns and supernodes into S
10: Numerical update
11: Use the columns and supernodes stored in S to update column k
12: end if
13: if there are skipped columns in the above symbolic prediction then
14: S=
15: Symbolic prediction
16: Determine which columns and supernodes will update column k
17: Skip all unfinished columns
18: Put newly detected columns and supernodes into S
19: Numerical update
20: Use the columns and supernodes stored in S to update column k
21: end if
22: Wait for all the children of column k to finish
23: S = //post-factorization
24: Symbolic prediction
25: Determine the exact symbolic pattern of column k
26: Determine which columns and supernodes will update column k
27: Without skipping any columns
28: Put newly detected columns and supernodes into S
29: Supernode detection
30: determine whether column k belongs to the same supernode as column
k1
31: Numerical update
32: Use the columns and supernodes stored in S to update column k
33: Partial pivoting
34: Pruning
35: end while
36: end for
90 5 Improvement Techniques

5.2.3.2 Parallel Algorithm

In parallel supernodal full factorization, we also adopt the dual-mode scheduling


strategy. It is worth mentioning that we use the column-based cluster mode without
detecting supernodes. Namely, supernode detection and supernodal-column updates
are only performed in the pipeline mode. The reason behind this point is that
supernode detection and construction can generate dependence between indepen-
dent columns. In the cluster mode, columns that can be factorized concurrently are
completely independent. However, supernode detection will add extra dependence
between these columns which will further affect the parallelism. Fortunately, there
are only a small number of columns belonging to the cluster mode and they are very
sparse, so they tend to not form (big) supernodes.
Algorithm 22 shows the algorithm flow of the pipeline mode supernodal full fac-
torization. It is quite similar to Algorithm 14 with a major difference that supernode-
related operations are integrated in the symbolic prediction and numerical update
steps. In symbolic prediction, unfinished columns are skipped and all the finished
and dependent columns are recorded. Different from the column-based pipeline mode
algorithm, which is shown in Algorithm 14, the set S here records both columns and
supernodes that will be used to update the current column. Numerical update is per-
formed in a supernode-column or column-column manner, depending on whether
the dependent column belongs to a supernode or not, at the time when it is used, just
like the numerical update flow shown in Algorithm 21. Except for these operations,
other steps are almost unchanged from the column-based pipeline algorithm. We will
not explain them again for concision.

5.2.4 Supernodal Re-factorization

In re-factorization, the symbolic pattern of the LU factors is fixed so all the supern-
odes are also fixed. Namely, whether a column belongs to a supernode and which
supernode it belongs to are known and fixed. Consequently, like the column-based
re-factorization algorithm, we also only need to perform the numerical update in the
supernodal re-factorization algorithm.

5.2.4.1 Sequential Algorithm

Algorithm 23 shows the algorithm flow of the sequential supernodal re-factorization


algorithm. It is quite similar to the supernodal numerical update algorithm shown
in Algorithm 21. Compared with the sequential column-based re-factorization algo-
rithm which is shown in Algorithm 10, the only difference is the numerical update
step. As shown in lines 516 of Algorithm 23, when using a column, say column j,
to update the current column, say column k, we first check whether column j belongs
to a supernode. If so, we perform a supernode-column update (lines 10 and 11), and
5.2 Supernodal Algorithm 91

Algorithm 23 Sequential supernodal re-factorization algorithm.


Input: Matrix A, the symbolic pattern of the LU factors, and the supernode infor-
mation
Output: Numerical values of the LU factors
1: L = I
2: for k = 1 : N do
3: x = A(:, k) //x is a column vector of length N
4: for j < k where U jk is a nonzero element do
5: if column j has not been used to update column k then
6: if column j belongs to a supernode that ends at column s then //perform
supernode-column update
7: if column k belongs to the same supernode as column j then
8: s =k1
9: end if
10: x( j : s) = L( j : s, j : s)1 x( j : s)
11: x(s + 1 : N ) = L(s + 1 : N , j : s) x( j : s)
12: Mark columns j to s as used
13: else //perform column-column update
14: x( j + 1 : N ) = x j L( j + 1 : N , j)
15: end if
16: end if
17: end for
18: U(1 : k, k) = x(1 : k)
x(k : N )
19: L(k : N , k) =
xk
20: end for

columns belonging to the supernode are all marked as used so they will not be used
to update column k again (line 12); otherwise a column-column update is performed
(line 14). A special case is that columns j and k belong to the same supernode. In this
case, the last column of the supernode is larger than or equal to k; however, only the
columns j to k 1 in the supernode are required to update column k, so we need to
set the last column of the supernode as column k 1 instead of its actual last column
(lines 79).

5.2.4.2 Parallel Algorithm

The parallel supernodal re-factorization algorithm is also scheduled by the dual-


mode strategy. We will not introduce the details here. The only point that is worth
mentioning is the waiting mechanism in the pipeline mode. In the column-based
pipeline mode, when we want to access a dependent column to perform the numerical
update, we need to wait for it to finish. The same method can be applied to the
pipeline mode supernodal algorithm. In other words, if a dependent column belongs
92 5 Improvement Techniques

Nave Fact. Fact. Fact.


supernodal column j column j+1 ...... column s Supernode-column update
pipeline

Column Fact. Fact.


piepline column j ...... ......
column s
...... ...... ......

Column-column update
Waiting due to previous unfinished
column-column updates

Supernodal Fact.
...... ......
Fact.
pipeline column j column s
Supernode- Supernode-
column
column update update
Time

Fig. 5.5 Illustration of the supernodal pipeline mode algorithm

to a supernode, we can wait for the entire supernode to finish. This does not cause
any accuracy problem but really causes a performance problem. If the supernode
is very large, i.e., it is composed of many columns, the waiting cost can be high,
and the performance may be even poorer than column-based re-factorization. In the
column-based pipeline mode, we can access a dependent column immediately after
it is finished; however, if we wait for the entire supernode to finish, we can access
the supernode only after all the columns belonging to the supernode are finished.
In this case, the waiting time can be very long. We still use the example shown in
Fig. 5.4 to illustrate this problem. Column k depends on columns j to s. When we
are factorizing column k and want to use columns j to s to perform a supernode-
column update, if column s is not finished, we need to wait for column s until it is
finished, and then perform a supernode-column update. In other words, column j is
accessed after column s is finished, instead of column j itself. In the column-based
pipeline mode algorithm, updates from finished columns can be performed before
when column s is finished. Figure 5.5 illustrates and compares the two cases (naive
supernodal pipeline and column pipeline). Note that in the column-based pipeline
mode algorithm, we may wait for some additional time due to previous unfinished
column-column updates.
To solve this problem, we propose to partition a large supernode into two parts.
Please note that the partition does not mean that we explicitly store a large supern-
ode by two separated parts. It only means that when performing supernode-column
updates, a large supernode is treated as two smaller supernodes so that two supernode-
column updates are performed. We can treat finished columns in a large supernode
5.2 Supernodal Algorithm 93

Algorithm 24 Pipeline mode supernodal re-factorization algorithm.


1: for all available threads running in parallel do
2: while the tail of the pipeline sequence is not reached do
3: Get a new un-factorized column, say column k//by pseudo dynamic
scheduling
4: x = A(:, k)//x is a column vector of length N
5: for j < k where U jk is a nonzero element do
6: if column j has not been used to update column k then
7: if column j belongs to a supernode that ends at column s then //perform
supernode-column update
8: if column k belongs to the same supernode as column j then
9: s =k1
10: end if
11: if s j + 1 < 2P then //small supernode, one supernode-column
update
12: Wait for column s to finish
13: x( j : s) = L( j : s, j : s)1 x( j : s)
14: x(s + 1 : N ) = L(s + 1 : N , j : s) x( j : s)
15: else //large supernode, two supernode-column updates
16: Wait for column s P to finish//first supernode-column update
17: x( j : s P) = L( j : s P, j : s P)1 x( j : s P)
18: x(s + 1 P : N ) = L(s + 1 P : N , j : s P) x( j : s P)
19: Wait for column s to finish//second supernode-column update
20: x(s P + 1 : s) = L(s P + 1 : s, s P + 1 : s)1 x(s P +
1 : s)
21: x(s + 1 : N ) = L(s + 1 : N , s P + 1 : s) x(s P + 1 : s)
22: end if
23: Mark columns j to s as used
24: else //perform column-column update
25: Wait for column j to finish
26: x( j + 1 : N ) = x j L( j + 1 : N , j)
27: end if
28: end if
29: end for
30: U(1 : k, k) = x(1 : k)
x(k : N )
31: L(k : N , k) =
xk
32: Mark column k as finished
33: end while
34: end for

as a small supernode and use them to perform a supernode-column update first.


After that, we can wait for the rest of columns in the supernode, and finally use
them to perform the remaining numerical update by a second supernode-column
94 5 Improvement Techniques

update. Figure 5.5 also illustrates this case (supernodal pipeline). Due to the higher
performance of a supernode-column update than multiple column-column updates,
the waiting time caused by the unfinished first supernode-column update tends to
be significantly reduced, and, hence, the second supernode-column update may be
started immediately after column s is finished. Consequently, the total runtime may
be reduced compared with the column-based pipeline mode. To optimize this imple-
mentation, the second part of the supernode should contain only a few columns;
otherwise the second supernode-column update may still consume too much run-
time. In NICSLU, the threshold to judge whether a supernode is so large that it
requires to be partitioned into two parts is 2P, where P is the number of invoked
threads. The size of the second part of the supernode is always set to P. The key rea-
son behind this setting is that, if there are two columns, say columns j and k (column
j is on the left of column k), whose positions in the pipeline sequence differ larger
than P, and if column k is being factorized, then column j must have been finished,
because there are only P threads. Consequently, setting the size of the second part of
the supernode to P ensures that no waiting happens for the first supernode-column
update. According to this principle, we present the algorithm flow of the pipeline
mode supernodal re-factorization in Algorithm 24. Lines 1214 correspond to the
case in which only one supernode-column update is invoked. Lines 1621 correspond
to the case in which two supernode-column updates are invoked. Other operations
in this algorithm flow has already been explained before so we will skip them here.

5.3 Fast Full Factorization

In this section, we will present a fast full factorization algorithm based on a novel
pivoting reduction technique. The proposed technique is used to accelerate full fac-
torization and improve its scalability for sparse matrices. It is also well compatible
with the SPICE-like circuit simulation flow.

5.3.1 Motivation and Pivoting Reduction

KLU and NICSLU both have full factorization and re-factorization to perform numer-
ical LU factorization. Re-factorization does not perform any pivoting so it is faster
than full factorization. However, re-factorization is numerically unstable, so we can
use it only when we can guarantee the numerical stability. Full factorization accom-
modates partial pivoting but it is slower and the scalability is poor. In the Newton
Raphson iterations of SPICE-like circuit simulation, the matrix values change slow
and the difference of matrix values between two successive iterations is small, espe-
cially when the NewtonRaphson method is converging. In this case, if full factor-
ization is invoked, it tends to reuse most of the previous pivot choices. Consider an
extreme case in which the second full factorization completely reuses the pivoting
5.3 Fast Full Factorization 95

order generated in the first full factorization. In this case, the symbolic predictions
performed in the second full factorization are actually useless because the symbolic
pattern is unchanged. However, before the second full factorization, we do not know
whether it really reuses the pivoting order so we still need to do pivoting during
factorization. If very few columns change their pivot choices, it also raises the same
question that the symbolic predictions of some columns in the second full factor-
ization are useless. Our test statistics show that the symbolic prediction costs on
average 20% of the total runtime of full factorization. For extremely sparse matrices,
this ratio can be up to 50%. Therefore, if the useless symbolic predictions can be
avoided, the performance of full factorization can be significantly improved.
Why not borrow some ideas from re-factorization? Re-factorization is based on
the prerequisite that the symbolic pattern of the LU factors and the pivoting order are
known and fixed. In the second full factorization, when we are factorizing a column,
its symbolic pattern can be considered known from the first full factorization.
Here the only difference between full factorization and re-factorization is that, in full
factorization, the symbolic pattern may be changed if the pivot choice of that column
is changed from the first full factorization. However, before the symbolic prediction
of the column, we can assume that its symbolic pattern is known so we can directly
use the symbolic pattern obtained in the first full factorization. Then the symbolic
prediction of that column can be skipped, and the numerical update can be done as
usual. After that, partial pivoting is performed. If the pivot choice is changed, it means
that for subsequent columns, the symbolic pattern is also changed so the symbolic
prediction cannot be skipped. On the contrary, if that column still uses the previous
pivot choice, our assumption holds and the symbolic prediction of the next column
can still be skipped. To maximize the skipped symbolic prediction, we should reuse
previous pivot choices as many as possible. Toward this goal, we develop a pivoting
reduction technique, which is quite simple but effective. This gives us an opportunity
to skip the symbolic prediction for as many columns as possible.
In the conventional partial pivoting method, the diagonal element has the highest
priority when searching for the pivot. As shown in Algorithm 8, if the diagonal
absolute value is larger than or equal to the product of the threshold and the maximum
absolute value in the corresponding column, then the diagonal element can be the

Algorithm 25 Pivoting reduction-based partial pivoting on x for column k


Input: k, x, previous pivot position p, and pivoting threshold //the default value of
is 103
Output: x //elements of x may be exchanged when returning
1: Find the element with the largest magnitude from x(k : N ), say x m
2: if |x p | |x m | then //the element at the previous pivot position is large enough
3: return
4: else if |x k | < |x m | then //re-pivoting required, do conventional partial pivoting
5: Exchange the positions of x k and x m , and record the permutation as well
6: end if
96 5 Improvement Techniques

pivot, even if it is not with the maximum magnitude in the column. In the pivoting
reduction technique, the element at the previous pivot position has the highest priority.
Namely, when we are doing partial pivoting for a column, we first check whether
the element at the previous pivot position is still large enough by absolute value so it
can still be the pivot. If so, the pivot order is not changed; otherwise a conventional
partial pivoting is performed. This is the so-called pivoting reduction technique.
Algorithm 25 shows the algorithm flow of the pivoting reduction technique. It reuses
previous pivot choices as many as possible, and, hence, it is helpful for keeping
the symbolic pattern unchanged as much as possible. By employing the pivoting
reduction technique, we will develop a fast full factorization algorithm.

5.3.2 Sequential Fast Full Factorization

Algorithm 26 shows that algorithm flow of the sequential fast full factorization,
which has two major parts: fast full factorization and normal full factorization. In
the fast full factorization (lines 313), for each column, the symbolic prediction
is skipped and the numerical update is performed based on the symbolic pattern
obtained in a previous full factorization. Then the pivoting reduction-based partial
pivoting shown in Algorithm 25 is performed. Once a re-pivoting has occurred, the
fast factorization is stopped and we enter the normal factorization (lines 1422)
to compute the remaining columns by the normal factorization algorithm without
skipping the symbolic prediction.
It should be noticed that when a re-pivoting occurs at a column k, not all the
subsequent columns (i.e., columns k + 1, k + 2, . . . , N ) are required to be factorized
by the normal full factorization. Only those columns which directly or indirectly
depend on column k require the normal full factorization. However, searching for all
the dependent columns from the subsequent columns will traverse all the subsequent
columns so it is time-consuming. Consequently, we use a simple but effective method
that once a re-pivoting occurs, all the subsequent columns are computed by the normal
full factorization algorithm.
From the fast full factorization algorithm presented above, it can be concluded
that the performance of fast factorization strongly depends on the matrix change
during the NewtonRaphson iterations. If the matrix values change little during iter-
ations, each fast factorization can always use the previous pivoting order so that
no re-pivoting happens, which is the best case. On the contrary, if the matrix val-
ues change dramatically, re-pivoting will always happen. The worst case is that
re-pivoting happens at the first column of each fast factorization so that fast factor-
ization degenerates to the normal full factorization. Consequently, fast factorization
should never be slower than normal full factorization.
Although algorithm 26 is for the column algorithm, the idea of fast factorization
can also be easily applied to the supernodal full factorization algorithm. One point
that is worth mentioning is that, if re-pivoting occurs at a column which belongs
to a supernode, the supernode will be changed. If the column in which re-pivoting
5.3 Fast Full Factorization 97

Algorithm 26 Sequential fast full factorization.


Input: N N matrix A, at least one full factorization is performed for the matrix
with the identical symbolic pattern as A
Output: Matrix L and U
1: L = I
2: k = 1
3: while k < N do //fast full factorization
4: Numeric update: solve Lx = A(:, k) using Algorithm 7
5: Pivoting reduction-based partial pivoting on x using Algorithm 25
6: U(1 : k, k) = x(1 : k)
x(k : N )
7: L(k : N , k) =
xk
8: Pruning: reduce the symbolic prediction cost of subsequent columns
9: + + k
10: if re-pivoting has occurred then
11: break
12: end if
13: end while
14: while k < N do //normal full factorization
15: Symbolic prediction: determine the symbolic pattern of column k, i.e., the
columns that will update column k
16: Numeric update: solve Lx = A(:, k) using Algorithm 7
17: Partial pivoting on x using Algorithm 8
18: U(1 : k, k) = x(1 : k)
x(k : N )
19: L(k : N , k) =
xk
20: Pruning: reduce the symbolic prediction cost of subsequent columns
21: + + k
22: end while

occurs is the first column of a supernode, then the supernode will be completely
destroyed; otherwise the supernode is ended at that column. Note that subsequent
columns may still belong to the same supernodes as which they belong to before;
however, this cannot be determined before those columns are factorized since their
symbolic pattern may be changed due to the re-pivoting. We will not present the
details of the supernodal fast factorization algorithm for concision.

5.3.3 Parallel Fast Full Factorization

We also use the dual-mode scheduling method to perform the parallel fast full fac-
torization. After applying the pivoting reduction-based partial pivoting method into
the cluster mode and pipeline mode, we call the new cluster mode and new pipeline
98 5 Improvement Techniques

Re-pivoting occurs
Fast cluster

No re-pivoting occurs Normal cluster for


remaining columns
Re-pivoting occurs
Fast pipeline
Normal pipeline
Normal pipeline for
remaining columns

Fig. 5.6 Scheduling framework of the parallel fast full factorization

mode as fast cluster mode and fast pipeline mode. Figure 5.6 shows the scheduling
framework of the parallel fast full factorization. If no re-pivoting occurs, the fast
cluster and fast pipeline modes are executed successively. If a re-pivoting occurs in
the fast cluster mode, the remaining levels belonging to the cluster mode are com-
puted by the normal cluster mode, and the other levels are computed by the normal
pipeline mode. If re-pivoting occurs in the fast pipeline mode, the fast pipeline stops
and the new normal pipeline mode is invoked for the remaining columns.
It is worth mentioning that in the fast pipeline mode, once a re-pivoting has
occurred at a column, say column k, all the finished computations of subsequent
columns by other threads must be abandoned, and the normal pipeline mode must
be completely restarted from column k + 1, as the finished computations of columns
after column k are based on the old symbolic pattern of column k before the re-
pivoting occurs.

References

1. Davis, T.A., Palamadai Natarajan, E.: Algorithm 907: KLU, A direct sparse solver for circuit
simulation problems. ACM Trans. Math. Softw. 37(3), 36:136:17 (2010)
2. Demmel, J.W., Gilbert, J.R., Li, X.S.: An asynchronous parallel supernodal algorithm for sparse
gaussian elimination. SIAM J. Matrix Anal. Appl. 20(4), 915952 (1999)
3. Li, X.S.: An overview of superLU: algorithms, implementation, and user interface. ACM Trans.
Math. Softw. 31(3), 302325 (2005)
4. Li, X.S.: Sparse gaussian elimination on high performance computers. Ph.D. thesis, Computer
Science Division, UC Berkeley, California, US (1996)
5. Demmel, J.W., Eisenstat, S.C., Gilbert, J.R., Li, X.S., Liu, J.W.H.: A supernodal approach to
sparse partial pivoting. SIAM J. Matrix Anal. Appl. 20(3), 720755 (1999)
Chapter 6
Test Results

In this chapter, we will present the experimental results of NICSLU and the compar-
isons with PARDISO and KLU. The excellent performance of NICSLU is demon-
strated by two tests: benchmark test and simulation test. We will first describe the
experimental setup, and then present the detailed results of the two tests.

6.1 Experimental Setup

Both benchmark test and simulation test are carried out on a Linux server equipped
two Intel Xeon E5-2690 CPUs running at 2.9 GHz and 64 GB memory. All codes are
compiled by the Intel C++ compiler (version 14.0.2) with O3 optimization. PARDISO
is from Intel Math Kernel Library (MKL) 11.1.2. Both NICSLU and PARDISO use
BLAS provided by Intel MKL.
In the benchmark test, we compare NICSLU with PARDISO and KLU for 40
benchmarks obtained from the University of Florida sparse matrix collection [1].
Table 6.1 shows the basic information (dimension, number of nonzeros, and the
average number of nonzeros in each row) of the benchmarks. All these bench-
marks are unsymmetric circuit matrices obtained from SPICE-based DC, transient,
or frequency-domain simulations. We exclude symmetric circuit matrices because
Cholesky factorization [2] is about 2 more efficient than LU factorization for sym-
metric matrices. The dimension of these benchmarks covers a very range which is
from two thousand to five million. The average number of nonzeros in each row
clearly shows that circuit matrices are extremely sparse, as for most of these bench-
marks, there are averagely less than 10 nonzero elements in each row. Even if for a
few slightly dense circuit matrices, the average number of nonzeros in each row is
only a little larger than 10.
In the simulation test, we use an in-house SPICE-like circuit simulator to compare
NICSLU and KLU by running three self-generated circuits and six circuits modified
Springer International Publishing AG 2017 99
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_6
100 6 Test Results

Table 6.1 Benchmarks used in our test


NNZ(A)
Benchmark N NNZ(A) N
add20 2395 17319 7.23
add32 4960 23884 4.82
asic_100k 99340 954163 9.61
asic_320k 321821 2635364 8.19
asic_680k 682862 3871773 5.67
bcircuit 68902 375558 5.45
circuit_1 2624 35823 13.65
circuit_2 4510 21199 4.70
circuit_3 12127 48137 3.97
circuit_4 80209 307604 3.84
circuit5m_dc 3523317 19194193 5.45
circuit5m 5558326 59524291 10.71
ckt11752_tr_0 49702 333029 6.70
dc1 116835 766396 6.56
freescale1 3428755 18920347 5.52
hcircuit 105676 513072 4.86
memchip 2707524 14810202 5.47
memplus 17758 126150 7.10
onetone1 36057 341088 9.46
onetone2 36057 227628 6.31
raj1 263743 1302464 4.94
rajat03 7602 32653 4.30
rajat15 37261 443573 11.90
rajat18 94294 485143 5.15
rajat20 86916 605045 6.96
rajat21 411676 1893370 4.60
rajat22 39899 197264 4.94
rajat23 110355 556938 5.05
rajat24 358172 1948235 5.44
rajat25 87190 607235 6.96
rajat26 51032 249302 4.89
rajat27 20640 99777 4.83
rajat28 87190 607235 6.96
rajat29 643994 4866270 7.56
rajat30 643994 6175377 9.59
rajat31 4690002 20316253 4.33
scircuit 170998 958936 5.61
trans4 116835 766396 6.56
transient 178866 961790 5.38
twotone 120750 1224224 10.14
6.1 Experimental Setup 101

from IBM power grid benchmarks [3]. Our self-generated benchmarks are post-
layout-like, i.e., there are large power and ground networks with a few transistors.
Since IBM power grid benchmarks are pure linear circuits, a few inverter chains are
inserted between the power network and the ground network to make the benchmarks
nonlinear. In order to reduce the impact of device model evaluation as much as
possible such that the total simulation time is dominated by the solver time, only a
few transistors are inserted in each benchmark.

6.2 Performance Metric

In order to measure and compare the performance of different solvers, performance


metrics are required to quantify the performance of solvers. In this book, we will
adopt speedups and performance profiles to compare the performance of different
solvers, which will be introduced below.

6.2.1 Speedups

Speedup is the most intuitive factor that can be used to compare the runtime of dif-
ferent solvers. In the following results, two types of speedups are defined to compare
the performance:
runtime of other solver
speedup = (6.1)
runtime of NICSLU
runtime of NICSLU (sequential)
relative speedup = . (6.2)
runtime of NICSLU (parallel)

In short, when we refer to speedup, it typically means that we are comparing


NICSLU with another solver. Relative speedup is only related to NICSLU itself,
so it also means the scalability of NICSLU. Runtime in speedup and relative
speedup can refer to the computational time of any step or multiple steps of interest
in a sparse direct solver, e.g., the total computational time of numerical factorization
and right-hand solving.

6.2.2 Performance Profile

In the following results, some figures will be plotted by the concept of performance
profile [4], which is defined as follows, taking computational time as an example.
Assume that we have a solver set S and a problem set P. t p,s is defined as the
runtime to solve the problem p P by solver s S . If solver s cannot solve
102 6 Test Results

problem p, then t p,s = +. We want to compare the performance on problem p


by solver s with the possible best performance on this problem. Toward this goal, a
baseline is required. For a given problem, the baseline is selected as the best solver
with the best performance on this problem. We first define performance ratio as
follows:
t p,s
r p,s = . (6.3)
min{t p,s : s S }

The performance ratio measures the ratio of the runtime of a solver s on problem p
to the runtime of the best solver on the same problem. If solver s can solve problem
p in ( 1) times of the runtime of the best solver on the same problem, i.e., the
runtime of solver s is less than min{t p,s : s S }, then for problem p, solver
s is called -solvable. The performance profile of solver s is defined as the ratio of
number of -solvable problems to the total number of problems, i.e.,

|{ p P : r p,s }|
Ps () = , 1 (6.4)
|P|

where | | is the size of the set. Ps () measures the probability for solver s that r p,s is
within a factor of the best possible ratio. For a given , a high- performance profile
value means that solver s has high performance. If = 1, the performance profile
measures for how much ratio of problems, solver s is with the best performance.

6.3 Results of Benchmark Test

In this section, we will present the detailed performance results of the benchmark
test, and also analyze the relation between the performance and the matrix sparsity.
We will first investigate how to select the optimal algorithm from the map algorithm,
column algorithm, and the supernodal algorithm by analyzing the matrix sparsity. We
will then present the relative speedups of NICSLU. We will also comprehensively
compare NICSLU with KLU and PARDISO in the terms of factorization time, resid-
ual, and the number of fill-ins to show the superior performance of NICSLU.

6.3.1 Comparison of Different Algorithms

In this subsection, we will analyze and compare the performance of the map algo-
rithm, the column algorithm, and the supernodal algorithm. As explained in Chap. 5,
the performance of sparse LU factorization strongly depends on the matrix spar-
sity, which is evaluated by the SPR defined in Eq. (3.2). Figure 6.1 plots the SPR
values of all the 40 benchmarks in the increasing order. As can be seen, the SPR
of circuit matrices covers a wide range from zero to more than 1000. For the 40
6.3 Results of Benchmark Test 103

10000

1000
Sparsity ratio

100

10

1
scircuit
add32

add20

twotone
memplus

dc1
rajat21
circuit_3
rajat22
hcircuit
rajat26
rajat23
rajat18
rajat27

bcircuit

circuit_1

trans4

ckt11752_tr_0
rajat03
rajat29
rajat15

rajat24

rajat30
rajat28
rajat25

rajat20

rajat31
memchip
circuit5m

raj1

circuit5m_dc
onetone2

onetone1
circuit_4
circuit_2

transient
asic_680k

freescale1
asic_320k

asic_100k
Fig. 6.1 Sparsity ratio

benchmarks, their SPR values are almost uniformly distributed in the logarithmic
scale. By analyzing the performance of these benchmarks, we are able to compre-
hensively investigate the performance of NICSLU.
To investigate how to select the optimal algorithm according to the value of SPR,
the map algorithm and the supernodal algorithm are compared with the pure column
algorithm, i.e., the G-P algorithm. Figure 6.2 shows the comparison, which is for the
re-factorization time. It clearly shows that the performance of the three algorithms
strongly depends on the matrix sparsity. The map algorithm is generally faster than
the column algorithm for extremely sparse matrices, i.e., matrices on the most left
side. By comparing the map algorithm with the column algorithm in the sequential
and parallel cases, we can conclude that for matrices with SPR < 20, we should
select the map algorithm. The parallel map algorithm has higher speedups than the
sequential map algorithm, compared with the corresponding column algorithm. This
is because that the parallel column algorithm has a higher cache miss rate than the
sequential column algorithm, as multiple uncompressed arrays x share the same
cache in the parallel column algorithm. For the parallel map algorithm, the threshold
can be up to 40. However, for a simple implementation, we use the same threshold
of the SPR value to select the sequential or parallel map algorithm in NICSLU. For
denser matrices, the map algorithm not only runs slower than the column algorithm,
but also consumes more memory to store the map. As shown in Fig. 6.2, the map
algorithm fails on three large matrices due to insufficient memory. The supernodal
algorithm is faster than the column algorithm for nearly half of the matrices on the
most right side, i.e., slightly dense matrices. By comparing the supernodal algorithm
with the column algorithm, we can conclude that for matrices SPR > 80, we should
select the supernodal algorithm rather than the column algorithm. By applying such
104 6 Test Results

Map vs. Column (T=1) Map vs. Column (T=8)


Supernodal vs. Column (T=1) Supernodal vs. Column (T=8)
3.0

2.5

2.0
Speedup

1.5

1.0

0.5

0.0
circuit_3

circuit_4
circuit_2
circuit5m
circuit_1

twotone
trans4
dc1

transient
add32

scircuit
rajat21

rajat22

rajat26
rajat23
rajat18
rajat27
add20

rajat03
rajat29
rajat15

rajat24
raj1

asic_680k

rajat30
rajat28
rajat25

rajat20

rajat31
memchip
ckt11752_tr_0
memplus

onetone2

asic_320k

asic_100k

onetone1
hcircuit

bcircuit

circuit5m_dc
freescale1
Fig. 6.2 Comparison of different algorithms

a sparsity-based algorithm selection strategy in NICSLU, we achieve about 1.5


average speedup compared with the pure column algorithm, for all the 40 bench-
marks.
By comparison of the three algorithms, we have also illustrated an important
observation by evidence that the G-P algorithm adopted by KLU is not always the
best for various circuit matrices. On the contrary, the optimal algorithm depends on
the sparsity. By integrating different algorithms and a smart sparsity-based selection
strategy together, NICSLU is able to achieve higher performance than the pure G-P
algorithm.

6.3.2 Relative Speedups

In this subsection, we will analyze the relative speedups of full factorization and
re-factorization of NICSLU to analyze the scalability. Here we focus on the rela-
tive performance of the parallel algorithms of NICSLU, so we only consider the
factorization time or re-factorization time, while the right-hand-solving time is not
considered.
6.3 Results of Benchmark Test 105

7 T=8

6 T=16
Relative speedup

0
circuit_4
circuit_3

twotone
circuit_2

circuit_1
scircuit

asic_320k

asic_100k
rajat21

rajat22

rajat18
rajat27

memplus
bcircuit

dc1

rajat03
rajat29
rajat15
add32

hcircuit

add20

asic_680k

rajat20

memchip
rajat26
rajat23

circuit5m

trans4

raj1
transient

rajat24
onetone2

rajat30
rajat28
rajat25

rajat31
circuit5m_dc
onetone1
freescale1
ckt11752_tr_0

Fig. 6.3 Relative speedup of full factorization

6.3.2.1 Full Factorization

Figure 6.3 shows the relative speedups of full factorization for all the 40 benchmarks.
As mentioned in Sect. 3.2.3, NICSLU selects sequential full factorization if the SPR
is smaller than 50. Therefore, for the first 22 matrices on the most left side, NICSLU
does not run parallel full factorization, so the relative speedup is always 1. For the
other 18 matrices, NICSLU runs parallel full factorization. However, the relative
speedups are not high. The average relative speedups of the 18 matrices when using
8 threads and 16 threads are 2.0 and 2.22, respectively. The reason of the low
scalability is that we use the ET to schedule tasks in parallel full factorization. The
ET severely overdetermines the column-level dependence. For a few matrices, the
performance when using 16 threads is even lower than that when using 8 threads.
This abnormal phenomenon is caused by the hardware platform. We use two Intel
CPUs to run all the experiments. Each CPU has 8 cores, so if we run the solver using
8 threads, all the communications are within one CPU. However, if we run the solver
using more than 8 threads, inter-core communications are invoked. The overhead of
inter-core communication is much larger than that of intra-core communication. This
observation caused by the hardware platform also limits the scalability of the solver
when using more too many threads. Fortunately, in SPICE-like circuit simulation,
we only need to invoke few times of full factorization, so the low scalability of
parallel full factorization will not significantly affect the overall performance of
circuit simulators.
106 6 Test Results

12

T=8
10
T=16
Relative speedup

0
circuit_3

circuit_1
circuit_4
circuit_2

dc1

asic_680k

onetone2

asic_100k

twotone

memchip
rajat21

rajat26
rajat23
rajat18
rajat27
hcircuit

memplus
bcircuit

scircuit

rajat03
rajat29

rajat24

asic_320k

rajat25

rajat20

onetone1

rajat31
add32

rajat22

add20

circuit5m

trans4

transient
rajat15
raj1

rajat30
rajat28
freescale1

circuit5m_dc
ckt11752_tr_0

Fig. 6.4 Relative speedup of re-factorization

6.3.2.2 Re-factorization

Figure 6.4 shows the relative speedups of re-factorization of NICSLU for all the 40
benchmarks. Since in SPICE-like circuit simulation, most factorizations during the
NewtonRaphson iterations are re-factorizations, the scalability of re-factorization
will have a significant impact on the overall performance of circuit simulation. For-
tunately, compared with full factorization, the scalability of re-factorization is much
better. The reason is that the EG used for task scheduling in parallel re-factorization
stores the exact column-level dependence. Compared with the ET, the EG is wider
and shorter, indicating that the EG implies more parallelism. For almost all of these
benchmarks, parallel re-factorization can be faster than sequential re-factorization.
The average relative speedups of re-factorization when using 8 threads and 16 threads
are 3.76 and 4.29, respectively.
Figure 6.4 also shows that the relative speedups of re-factorization tend to be
higher for denser matrices. To investigate the relation between the relative speedup
and the matrix sparsity, we show a scatter plot which draws the relation between the
SPR and the relative speedup of re-factorization in Fig. 6.5. It clearly shows that the
relative speedup has an approximate linear relation with the logarithm of the SPR.
This observation indicates that the relative speedup, i.e., the scalability, is better for
denser matrices. However, circuit matrices are highly sparse, so the scalability of
circuit matrix-oriented sparse solvers cannot be as high as that of solvers for general
sparse matrices from other applications. The reason can be simply explained as that
the communication overhead is relatively large for highly sparse matrices, since the
computational cost is too small. From this observation, we can also have an early
estimation of the relative speedup for re-factorization when the SPR is known in the
pre-analysis step.
6.3 Results of Benchmark Test 107

Fig. 6.5 Relation between 7


the SPR and the relative
speedup (T = 8) 6

Relative speedup
4

0
1 10 100 1000 10000
Sparsity ratio

6.3.3 Speedups

In this subsection, we will compare NICSLU with KLU and PARDISO in the term of
runtime. Since our purpose here is to evaluate the three solvers in circuit simulation
applications, we will compare the total runtime of factorization/re-factorization and
forward/backward substitutions, as these two steps are both repeated in the Newton
Raphson iterations.

6.3.3.1 Full Factorization

Table 6.2 compares the total runtime of full factorization and forward/backward sub-
stitutions. Please note that for PARDISO, the runtime also includes the iterative
refinement step which is a necessary step for PARDISO. When comparing NICSLU
with KLU and PARDISO, due to the different pre-analysis algorithms adopted, the
number of fill-ins may differ dramatically, so the runtime also shows great differ-
ences. Therefore, the geometric mean is fairer than the arithmetic mean when com-
paring the runtime. Recall that KLU is a sequential solver. Compared with KLU,
NICSLU achieves 3.46, 4.56, and 4.56 speedups on average when NICSLU
uses 1 thread, 8 threads, and 16 threads, respectively. NICSLU is averagely faster
than PARDISO when using 1 thread, and slower than PARDISO when using mul-
tiple threads. This is mainly due to the low scalability of NICSLU. As NICSLU
uses the ET which contains all the potential column-level dependence to schedule
tasks in full factorization and PARDISO uses a fixed dependence graph to schedule
tasks, PARDISO naturally has better scalability than full factorization of NICSLU.
However, such a direct comparison is unfair, because by adopting partial pivoting,
NICSLU has much better numerical stability than PARDISO, which can only select
pivots from diagonal blocks.
108 6 Test Results

Table 6.2 Speedup for full factorization and forward/backward substitutions


Benchmark NICSLU versus KLU NICSLU versus PARDISO
T =1 T =8 T = 16 T =1 T =8 T = 16
add32 2.18 2.32 2.33 7.26 3.37 2.44
rajat21 1.98 1.98 1.98 8.32 1.35 1.29
circuit_3 2.20 2.21 2.22 6.55 2.61 1.65
rajat22 2.30 2.32 2.32 6.34 1.70 1.46
hcircuit 2.31 2.35 2.36 6.84 1.79 1.47
rajat26 2.36 2.37 2.39 6.98 1.40 1.58
rajat23 2.35 2.40 2.42 6.29 1.32 1.32
rajat18 2.03 2.08 2.10 9.35 1.87 2.18
rajat27 2.25 2.25 2.27 5.86 1.95 1.35
add20 2.17 2.28 2.29 4.86 2.11 2.04
memplus 2.21 2.20 2.24 5.49 1.54 1.53
bcircuit 2.31 2.34 2.33 4.05 0.75 0.80
circuit_4 1.86 1.87 1.87 5.69 1.19 1.35
circuit_2 2.12 2.16 2.19 5.55 2.01 2.54
circuit5m 2.04 2.04 2.04 112.41 20.74 15.97
circuit_1 1.80 1.82 1.80 2.54 1.47 1.53
scircuit 2.17 2.17 2.17 3.45 0.58 0.55
trans4 1.76 1.78 1.79 3.54 0.82 0.80
dc1 1.73 1.72 1.74 3.46 0.69 0.79
ckt11752_tr_0 6.14 6.19 6.17 2.10 0.45 0.38
rajat03 1.56 1.57 1.58 1.46 0.43 0.47
rajat29 1.51 1.51 1.50 4.13 0.71 0.63
rajat15 2.24 4.06 3.08 1.23 0.70 0.26
raj1 184.16 285.53 160.61 1.66 0.39 0.20
transient 2.08 2.30 1.65 2.30 0.43 0.34
asic_680k 1.99 2.17 1.69 28.66 5.46 2.66
rajat24 65.51 82.35 62.63 2.10 0.38 0.30
onetone2 4.48 8.93 8.70 1.36 0.56 0.63
freescale1 2.27 3.22 3.68 1.34 0.29 0.27
asic_320k 2.04 3.17 2.56 1.98 0.54 0.44
rajat30 10.21 18.08 18.16 1.35 0.42 0.40
rajat28 8.35 15.58 14.77 0.85 0.45 0.29
rajat25 8.80 16.64 16.75 0.97 0.40 0.45
asic_100k 2.98 5.52 5.58 0.89 0.28 0.24
rajat20 7.45 13.33 13.80 1.01 0.40 0.44
circuit5m_dc 2.48 4.25 5.29 0.71 0.17 0.20
onetone1 29.45 66.81 99.25 1.07 0.65 0.78
twotone 13.34 61.00 98.67 0.93 0.68 0.88
rajat31 2.47 6.12 6.81 0.13 0.04 0.03
memchip 2.02 5.89 8.94 0.04 0.02 0.02
Arithmetic 10.04 16.37 14.57 6.78 1.58 1.32
mean
Geometric 3.46 4.56 4.56 2.65 0.76 0.68
mean
6.3 Results of Benchmark Test 109

6.3.3.2 Re-factorization

Table 6.3 compares the total runtime of re-factorization and forward/backward substi-
tutions. Re-factorization has better scalability than full factorization, so the speedups
of NICSLU compared with KLU and PARDISO are also higher. Compared with
KLU, NICSLU is faster for almost all of the benchmarks. The average speedups are
2.58, 7.51, and 7.94 when NICSLU uses 1 thread, 8 threads, and 16 threads,
respectively. Compared with PARDISO, NICSLU is faster for most of the bench-
marks and slower for only a few very dense matrices, as for such dense matrices, the
supernodesupernode algorithm adopted by PARDISO is more suitable. The average
speedups compared with PARDISO are 3.15, 2.01, and 1.9 when NICSLU and
PARDISO both use 1 thread, 8 threads, and 16 threads, respectively.
Figure 6.6 shows the performance profile for the total runtime of re-factorization
and forward/backward substitutions, which approximately evaluates the overall
solver performance in SPICE-like circuit simulators. It clearly shows that multi-
threaded NICSLU has the highest performance, and multi-threaded PARDISO is the
second best. The performance of sequential NICSLU is just a little lower than that of
multi-threaded PARDISO. Sequential PARDISO and KLU generally have the lowest
performance.

6.3.4 Other Comparisons

6.3.4.1 Floating-Point Performance

Figure 6.7 compares the factor of giga FLOP per second (GFLOP/s). GFLOP/s mea-
sures the floating-point computational performance achieved by the three solvers.
From the trend point of view, GFLOP/s of the three solvers increases when the
matrix becomes dense. 16-threaded NICSLU generally has the highest GFLOP/s and
sequential PARDISO has the lowest GFLOP/s. Such a trend is consistent with the run-
time performance of the three solvers. For a few benchmarks, PARDISO shows very
high GFLOP/s. This is sometimes due to the large number of fill-ins caused by the
pre-analysis step of PARDISO. For example, for benchmark circuit5M, 16-threaded
PARDISO can run at high computational performance of 125 GFLOP/s; however,
actually PARDISO runs slow on this benchmark, as shown in Tables 6.2 and 6.3.
For rajat21, rajat18, rajat29, and asic_680k, we can see a similar situation. For the
last three benchmarks (onetone1, rajat31, and memchip), the high GFLOP/s of PAR-
DISO is really caused by its computational performance. Consequently, GFLOP/s is
an one-sided factor that cannot well estimate the real performance of sparse solvers.
When comparing GFLOP/s, we should also compare the runtime or speedup to avoid
the one-sidedness of GFLOP/s.
On the other hand, the GFLOP/s values shown in Fig. 6.7 indicate that the floating-
point performance achieved by the three solvers is far away from the peak per-
formance of the CPUs that are used in our experiments. Theoretically, the peak
110 6 Test Results

Table 6.3 Speedup for re-factorization and forward/backward substitutions


Benchmark NICSLU versus KLU NICSLU versus PARDISO
T =1 T =8 T = 16 T =1 T =8 T = 16
add32 1.54 1.55 1.56 11.31 4.96 3.60
rajat21 1.46 2.07 2.04 14.60 3.36 3.16
circuit_3 1.23 2.11 1.47 9.66 6.56 2.86
rajat22 1.45 3.05 2.70 9.82 5.49 4.17
hcircuit 1.59 3.34 3.29 10.39 5.61 4.52
rajat26 1.43 3.37 2.96 10.46 4.95 4.85
rajat23 1.49 3.32 3.32 9.79 4.49 4.47
rajat18 1.26 2.88 2.74 15.02 6.68 7.33
rajat27 1.32 3.16 2.24 8.16 6.48 3.16
add20 1.07 2.09 1.30 5.05 4.09 2.45
memplus 1.31 4.00 3.59 6.59 5.70 5.00
bcircuit 1.23 3.46 3.67 4.99 2.57 2.92
circuit_4 1.26 3.44 3.29 7.46 4.22 4.60
circuit_2 1.10 2.23 1.62 5.55 4.00 3.62
circuit5m 1.31 2.86 3.06 129.39 52.32 43.05
circuit_1 0.89 1.91 1.37 2.38 2.93 2.20
scircuit 1.31 3.28 3.83 3.87 1.64 1.82
trans4 1.14 3.48 3.14 3.89 2.73 2.38
dc1 1.14 3.47 3.47 3.91 2.39 2.69
ckt11752_tr_0 5.26 20.59 17.35 2.28 1.91 1.36
rajat03 0.92 4.18 4.26 1.46 1.95 2.18
rajat29 0.93 2.57 2.78 4.95 2.35 2.26
rajat15 1.78 6.93 7.54 1.34 1.64 0.87
raj1 187.82 592.57 792.80 1.92 0.91 1.11
transient 1.83 5.29 5.66 2.69 1.32 1.55
asic_680k 1.81 5.47 6.69 32.99 17.39 13.27
rajat24 64.40 189.68 233.81 2.45 1.04 1.33
onetone2 4.18 18.32 20.54 1.52 1.36 1.78
freescale1 2.10 5.51 6.83 1.51 0.60 0.61
asic_320k 1.85 6.57 8.91 2.14 1.32 1.81
rajat30 10.21 39.15 53.81 1.53 1.02 1.35
rajat28 7.83 34.75 44.07 0.91 1.13 1.00
rajat25 8.72 38.14 43.13 1.09 1.04 1.31
asic_100k 3.07 13.20 16.32 1.01 0.74 0.78
rajat20 7.13 31.78 41.13 1.09 1.07 1.46
circuit5m_dc 2.32 5.60 6.59 0.75 0.26 0.27
onetone1 26.88 142.42 216.32 1.11 1.56 1.93
twotone 10.31 61.61 111.51 0.96 0.92 1.33
rajat31 2.41 9.97 12.85 0.13 0.08 0.07
memchip 2.12 9.04 12.75 0.04 0.03 0.02
Arithmetic 9.46 32.46 42.91 8.40 4.27 3.66
mean
Geometric 2.58 7.51 7.94 3.15 2.01 1.90
mean
6.3 Results of Benchmark Test 111

1.0

0.9

0.8

0.7

0.6
P( )

0.5 NICSLU (T=1)


NICSLU (T=8)
0.4
NICSLU (T=16)
0.3 KLU

0.2 PARDISO (T=1)


PARDISO (T=8)
0.1
PARDISO (T=16)
0.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Fig. 6.6 Performance profile for the total runtime of re-factorization and forward/backward sub-
stitutions

1000

100

10
GFLOP/s

1
NICSLU (T=1)
NICSLU (T=16)

0.1 KLU
PARDISO (T=1)
PARDISO (T=16)

0.01
memchip
circuit_3

dc1

asic_680k

asic_320k

asic_100k
circuit_4
circuit_2

circuit_1
scircuit

transient

circuit5m_dc
add32
rajat21

rajat22

rajat26
rajat23
rajat18
rajat27
add20
memplus

rajat03
rajat29
rajat15
raj1

rajat24

rajat30
rajat28
rajat25

rajat20

twotone
rajat31
circuit5m

trans4

onetone2
hcircuit

bcircuit

onetone1
freescale1
ckt11752_tr_0

Fig. 6.7 Comparison on GFLOP/s


112 6 Test Results

floating-point performance of 16 threads on our CPUs should be 2.9 2 2 16 =


185.6 GFLOP/s (the two multipliers 2 are for hyper-threading [5] and Streaming
SIMD Extensions 2 (SSE2) instructions [6], respectively). However, for about half
of the benchmarks, the achieved performance is even less than 10 GFLOP/s. For only
2 benchmarks, the achieved performance by 16-threaded PARDISO can exceed 100
GFLOP/s. Such an observation indicates that the computational capacity of CPUs
cannot be fully utilized by sparse solvers for circuit matrices. This is mainly due to
the high sparsity nature of circuit matrices, leading to that sparse solvers for circuit
matrices are highly memory-intensive applications.

6.3.4.2 Fill-ins and Residual

In addition to the runtime, speedup, and GFLOP/s which directly or indirectly reflect
the performance of sparse solvers, we will also compare some other factors among
the the three solvers to present a comprehensive analysis. Figures 6.8 and 6.9 compare
NICSLU with KLU and PARDISO in the terms of the residual and the number of
fill-ins, by plotting the corresponding performance profiles.

Fig. 6.8 Performance 1.0


profile for the residual
0.8

0.6
P( )

0.4 NICSLU (w/o refinement)


NICSLU (w/ refinement)
0.2 KLU
PARDISO
0.0
1 2 3 4 5 6 7 8 9 10

Fig. 6.9 Performance profile 1.0


for the number of fill-ins
0.8

0.6
P( )

NICSLU
0.4
KLU
0.2 PARDISO

0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
6.3 Results of Benchmark Test 113

Figure 6.8 compares the three solvers in the term of the residual. Residual is
defined as the root-mean-square error (RMSE) of the residual vector Ax b, i.e.,

r = Ax b,


1  N (6.5)
residual =  r 2.
N i=1 i

As mentioned in Sects. 3.1 and 3.5, NICSLU has a feature that it can automatically
control the iterative refinement step, which can potentially improve the accuracy of
the solution. In Fig. 6.8, we evaluate the residual of NICSLU in both cases when
the iterative refinement step is disabled and enabled. Please note that when itera-
tive refinement is enabled, NICSLU may not perform any iterations as the solution is
already accurate enough or cannot be refined, as shown in Algorithm 11. The compar-
ison illustrated in Fig. 6.8 clearly shows that NICSLU with refinement generally has
the highest solution accuracy. Even when iterative refinement is disabled, NICSLU
still generates more accurate solutions than KLU and PARDISO. Compared with
KLU, NICSLU has an additional step of static pivoting, i.e., the MC64 algorithm,
which is introduced in Sect. 3.2.1.2, so the accuracy of the solution can be improved.
Compared with PARDISO, NICSLU adopts the partial pivoting strategy, which has a
larger pivoting selection space and generates more accurate solutions than the block
supernode diagonal pivoting method adopted by PARDISO. Actually, we have found
that for a few matrices, due to the incomplete pivot selection space, PARDISO fails
to get an accurate solution, which means that the residual is unreasonably large. For
NICSLU, by integrating the MC64 algorithm, partial pivoting, and/or the iterative
refinement algorithm together, we can always obtain accurate solutions even when
the matrix is near ill-conditioned.
Figure 6.9 compares the three solvers in the term of the number of fill-ins, i.e.,
the number of nonzero elements of L + U I. Generally, NICSLU generates the
fewest fill-ins, and KLU and PARDISO have a similar performance on the number
of fill-ins. The difference in the fill-ins is mainly caused by the different algorithms
adopted in the pre-analysis step. KLU permutes the matrix into a block triangular form
(BTF) [7, 8] in the pre-analysis step. It is claimed that nearly all circuit matrices are
permutable to a BTF [9]; however, whether such a form can improve the performance
is unclear and needs further investigations. Our results from the benchmark test tend
to indicate that the effect of BTF on reducing fill-ins is somewhat small. On the
contrary, the MC64 algorithm adopted by NICSLU is helpful for improving the
numerical stability and reducing fill-ins. Although PARDISO also adopts the MC64
algorithm in the pre-analysis step, it uses a different ordering algorithm based on the
nested dissection method [10, 11], which can generate better orderings only for very
large matrices. Combining with the MC64 algorithm, the AMD [12, 13] algorithm
adopted by NICSLU is generally more efficient in most practical problems.
114 6 Test Results

6.4 Results of Simulation Test

In this section, we will present the detailed results of the simulation test. We have
created an in-house SPICE-like circuit simulator with the BSIM3.3 and BSIM4.7
MOSFET models [14] integrated. The simulator integrates NICSLU and KLU, so we
can easily compare the performance of NICSLU and KLU by running the simulator.
Six IBM power grid benchmarks for transient simulation [3] are adopted. Since they
are pure linear circuits and only forward/backward substitutions are required dur-
ing transient simulation, leading to some difficulties to evaluate the performance of
numerical LU factorization, we artificially insert a few transistors into each bench-
mark to make them nonlinear. We also create three power grid-like benchmarks
with large power and ground networks. The power and ground networks in the self-
generated benchmarks are completely regular meshes. A few inverter chains which
act as the functional circuit are inserted between the power network and the ground
network, making the circuit nonlinear as well. Figure 6.10 illustrates the power and
ground networks.
Table 6.4 compares the total transient simulation time between NICSLU and KLU.
NICSLU is faster than KLU in transient simulation for all of the nine adopted bench-
marks, regardless of the number of threads invoked by NICSLU. NICSLU achieves
3.62, 6.42, and 9.03 speedups on average compared with KLU in transient
simulation, when NICSLU uses 1 thread, 4 threads, and 8 threads, respectively. The
high performance of NICSLU is caused by two factors: less fill-ins/FLOPs and the
more advanced algorithms. In order to explain this, Table 6.5 compares the numbers
of fill-ins and FLOPs. For some benchmarks (ibmpg1t mod., ibmpg2t mod., ibmpg3t
mod., ibmpg5t mod., and ibmpg6t mod.), NICSLU generates much less fill-ins and
FLOPs than KLU, and, thus, NICSLU runs much faster than KLU in transient sim-
ulation. However, for the other benchmarks, NICSLU generates more fill-ins and
FLOPs than KLU, but NICSLU still runs faster than KLU. For example, for ibmpg4t
mod., NICSLU generates 6 % more FLOPs than KLU, but NICSLU is 2.36 faster
than KLU even if NICSLU runs in sequential. This speedup is certainly caused by
the advanced algorithms adopted by NICSLU.

Fig. 6.10 Illustration of ......


power and ground networks
Vdd Vdd
in our self-generated
benchmarks
......
......
...
Vin
......
6.4 Results of Simulation Test 115

Table 6.4 Comparison on the transient simulation time (in seconds)


Benchmark KLU time NICSLU
Time Speedup Time Speedup Time Speedup
(T = 1) (T = 4) (T = 8)
ibmpg1t 8.531 6.633 1.29 5.853 1.46 5.407 1.58
mod.
ibmpg2t 1057 91.76 11.52 49.78 21.23 34.29 30.83
mod.
ibmpg3t 12550 3275 3.83 1959 6.41 1271 9.87
mod.
ibmpg4t 12460 5270 2.36 2829 4.40 1867 6.67
mod.
ibmpg5t 23100 2365 9.77 1435 16.10 994.7 23.22
mod.
ibmpg6t 24610 2072 11.88 1510 16.30 1103 22.31
mod.
ckt1 1276 855.8 1.49 457.4 2.79 330.8 3.86
ckt2 21140 10240 2.06 4102 5.15 2970 7.12
ckt3 90780 40840 2.22 16040 5.66 10420 8.71
Arithm. 5.16 8.83 12.69
mean
Geome. 3.62 6.42 9.03
mean

Table 6.5 Comparison on the numbers of fill-ins and FLOPs


Benchmark KLU NICSLU
Fill-ins FLOPs Fill-ins Ratio FLOPs Ratio
(106 ) (106 )
ibmpg1t 9.33E+05 2.71E+01 7.58E+05 0.81 1.76E+01 0.65
mod.
ibmpg2t 1.93E+07 9.14E+03 8.85E+06 0.46 1.54E+03 0.17
mod.
ibmpg3t 1.54E+08 1.05E+05 1.19E+08 0.77 6.64E+04 0.64
mod.
ibmpg4t 1.61E+08 1.12E+05 1.61E+08 1.00 1.18E+05 1.06
mod.
ibmpg5t 1.79E+08 1.83E+05 1.35E+08 0.75 5.01E+04 0.27
mod.
ibmpg6t 2.02E+08 1.00E+05 1.35E+08 0.67 3.46E+04 0.35
mod.
ckt1 1.39E+06 1.41E+02 1.43E+06 1.03 1.74E+02 1.24
ckt2 7.34E+06 1.39E+03 7.88E+06 1.07 1.97E+03 1.42
ckt3 2.02E+07 6.18E+03 2.01E+07 1.00 7.03E+03 1.14
116 6 Test Results

The comparison on the number of fill-ins and FLOPs shown in Table 6.5 indicates
that the BTF algorithm adopted by KLU seems to be more suitable for regular meshed
circuits, as KLU generates less fill-ins and FLOPs than NICSLU for the three self-
generated regular meshed circuits. However, whether this conclusion is true requires
further investigations, which is out of the scope of this book.
For a simple summary, NICSLU has been proven to be high performance in time-
consuming post-layout simulation problems.

References

1. Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math.
Softw. 38(1), 1:11:25 (2011)
2. Davis, T.A.: Direct Methods for Sparse Linear Systems, 1st edn. Society for Industrial and
Applied Mathematics, US (2006)
3. Li, Z., Li, P., Nassif, S.R.: IBM Power Grid Benchmarks. http://dropzone.tamu.edu/~pli/
PGBench/
4. Dolan, D.E., Mor, J.J.: Benchmarking optimization software with performance profiles. Math.
Program. 91(2), 201213 (2002)
5. Wikipedia: Hyper-threading. https://en.wikipedia.org/wiki/Hyper-threading
6. Wikipedia: SSE2. https://en.wikipedia.org/wiki/SSE2
7. Duff, I.S., Reid, J.K.: Algorithm 529: permutations to block triangular form [F1]. ACM Trans.
Math. Softw. 4(2), 189192 (1978)
8. Duff, I.S.: On permutations to block triangular form. IMA J. Appl. Math. 19(3), 339342
(1977)
9. Davis, T.A., Palamadai Natarajan, E.: Algorithm 907: KLU, a direct sparse solver for circuit
simulation problems. ACM Trans. Math. Softw. 37(3), 36:136:17 (2010)
10. George, A.: Nested dissection of a regular finite element mesh. SIAM J. Numer. Anal. 10(2),
345363 (1973)
11. Lipton, R.J., Rose, D.J., Tarjan, R.E.: Generalized nested dissection. SIAM J. Numer. Anal.
16(2), 346358 (1979)
12. Amestoy, P.R., Davis, T.A., Duff, I.S.: An approximate minimum degree ordering algorithm.
SIAM J. Matrix Anal. Appl. 17(4), 886905 (1996)
13. Amestoy, P.R., Davis, T.A., Duff, I.S.: Algorithm 837: AMD, an approximate minimum degree
ordering algorithm. ACM Trans. Math. Softw. 30(3), 381388 (2004)
14. BSIM Group: Berkeley Short-Channel IGFET Model. http://bsim.berkeley.edu/
Chapter 7
Performance Model

In the previous chapter, we have shown the test results of NICSLU, where the relative
speedups vary in a big range for different benchmarks. In order to understand the
performance difference and find possible limiting factors of the scalability, further
investigations are required. Toward this goal, in this chapter, we will build a perfor-
mance model to analyze the performance and find bottlenecks of the scalability of
NICSLU. The performance model is based on an as-soon-as-possible (ASAP) analy-
sis on the dependence graph (i.e., the EG) used for parallel numerical re-factorization.
Under a unified assumption about the computational and synchronization costs, the
performance model predicts the theoretical maximum relative speedup and the max-
imum relative speedup when using given cores. With the performance model, one
can also analyze the parallel efficiency to further understand the bottlenecks in the
parallel algorithm.

7.1 DAG-Based Performance Model

In order to focus on the most important operations of sparse LU factorization and


avoid the impact of less important factors, the proposed performance model ana-
lyzed re-factorization rather than full factorization. The performance model is based
on an ASAP analysis on the dependence graph (i.e., the EG) used for scheduling
parallel re-factorization. In the model, we only consider all the FLOPs and essential
synchronization cost. We assume that each FLOP takes one unit of runtime and each
synchronization takes Tsync units of runtime. This is a unified assumption which will
be used in the model.
For a given column, all the FLOPs can be classified into two parts. One part is
related to the numerical update from dependent columns, corresponding to line 5 of
Algorithm 10. When using column j to update column k, the operation is denoted

Springer International Publishing AG 2017 117


X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_7
118 7 Performance Model

Fig. 7.1 Example to 1


illustrate the task flow graph 2
and the timing constraints
3 4 5
OPupd(4,6)

OPupd(3,6) OPupd(5,6)

OPnorm(6) 6
Executed sequentially by one thread

as OPupd ( j, k), which takes 2 NNZ(L( j + 1 : N , j)) units of runtime. The other
part is related to the normalization of column k of L, corresponding to line 8 of
Algorithm 10, which is denoted as OPnorm (k) and takes NNZ(L(k + 1 : N , k))
units of runtime. Finishing OPnorm (k) is equivalent to finishing the factorization of
column k.
The above-mentioned operations can be easily mapped onto the dependence graph
used for scheduling parallel re-factorization. A directed edge ( j, k) in the dependence
graph corresponds to OPupd ( j, k) and a node labeled as k corresponds to OPnorm (k).
According to this mapping, the dependence graph becomes a task flow graph that
describes all the FLOPs which are required to factorize the matrix. The task flow
graph also implies the timing constraints that must be satisfied during parallel re-
factorization. Figure 7.1 shows an example of the task flow graph. Take node 6 as
example to illustrate the timing constraints.
OPupd (3, 6) can only be started after OPnorm (3) is finished, and the same for
OPupd (4, 6) and OPupd (5, 6).
OPnorm (6) can only be started after OPupd (3, 6), OPupd (4, 6), and OPupd (5, 6) are
all finished.
According to the thread-level scheduling method, the four tasks OPupd (3, 6),
OPupd (4, 6), OPupd (5, 6), and OPnorm (6) are executed by one thread, so they are
executed in sequential.
These timing constraints imply that we can use an ASAP algorithm to calculate the
earliest finish time of all the tasks shown in the dependence graph. Before presenting
the ASAP algorithm, we first define some symbols which will be used in the ASAP
algorithm.
FT(k): the earliest finish time of OPnorm (k), which is also the earliest finish time
of the factorization of column k.
FT: the earliest finish time of the entire dependence graph.
FTcore ( p): the time when core p finishes its last task.
7.1 DAG-Based Performance Model 119

We have two algorithms to evaluate the performance of NICSLU. The first one
is shown in Algorithm 27. It assumes infinite cores and calculates the earliest finish
time of the entire graph. The algorithm calculates the theoretical minimum finish
time for a given matrix by accumulating the computational cost of FLOPs and the
synchronization cost, while the above-mentioned timing constraints are satisfied.
After the earliest finish time is calculated, the predicted relative speedup can be
calculated as follows:
FLOPs
predicted relative speedup = . (7.1)
FT
As Algorithm 27 assumes that infinite cores are used, the relative speedup esti-
mated by Algorithm 27 and Eq. (7.1) is the theoretical upper limit of the relative
speedup for a given matrix. Namely, it is the theoretical upper limit that the actual
relative speedup of any practical execution cannot exceed this value, regardless of
how many threads are running in parallel. The theoretical maximum relative speedup
cannot be used to predict actual relative speedups as it assumes infinite cores; how-
ever, it gives us a good estimation about the parallelism of a given matrix. In other
words, it estimates the maximum parallelism that can be achieved by parallel re-
factorization, regardless of the number of cores used. If the theoretical maximum
relative speedup is too low, it indicates that the given matrix is not suitable for par-
allel factorization.

Algorithm 27 Performance model algorithm (infinite cores).


Input: Symbolic pattern of U
Output: The earliest finish time FT
1: For k = 1, 2, , N , set FT(k) = 0
2: for k = 1 : N do
3: for j < k where U jk is a nonzero element do
4: FT(k) = max{FT(k), FT( j)}
5: FT(k)+ = Tsync
6: FT(k)+ = 2 NNZ(L( j + 1 : N , j))
7: end for
8: FT(k)+ = NNZ(L(k + 1 : N , k))
9: end for
10: FT = max{FT(k)}
k
120 7 Performance Model

Algorithm 28 Performance model algorithm (limited cores).


Input: Symbolic pattern of U and the number of cores P
Output: The earliest finish time FT, waiting cost Cwait , and synchronization cost
Csync
1: For p = 1, 2, , P, set FTcore ( p) = 0
2: C wait = 0
3: C sync = 0
4: for k = 1 : N do
5: q = arg min {FTcore ( p)}
p
6: FT(k) = FTcore (q)
7: for j < k where U jk is a nonzero element do
8: if FT( j) > FT(k) then
9: Cwait + = FT( j) FT(k)
10: FT(k) = FT( j)
11: end if
12: FT(k)+ = Tsync
13: Csync + = Tsync
14: FT(k)+ = 2 NNZ(L( j + 1 : N , j))
15: end for
16: FT(k)+ = NNZ(L(k + 1 : N , k))
17: FTcore (q) = FT(k)
18: end for
19: FT = max{FT(k)}
k

We have another algorithm to calculate the earliest finish time under the condition
that limited cores are used, as shown in Algorithm 28. For each task (i.e., a column),
the core which finishes its last task earliest among all the available cores is selected to
execute the current task. Except for this point, the algorithm to calculate the earliest
finish time is the same as Algorithm 27. After Algorithm 28 is finished, we can
estimate the maximum relative speedup under limited cores using Eq. (7.1). The
estimated relative speedup can be used to predict actual relative speedups.
Besides predicting the relative speedups, we are also interested in investigating the
bottlenecks of the parallel algorithm. There are two potential factors that may limit
the scalability. The first factor is the parallelism. If there is not enough parallelism,
the parallel efficiency will be low. The other factor is the synchronization cost. If
the synchronization cost takes a big portion in the total computational time, the
parallel efficiency will also be low. Parallelism is not easy to investigate, as sparse
LU factorization is a task-driven application. In this model, we use the waiting cost
instead of the parallelism to evaluate the parallelism. When we are trying to use a
column to update another column, the former column must be finished; otherwise
we need to wait until it is finished. It can be explained intuitively why the waiting
cost can be treated as an estimation of the parallelism. If the parallelism is high,
7.1 DAG-Based Performance Model 121

there tends to be many independent columns that can be factorized in parallel, and,
therefore, the dependence graph tends to be wide and the critical path tends to be
short. In other words, the data dependence in the pipeline mode tends to be weak. It is
easy to understand that weak dependence leads to low waiting cost. On the contrary,
if the parallelism is low, the dependence graph will be narrow and the critical path
tends to be long. In this case, the dependence is strong, leading to high waiting cost
as tasks are closely dependent. Please note that directly analyzing the dependence
graph used for scheduling parallel re-factorization cannot get a good estimation of
the parallelism, because we use the proposed pipeline mode scheduling strategy to
explore parallelism between dependent vertexes in the DAG. In other words, an
inter-vertex level analysis underestimates the parallelism. To analyze the impact of
the parallelism and synchronization to the parallel efficiency, we also collect the
waiting cost and the synchronization cost in Algorithm 28, as shown in lines 9 and
13. Once Algorithm 28 is finished, we can calculate the percentage of the waiting
cost and the synchronization cost based on

Cwait
waiting% = 100%
FLOPs (7.2)
Csync
synchronization% = 100%.
FLOPs
Bottlenecks of parallel LU re-factorization can be investigated by comparing the
waiting cost and the synchronization cost obtained from Algorithm 28. One can
also judge whether the matrix is suitable for parallel factorization by analyzing the
waiting percentage and the synchronization percentage according to Eq. (7.2). If at
least one percentage is high, e.g., 50%, it indicates that the parallel efficiency cannot
be high for the given matrix due to the high waiting or synchronization cost.

7.2 Results and Analysis

In this section, we will show and analyze the results of the proposed performance
model. We will analyze three aspects of results: theoretical maximum relative
speedup, predicted relative speedup, and bottlenecks of parallel LU re-factorization.
Tsync is set to 10 in these experiments.

7.2.1 Theoretical Maximum Relative Speedup

Figure 7.2 plots the theoretical maximum relative speedup of all the 40 benchmarks
calculated by Algorithm 27. Since the theoretical maximum relative speedup is the
theoretical upper limit of the relative speedup, Fig. 7.2 plots the maximum possible
relative speedup that we can achieve, regardless of how many cores are used to execute
122 7 Performance Model

Theoretical maximum relative speedup 10000

1000

100

10

1
rajat21

rajat18

rajat30

memchip
rajat22

rajat26
rajat23

rajat27

ckt11752_tr_0

rajat15
rajat03
rajat29

rajat24

rajat28
rajat25

rajat31
asic_100k
rajat20
dc1

asic_680k
add32

circuit_3

add20
memplus

scircuit
circuit_2

asic_320k
raj1
hcircuit

circuit_4

circuit5m

trans4
circuit_1

transient

onetone2
freescale1

circuit5m_dc
onetone1
bcircuit

twotone
Fig. 7.2 Predicted theoretical maximum relative speedup of re-factorization

parallel LU re-factorization. The theoretical maximum relative speedup generally


tends to increase when the matrix becomes denser. If we look back to Fig. 6.4, we can
find that the theoretical maximum relative speedup and the actual 8-thread relative
speedup of re-factorization have a similar trend. This means that the theoretical
maximum relative speedup is consistent with the actual performance. For extremely
sparse matrices, the theoretical maximum relative speedup is quite low (less than
100), indicating that the actual scalability is not high in practice, and there must be
some limiting factors that restrict the scalability.

7.2.2 Predicted Relative Speedup

Figure 7.3 shows the scatter plot which describes the relation between the predicted
relative speedup and the actual relative speedup of re-factorization when 8 threads
are used. It clearly shows that the predicted relative speedup is consistent with the
actual relative speedup, and there is an approximate linear relationship between
them. Consequently, the proposed performance model can be used to predict the
parallel efficiency of re-factorization of NICSLU. Of course, there are lots of detailed
factors that can affect the actual performance, which cannot be all captured by our
model. However, it is possible to capture the major factors and reasonably predict
the performance by a simple performance model. In what follows, we will analyze
the bottlenecks that can affect the scalability of NICSLU.
7.2 Results and Analysis 123

Actual relative speedup


5

0
0 1 2 3 4 5 6 7 8
Predicted relative speedup

Fig. 7.3 Relation between the predicted relative speedup and the actual relative speedup of re-
factorization (T = 8)

7.2.3 Bottleneck Analysis

In order to investigate the bottlenecks in sparse LU re-factorization, we plot the


percentages of the waiting cost and the synchronization cost in Fig. 7.4. With the
matrix becomes denser, the waiting cost and the synchronization cost both tend to

200%

180% Waiting%

160% Synchronization%

140%

120%
Percentage

100%

80%

60%

40%

20%

0%
rajat21

rajat22

rajat26
rajat23
rajat18
rajat27

memplus

rajat03
rajat29
rajat15

rajat24

rajat30
rajat28
rajat25

rajat20

rajat31
trans4

asic_680k

asic_320k

asic_100k

memchip
dc1
add32

add20

circuit5m

raj1

circuit5m_dc
scircuit

ckt11752_tr_0
circuit_3

circuit_4
circuit_2

circuit_1

twotone
transient
hcircuit

bcircuit

freescale1
onetone2

onetone1

Fig. 7.4 Percentages of the waiting cost and the synchronization cost
124 7 Performance Model

decrease, as the computational cost, i.e., the number of FLOPs, tends to increase
for dense matrices. For a few extremely sparse matrices, i.e., matrices on the most
left side, the synchronization cost is higher than the waiting cost, and they can be
both very high. This observation indicates that for extremely sparse matrices, it is not
suitable for parallel factorization as the synchronization cost is too high. Additionally,
the waiting cost is also high due to the insufficient parallelism. However, when the
matrix is not so sparse, the synchronization cost decreases rapidly, and the waiting
cost dominates the parallel overhead. Even for slightly dense matrices, the waiting
percentage can be up to 20%. This also means that the parallelism is the major
limiting factor of the scalability of NICSLU for those matrices.
Chapter 8
Conclusions

Efficiently parallelizing the sparse direct solver in SPICE-like circuit simulators is a


practical problem and also an industrial challenge. The high sparsity and the irregular
symbolic pattern of circuit matrices, and the strong data dependence during sparse
LU factorization, make the sparse direct solver extremely difficult to parallelize.
In this book, we have introduced NICSLU, a parallel sparse direct solver which is
specially targeted at circuit simulation applications. We have described algorithmic
methods and parallelization techniques that aim to realize a parallel sparse direct
solver for SPICE-like circuit simulators. Based on the baseline G-P sparse left-
looking algorithm [1], we have presented an innovative parallelization framework
and novel parallel algorithms of the sparse direct solver in detail. We have also shown
how to improve the performance by simple yet effective numerical techniques. Not
only the features of circuit matrices, but also the features of the circuit simulation
flow are fully taken into account when developing NICSLU. In particular, we have
developed the following innovative techniques in NICSLU.
An innovative framework to parallelize sparse LU factorization is proposed, which
is based on a detailed dependence analysis and contains two different scheduling
strategies to well fit different data dependence and sparsity of circuit matrices.
In addition to the existing G-P sparse LU factorization algorithm, we have also
proposed two fundamental algorithms to fit different sparsity of circuit matrices.
A simple yet effective method is proposed to select the best algorithm according
to the matrix sparsity. We have investigated that by carefully designing differ-
ent algorithms and selecting the optimal algorithm according to the sparsity can
achieve better performance than using the pure G-P algorithm.
Sufficient parallelism is explored among highly dependent tasks by a novel pipeline
factorization algorithm.

Springer International Publishing AG 2017 125


X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9_8
126 8 Conclusions

A numerically stable pivoting reduction technique is proposed to reuse previous


information as much as possible during successive factorizations in circuit simu-
lation. This feature fully utilizes the unique feature of SPICE iterations. We have
also proposed a simple yet effective method to select the factorization method
during SPICE iterations.
The sparse direct solver techniques described in this book have been proven to be
high performance by actual circuit simulation applications and can be applied to any
SPICE-like circuit simulators. The parallelization and improvement techniques of
the sparse direct solver can also be applied to other sparse matrix algorithms.
We have also developed a performance model to deeply analyze the bottlenecks of
NICSLU. For extremely sparse matrices, due to the insufficient parallelism, and low
computational cost, synchronization cost dominates the total runtime. For slightly
dense matrices, the parallelism is the major bottleneck. In order to reduce the synchro-
nization cost and explore the parallelism, blocked parallel factorization algorithms
can be developed and studied in the future. In such approaches, an efficient circuit or
matrix partitioning method is required, and the load balance problem needs special
attentions. As circuit matrices are of high sparsity, developing a low-overhead but
efficient scheduling method is still a challenge.
As the final note, NICSLU can be downloaded from http://nics.ee.tsinghua.edu.
cn/people/chenxm/nicslu.htm.

Reference

1. Gilbert, J.R., Peierls, T.: Sparse Partial Pivoting in Time Proportional to Arithmetic Operations.
SIAM J. Sci. Statist. Comput. 9(5), 862874 (1988)
Index

A Direct acyclic graph (DAG), 15, 51, 52, 64,


Amdahls law, 9, 10 73, 74, 117, 121
Approximate minimum degree (AMD), 44, Direct method, 4, 7, 14, 19, 22
47, 48, 113 Domain decomposition, 22
As-soon-as-possible (ASAP), 68, 117, 118
Atomic operation, 66, 67, 73
E
Earliest finish time, 118120
Electronic Design Automation (EDA), 1, 2
B
Elimination graph (EG), 48, 74, 75, 106, 117
Basic linear algebra subprogram (BLAS),
Elimination tree (ET), 15, 17, 64, 68, 72, 74,
1517, 33, 85, 87, 88, 99
75, 105107
Benchmark test, 99, 102, 113 ESched, 68, 69, 75
Blocked waiting, 77
Bordered block-diagonal (BBD), 2225
F
Fast cluster, 98
C Fast factorization, 96, 97
Cache efficiency, 82, 88 Fast pipeline, 98
Circuit simulator, 14, 7, 8, 10, 13, 14, 19, Field programmable gate array (FPGA), 13,
27, 43, 44, 56, 99, 105, 109, 114 19, 3335
Cluster mode, 11, 63, 69, 70, 75, 76, 90, 97, Fill-in, 5, 6, 20, 43, 47, 48, 86, 88, 102, 107,
98 109, 112116
Compressed array, 53, 80, 82 Floating-point operations (FLOP), 14, 47,
Compute unified device architecture 49, 57, 117119, 123
(CUDA), 3335 Forward/backward substitutions, 5, 7, 14,
Critical path, 121 16, 43, 44, 57, 107111, 114

G
D Gaussian elimination, 4, 48, 53, 82
Data dependence, 9, 11, 15, 63, 64, 74, 121 Giga FLOP (GFLOP/s), 109, 111, 112
Dependence graph, 15, 16, 64, 74, 75, 107, Graphics processing unit (GPU), 13, 19, 33
117, 118, 121 35
Depth-first search (DFS), 45, 46, 51, 52, 54,
73
Differential algebraic equation (DAE), 3, 4, I
7, 27, 28, 32 Incomplete factorization, 20
Springer International Publishing AG 2017 127
X. Chen et al., Parallel Sparse Direct Solver for Integrated Circuit Simulation,
DOI 10.1007/978-3-319-53429-9
128 Index

Indirect memory access, 82, 88 P


Inter-thread synchronization, 76, 77 Parallel circuit simulation, 7, 13, 22
Iterative method, 7, 1820, 22, 26, 27, 56 Parallel efficiency, 810, 14, 117, 120122
Iterative refinement, 11, 16, 43, 44, 5759, Partial pivoting, 5, 11, 15, 35, 43, 50, 51, 53
107, 113 55, 63, 64, 71, 72, 74, 87, 90, 9497,
107, 113
Performance model, 117, 119122
J Performance profile, 101, 102, 109, 111, 112
Jacobian matrix, 4, 7 Pipeline mode, 11, 63, 6972, 7577, 83, 84,
9092, 94, 97, 98, 121
Pivoting reduction, 11, 79, 9497
K Post-layout simulation, 7, 116
Krylov subspace, 32 Power grid, 101, 114
Pre-analysis, 5, 6, 11, 13, 4345, 48, 50, 87,
106, 107, 109, 113
L Pre-conditioner, 7, 1821
Left-looking, 15, 18, 50, 51, 54, 63, 64, 69, Pruning, 4951, 54, 55, 7174, 90, 96
70, 74, 7982, 86 Pseudo condition number (PCN), 44, 56, 57
Linear algebra package (LAPACK), 15, 17 Pseudo-dynamic scheduling, 6567, 70, 71,
Linear system, 4, 5, 7, 14, 17, 19, 20, 25, 27, 73, 76, 84, 90, 92
43 Pseudo-snapshot, 73
Load imbalance, 34, 66
Lower-upper (LU) factor, 6, 7, 9, 11, 17, 20,
22, 34, 43, 48, 49, 55, 63, 64, 74, 80 R
83, 85, 90, 95 Re-factorization, 43, 44, 51, 5557, 63, 67,
Lower-upper (LU) factorization, 4, 5, 7, 14 7476, 79, 8184, 9092, 94, 95, 103,
16, 18, 20, 23, 34, 4345, 47, 4952, 104, 106, 107, 109111, 117119,
55, 57, 63, 65, 68, 81, 82, 94, 99, 102, 121123
114, 117, 120 Relative speedup, 101, 102, 104106, 117,
119123
Relaxation method, 22, 2628
M Re-pivoting, 95, 96, 98
Map algorithm, 11, 44, 49, 79, 8184, 102, Residual, 20, 26, 58, 59, 102, 112, 113
103 Root-mean-square error (RMSE), 113
Matrix exponential method, 32
Matrix ordering, 47
Model evaluation, 711, 13, 14, 31, 33, 34, S
101 Scalability, 810, 14, 16, 26, 30, 32, 94, 101,
Modified nodal analysis (MNA), 3, 6, 7, 9, 104107, 109, 117, 120, 122, 124
85 Scatter plot, 106, 122
Multi-core parallelism, 8, 19 Scatter-gather, 52, 53, 55, 80
Multifrontal method, 1517 Schwarz method, 22, 25, 26
Multiplication-and-add (MAD), 52, 55, 76 Simulation Program with Integrated Circuit
Emphasis (SPICE), 14, 610, 13,
14, 19, 20, 22, 25, 2729, 31, 34, 44,
N 81, 94, 99, 105, 106, 109, 114
Newton-Raphson method, 4, 7, 20, 24, 25, Simulation test, 99, 114
29, 56, 94 Sparse direct solver, 1, 811, 13, 14, 17, 31,
Numerical factorization, 14, 43, 44, 55, 57, 3335, 43, 47, 101
63, 64, 79, 81, 101 Sparse matrix-vector multiplication
Numerical stability, 5, 7, 11, 53, 54, 56, 57, (SpMV), 19, 32
79, 94, 107, 113 Sparsity ratio (SPR), 44, 49, 50, 57, 80, 102,
Numerical update, 5153, 55, 7073, 76, 80, 103, 105107
82, 84, 8688, 90, 91, 93, 95, 96, 117 Speedup, 101, 103, 104, 107110, 112, 114
Index 129

Spin waiting, 77, 78 Transient analysis, 2


Static pivoting, 11, 4447, 113
Static scheduling, 15, 6567, 70, 71, 90
Submatrix kernel, 84 U
Supernodal method, 15 University of Florida sparse matrix collec-
Supernode, 15, 8488, 90, 92, 96, 113 tion, 99
Supernode-column algorithm, 84, 85, 88
Symbolic factorization, 43, 44, 48, 49
Symbolic pattern, 6, 7, 9, 24, 34, 44, 45, 48 V
52, 54, 55, 63, 64, 68, 71, 72, 74, 75, Very-large-scale integration (VLSI), 1
8184, 8688, 90, 9598, 119, 120
Symbolic prediction, 4952, 54, 55, 7074,
79, 86, 88, 90, 95, 96
Synchronization cost, 69, 117, 119121, 123 W
Waiting cost, 92, 120, 121, 123

T
Task flow graph, 118 Z
Timing constraint, 118, 119 Zero-free permutation, 4447, 51

Вам также может понравиться