Mimo Ofdm

Application-Specific Processor for MIMO-OFDM
Software-Defined Radio
Diss. ETH No. 18582
Application-Specific
Processor for MIMO-OFDM
Software-Defined Radio
A dissertation submitted to
ETH ZURICH
for the degree of

Doctor of Sciences
presented by
STEFAN EBERLI
Dipl. El.-Ing. ETH
born 15.4.1978
citizen of Zürich (ZH) and Hüttwilen (TG)
accepted on the recommendation of

Prof. Dr. W. Fichtner, examiner
Prof. Dr. H. Meyr, co-examiner
2009
Acknowledgments
I would like to express my gratitude to Prof. Dr. Fichtner, who gave

me the opportunity to pursue my Ph.D. at the Integrated Systems Lab-
oratory (IIS), in an excellent scientific environment and with excellent
colleagues. His concise advice kept me focused on the project, without
loosing the big-picture. My gratitude goes also to Prof. Dr. Meyr for
reading, correcting, and co-examining this thesis.
I am very thankful to Dr. Norbert Felber and Dr. Hubert Kaeslin
for their support during all the years at the IIS, and for their valuable
input while proof-reading this manuscript – thank you Norbert, e
grazie mille Hubert. Next, I would like to acknowledge Prof. Dr. Burg.
He has been a reference since the beginning and his precious and
practical advice guided me many times – thank you Andy.
BridgeCo AG has been the industrial partner of this Ph.D. project.
Among the many colleagues at BridgeCo, I am especially thankful
to Thomas Thaler, who contributed setting up the project; and to
Dr. Markus Thalmann, Matthias Tramm, and Dr. Manfred Stadler
for their support and the fruitful discussions and advice on processors.
Thanks go also to the Swiss innovation promotion agency CTI that
enabled the project.
Among the colleagues at the IIS, my gratitude goes to Dr. David
Perels for carefully following me in the early stage of my thesis. Also, a
great fortune was the possibility to discuss with top-class and brilliant
colleagues as Christoph Studer, Felix Bürgin, Flavio Carbognani, Frank,
Jürg Treichler, Marc Wegmüller, Markus Wenk, Matthias Brändli,
Peter Lüthi, Simon Häne, and Stephan Oetiker.
At the Communication Theory Group, Davide Cescato helped
me a lot with mathematical discussions and contributed with the
v
vi ACKNOWLEDGMENTS
divide-and-conquer matrix inversion method – grazie Davide.

I also would like to thank the following students for their contri-
butions to this thesis: Luca Henzen, Christoph Pedretti, and Lorenzo
Bardelli (master thesis); and Benjamin Dietrich and Lukas Haas
(semester thesis).
Finally, I owe much to my family, and to Mattea and Caterina –
thanks for coloring life ♥.
Abstract
Software-defined radios (SDRs) present a promising approach to face

the demands of today’s fast evolving environment of wireless communi-
cation standards. Ever increasing requirements in terms of performance
and flexibility call for programmable, high-performance signal pro-
cessing platforms. Performance is necessary to cope with the high
computational burden inherent to wireless communications. Flexibility
is desired to shorten the time-to-market. Unfortunately, performance
and flexibility are antagonists and are difficult to combine in one single
efficient platform, as one is obtained at the cost of the other.
This thesis contributes to the SDR research domain, addressing
the implementation of a 2 × 2 MIMO-OFDM receiver on an SDR
platform. Appropriate receiver algorithms are evaluated and the
corresponding computational complexity is derived. The use of low-
complexity algorithms is imperative to spare the limited processing
resources of a programmable platform. Three software-programmable
architectures are evaluated to find a suitable SDR platform, eventually
leading to the selection of a design-time configurable application-
specific processor as platform. The 2 × 2 MIMO-OFDM receiver is
split into two parts which are mapped onto two application-specific
processors, each tailored to the computational needs of the associated
digital signal processing kernels. The first processor performs the
per-stream MIMO-OFDM processing. The second processor handles
the MIMO detection. Finally, the 0.18 µm 1P/6M CMOS technology
layout of both fabricated application-specific processors is presented:
the silicon area required by the two processors is 7.65 mm2 and real-
time baseband processing is possible on these engines running at a
clock frequency of 250 MHz.
vii
Zusammenfassung
Software-definierte Radios (SDRs) stellen einen viel versprechenden

Ansatz dar, um sich dem heutigen, sich schnell entwickelnden Umfeld
der drahtlosen Kommunikationsstandards rasch anpassen zu können.
Ständig steigende Anforderungen, in Leistung und Flexibilität aus-
gedrückt, beanspruchen programmierbare und leistungsstarke Signal-
aufbereitungsplattformen. Leistung ist notwendig, um mit dem hohen
Rechenaufwand fertig zu werden, der der drahtlosen Kommunikation
inheränt ist. Flexibilität ist erwünscht, um die “time-to-market” zu
verkürzen. Leider sind Leistung und Flexibilität Antagonisten und
schwierig in einer einzelnen, effizienten und leistungsfähigen Plattform
zu kombinieren, da das eine auf Kosten des anderen erreicht wird.
Diese Arbeit trägt zum SDR-Forschungsgebiet bei und behandelt
die Implementierung eines 2 × 2 MIMO-OFDM Empfängers auf ei-
ner SDR-Plattform. Dazu werden passende Empfängeralgorithmen
ausgewertet und der entsprechende Rechenaufwand wird abgeleitet.
Der Gebrauch von Algorithmen mit niedrigem Rechenaufwand ist
zwingend, um kostbare Rechenressourcen einer programmierbaren
Plattform einzusparen. Drei Software-programmierbare Architektu-
ren werden evaluiert, um eine geeignete SDR-Plattform zu finden.
Ein konfigurierbarer, anwendungsspezifischer Prozessor wird schlies-
slich als Plattform ausgewählt. Der 2 × 2 MIMO-OFDM Empfänger
wird dann in zwei Teile, die an zwei solcher anwendungsspezifischen
Prozessoren angepasst werden, aufgespalten, wobei die Recheneinhei-
ten der zwei Prozessoren auf die Charakteristiken der Algorithmen
massgeschneidert sind. Der erste Prozessor führt die Verarbeitung des
MIMO-OFDM Datenstromes durch. Der zweite Prozessor behandelt
ix
x ZUSAMMENFASSUNG
die MIMO-Detektion. Zu guter Letzt wird das 0.18 µm 1P/6M CMOS

Layout beider gefertigten Prozessoren vorgestellt: die Siliziumfläche
der zwei Prozessoren beträgt zusammen 7.65 mm2 und die Echtzeitda-
tenverarbeitung ist bei einer Taktfrequenz von 250 MHz möglich.
Contents
Acknowledgments v
Abstract vii
Zusammenfassung ix
1 Introduction 1
1.1 Motivation – Mobility and Wireless Communications . 1
1.2 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 State of the Art 9

2.1 Design Considerations . . . . . . . . . . . . . . . . . . 9
2.1.1 Flexible architectures . . . . . . . . . . . . . . 9
2.1.2 Technology scaling . . . . . . . . . . . . . . . . 11
2.1.3 Real-valued vs. complex-valued functional units 12
2.2 Flexible Architectures for OFDM Baseband Processing 17
2.2.1 Academic players . . . . . . . . . . . . . . . . . 17
2.2.2 Relevant examples for industrial implementations 25
2.3 Flexible Architecture for MIMO-OFDM Baseband Pro-
cessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Summary and Discussion . . . . . . . . . . . . . . . . 35
3 Algorithms and Computational Complexity 43

3.1 MIMO-OFDM System Model . . . . . . . . . . . . . . 44
3.2 Performance and Computational Complexity Metrics . 50
3.3 Choice of the MIMO Detector . . . . . . . . . . . . . . 52
xi
xii CONTENTS
3.3.1 Brute-force maximum-likelihood (ML) . . . . . 53

3.3.2 Sphere decoding (SD) . . . . . . . . . . . . . . 54
3.3.3 K-Best (KB) . . . . . . . . . . . . . . . . . . . 56
3.3.4 Successive interference cancellation (SIC) . . . 56
3.3.5 Linear detection . . . . . . . . . . . . . . . . . 57
3.3.6 Results and conclusion . . . . . . . . . . . . . . 58
3.4 Linear MMSE Detection . . . . . . . . . . . . . . . . . 60
3.4.1 Adjoint method . . . . . . . . . . . . . . . . . . 65
3.4.2 LR-decomposition . . . . . . . . . . . . . . . . 65
3.4.3 LDL-decomposition . . . . . . . . . . . . . . . 65
3.4.4 GS-decomposition . . . . . . . . . . . . . . . . 66
3.4.5 QR-decomposition . . . . . . . . . . . . . . . . 66
3.4.6 Rank-1 update . . . . . . . . . . . . . . . . . . 67
3.4.7 Divide-and-Conquer algorithm . . . . . . . . . 67
3.4.8 Results and conclusion . . . . . . . . . . . . . . 68
3.5 MIMO-OFDM Receiver Algorithms . . . . . . . . . . . 72
3.5.1 Frame-start detection . . . . . . . . . . . . . . 73
3.5.2 STF processing . . . . . . . . . . . . . . . . . . 74
3.5.3 LTF processing . . . . . . . . . . . . . . . . . . 75
3.5.4 MIMO channel processing . . . . . . . . . . . . 76
3.5.5 Data processing . . . . . . . . . . . . . . . . . . 77
3.5.6 Computational complexity of the presented al-
gorithms . . . . . . . . . . . . . . . . . . . . . 78
3.6 Summary and Conclusion . . . . . . . . . . . . . . . . 82
4 Design Space Exploration 83

4.1 C6455 . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.1 SISO-OFDM transceiver . . . . . . . . . . . . . 84
4.1.2 Results, discussion, and conclusion . . . . . . . 89
4.2 MSEC4 . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.1 Architecture details . . . . . . . . . . . . . . . 93
4.3 ASPE . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.1 Architecture . . . . . . . . . . . . . . . . . . . 98
4.3.2 SISO-OFDM receiver . . . . . . . . . . . . . . 101
4.4 Summary and Conclusion . . . . . . . . . . . . . . . . 105
CONTENTS xiii
5 MIMO-OFDM SDR Receiver 109

5.1 SDR Platform Overview . . . . . . . . . . . . . . . . . 109
5.1.1 The Modified ASPE . . . . . . . . . . . . . . . 109
5.1.2 Receiver Architecture . . . . . . . . . . . . . . 111
5.1.3 Common ASPE A and ASPE B configuration . 112
5.2 ASPE A – MIMO-OFDM Processing . . . . . . . . . . 113
5.2.1 Datapath configuration . . . . . . . . . . . . . 113
5.2.2 BB processing on ASPE A . . . . . . . . . . . 124
5.3 ASPE B – MIMO Detection . . . . . . . . . . . . . . . 130
5.3.1 Datapath configuration . . . . . . . . . . . . . 130
5.3.2 BB processing on ASPE B . . . . . . . . . . . 137
5.4 Dictionary Based Program-Code Compression . . . . . 139
5.4.1 Reference design . . . . . . . . . . . . . . . . . 140
5.4.2 DBCC with NOP bitmask . . . . . . . . . . . . 141
5.5 Implementation Results . . . . . . . . . . . . . . . . . 148
6 Summary and Conclusions 155
A MIMO Detection Methods 161

A.1 Sphere Decoding . . . . . . . . . . . . . . . . . . . . . 161
A.2 K-Best . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A.3 Successive Interference Cancellation . . . . . . . . . . 166
A.4 Linear Detection – Matrix Decomposition and Inversion
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.4.1 Adjoint method . . . . . . . . . . . . . . . . . . 168
A.4.2 LR-decomposition . . . . . . . . . . . . . . . . 169
A.4.3 LDL-decomposition . . . . . . . . . . . . . . . 170
A.4.4 GS-decomposition . . . . . . . . . . . . . . . . 175
A.4.5 QR-decomposition . . . . . . . . . . . . . . . . 181
A.4.6 Rank-1 update method . . . . . . . . . . . . . 185
A.4.7 Divide-and-conquer method . . . . . . . . . . . 189
B Datasheet 193
Chapter 1
Introduction
1.1 Motivation – Mobility and Wireless

Communications
The beginnings Long-distance, wireless communication across the
Atlantic Ocean was inaugurated in 1901 by Marconi. On the Signal
Hill in Newfoundland (CA) he received a message transmitted from
Cornwall (GB). Since then, the domain of wireless communications has
incredibly evolved and, with the advances in both hardware and wireless
technology, wireless communications have diffused into everyone’s daily
life.
Diffusion to mass market In the late 1980s, the definition of

mobile phone communication standards prepared the proliferation
of economically affordable, mobile wireless communications to the
mass market, culminating today in penetration rates near 100 %. The
success of mobile wireless communications was – and still is – fueled by
the human need for communication and social interaction, combined
with a global environment demanding flexibility and mobility.
This need for mobility is also well reflected and sustained by the
rapid development and deployment of portable personal computers
where, again, wireless communications played an important role. Ini-
tially, wired local area networks (LANs) were conceived for large
1
2 CHAPTER 1. INTRODUCTION
universities and research laboratories to interconnect their computers.

In the early 1990s, the Internet’s tentacles begun to grow dramatically,
interconnecting an increasing number of LANs via metropolitan area
networks (MANs). The Internet diffused into the homes, providing the
mass market with the possibility to share and disseminate information
across almost the entire globe. In the late 1990s, wireless communica-
tion was introduced into the LAN infrastructure thanks to the global
standardization effort, which lead to the initial IEEE 802.11 wireless
LAN (WLAN) standard (2 Mbit/s). Wired LAN connections could
finally be replaced by wireless ones, and gradually, new wireless access
points appeared. Nowadays, train stations, airports, and even coffee
bars provide WLAN infrastructure, leading to an almost ubiquitous
Internet connectivity. MANs are experiencing a similar fate, and
wireless links started to replace wired last-mile connections.
Today: plethora of standards Despite all attempts of global

standardization, the world of (consumer) wireless communications is
populated by differing standards. These standards are tailored to the
particular needs of their own application domain, employing the appro-
priate modulation techniques to best exploit the application’s wireless
channel. To name a few, the WMAN IEEE 802.16 standard, the
IEEE 802.11a/b/g WLAN standards, the Bluetooth wireless personal
area network (WPAN) standard, as well as the mobile phone GSM
standard all populate this heterogeneous world. Figure 1.1 illustrates
the situation. Today, the interaction between services relying on these
different standards is crucial, for the mobile end-user. For instance, a
calendar has to be synchronized between laptop and mobile phone via
Bluetooth, and, at the same time, the mobile phone has to support
GSM and possibly UMTS for its original duty.
While infrastructure components can concentrate on a single wire-
less standard and do not pose overly tight requirements to power
consumption, mobile terminals (e.g., mobile phones, laptops) evolve
towards multi-standard platforms that are very sensitive to power
consumption. For conventional mobile terminals, this means the in-
tegration of one dedicated, hardwired modem for each supported
standard. The result is an efficient, but complex platform that has
long redesign times and little flexibility. An unacceptable condition in
1.1. MOTIVATION – MOBILITY AND WIRELESS COMMUNICATIONS3
Figure 1.1: Wireless networks form an heterogeneous environment

(source [1]).
this rapidly evolving world.
SDR Innovative solutions group multiple standards onto a single,

programmable and thus flexible platform, pointing towards the soft-
ware defined radio (SDR)1 . The ideal SDR is capable of handling all
imaginable standards, from the radio frequency (RF) front-end to the
baseband processing and up to MAC layer processing – all on the same
platform. The platform on which the SDR resides is programmed in
software enabling run-time adaptation to each particular standard. For
instance, a single SDR chip embedded in a mobile phone would provide
the connectivity to the mobile phone network, the WLAN, and the
WPAN; either in mutual exclusion, or even concurrently, time-sharing
the hardware resources to process the different standards.
Compared to dedicated hardwired solutions, the advantages of
SDR platforms are evident. They reduce the time to market due to
the inherent programmability and design re-use. Algorithms can be
improved or adapted to evolving standards after chip-fabrication, and
both hardware and software bugs can be fixed. In other words, they
1 www.sdrforum.org
perfectly fit this heterogeneous and rapidly evolving habitat.

Unfortunately, the stringent – and, when combined together, some-
how utopic – exigencies of the ideal SDR pose significant challenges to
all components of the underlying platform. With today’s technology,
the ideal SDR is unrealistic and, indeed, no such fully generic SDR
exists. Although the advances in silicon integration technology, ex-
pressed by Moore’s law [2] as exponential in time, bring some relief, the
concurrent complexity increase associated to the exponential growth
of communication data rates predicted by Edholm’s law [3], partially
nullifies these advances.
In this scenario, the research for an efficient hardware platform
contributing to the SDR realization is imperative. It appears rea-
sonable to step back form the ambition of directly implementing the
ideal SDR. Current research efforts focus on the different components
of the SDR (i.e., the RF front-end, the baseband processing, etc.)
separately. In the baseband processing domain – in which this thesis
is situated – an important aspect to be considered is the platform, or
the flexible architecture (FA) employed to implement the SDR func-
tionality. This FA has to have the right balance between flexibility
and processing performance required to support the considered stan-
dards. Commercial SDRs are appearing as standard-specific baseband
software-programmable architectures that address the relatively low
data rates of mobile phone standards (e.g., [4, 5]). For high data rates,
the implementation on the limited processing resources of an FA is
challenging and commercial solutions are still lacking.
1.2 This Thesis

This thesis contributes to the SDR development by concentrating on
the WLAN domain and by implementing the relevant MIMO-OFDM
baseband processing on an FA. Instead of pursuing the ambition
of designing a fully-generic platform, the more pragmatic standard-
specific approach is followed.
In the WLAN domain, the push towards higher and higher data
rates is well documented by the numerous amendments to the initial
IEEE 802.11 standard. Orthogonal frequency-division multiplexing
(OFDM) is included as modulation technique in the 802.11a-1999
1.2. THIS THESIS 5
supplement, enabling data rates up to 54 Mbit/s. Currently, the use of

multiple antennas at both ends of the wireless link – commonly referred
to as multiple-input multiple-output (MIMO) – is being integrated in
what will become the 802.11n supplement [6, 7, 8].
With respect to single-antenna communication systems (single-
input single-output, SISO), MIMO systems deliver significant perfor-
mance gains that manifest themselves in a broader coverage, a higher
throughput, or a more reliable wireless link. For the transmission
over a wideband frequency-selective channel, MIMO is typically com-
bined with OFDM (as in the 802.11n supplement), which divides
the wideband frequency-selective channel into a set of narrowband
frequency-flat, parallel subchannels. As result, the communication is
more robust in multipath-propagation environments and the channel
equalization at the receiver becomes less computationally intensive.
However, compared to SISO-OFDM, the signal processing load
inherent to MIMO-OFDM communication is significant: it scales at
least with the number of transmit antennas and it depends on the
employed receiver algorithm. Therefore, for real-time operation on FAs,
only low-complexity MIMO-OFDM transceivers with computationally
efficient algorithms can be considered.
In this thesis, three different FAs are assessed by mapping compu-
tationally hard OFDM baseband processing kernels onto the following
architectures: Texas Instruments high-end C6455 DSP, a special pur-
pose baseband processor, and a design-time customizable application-
specific instruction-set processor (ASIP) developed in the predecessor’s
Ph.D. thesis [9]. Eventually, the proposed ASIP qualifies best, and
proceeding further on this track for the 2 × 2 MIMO-OFDM baseband
implementation is worthwhile.
Contributions In summary, the two main contributions of this

thesis are:
• The implementation of the complete IEEE 802.11a baseband

processing (except Viterbi decoding) on an ASIP, published
in [10] and described in Section 4.3. This contribution shows that
the real-time implementation of OFDM baseband processing on
the selected ASIP is possible, enabling data rates up to 54 Mbit/s.
Compared to other work in the domain, the solution presented
here is very competitive in terms of silicon area.

• The implementation of the relevant baseband processing ker-
nels of a 2 × 2 MIMO-OFDM receiver on a pair of properly
customized ASIPs, published in [11] and detailed in Chapter 5.
The comparison of the proposed baseband receiver with pub-
lished related results [12] indicates that – thanks to appropriate
design-time customization and to low-complexity algorithms –
the solution presented here is very area-efficient. In addition, it
suggests that ASIPs, are appropriate and efficient vehicles for
the further development of baseband processing in SDRs.
A number of minor contributions were necessary to construct and

consolidate the two main contributions:
• The classification of FAs that are employed for OFDM baseband
processing and are reported in the open literature. The clas-
sification clarifies the options available to design FAs and the
obtained performance. The material is presented in Chapter 2.
• The detailed analysis of existing MIMO detection algorithms,
performed with special attention to computational complexity,
in view of the implementation on an FA. The analysis can be
used as a reference for future work in the domain, for instance,
when more sophisticated detectors are required. The material is
presented in Chapter 3, partly in [11] and [13].
• The mapping of hard computational kernels identified during the
algorithm evaluation on different FAs permits their assessment
for (MIMO-)OFDM processing. The material is presented in
Chapter 4.
• Finally, this thesis acts as proof of concept for the design-time

customizable ASIP framework, proposed in the predecessor’s
PhD. thesis [9].
1.3. OUTLINE 7
1.3 Outline
The remainder of this thesis is organized as follows:
Chapter 2 reviews related work. FAs implementing OFDM-related
baseband processing tasks are presented. Relevant figures of merit
are gathered and/or extrapolated form the open literature. The final
discussion allows a comparison of the different FAs.
Chapter 3 is dedicated to the algorithms. After an introduction
to the domain of MIMO-OFDM wireless communications, two crucial
MIMO-OFDM design considerations are made. First, computational
complexities of several MIMO detectors are compared, allowing to se-
lect linear MMSE detection as an appropriate one. Second, algorithms
to compute linear MMSE detectors are assessed in their computational
complexity and BER performance. Finally, the complete baseband
processing for the MIMO-OFDM receiver considered in this thesis is
described and the associated computational complexity is derived.
Chapter 4 evaluates three different FAs, by mapping computation-
ally hard OFDM baseband processing kernels onto each one of them.
The evaluation suggests to consider the design-time customizable ASIP
for the case-study described in the subsequent chapter.
Chapter 5 details the implementation of the relevant baseband
processing kernels of the 2 × 2 MIMO-OFDM receiver. The receiver
is split onto two properly configured ASIPs. The chapter concludes
with the comparison to the only known related work [12] and gives
reference for the silicon complexity.
Chapter 6 summarizes and discusses the achieved results and
draws the appropriate conclusions.
Chapter 2
State of the Art
This chapter reviews the literature of flexible architectures that are

employed as SDR platforms for OFDM baseband processing in wireless
communication systems. The review includes selected examples from
both academia and industry, and it aims at presenting the architecture
design environment in which this thesis is situated.
Flexible architectures are briefly described and characterized by
their area, processing performance, and power consumption. At the
end of the chapter, these characteristics are presented in a unified
manner to get an overview of the domain. Before the actual literature
review, a few concepts and considerations related to the design of FAs
need to be introduced.
2.1 Design Considerations

2.1.1 Flexible architectures
The term flexible architecture (FA) unifies two architecture categories,
namely: software-programmable architectures (SPAs) and reconfigurable
architectures (RAs). Figure 2.1, on the left, delineates the generic
FA concept with a block diagram. FAs allow their datapath (DP) to
operate in several modes or configurations through F control bits that
are determined by the control path (CP). Inside the DP, a number of
functional units (FUs) perform the actual data processing. Memory
9
10 CHAPTER 2. STATE OF THE ART
Flexible Architecture (FA)
F FU FU
FU FU
...
CP DP FA
Memory
SPA SPA + RA RA
Figure 2.1: Left: Flexible architecture (FA) block diagram, CP: Con-
trolpath, DP: Datapath. Right: Software-programmable architectures
(SPAs), reconfigurable architectures (RA), and a combination of these
(SPA+RA), are FA sub-classes.
stores the data to be processed, as well as instructions or configurations.

Data and instructions or configurations may also be stored in separate
memories.
SPAs determine their DP operation with a sequence of instructions
that are fetched from the instruction memory: the DP can change
its operation mode at each clock cycle. However, the instruction set
architecture is determined at design time, and it cannot change after
the SPA is fabricated. The desired functionality is implemented by
selecting and sequencing the appropriate instructions into program-
code, ideally with the aid of a software development tool. RAs, instead,
keep one DP configuration F for several clock cycles, and, commonly,
a change of the DP configuration requires several clock cycles to take
place. The configurations are determined after design-time by the
user/programmer and are loaded into the configuration memory during
an initialization phase. A combination of SPA and RA (SPA+RA)
results in part of the DP being able to change operation mode at each
clock cycle, and another part – the reconfigurable one – to continue
with the same configuration for many clock cycles. This classification
is depicted on the right side of Figure 2.1.
Processing performance The peak processing performance of FAs

is commonly reported in millions of operations per second (MOPS). It
is derived by multiplying the number functional units (FUs) that can
2.1. DESIGN CONSIDERATIONS 11
be addressed in parallel during one clock cycle, with the operating clock
frequency. Conventionally, to compute the processing performance, all
operations that can be executed in parallel are taken into account (e.g.,
load/store, data processing operations, address generation operations).
The so-obtained processing performance is a qualitative measure
that allows only a rough comparison between differing architectures.
Because the operations on the various architectures are not necessarily
the same, comparisons are error-prone.
In this thesis, for the comparison with the computational load
inherent to the algorithms presented in Chapter 3, the processing
performance (PP) is defined as the millions of data-operations per
second ( MdOp/s) an FA can execute. This unit considers only the
data processing relevant operations that can be executed in parallel on
the FA, multiplied with the operating clock frequency. The relevant
data processing operations are those typical to digital signal processors:
additions, subtractions, multiplications and combinations of these.
2.1.2 Technology scaling

Important figures of merit that characterize ICs include silicon area,
processing performance (e.g, the throughput), and energy consumption.
These figures of merit are related to one IC realization in a given
CMOS technology. Since the architectures reported in the literature
may be implemented in different technologies, their figures of merit can
not directly be compared to each other, especially when considering
architecture design aspects.
Therefore, in this thesis, these technology-dependent figures of
merit are normalized to a 0.18 µm CMOS technology. Although scal-
ing to a reference technology represents an approximation, it allows
more meaningful conclusions. The scaling from the original technol-
ogy, utilized in the considered publication, to the 0.18 µm reference
technology is performed according to [14], assuming devices and wires
scale equally with 1/αD :
2
A0.18 = A · 1/αD (2.1)
f0.18 = f · αD (2.2)
2
P0.18 = P · (/αD ) , (2.3)
where A, f , and P stand for the area, clock frequency, and power
consumption in the original technology and the normalized quantities
are indicated by the 0.18 subscript. The technology scaling factor
αD = 0.18/x is derived from the half-minimum-feature-size x in the
original technology. Although not standardized, the CMOS technol-
ogy’s name commonly indicates approximately the associated half-
minimum-feature-size x: for instance, in a 0.13 µm CMOS technology
technology x = 0.13. The correction factor reflects the voltage scaling
V0.18 = V · /αD from original to the 0.18 µm reference technology
(typically ≈ 1).
Figure 2.2 illustrates the area-time product curves (AT-plot, [15])
for a 16 bit adder and a 16 bit multiplier, synthesized for 0.25 µm,
0.18 µm, and 0.13 µm UMC CMOS technologies. The curves have
been generated synthesizing the corresponding circuit with Synopsys
Design Compiler (Z-2007.03-SP3) and applying the timing constraints
indicated by the curve’s markers. The blue circles show the results
obtained when taking the 0.25 µm CMOS technology as starting point
and scaling the two designs to the 0.18 µm (αD = 0.25/0.18), and
0.13 µm (αD = 0.25/0.13) technologies, according to (2.1) and (2.2).
It is comforting to note that the scaled results reflect the results
achieved with synthesis well: the distance between scaled result-point
and synthesized result-curve is minimal.
2.1.3 Real-valued vs. complex-valued functional

units
Most baseband algorithms operate on complex-valued numbers. Since
conventional FAs provide FUs that perform real-valued arithmetic
operations, it is necessary to map the complex-valued operations
into real-valued ones – which is easily performed (c.f. Section 3.2).
However, an interesting aspect is whether the support of complex-
valued operations by the FA’s datapath is desirable or not.
Figure 2.3 reports the AT-plot for a real-valued and a complex-
valued multiplier-FU, both synthesized for a 0.18 µm CMOS technology.
The two synthesized units are shown in Figure 2.4. While the real-
valued FU requires six clock cycles to compute the complex-valued
multiplication, the complex-valued FU requires just one clock cycle
AT−plot for 16bit Adder

16000
umc250
14000 umc180
umc130
12000
10000
Area [μ m2]
8000
6000
4000
2000
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5
Longest path [ns]
AT−plot for 16bit x 16bit Multiplier

60000
umc250
umc180
50000
umc130
40000
Area [μ m2]
30000
20000
10000
0
1 2 3 4 5 6 7 8
Longest path [ns]
Figure 2.2: Area-time product curves (AT-plot) for 0.25 µm, 0.18 µm,
and 0.13 µm CMOS technology. Top: 16 bit adder, Bottom: 16 bit
multiplier.
for the same task.1 This difference is visible in the AT-plot where the
isomorph, complex-valued unit can attain the higher throughput (at
the cost of an area increase), whereas the decomposed real-valued unit
occupies the smallest area (at the penalty of a lower throughput). As
shown by the black dashed curve, the AT-efficiency of the two units
is nearly the same, with the complex-valued unit being slightly more
efficient. The operation with FUs performing only real-valued arith-
metics results in a lower throughput than that with complex-valued
FUs since the circuit cannot be synthesized for timing constraints
below 12 ns per data item. Thus, it can be stated that for algorithms
requiring high throughput and mostly operating on complex-valued
numbers, it is convenient to incorporate complex-valued FUs into the
FA’s datapath.
1 The computation of the complex-valued multiplication C = A · B is di-
vided into the six real-valued steps: s1 = <{A}<{B}, s2 = ={A}={B},

s3 = <{A}={B}, s4 = ={A}<{B}, s5 = s1 − s2 , s6 = s3 + s4 . Where the
<{.} and ={.} operators return the real and imaginary parts of their argument,
respectively. The complex-valued result is C = s5 + j s6 .
4
x 10 AT−plot for UMC180
16
CMUL
14 RMULADD
12
Area [μ m2]
10
6 AT = 410700 μ m2 ns
2
0 5 10 15 20 25 30
Time per data item (T) [ns]
Figure 2.3: AT-plot for complex-valued multiplication with real-valued

functional unit (RMULADD) and complex-valued functional unit
(CMUL).
16 16 16 16
Im{A} Re{A} Im{B} Re{B} X Y
32 32 16 16
Re{A} Re{B} Im{A} Im{B} Re{A}Im{B} Im{A} Re{B}
16 16 16 16 16 16 16 16 16 16 16 16
32 32 32 32 32 16
>>
32 32 16
Select
>> >> result
16 16 16
32
16 16
Im{C} Re{C} Z
Figure 2.4: Left: complex-valued multiplication unit. Right: real-

valued unit able to compute one complex-valued multiplication in six
steps, or clock cycles.
2.2. FAS FOR OFDM BB PROCESSING 17
2.2 Flexible Architectures for OFDM Base-

band Processing
2.2.1 Academic players
Reconfigurable datapath (RD) The potential of run-time recon-
figurable hardware in supporting different standards that exploit the
same modulation technique is explored in [16] (2003). The RD is
designed to support synchronization and demapping of IEEE 802.11a
and HiperLAN/2 standards, both relying on OFDM. Other tasks that
are also required by the two standards, as for instance the computation
of a fast Fourier transform (FFT) and Viterbi decoding, are deemed
as too computationally intensive and delegated to dedicated hardware
blocks.
The RD is depicted in Figure 2.5. The datapath is steered by a
controller and one of three possible configurations, stored in a config-
uration memory, is applied to the datapath. The configuration bits
for the three operation modes are determined without the aid of a
software tool, i.e. by hand. Small memories for temporary data storage
are distributed across the datapath. The architecture falls into the
category of RAs.
The circuit is realized in 0.35 µm CMOS technology and runs at a
clock frequency of 100 MHz [16]. No power consumption figures are
reported. The area of the circuit is normalized to the area of one 8 bit
multiplier, and amounts to 143. The area savings obtained by sharing
parts of the datapath instead of implementing distinct units for each
computation amount to 20 % [16]. The datapath has a wordwidth of
8 bit and incorporates nine multiplier units, one divider, and twelve
adders (the two CORDIC units are neglected). Consequently, the
processing performance is PP = 20 200 MdOp/s.2
RaPiD In [17] (2004) a 4 antenna OFDM receiver has been imple-

mented on the RaPiD architecture [18] (1996). The receiver performs
the timing synchronization necessary to detect an OFDM frame and to
align the received symbol boundaries. The FFT on the four received
streams, required to demodulate the received data, is performed as well.
2 PP = 22 dOp × 100 MHz = 20 200 MdOp/s.
Inv m3
m1
Inv
Phase *
m3
*
*
+
+-
Memory 4
m10
m_o2
m3 * m_o1 +
+-
-
m9 m_o2
m5 m_o1
Memory 3 *
+ +-
m_o4
-
*
m6 m_o3
*
m_o3 +
m2
*
* Inv
+-
-
-
m8 m_o4
* m4
m_o1 m_o2
m5 m_o3
Inv CORDIC2
CORDIC1
m7 m_o4 Phase
m1
m9 Memory 1a m_o1
m2
m3
m10 Memory 1b m_o2
m4
m5 m_o3
Memory 2a
m6
m7
Memory 2b m_o4
m8
Figure 2.5: Reconfigurable datapath block diagram (source [16]).

External Memory External Sensors
Input Output
streams stream
STREAM
MANAGER
CONFIGURABLE INTERCONNECT
F F A M R R A R User M R A M R R F F
I U R U U I I
I L A E L A E Def E L A E
F F U L M G U G L G U L M G F F
O O T M FU T T O O
CONFIGURABLE INSTRUCTION DECODE
Data bus
INSTRUCTION GENERATOR
Control
Figure 2.6: RaPiD block diagram (source [18]).
The same receiver tasks are also mapped onto an ASIC, an FPGA,
and a DSP for comparing the achieved performance-over-cost. One
of the conclusions in [17] is that RAs fill the performance-over-cost
gap between ASICs and DSPs. It is found that there is a six-fold
increase in complexity compared to an ASIC implementation, while
the cost is reduced by a factor of six compared to an implementation
on conventional DSPs. Thus, compared to ASICs, lower NRE-cost
production is possible, while realizing a higher performance-over-cost
than on DSPs.
The RaPiD architecture is reported in Figure 2.6. Its datapath
consists of an heterogeneous, linear array of FUs. The FUs are ALUs,
multipliers (MULT), registers (REG), and storage units (RAM) that
are connected through a configurable interconnect network. The
number and type of FUs are scalable, and determined at design-time.
The interconnect is built by multiplexers that select the input to the
functional units, by tristate buffers for driving the output of functional
units onto the wanted bus, and by bus connectors that split long bus
segments into smaller ones, enabling concurrent utilization of two bus
segments belonging to the same long bus.
The configuration bits required to operate the RaPiD architecture
are divided into hard and soft configuration bits. Hard configuration
bits (as for example those controlling the bus inter-connectors) do

not change while an application is running, whereas soft configuration
bits (e.g., control bits for functional units) can change at each clock
cycle. The control part is sophisticated and implements instruction
compression through the addition of instruction repeat counters and
automatic loop generation. RaPiD is programmed by means of the
RaPiD-C language. According to the definition at the beginning of this
chapter, the RaPiD architecture falls into the category of SPA+RA.
The figures of merit reported in [19] refer to one RaPiD benchmark
cell. The datapath of one such RaPiD benchmark cell has a word-
width of 16 bit, and includes 1 multiplier and 3 ALUs. Its area is
5.07 mm2 in 0.5 µm CMOS technology at 100 MHz. The estimated
power consumption lies between 1.9 W, for performing a 16-tap FIR
filter, and 6.1 W peak power consumption. The receiver described
in [17] utilizes 16 RaPiD benchmark cells, thus resulting in an area of
approximately 16 · 5.07 mm2 = 81 mm2 in 0.5 µm CMOS technology,
at a power consumption of 16 · 1.9 W = 30.4 W. The corresponding
datapath processing performance is PP = 60 400 MdOp/s.3
MS1 and MaRS A Viterbi decoder and an FFT are implemented

as relevant, wireless communication kernels on MaRS, in [20] (2005).
MaRS [21] is the scalable successor of MorphoSys MS1 [22]. Since
only marginal information regarding MaRS is available, the following
description concentrates on the MS1, shown in Figure 2.7. Its datapath
is composed by an 8 × 8 array of reconfigurable cells (RCs). Each RC
incorporates an ALU, a multiplier, and a registerfile. The RCs are
configured through 32 bit context words. Two RC-array configuration
modes exist: in row-mode, all RCs of the same row receive the same
context word; analogously, in column-mode all RCs of the same column
are provided with the same context word. As a result the operation is
either row- or column SIMD-like. The MS1 is a RA.
In [22], the MS1 architecture is synthesized for a 0.35 µm CMOS
technology and occupies 180 mm2 (entire chip, including periphery).
One RC occupies an area of 1.5 mm2 and the achieved clock frequency
is estimated to be 100 MHz [22]. The peak data processing performance
is PP = 60 400 MdOp/s.4
3 PP = 16 tiles × 4 dOp/tile × 100 MHz = 60 400 MdOp/s.
4 PP = 64 RCs × 1 dOp/RC × 100 MHz = 60 400 MdOp/s.
Figure 2.7: MS1 block diagram (source [22]).

SODA The SODA architecture [23] (2006) is especially designed

for the SDR domain. In [23], the W-CDMA and IEEE 802.11a stan-
dards are taken as two wireless communication standards that rely on
completely different modulation techniques, and are implemented on
the SODA architecture. The achieved performance is of 2 Mbit/s for
W-CDMA, and 24 Mbit/s for 802.11a (including Viterbi decoding).
The SODA architecture (see Figure 2.8) is composed of an ARM
Cortex-M3 processor for top-level control tasks connected to a system
bus. The system bus, in turn, connects four processing elements
(PEs) and a global memory. The PEs (see Figure 2.9) are designed
to support data-level parallelism and data transfers, since these are
key elements of the analyzed communication algorithms. Each PE
contains a scalar unit and a 32-way SIMD unit. Their 16 bit datapaths
are interconnected through a shuffle network for the conversion from
scalar to vector operation, and vice-versa. The scalar unit incorporates
one ALU, whereas the 32-way SIMD unit incorporates 32 multipliers
and 32 adders. Each PE contains a scalar, as well as an SIMD
scratch-pad memory, and the corresponding register-file counterparts.
Programming SODA’s PEs is done in C-language with additional
optimization and mapping of processing kernels supported by a software
tool. The SODA architecture falls into the category of SPAs.
SODA occupies an area of 26.6 mm2 and runs at a clock frequency
of 400 MHz in 0.18 µm CMOS technology [23]. The power consumption
lies around 3 W. The datapath’s peak processing performance, for the
four PEs together, is considerable: PP = 510 200 MdOp/s.5
5 PP = 4 PEs × 32 dOp/PE × 400 MHz = 510 200 MdOp/s.

ARM GLOBAL
MEMORY
LOCAL LOCAL LOCAL LOCAL

MEMORY MEMORY MEMORY MEMORY
PE
EXECUTION EXECUTION EXECUTION EXECUTION

UNIT UNIT UNIT UNIT
DMA
SCALAR SCALAR SIMD SIMD

RF MEMORY MEMORY REGISTER FILE
WtoS
&
StoW
SCALAR SIMD
ALU ALU
Figure 2.8: SODA multi-core architecture (source [23]).

PE SIMD SCRATCHPAD MEMORY (8KB)

SIMD PIPELINE
2 READ/WRITE PORT (512bit wide)
32x16bit
16
RF I E 16bit 16bit W
16
16x16bit D X ALU MULT B
16 S E
RF I 16bit 16bit W
32-way I 16x16bit 16 D S X ALU MULT B
SIMD R H
16 E
RF I 16bit 16bit W
16x16bit 16 D X ALU MULT B
16 I E W
RF 16bit 16bit
16x16bit 16 D X ALU MULT B
32-way SIMD RF
Wide SIMD to Scalar Wide SIMD to Scalar
I-MEM 2 READ PORTS
Reduction Network Reduction Network
4KB 1 WRITE PORT
Stage 1 (WtoS 1) Stage 2 (WtoS 2)
SCALAR PIPELINE
StoW1 StoW2 32x16bit
32x16bit
I 16 To
RF I E 16bit W
I-Queue
R 16x16bit 16 ALU SCALAR
D X 512 B
RF
SCALAR SCRATCHPAD MEMORY (4KB) To
DMA Inter-PE
PC&Loop AGU PIPELINE 2 READ/WRITE PORT (16 bit wide)
Counter
BUS
I RF 12 I Address E W To
R 16x16bit D Calculation X B AGU
RF
Figure 2.9: One SODA PE (source [23]).

2.2.2 Relevant examples for industrial implemen-

tations
Montium – Recore Systems Recore Systems founded 2005 in
Entschede, The Netherlands, sells the Montium processor as intellectual
property (IP). The Montium processor has its origins at the University
of Twente, The Netherlands.
In [24] (2004), the implementation of an OFDM receiver on the
Montium reconfigurable architecture, is described. The receiver is
implemented on three Montium tiles. The first tile performs the tasks
of frequency offset correction, the second the computation of a 64-
point FFT, and the last performs channel equalization, the phase offset
correction and the subsequent demapping. The proposed platform
achieves datarates up to 54 Mbit/s.
One Montium tile [25, 26] is depicted in Figure 2.10. The architec-
ture is given by a linear array of five ALUs connected to ten memory
units. All units are connected to each other by a dense bus network.
The ALUs embody one multiplier, three adders, and may be extended
at design-time with user defined functionality. Each ALU stores four
different configurations in corresponding registers. These configuration
registers are addressed by the ALU decoder to control the ALU’s
operation. A similar concept is employed for controlling the memory,
the bus network, and the ALU’s input registers, thus resulting in a
hierarchal control system with the main program sequencer selecting
the configurations of the four sub-systems. The Montium processor is
programmed by means of the MontiumC language. It falls into the
SPA+RA category.
The figures of merit for one Montium tile are obtained from [27].
One Montium tile occupies an area of 2 mm2 in 0.13 µm CMOS tech-
nology. The achievable clock frequency is 100 MHz, which is rather low
when compared to the gate delays of that technology. The power effi-
ciency for one tile is estimated to be 0.5 mW/MHz and thus the power
consumption is derived as 50 mW [27]. The peak datapath performance
of one Montium tile is determined by the concurrent operation of the
five ALUs, resulting in 500 MdOp/s.6 Eventually, the total area for
the above-described OFDM receiver, which employs three Montium
6 PP = 5 ALUs × 1 dOp/ALU × 100 MHz = 500 MdOp/s.
M01 M02 M03 M04 M05 M05 M07 M08 M09 M10
A B C D A B C D A B C D A B C D A B C D
ALU1 E W ALU2 E W ALU3 E W ALU4 E W ALU5
OUT2 OUT1 OUT2 OUT1 OUT2 OUT1 OUT2 OUT1 OUT2 OUT1
Memory Crossbar Register ALU

decoder decoder decoder decoder
Sequencer
Communication and Configuration Unit
Figure 2.10: Montium tile block diagram (source [27]).
tiles, is 6 mm2 , the power consumption scales to 150 mW, and the data
processing performance to PP = 10 500 MdOp/s.
EVP – NXP NXP, formerly Philips, acquired the company Syste-

mOnIC AG, Dresden, Germany, in early 2003. SystemOnIC developed
DSP1 [28], the predecessor of EVP [5]. DSP1, in turn, leans upon the
M3-DSP architecture [29] developed at TU Dresden, Germany.
Reference [5] analyzes different wireless baseband processing ker-
nels (including OFDM baseband processing) and derives architectural
requirements an SPA needs to efficiently support baseband process-
ing. The conclusions are that, although SIMD operation can heavily
be employed, the support of scalar operations is still required. The
common wordwidth in the evaluated algorithms is 16 bits, with a few
exceptions requiring 8 bit or 32 bit precision. The embedded vector
processor (EVP), stylized in Figure 2.11, meets these requirements.
Control 16-way SIMD units

Scalar units
Vector memory
AGU
16 vector registers 32 registers
Load/store unit Load/store

Program memory
ALU ALU
VLIW controller
MAC/shift unit MAC
Shuffle unit
Intravector unit
Code generation unit
Figure 2.11: EVP block diagram (source [5]).
The EVP’s datapath includes one scalar unit, as well as a set of 16-
way SIMD units. The datapath is controlled by very long instruction
words (VLIWs). The data memory feeds the 16 vector registers from
where the execution units retrieve their operands. The programming
of the EVP is performed in EVP-C, an extension to the C-language
for supporting the SIMD units. The EVP falls into the category of
SPAs.
The EVP described in [5] is synthesized for a 90 nm CMOS technol-
ogy. It runs at a frequency of 300 MHz and occupies an area of 2 mm2 .
The power efficiency is of 1 mW/MHz, leading to a power consumption
of 300 mW. The EVP’s processing performance is derived observing
that one multiplication and one ALU operation can be executed in
parallel on the 16-way SIMD datapath, thus resulting in a peak data
processing performance of PP = 90 600 MdOp/s.7
7 PP = 2 units × 16 dOp/unit × 300 MHz = 90 600 MdOp/s.

SB3010 – Sandbridge Sandbridge Technologies Inc. (Tarrytown

NY, USA) was founded in 2001, targeting the domain of baseband
processors for 3G wireless phones. No specific academic project is
behind the company, but rather different personalities of the digital
signal processor scene, especially from IBM (eLite DSP project [30]).
In [31] a WiMAX receiver, which relies on OFDM, is implemented
on the SB3010 platform. The receiver performs timing synchronization,
frequency offset compensation, channel equalization, and demodulates
BPSK symbols via 256-point FFTs. Viterbi decoding is also imple-
mented on the SB3010 platform. The Sandblaster architecture [32, 33]
(see Figure 2.12) encapsulates four DSP cores, which are controlled by
a general purpose processor (ARM9). The scheduling of tasks among
these four cores is dynamic. Each DSP core is designed to support
4-way SIMD instructions, scalar and general-purpose instructions, as
well as memory address generation. The I-decode unit distributes the
instructions to these three parts. A single memory delivers the data
for the SIMD and the scalar parts. Eight banks guarantee enough
bandwidth to maintain the SIMD register-file filled. The SB3010 plat-
form is programmed in C-language by means of a powerful software
development kit. Sandblaster falls into the category of SPAs.
The SB3010 chip is fabricated in 90 nm CMOS technology and
each DSP core runs at 600 MHz [31]. The power consumption is
reported as 150 mW in [33]. No area figures are disclosed. The total
data processing performance delivered by the four DSP cores is of
PP = 90 600 MdOp/s.8
8 PP = 4 DSP cores × 4 dOp/DSP core × 600 MHz = 90 600 MdOp/s.

10-50MHz REF
DSP Local REF1 REF2
JTAG
Peripherals Int. Clks
Clock Gen Multimedia
GPIO
Card IF
RF Control DSP Complex Smart Card
Serial IF
(SPI, I2C) I&D Mem IF
L2
Int
MEM
I&D Mem
L2
Int
MEM
I&D Mem
L2
Int
Sync. Ser.
MEM
Timer I/O I&D Mem
DSP
L2
Int
Prog.
MEM
DSP
DSP Port
Timers/Gens DSP
Keyboard
Parallel IF
TX Data
Streaming
RX Data DSP-ARM Bridge Keypad IF
Data IF
Vector UART IrDA
Interrupt
Multi Port Controller Audio
Memory ARM Codec IF
Controller Processor
Memory Interface DMA GPIO
(Synch. and USB
Controller
Asynch.) Interface AHB-APB
LCD Timers
Bridge
Interface Peripheral RTC
Dev. Ctrl.
Bus/Memory
Interface
4W (2 active)
64B Lines
I-Cache
64kB
Data Memory
Dir 64kb
LRU 8 Banks
Replace
I-Decode Data Buffer
Interrupt
SIMDIQ
Branch PC LS IQ WB INT IQ VP VP VP VP
R0 R0 R0 R0
LR (16)32bit VRABC VRABC VRABC VRABC
CR Address GPR
CTR MPY MPY MPY MPY
LRA LRB IRA IRB PABC PABC PABC PABC
Address ALU
ADD ADD ADD ADD
ACC ACC ACC ACC
SAT
Figure 2.12: SB3010 architecture (source [32, 33]). Top: entire plat-
form. Bottom: one DSP slice.
BBP1 and BBP2 – Coresonic Coresonic is a start-up company

founded in 2004 in Linköping, Sweden. The company has its roots in
the BBP1 processor research project at Linköping University, Sweden.
BBP1 [34] is a multi-standard baseband processor mainly designed
for WLAN standards (e.g., IEEE 802.11a/b/g). The attained per-
formance is, for instance, sufficient to sustain 54 Mbit/s in the IEEE
802.11a standard (OFDM, no Viterbi decoding). The main idea leading
to the BBP1 architecture reported in Figure 2.13, is that many wireless
communication standards employ the same set of functions (e.g., filters,
FFTs, interleaving, etc.), configured with standard-specific parameters
(e.g, filter coefficients, number of FFT points, permutations used for
interleaving, etc.). The resulting architecture contains a baseband
processor-core that is connected to a set of specialized, parameterizable
data processing blocks, and to data memories (DM). The processor
core controls the specialized units, and it is equipped with an ALU and
a complex-valued MAC unit. Vector instructions are used to schedule
the processing of data blocks on the specialized blocks. Programming
is eased by an assembler and an instruction set simulator. The BBP1
is classified as SPA and, the accelerators, as RA.
The figures of merit for a BBP1 realization in 0.18 µm CMOS
technology are collected from [34]. The BBP1 runs up to a frequency of
240 MHz and occupies an area of 2.9 mm2 . The energy consumption is
of 126 mW when operating at 160 MHz (which is the frequency required
for the 802.11a receiver operation). The processing performance is hard
to estimate because of the heterogeneous granularity of the datapath.
Therefore, no performance figures are extrapolated.
The BBP2 processor, described in [35, 36] (see Figure 2.14), is the
successor of the BBP1. It is designed for multi-standard baseband pro-
cessing and its datapath includes two 4-way SIMD-units that operate
on 16 bit complex-valued vectors. The first unit is a complex-valued
ALU, and the second a complex-valued MAC. A simple controller unit
steers the two SIMD units, through the corresponding vector control
units, and performs the program control flow. The controller supports
up to three contexts, for three different tasks. Four memory banks for
complex-valued data and one bank for real-valued data compose the
data memory of the BBP2 processor. Each of the four complex-valued
data banks contains four memories that are accessed concurrently.
As a result, enough bandwidth for the operation on one of the two
Decimator Inter- Viter- MAC MAC /

RAKE / FFT/
RF & Symbol bi CRC port Application
Despread CMAC leave
Front- Shaper Processor
end
Central Baseband Processor Core

and Accelerator Network
DM1 DM2 DM3 DM4 CM PM
Figure 2.13: BBP1 block diagram (source [34]).
Memory bank 0 Memory bank 1 Memory bank 4
AGU
AGU
Complex
AGU
Complex Complex Integer
AGU
Memory Memory Memory Memory

To analog part
Integer oriented
Complex oriented on-chip network
on-chip network
PRBS
Host
Freq.err. RF Stack Map/
controller
controller
Vector L/S unit Vector L/S unit
gen
NCO
IF
canc.
processor
demap
To host
Vector
Vector
CALU
CMAC
CALU
CMAC
CALU
CMAC
ALSU
CALU
CMAC
Filter &
MAC
decimation
PM
Digital front-end CALU SIMD Datapath CMAC SIMD Datapath Controller unit
Figure 2.14: BBP2 block diagram (source [35]).
4-way SIMD units is delivered. The BBP2 is programmed in assembler

language and debugged with a bit and cycle true C-simulator. The
BBP2 processor is classified as an SPA.
The implementation in 0.13 µm CMOS technology runs at a clock
frequency of 240 MHz and occupies an area of 11 mm2 [35]. The data
processing performance is determined by the two SIMD units that
can execute 8 complex-valued operations per clock cycle. Accord-
ingly, the real-valued data processing performance becomes PP =
50 760 MdOp/s.9
9 PP = 2 units × 4 CdOp/unit × 240 MHz = 24 RdOp/unit × 240 MHz =
50 760 MdOp/s.
CSP2xxx Series – Silicon Hive Silicon Hive (Eindhoven, The

Netherlands) spun-out of Philips Research in 2007.10 It bases its
CSP2xxx processor series upon the AVISPA processor [37]. Three
AVISPA architectures are reported in literature: AVISPA, AVISPA+,
and AVISPA-CH. AVISPA and AVISPA+ are designed for OFDM
baseband processing [38]. AVISPA-CH [39], the successor of AVISPA
and AVISPA+, is designed for the multi-standard digital television
baseband processing and incorporates complex-valued FUs.
The AVISPA architecture [37] is shown in Figure 2.15. The top level
architecture contains a control processing and storage element (PSE)
and a mesh of four PSE for data processing. Each data processing PSE
instantiates different FUs, namely: a 16 bit ALU, a 16 bit multiplier, a
40 bit accumulator, a 40 bit barrel shifter, an address generation unit,
and two 16 bit load/store units connected to a local dual-port data
memory. Small register files (RFs), connected to the FUs through
a local interconnect network, enable temporary data storage. The
AVISPA processor is programmed by means of different tools that
allow to write code in a subset of the C-language and extract the
instruction level parallelism. The AVISPA architecture is classified as
an SPA.
The AVISPA architecture is realized in 0.13 µm CMOS technology,
running at 150 MHz and consuming an area of 6.5 mm2 [38]. The
processor consumes around 127 mW and the peak data processing
performance is PP = 10 200 MdOp/s.11
2.3 Flexible Architecture for MIMO-OFDM

Baseband Processing
Today, to the best of the author’s knowledge, only one MIMO-OFDM
baseband processing implementation case-study exists that is com-
parable to the one presented in this thesis. The corresponding FA
is described here and then refreshed later on in Chapter 5, when
presenting the implementation results of this thesis.
10 http://www.siliconhive.com
11 PP = 4 PSE × 2 dOp/PSE × 150 MHz = 10 200 MdOp/s.
2.3. FA FOR MIMO-OFDM BB PROCESSING 33
PSE
CELL
IN PROG.MEM.
MULTI-CELL CORE
RF RF RF CTRL
IN PSE PSE
IN
PSE PSE
FU FU FU FU MEM
IN IS IS IS BUS
Host Mem
Figure 2.15: AVISPA block diagram (source [37]).

ADRES – IMEC The ADRES processor [40] was developed at the
IMEC research center in Leuven, Belgium. The implementation of the
complete 2 × 2 MIMO-OFDM baseband processing on the ADRES
processor is presented in [12, 41] (only Viterbi decoding is performed
on a dedicated unit). The presented receiver can process data rates
up to 108 Mbit/s.
The generic ADRES architecture template [40] and the realization
for the baseband receiver in [12] are depicted in Figure 2.16. The
ADRES core is composed of one VLIW part and one coarse-grained
array (CGA) part. These two parts operate in mutual exclusion. For
the realization in [12], the CGA part consists of a 4 × 4 array of 4-way
SIMD 16 bit FUs and the VLIW part comprises 3 FUs. The VLIW
and CGA parts exchange data over a shared register-file. A four-bank
scratch-pad memory completes the storage capabilities. The ADRES
core is programmed in C-language, the mapping to the VLIW and
CGA parts is done by the DRESC compiler. It is interesting to note
that the CGA architecture and the interconnect are based on the MS1
architecture analyzed in Section 2.2.1 [42].
The ADRES processor is fabricated in 90 nm CMOS technology.
It occupies an area of 5.79 mm2 , runs at a frequency of 400 MHz,
and consumes around 220 mW [12]. The data processing performance
is determined by the 4-way SIMD CGA, which can perform PP =
250 600 MdOp/s.12
12 PP = 16 FUs × 4 dOp/FU × 400 MHz = 250 600 MdOp/s.
Generic Architecture
ICache
Program Fetch Instructions
Instruction Dispatch
Instruction Decode
VLIW view FU VLIW FU VLIW
Shared Registerfile
CU
FU FU FU FU FU FU VLIW
CDRF/CPRF
RC RC RC RC RC RC DMQ
RC RC RC RC RC RC
RC RC RC RC RC RC Reconfigurable Debug
Matrix view CGA IF
RC RC RC RC RC RC
RC RC RC RC RC RC
CMEM
inteface AHB-S
ADRES core
4x4 CGA
Shared Registerfile
FU0 FU1 FU2 FU3
FU4 FU5 FU6 FU7

Configuration
Configuration
LRF LRF LRF LRF

memory
memory
bank1
bank2
FU8 FU9 FU10 FU11
LRF LRF LRF LRF
FU12 FU13 FU14 FU15
LRF LRF LRF LRF
Functional Unit
From different sources
Configuration
RAM
LRF
FU
Configuration
counter To different destinations
Figure 2.16: ADRES block diagram (source: [12, 40]). Top: Generic
architecture template. Middle: 4 × 4 CGA realization for the 2 × 2
MIMO-OFDM receiver. Bottom: FU template.
2.4. SUMMARY AND DISCUSSION 35
2.4 Summary and Discussion

Summary This chapter reviewed selected flexible architectures (FAs)
employed as SDR platforms, and especially related to the OFDM
baseband processing domain. The review permitted to gain insight
into the structure of these architectures that range from RA (e.g., RD,
MS1), over SPA+RA (e.g., RaPiD, BBP1), to pure SPAs (e.g., SODA,
SB3010). For a concise overview, the key figures of merit of these FAs
are reported in Table 2.1. In Table 2.2, these figures are normalized
to a 0.18 µm CMOS technology, according to the technology scaling
described in Section 2.1.
In the following, the data of Table 2.2 is further elaborated to
highlight different design aspects associated to the reviewed FAs (where
the corresponding figures of merit are available).
Discussion Figure 2.17 depicts the data processing performance per

area attained by the various FAs. In these terms, the RD is clearly
the most efficient architecture. Thanks to its datapath, especially
tailored to support two specific and very similar tasks, it reaches
a high processing performance at small silicon area expense. The
SODA and RaPiD architectures follow next in the ranking. Both
provide slightly more than 10 500 MdOp/s/mm2 , which is already a
factor of three less than that of the RD. The DSP1, EVP, and ADRES
architectures deliver more than 500 MdOp/s/mm2 ; while MS1, BBP2,
Montium, and AVISPA less than that.
Figure 2.18 expands the view to the energy efficiency. The most
processing- and energy-efficient architectures reside in the upper-right
corner of the figure. The energy-efficiencies of the presented FAs are
wide-spread, and lie between 20 MdOp/s/mW (ADRES, SODA) and
2 MdOp/s/mW (RaPiD) – one order of magnitude apart. Accordingly,
the power densities attained by the various architectures, lie in the
strip between 10 mW/mm2 and 1 W/mm2 .
Nonetheless, when considering a real-world implementation, the
absolute figures of merit are important. In this perspective, Figure 2.19
illustrates the processing performance vs. the power consumption of the
considered FAs and, in addition, the marker’s size is made proportional
to the FA’s areas. The power consumption is a crucial aspect for
portable mobile devices since it determines the endurance form battery.
Today, typical batteries of mobile devices have a capacity of 3600 mWh.

At a power consumption of, for instance, 1 W, an endurance of 3.6 h can
be attained from the battery. Thus, for reasonable real-life applications,
the power consumption of mobile devices should remain well below
1 W (cf. vertical line in Figure 2.19). On the reference 0.18 µm CMOS
technology, both the SODA and RaPiD architectures violate this
constraint.
On the other side, the computational processing performance re-
quired to sustain the OFDM baseband processing is significant. The
horizontal lines at 680 MdOp/s, 10 530 MdOp/s, 30 650 MdOp/s, and
80 580 MdOp/s indicate the estimated data-processing performance re-
quired for the SISO-OFDM, 2 × 2, 3 × 3, and 4 × 4 MIMO-OFDM
WLAN baseband processing, as described later in Chapter 3 (with-
out considering Viterbi decoding). Apparently, all reported FAs can
support single-antenna OFDM baseband processing. AVISPA-CH and
DSP1 possibly support 2 × 2; and EVP 3 × 3 MIMO-OFDM opera-
tion. ADRES possibly supports up to 4 × 4 MIMO-OFDM operation.
However, it must be stressed that the data processing performance
attributed to the FAs and the algorithmic processing performance
requirements are qualitative measures. The processing performance
requirements assume that the underlying FA is able to compute ex-
actly the required operation at the correct point in time. The data
processing performance, instead, assumes that all FUs inside the FA’s
datapath are fully exploited, which is rather difficult to fulfill and
depends on how well the FA’s datapath matches the application do-
main.13 Despite these two assumptions, the big picture presented in
Figure 2.19 remains valid and becomes especially helpful for comparing
the various FAs among each other.
To conclude, it can be stated that a vast number of FAs exists and
that an increasing number of publications describe the implementation
of (MIMO-)OFDM baseband processing related tasks onto these FAs.
As this vast number of architectures suggests, there is no unique and
optimal FA able to support MIMO-OFDM baseband processing yet,
but rather there are many ways aiming at the same goal. As the
13 The discrepancy between estimated algorithmic data-processing performance
and the data-processing performance delivered by the FA can be seen as an indicator

of how well the FA matches the algorithm. The lower this discrepancy, the better
the FA matches the algorithm since the number of overhead instructions is reduced.
2.4. SUMMARY AND DISCUSSION 37
complexity of the underlying architectures increases, the importance of

the support by a powerful programming tool grows. Indeed, many of
the reviewed FAs go into this direction (e.g., EVP, Montium, ADRES)
and provide various – more or less sophisticated – tool chains. Finally,
among the related work, only one addresses the complete implementa-
tion of a 2 × 2 MIMO-OFDM receiver on the coarse grained ADRES
architecture [41].
Further reading The review in this chapter presented only a set
of selected FAs employed in the OFDM baseband processing domain
and stricktly related to this thesis. The following list summarizes, in a
more general term, the surveys and descriptions related to FAs:
• In 1999, Enzler explores the status of the early reconfigurable
computing research [43]. After an analysis and description of
the reconfigurable computing paradigm a list of over sixty (!)
different FAs is given. In his thesis [44] many of the issues faced
by the early FAs are described.
• In 2001, Hartenstein gives a survey of FAs [45]. Herein, the archi-
tectures reported are: DP-FPGA, KressArray, Colt, MATRIX,
RAW, GARP, REMARC, MorphoSys, CHESS, DReAM, CS2000
family, MECA family, CALISTO, FIPSOC, RaPiD, PipeRench,
PADDI and PADDI-2, and Pleiades. The author concludes that
the exploding design costs of dedicated VLSI solutions and the
shrinking product life-cycles strengthen the demand for FAs. The
challenge lies in the development of software tools that effectively
support designers and hence reduce time-to-market.
• In 2006, Amano [46] lists recent FAs that have gained industrial
maturity: CS2112 (Chameleon), DAPDNA-2 (IPFlex), DRP-
1 (NEC Electronics), FE-FA (Hitachi), XPP-64 (PACT), D-
Fabrix (Exilent), Kilokore KC256 (Rapport), ADRES (IMEC),
S5-engine (Stretch), Cluster machine (Fujitsu). The author
concludes that many FAs have gained industrial attention and
that their structure gratly varies according to the application
domain. It is argued that in the near future the structure will be
generated automatically to match the target application domain.
• In 2008, IEEE micro dedicates a complete edition to hardware
accelerators [47].
CHAPTER 2. STATE OF THE ART
Table 2.1: Figures of merit for the reviewed FAs, in the original technology.
Flexible CMOS Area Freq. Power Proc. Perf.
Architecturea [µm] [mm2 ] [MHz] [W] [MdOp/s]
RD [48], 2003 0.35 2.86 100 n.a. 2’200
RaPiD [18, 17], 1996 0.5 81 100 30.4 6’400
MS1 [22], 1999 0.35 180 100 n.a. 6’400
SODA [23], 2006 0.18 26.6 400 3 51’200
Montium [49, 24], 2003 0.13 6 100 0.150 1’500
DSP1 [28, 5], 2002 0.13 1.5 160 0.128 2’560
EVP [5], 2005 0.09 2 300 0.300 9’600
SB3010 [32, 31], 2002 0.09 n.a. 600 0.150 9’600
BBP1 [34], 2005 0.18 2.9 240 189b n.a.
BBP2 [35], 2007 0.13 11 240 n.a. 5’760
AVISPA [37], 2003 0.13 6.5 150 0.127 1’200
ADRES [40, 12], 2003 0.09 5.79 400 0.22 25’600
a The first reference indicates the description of the architecture, while the second to the SDR/OFDM baseband processing
related work (if different form the first). The year refers to the first publication of the architecture.
b Linearly scaled from 126 mW @ 160 MHz to 189 mW @ 240 MHz.
38
Table 2.2: Figures of merit for the reviewed FAs, scaled to 0.18 µm CMOS technology.
Flexible Scaling Area Freq. Power Proc. Perf.
Architecture αD [mm2 ] [MHz] [W] [MdOp/s]
RD 1.94 1.06 0.76 194 n.a. 4’278
RaPiD 2.78 1.52 10.5 278 9 17’778
MS1 1.94 1.06 47.61 194 n.a. 12’444
SODA 1 1 26.6 400 3 51’200
Montium 0.72 1.08 11.5 72 0.338 1’083
2.4. SUMMARY AND DISCUSSION
DSP1 0.72 1.08 2.88 116 0.288 1’849

EVP 0.5 0.9 8 150 0.972 4’800
SB3010 0.5 0.9 n.a 300 0.486 4’800
BBP1 1 1 2.9 240 0.189 n.a.
BBP2 0.72 1.08 21.09 173 n.a. 4’160
AVISPA 0.72 1.08 12.46 108 0.286 867
ADRES 0.5 0.9 23.16 200 0.713 12’800
39
6000
5000
Performance over area [MdOP/s/mm2]
4000
3000
2000
1’500 MdOp/s/mm2
1000
500 MdOp/s/mm2
0
H + PA
RD ODA aPiD SP 1 P
EV A−C DRE
S
MS
1 P 2 A um
S R D BB VISP onti AVIS
ISP A A M
AV
Figure 2.17: Performance/area, normalized to 0.18 µm CMOS technology.
40
4
10
2 2 RaPiD
m 2
/m m m Montium
W /m /m
m W W SODA
m m
00 0 DSP 1
10 10 10
EVP
AVISPA
3 AVISPA+
10
AVISPA−CH
ADRES
2
2.4. SUMMARY AND DISCUSSION
10
Performance over area [MdOP/s/mm2]

1
10
0 1 2 3
10 10 10 10
Energy efficiency [MdOP/s/mW]
Figure 2.18: Performance/area vs. energy efficiency, normalized to 0.18 µm CMOS technology.
41
5
10
RaPiD
W
s /m
p/ Montium
dO
M
Processing performance [MdOP/s]
0 SODA
10
4
10 8’580 MdOp/s DSP 1
W EVP
m 3’650 MdOp/s
s/
p/ AVISPA
dO
M
10 1’530 MdOp/s AVISPA+
3
10 AVISPA−CH
680 MdOp/s
W ADRES
m
s/
p/
dO
M
1
2
10
2 3 4 5
10 10 10 10
Power consumption [mW]
Figure 2.19: Processing performance vs. power consumption, normalized to 0.18 µm CMOS technology.
The marker size is proportional to the corresponding circuit area.
42
Chapter 3
Algorithms and
Computational
Complexity
The current chapter describes the system model of the MIMO-OFDM

transceiver considered in this thesis. It details and evaluates the
corresponding receiver algorithms with the aim of eventually finding
a suitable candidate that fits the limited processing resources of FAs,
while delivering an acceptable receiver signal quality. To this end,
the mathematical relation used to model the MIMO-OFDM system is
presented first. Next, well known MIMO detectors are reviewed and
evaluated with special attention to their computational complexity
and to their impact on the receive signal quality. The evaluation
generates the decision criteria that permit to select linear minimum
mean-squared error (MMSE) detection as best candidate.
Many methods exist to implement linear MMSE detection. The
difference among these methods does not rely in the achieved result or
quality, but in the way the result is attained.1 Again, in order to find
1 The achieved signal quality is dictated by the type of MIMO detector (linear
MMSE in this case). When the computations are performed in infinite precision, all
methods using the same type of MIMO detector lead to the same result. However,
when considering finite precision computations the picture changes. Rounding
43
44 CHAPTER 3. ALGORITHMS AND COMP. COMPLEXITY
the best trade-off between computational complexity and achievable

receive quality, different candidates are assessed and two promising
methods are identified. With the appropriate linear MMSE detec-
tor at hand, the algorithms for the practical MIMO-OFDM receiver
considered in this thesis are detailed. The summary with the computa-
tional complexity of the presented MIMO receiver and the subsequent
discussion of these results concludes the chapter.
Notation x ∈ Ca , is a complex-valued vector with a entries, x(n)

is the nth element of vector x. X ∈ Ca×b is a complex-valued matrix
with a rows and b columns. The superscripts (.)T and (.)H denote
the transpose and conjugate transpose, respectively. The notation
z ∼ CN (u, R) indicates that the random vector z is characterized
by a circularly-symmetric complex-valued Gaussian distribution with
mean u and covariance matrix R.
3.1 MIMO-OFDM System Model

Figure 3.1 depicts a generic MR × MT MIMO-OFDM transceiver. In
this thesis, the transceiver operates in spatial multiplexing mode, or
space-division multiplex mode. It employs MT transmit antennas and
MR ≥ MT receive antennas. The transmission is frame based, and
one OFDM-frame is composed of a preamble and the data payload
consisting of one or more OFDM-symbols. The preamble serves the
receiver for estimating physical system impairments, whereas the
OFDM-symbols carry the actual data to be transmitted.
Accordingly, the operation of the receiver can be divided into three
main phases, depending on whether it is processing a frame or not;
and, if it is processing a frame, depending on which part of the frame
is being processed. During the frame-start detection phase the receiver
is not processing a frame, but it is analyzing the received samples
to discover a frame start. Then, the preamble is processed during
differences among the methods used to implement the MIMO detector may lead to
a better or worse received signal quality. For this reason, Section 3.4, describing
linear MMSE detection, also considers finite precision effects implementing the
methods in fixed-point.
Transmitter Channel Receiver
MIMO Detection
Mapping OFDM OFDM
Mod. Demod.
Mapping OFDM OFDM

TxData Mod. Demod. RxData
FEC Decoding
...01011 ...01011
... ... ...
Conversion
Conversion
MIMO Processing
Serial to Parallel
Parallel to Serial
OFDM OFDM
Bit-Metric Computation
Mapping
Mod. Demod.
3.1. MIMO-OFDM SYSTEM MODEL
Noise
b x sk Hk nk yk ŝk x̂ b̂
Figure 3.1: MR × MT MIMO-OFDM transceiver.

45
√
Im 1/ 2
Im 01 11
0 1
−1 1 Re Re
√
00 −1/ 2 10
(a) BPSK, M = 2, Q = 1. (b) QPSK, M = 4, Q = 2.
Im √
Im 7/ 42
√ 100100
3/ 10 1010 √
5/ 42 100101
√
√ 3/ 42 100111
1/ 10 1011 √
1/ 42 100110
Re √ Re
−1/ 42
√
−1/ 10 1001 √
−3/ 42
√
−5/ 42
√ 1000 √
−3/ 10 −7/ 42
(c) 16-QAM, M = 16, Q = 4. (d) 64-QAM, M = 64, Q = 6.
Figure 3.2: M ary-QAM constellation points with M = 2Q , and cor-

responding Gray-mapped binary labels. (For 16-QAM and 64-QAM,
only four labels are shown for clarity.)
3.1. MIMO-OFDM SYSTEM MODEL 47
the preprocessing phase while the payload during the data processing
phase.
The transceiver’s building blocks illustrated in Figure 3.1 are now
described one after the other, from the transmitter to the receiver.
In the description, the subscript k = 1, . . . , N indicates the OFDM-
subchannel a variable belongs to. The superscript n = 1, . . . , MT
associates a variable to one of the MT lower-rate datastreams. A is
an alphabet containing M ary-QAM constellation points that have
modulation order M = 2Q , mean zero, and average energy 1/MT . Q
is the number of bits encoded by one constellation point. Figure 3.2
depicts the alphabets A for MT = 1, and M = 2, 4, 16 and 64.
Forward Error Correction The transmitter starts by convolution-

ally encoding the incoming binary datastream (TxData, with bits b).
Encoding increases the transmission’s robustness by adding redun-
dancy to the incoming bitstream. It is performed by the forward error
correction (FEC) block with a coding rate R, meaning that each bit of
the FEC’s output encodes R bits of its input (typical coding rates are
R = 1/2, 2/3, 3/4, and 5/6). Then, in spatial multiplexing mode, the
encoded binary datastream is split into MT lower rate datastreams.
OFDM Modulation For obtaining OFDM modulated data in a

system with N OFDM-subchannels, the operations described in the
following are performed, independently, on each of the MT lower-rate
transmit streams (typically 2 to 4 streams).
1. The encoded lower-rate bitstream is partitioned into groups of

Q bits represented by the binary labels
(n) (n) (n) (n)
xk = [b̄(k−1)Q , b̄(k−1)Q+1 , . . . , b̄kQ−1 ].
(n)
These binary labels xk are Gray-mapped, by the Mapping-
block in Figure 3.1, into the complex-valued constellation points
(n) (n) (n)
sk ∈ A according to G : xk 7→ sk .
2. Next, by using an N-point inverse Fourier transform, groups of
N constellation points are mapped into time-domain OFDM-
symbols. Each time-domain OFDM-symbol is prepended by
a cyclic extension of itself. This extension is named guard

interval (GI) and adds robustness against interference caused by
multipath propagation.
3. Finally, the MIMO-OFDM preamble is inserted in front of the

resulting MT time-domain OFDM-symbols, and the complete
OFDM-frame is transmitted – all MT streams concurrently and
in the same frequency band – over the wireless baseband (BB)
channel.
OFDM Demodulation The receiver performs the inverse process

of the transmitter. After removal of the guard interval in the time-
domain, the MR received datastreams are individually OFDM demo-
dulated using Fourier transforms. This transformation (back) into the
frequency domain, and the successive stacking of the demodulated
OFDM-symbols, yields the MR -dimensional received vector
yk = Hk sk + nk , (3.1)
where the corresponding OFDM-subchannel is indicated by the index

k. The transmitted frequency-domain vector-symbols
(1) (2) (MT ) T
sk = [sk , sk , . . . , sk ]
are obtained by stacking the MT constellation points of subchannel

k into one vector. Each vector-symbol sk then conveys MT · Q · R
bits of the binary datastream TxData. In (3.1), the noise experienced
at the receiver is modeled by the additive noise nk whose entries
are distributed according to CN (0, σ 2 ). The matrix Hk ∈ CMR ×MT
describes the gain and phase of the MR × MT wireless BB channel.
MIMO Detection The task of the receiver is to recover the trans-

mitted binary datastream by observing the received datastream. More
precisely, the MT · Q · R data bits conveyed by the vector-symbol sk
have to be recovered by observing the corresponding received vector
yk . To this end, the MIMO detector processes the received vector yk
(n) (n) (n) (n)
and outputs one row vector x̂k = [L(k−1)Q , L(k−1)Q+1 , . . . , LkQ−1 ]
for each of the MT entries of sk (n = 1, 2, . . . , MT ). The row vector
3.1. MIMO-OFDM SYSTEM MODEL 49
(n) (n)
x̂k contains Q entries. Each entry Li [i = (k − 1)Q, (k − 1)Q +
1, . . . , kQ − 1] deliveres decision information employed to detect the
(n)
corresponding bit of the binary label of the constellation point sk .
MIMO detectors (and detectors in general) can be classified into two
categories according to the type of decision information they deliver.
Hard detectors (or hard-out detectors) output only two values, usually
−1 and +1, according to whether the bit that has to be detected
is estimated to be 0 or 1, respectively. Soft detectors (or soft-out
detectors), instead, deliver an entire range of values that usually lie in
the interval [−1, +1]. The sign indicates whether the bit is estimated
to be 0 or 1, and larger absolute values indicate more reliable estimates
than smaller ones. Soft detection is superior to hard detection in
the receiver’s signal quality. On the other hand, depending on the
MIMO detection algorithm, the generation of soft information is not
always possible, or it may be associated with an overly increased
computational complexity.
Another distinction is made according to whether the MIMO de-
tector performs coherent or non-coherent detection. In this thesis,
receivers that perform coherent detection are considered: for coherent
detection, the receiver needs to take the effect of the wireless channel
into account for correct operation – instead non-coherent detection is
performed without channel knowledge. The coherent receiver has to
estimate the wireless channel Hk during an appropriate training phase
of the transmission (typically at the beginning of an OFDM-frame).
The channel-estimate is then expressed as matrix Ĥk .
Finally, to perform MIMO-detection, several algorithms that vary
in computational complexity and in the receiver’s signal quality are
known (e.g., [50]). Before choosing an appropriate MIMO detector for
the implementation on an FA in Section 3.3, the evaluation metrics
required to take this choice are introduced in the Section 3.2.
Convolutional Decoding / Viterbi Decoding As a last step,

the decision information (either hard or soft) computed by the MIMO
detector is multiplexed into one single stream and fed into the convo-
lutional decoder that eventually delivers the received bitstream.
Please note that from now on, the OFDM-subchannel index k
will be dropped for sake of brevity. Also, in the following, the chan-
nel estimate Ĥ is supposed to be perfect (i.e., Ĥ = H), thus ren-

dering the distinction with the channel matrix H superfluous. The
OFDM-subchannel index k and channel-estimate matrix Ĥ will only
be considered when necessary.
3.2 Performance and Computational Com-

plexity Metrics
The algorithms evaluated in Section 3.3 are classified according to the
following characteristics.
BER performance The wireless receiver signal quality is measured

by means of the bit error rate BER = berr /btot , where berr is the
number of erroneous bits in the received datastream and btot is the
total number of transmitted bits.
In this thesis, the BER performance is obtained by running Monte
Carlo simulations, and it is computed for different receiver signal-to-
noise ratios (SNRs). In each simulation cycle, randomly generated
bits are sent through a rate R = 1/2 convolutional encoder (which
has generator polynomials [1338 1718 ] and constraint length 7)2 and
successively Gray-mapped onto points of a 64-QAM constellation
(recall Figure 3.2). The resulting complex-valued symbols are stacked
to build the vector s, which has average energy 1. The vector s is
transmitted over the MIMO channel according to (3.1), where the
entries of the channel matrix H are chosen to be independent and
identically-distributed (i.i.d) as CN (0, 1). The SNR at the receiver is
1/σ 2 . In the simulations, the receiver perfectly estimates the wireless
channel (i.e., Ĥ = H), as well as the noise variance σ 2 .
Depending on the purpose of the simulations, the numerical pre-
cision for the entire receiver is set to either floating-point, in order
to fathom its limits, or parts of the receiver may be written to em-
ulate fixed-point behavior, introducing quantization errors. This al-
lows to assess the receiver’s performance, achievable on practical FAs
2 The polynomials are expressed in octal format. In this case, the generator
polynomials are g0 (x) = x6 + x4 + x3 + x + 1 and g1 (x) = x6 + x5 + x4 + x3 + 1,

in GF(2).
3.2. PERFORMANCE AND COMP. COMPLEXITY 51
that typically support word-widths of 16 bits and perform fixed-point

computations. For instance, the decision of transmitting 64-QAM
constellation points is taken in this perspective: the higher numerical
precision requirements of 64-QAM, compared to 16-QAM, QPSK, or
BPSK, allow to derive the fixed-point requisites for the datapath of
an FA that supports up to 64-QAM modulation.
Computational complexity (CC) The CC of a given algorithm

depends on the number of operations and on the type of operations, i.e.
the atomic operations, required
P to complete it. The CC for algorithm a
can be described by Ca = o∈N No wo , where N is the set of all atomic
operations needed for algorithm a; No is the number of operations
performed with atomic operation o, and wo the cost of o. In general,
No depends on the specific realization of algorithm a, whereas wo
varies according to the target platform selected for implementing that
algorithm.
For estimating the CC, with the implementation on an FA in
mind, the atomic operations commonly available on digital signal
processing platforms are split into the following categories (remember
also Chapter 2):
• ADD: add, subtract, arithmetic shift, compare.
• MAC: multiply, multiply and accumulate, multiply and subtract.
Additional atomic operations that are required in parts of the MIMO

receiver, and are thus also considered with special care as atomic
operations of a signal processor, are:
• DIV: invert (1/x), divide (y/x).
• ANGLE: compute angle α of a complex number z, α = ](z).

Other trigonometric functions.
√
• SQRT: Square-root x.
With this subdivision, the CC for algorithm a is computed as: Ca =

NMAC wMAC +NADD wADD +NDIV wDIV +NANGLE wANGLE +NSQRT wSQRT .
NMAC , NADD , NDIV , NANGLE , and NSQRT are the number of MAC,
ADD, DIV, ANGLE, and SQRT atomic operations, whereas the costs
associated to these atomic operations are set to wMAC = wADD =

wDIV = wANGLE = wSQRT = 1. These weights reflect the clock cycles
required to complete the corresponding operation. It is important
to note that the weights for the additional atomic operations DIV,
ANGLE, and SQRT are optimistic and assume single-cycle atomic op-
erations. Thus, the CC of algorithms that involve one of these atomic
operations is rather underestimated. Summarizing, the so-defined CC
is a rough measure of the clock cycles needed to accomplish algorithm
a.
Most of the BB processing algorithms deal with complex-valued
numbers. Since conventional FAs do not offer dedicated execution
units for complex-valued operations, the CCs are reported in two
flavors. First, accounting the operations as if only real-valued atomic
operations were available on the platform, and second, accounting the
operations honoring complex-valued atomic operations. The mapping
from complex- to real-valued operations, and vice-versa, is:
• 1 complex-valued addition ↔ 2 real-valued additions,
• 1 complex-valued multiplication ↔ 4 real-valued multiplications
and 2 real-valued additions,
• 1 complex-valued multiply and accumulate ↔ 4 real-valued mul-
tiply and accumulate (or, 4 real-valued multiplications and 4
additions).
The mapping of complex- into real-valued division is not necessary
since the considered BB algorithms require only real-valued divisions.
Finally, the processing performance requirement Pa necessary to
compute algorithm a in real-time is obtained by Pa = Ca /Ta , where Ta
is the time lapse, or the duty cycle, at disposal to complete algorithm
a. The required processing performance Pa is expressed in millions of
data-operations per second ( MdOp/s) the platform has to execute (cf.
Chapter 2, with the MdOp/s delivered by the various FAs).
3.3 Choice of the MIMO Detector

This section reviews the most popular types of MIMO detectors and
derives the corresponding CCs, in light of an implementation on an FA
3.3. CHOICE OF THE MIMO DETECTOR 53
equipped with the atomic operations ADD, MAC, DIV, ANGLE, and
SQRT. At the end of the section, the CC of the evaluated detectors is
reported for one OFDM-subchannel and it is split into two parts. It is
reported separately for the receiver’s preprocessing phase (in Table 3.1)
and the data processing phase (in Table 3.2), since the algorithms
involved in the two phases differ. The subsequent discussion and
comparison of the CCs and the BER performances permits to select a
MIMO detector that is appropriate for an FA.
For the implementation as dedicated VLSI components, a good ref-
erence that compares the appropriate CCs and the BER performances
of various MIMO detection algorithms is [51].
3.3.1 Brute-force maximum-likelihood (ML)

MIMO detection is a decision problem: among all possible transmitted
vector-symbols s ∈ AMT , the MIMO detector has to choose the vector-
symbol ŝ that maximizes the probability of a correct decision (i.e.,
that ŝ = s).3 From statistics, it is well known that in this case the ML
rule [e.g, [7, 52]] maximizes the probability of a correct decision. For
the considered MIMO system, the ML rule can be reduced to
ŝ = arg min ky − Hsk2 . (3.2)

s∈AMT
The solution of (3.2) can be found by first estimating the channel

to get H and by precomputing all possible Hs candidates – during the
preprocessing –, followed by exhaustively testing all |AMT | = M MT
candidates against the current received vector y – during the data
processing.
For hard detection, once the solution of (3.2) is obtained, the entries
of ŝ are directly translated into the binary labels of the corresponding
constellation points (demapping). The resulting CCs associated to both
the preprocessing and the data processing phases are prohibitive [in
the order of O(M MT )]. The rough calculation of the CC for a 64-QAM
2 × 2 MIMO-OFDM receiver is sufficient to show the overwhelming
complexity of the problem. In total, 642 = 40 096 candidate vector-
symbols have to be tested – just for one OFDM-subchannel. The
3 The vector-symbol s ∈ AMT has MT entries s ∈ A.
assumption that one test can be completed in one clock cycle, for an
OFDM system with 52 subchannels and an OFDM-symbol duration
of 4 µs, would lead to a required processing performance of 40 096 ×
52/4 µs = 530 248 MdOp/s (!). Soft detection would lead to an even
higher CC [53].
These results show that brute-force ML is not a viable path to find
the solution of (3.2). For this reason, brute-force ML is dropped from
the candidate list.
3.3.2 Sphere decoding (SD)

A more sophisticated and computationally less complex method to
obtain the ML solution is to map the problem (3.2) onto an equivalent
tree structure, in a first step. Then, in a second step, the problem
can be solved by applying appropriate tree search algorithms that use
pruning on forlorn branches to find the ML solution earlier. Eventually,
this leads to a lower CC than brute-force ML. In this category, the most
prominent and promising algorithm is SD [54], which is considered in
the following.
In order to perform the mapping onto the appropriate tree structure,
during the preprocessing, the QR-decomposition of H = QR has to
be taken. The QR-decomposition leads to the MR × MT orthonormal
matrix Q and to the MT × MT right-triangular matrix R.4 Then,
during data processing, the received vector y is left-multiplied by QH
leading to the modified input-output relation [cf. (3.1)] ỹ = Rs + ñ,
where ỹ = QH y, and ñ = QH n has the same statistics as n. This
modified input-output relation enables the mapping onto an equivalent
tree structure (see Appendix A.1).
The CC of SD during data processing is proportional to the average
number of visited tree nodes (Nav ). Nav , in its turn, depends on the
SNR-regime the receiver is operating at. At low SNR, Nav is larger
than at high SNR, as visible in Figure 3.3. The CC of SD, for one
visited node, is reported in Table 3.2.
One problem of SD resides exactly in its varying CC and thus its
varying run-time. On average, its complexity is significantly lower
than that of brute-force ML, however, in the worst case it is equal.
4 The precise QR-decomposition is detailed later, in Section 3.4.
Average numeber of visited nodes [Nav]

20
2x2 MIMO
4x4 MIMO
15
10
0
0 10 20 30 40
SNR [dB]
Figure 3.3: CC of SD with respect to SNR for a 2 × 2 and 4 × 4 MIMO

system with 64-QAM.
Thus, if a certain throughput at a fixed BER must be guaranteed,

the implementation of SD has to be dimensioned for the worst case,
resulting in a design that fully exploits its resources only occasionally.
To overcome this problem, run-time constraints can be imposed to
SD such that the search is terminated after a pre-determined amount
of time or number of operations, as proposed in [55]. Applying this
restriction makes SD attractive, but it slightly degrades the BER
performance. In [55], the average number of visited nodes, has to be
set to values between Nav = 7 and 18 for obtaining a reasonable BER
performance in a 4 × 4 MIMO system with 16-QAM.
Both above-described SD incarnations, [54, 55], deliver hard deci-
sion information. The results in [56, 57] describe a realization of SD
capable of both, delivering soft decision information and respecting
run-time constraints. Although providing a much better BER perfor-
mance, for the implementation on a reasonably-dimensioned FA, the
CC of soft-out SD is not yet manageable.
3.3.3 K-Best (KB)

Another tree-search method that completes in a fixed run-time, at the
cost of diverging from the ML performance, is the KB algorithm [58].
As for SD, the preprocessing requires the QR-decomposition of H
for obtaining the tree structure. In [59], a high-throughput VLSI
implementation of KB is presented where the complex-valued input-
output relation (3.1) is decomposed into a real-valued problem, through
its real-valued decomposition (RVD). As a consequence, the size of the
involved vectors and matrices doubles, but the computations are all real-
valued instead of complex-valued. During data processing, at each tree
level KB keeps only the K best solutions in its candidate list. All other
candidates are neglected. Once the lowest tree level is attained, the
best of the K solutions is returned and the so-obtained vector-symbol
is declared as the transmitted vector-symbol (see Appendix A.2).
Although the CC during data processing is reduced compared to
SD, involving only real-valued operators due to the RVD, it is still
considerable (as depicted in Figure 3.4, at the end of this section). The
BER performance slightly diverges from the ML BER performance in
the high SNR regime. The larger K, the later the KB BER performance
diverges from that of ML.5
3.3.4 Successive interference cancellation (SIC)

During preprocessing the SIC algorithm relies on the QR-decomposition,
as it is required for SD and for KB. However, during the preprocess-
ing phase, SIC has a much lower CC than SD or KB. SIC maps the
detection problem onto the equation ŷ = R−1 QH y. Thanks to the
right-triangular structure of the MT × MT matrix R, the unknown
ŷ is stepwise reconstructed through back-substitution (3.3), solving
ŷ = R−1 ỹ (where ỹ = QH y):
MT
X
ŷi = ỹi − ri,j ŝj (3.3)
j=i+1
ŝi = Q(ŷi , ri,i ), (3.4)

5 Reference [60] describes the implementation of the KB list sphere decoding on
a transport triggered architecture. The achieved throughput is 5.3 Mbit/s.

and i = MT , . . . , 1. After each back-substitution step i, the obtained

solution ŷi is mapped to the nearest constellation point in the alphabet
A (3.4), leading to the (hard) detected vector-symbol ŝ.6 The main
drawback of SIC is that no good-quality soft information can easily be
extracted [61].
The BER performance of the SIC algorithm lies between that of
KB and linear detection. The CC of SIC during the preprocessing is
derived in Appendix A.3 and it is reported in Table 3.2 at the end of
this section.
3.3.5 Linear detection

Linear detectors reduce the CC in the receiver by splitting the MIMO
detection problem into MR independent SISO problems, before ap-
plying the ML rule to each stream. To this end, the received vector
y is multiplied with an estimator matrix G. Commonly the matrix
G is obtained as either zero forcing (ZF) or minimum mean-squared
error (MMSE) estimator. ZF has a slightly worse BER performance
and almost the same computational complexity as MMSE, therefore
usually MMSE is preferred.
To derive the MMSE estimate ŷ, given the received vector y in (3.1)
and the channel estimate H, the following three steps are required [7]:
F = HH H + MT σ 2 I (3.5)
G=F −1
H H
(3.6)
ŷ = Gy. (3.7)
It is important to note that in (3.6) the matrix F has to be inverted

to obtain G, constituting a major computational challenge when it
comes to the fixed-point implementation on the FA. Furthermore, to
obtain the ZF estimator matrix it suffices to set σ 2 = 0 in (3.5).
The vector ŷ is further processed to obtain the estimated trans-
mitted symbol ŝ = Q(ŷ) through the slicing operation Q(.), when
performing hard detection. As alternative, with only minor additional
CC, ŷ can further be elaborated to obtain appropriate soft information
6 Note that the mapping Q(.) takes the diagonal elements r
i,i of R into account
for scaling the decision boundaries, such that no division is required.
as detailed in [53, 62]. Here, the possibility of obtaining good-quality

soft information represents a significant advantage over SIC. The CC
of linear MMSE detection during the preprocessing in Table 3.1 is
obtained using the rank-1 update method (proposed in [63]) for the
inversion of F.
3.3.6 Results and conclusion

Complexity Tables 3.1 and 3.2 summarize the CCs of the above de-
scribed MIMO detectors for the preprocessing and the data processing,
respectively. The case real-valued, as well as the case complex-valued
atomic operations are at disposal, are both considered. The CCs are
listed for one OFDM-subchannel, and thus, for obtaining the CC of
the MIMO-OFDM detector, they have to be scaled with the number
of data-carrying OFDM-subchannels.
Although the reported CCs represent estimates and do not take
any implementation overhead into account, they are essential to relate
the detectors among each other. Figure 3.4 visualizes the findings for
different, symmetric (MT = MR ), antenna configurations. For KB the
CC was obtained with K = 5, whereas for SD the average number of
visited tree nodes was set to Nav = 2.5, 3.75, and 5 for the 2 × 2, 3 × 3,
and 4 × 4 systems, respectively. These Nav values are optimistic, and
correspond to operating in the high SNR regime (cf. Figure 3.3).
While all considered detectors have a comparable CC during pre-
processing, the situation is completely different during data processing.
As expected, SD and KB have a much higher CC than SIC and linear
MMSE detection. SIC is slightly more complex than linear MMSE.
Examining the MT = MR = 2 case, with real-valued atomic operations
[see Figure 3.4(a)], discloses that the preprocessing has a CC of 224
(with QR-decomposition), and the data processing a CC of 10 940 with
SD, of 574 with KB, of 96 with SIC, and of 16 with linear MMSE
detection. Scaling these results to a MIMO-OFDM system as the
one described later on in Section 3.5, leads to a required processing
performance of more than 20 900 MdOp/s for the preprocessing, and
250 000 MdOp/s, 70 400 MdOp/s, 10 200 MdOp/s, and 200 MdOp/s for
SD, KB, SIC, and linear MMSE, during the data processing phase.7
7 The MIMO-OFDM system considered in Section 3.5 employs 52 data-carrying
As these figures testify, linear MMSE and SIC detection may possibly
fit on a conventional high-end digital signal processor (DSP) – e.g., TI’s
C6455 (with a peak data processing performance of 40 000 MdOp/s)
or ADI’s TigerSHARC (with 20 400 MdOp/s peak data performance).8
SD and KB, however, are far beyond that possibility.
An FA with execution units performing complex-valued arithmetic
would reduce the CC [see Figure 3.4(b)] and the processing require-
ments would become: 800 MCdOp/s (millions of complex-valued op-
erations per second) for the preprocessing with QR-decomposition,
and 80 400 MCdOp/s (SD), 70 400 MCdOp/s (KB), 400 MCdOp/s (SIC),
and 50 MCdOp/s (MMSE) for the data payload processing.
BER performance First, the BER performance without any de-

coding is considered. The simulations were run according to the setup
described in Section 3.2, and the results are reproduced in Figure 3.5
for both a 2 × 2 and a 4 × 4 MIMO system. It is easy to observe
that the performance gap between SD and KB, which achieve ML and
near-ML performance, and SIC and MMSE is substantial. For the
2 × 2 system the gap is of around 5 dB at an SNR of 30 dB, wheres for
the 4 × 4 system it is of more than 10 dB.
Figure 3.6 illustrates the BER performance obtained when consid-
ering encoding and decoding, employing a MIMO detector delivering
hard-out information. The gap between SD and KB, and SIC and
MMSE is perceptible also here. In addition, the figure shows the
BER performance obtained with soft-out information for SD and lin-
ear MMSE. Please note that the simulations are performed with the
setup described in Section 3.2, where the channel H has i.i.d. entries
∼ CN (0, 1), the BER performance difference between these two MIMO
detectors is only minimal. Using a different channel model leads to
a different gap in the BER performance between soft-out SD and
soft-out MMSE detection. With the TGn channel, for instance, the
gap becomes larger [64].
OFDM-subchannels and the data processing has to be concluded in 4 µs. Hence,

the required processing performance is computed as Pa = Ca · 52/4 µs.
8 The data processing performance is derived as: PP = 2-way SIMD ×
2 dOp/SIMD × 1 GHz = 40 000 MdOp/s for the C6455. For the TigerSHARC
ADSP TS201S: PP = 2-way SIMD × 2 dOp/SIMD × 600 MHz = 20 400 MdOp/s.
Table 3.1: Per-subchannel CC of different MIMO detectors – Prepro-

cessing.
Detector Ops. Preprocessing
Brute-force ML, C-Ops. MR MT M M T
Brute-force ML, R-Ops. 4MR MT M MT
SD, KB, SIC, C-Ops. (17/2 + 2MR )MR MT + 3/2MR MT2
SD, KB, SIC, R-Ops. 4(7 + 2MR )MR MT + 6MR MT2
Linear MMSE, C-Ops. 2MR + 2MR MT + 4MR MT2
Linear MMSE, R-Ops. 3MR + 8MR MT + 14MR MT2
Conclusion Although SD and KB exhibit a much better BER per-

formance than SIC and MMSE, the analysis of their CCs and the
resulting processing performance requirements excludes them from
the candidates suitable for an FA. Between SIC and MMSE there is
no practical BER performance gain. Thus, considering that MMSE
detection has the lowest CC during both preprocessing and data pro-
cessing, the implementation of a linear MMSE detector seems the most
reasonable step towards a MIMO-OFDM SDR implementation on an
FA. Further, this choice is enforced by observing that it is still possible
to boost the MMSE detector’s BER performance by generating soft
decision information, as illustrated in Figure 3.6.
3.4 Linear MMSE Detection

The hardest computational kernel involved in the computation of the
linear MMSE estimator matrix G in (3.6) is the matrix inversion F−1
performed during the preprocessing. Therefore, this section inspects
different methods to obtain G and, especially, to invert the MT × MT
matrix F. By comparing the associated CCs and the achievable BER
performance, as done in the previous section, it is possible to quantify
the qualities of the evaluated methods, which eventually permits to to
select the method that best fits on the target platform.
The methods considered for computing G are: classical adjoint
method, indirect inversion of F through LR-decomposition, through
LDL-decomposition, through GS-decomposition, and through QR-
3.4. LINEAR MMSE DETECTION 61
Preprocessing
1400
1344
1260
1120 1036
SD, Nav=2.5, 3.75, 5
980
840 KB, K=5
CC
700 630 SIC

560
459 LMMSE
420
280 224
150
140
0
2 3 4
Antenna configuration (MxM) [M]
Symbol processing
4000
3920
3600
3200
2925
2800
SD, Nav=2.5, 3.75, 5
2400
1940 KB, K=5
CC
2000
1600 SIC
1228
1200
891 LMMSE
800 574
400 150 36 208 64
96 16
0
2 3 4
(a) CC considering real-valued atomic operations.
Preprocessing
400
360
360
320 296
280 SD, Nav=2.5, 3.75, 5
240
KB, K=5
CC
200 171
160 SIC
132
120 LMMSE
80 62
44
40
0
2 3 4
Symbol processing
1400
1305
1260 1228
1120
975
980 891 SD, Nav=2.5, 3.75, 5
840
CC
KB, K=5
700 648
574
560 SIC
420 LMMSE
280
140 48 66
31 4 9 16
0
2 3 4
(b) CC considering complex-valued atomic operations.
Figure 3.4: CC of different MIMO detectors.

0
10
−1
10
−2
10
BER
−3
10
−4 SD
10 KB, K=5
SIC
MMSE
−5
10
0 10 20 30 40
SNR [dB]
0
10
−1
10
−2
10
BER
−3
10
−4 SD
10 KB, K=5
SIC
MMSE
−5
10
0 10 20 30 40
SNR [dB]
Figure 3.5: Uncoded BER performance for different MIMO detectors

with 64-QAM. Top: 2 × 2 MIMO system. Bottom: 4 × 4 MIMO
system.
0
10
Hard−out SD
−1
10 Hard−out KB, K=5
−2 Hard−out SIC
10 Hard−out MMSE
−3 Soft−out SD
10
BER
Soft−out MMSE
−4
10
−5
10
−6
10
−7
10
10 20 30 40
SNR [dB]
0
10
Hard−out SD
−1
10 Hard−out KB, K=5
−2 Hard−out SIC
10 Hard−out MMSE
−3 Soft−out SD
10
BER
Soft−out MMSE
−4
10
−5
10
−6
10
−7
10
10 20 30 40
SNR [dB]
Figure 3.6: Coded BER performance for different hard-out MIMO

detectors with 64-QAM. Top: 2 × 2 MIMO system. Bottom: 4 × 4
MIMO system.
Table 3.2: Per-subchannel CC of different MIMO detectors – Data

processing.
Detector Ops. Symbol detection
Brute-force ML, C-Ops. (2MR + 1)M MT
Brute-force ML, R-Ops. (6MR + 2)M MT
SD, C-Ops. Nav (4M + MT + 1)
SD, R-Ops. √ Nav (12M + 4MT )
√
KB, C-Ops. (K(6√M − 1) + 4√M )MT + 2KMT2
KB, R-Ops. (K(6 M √ − 1) + 4 M )MT + 2KMT2
SIC, C-Ops. (1/2 + √M + log2 M )MT + MT2 /2
SIC, R-Ops. 2(2 M + log2 M )MT + 2MT2
Linear MMSE, C-Ops. MR MT
Linear MMSE, R-Ops. 4MR MT
decomposition; direct inversion of F by a series of Rank-1 updates,

and by the Divide-and-Conquer (D&C) algorithm. While the accurate
CCs and the steps leading to the inverse of F for the above mentioned
algorithms are all detailed in Appendix A.4, in the following only
the underlying principles are explained and the corresponding atomic
operations are identified. The discussion focuses on the steps that are
performed during the preprocessing and on the resulting CC, since the
CC of the data processing is equal for all methods. The CC of the
data processing arises from the matrix-vector multiplication ŷ = Gy
that is part of the detection. It amounts to 4MR MT when considering
only real-valued atomic operations, or to MR MT with complex-valued
atomic operations.
Finally, we remark that F is Hermitian and positive-definite by

construction. A matrix F is Hermitian if it satisfies FH = F and
positive-definite if xH Fx > 0 holds for all x ∈ CM [65]. Some of the
presented methods exploit the structure of F to reduce the CC, while
others do not since no significant advantages accrue.
3.4.1 Adjoint method

The classical way of presenting the inverse of an MT × MT matrix F
is by the adjoint method:
adj (F)
F−1 = . (3.8)
det(F)
The solution is trivial for the case M = 1 since in this case F is a

positive scalar and its CC amounts to 1 (i.e., one division). For M = 2,
(3.8) corresponds to:
−1
1

a b c −b
F−1 = = . (3.9)
b∗ c ac − bb∗ −b∗ a
To solve (3.9) employing complex-valued atomic operations, 5 CMACs

and 1 real-valued DIV are required, resulting in a CC of 6. When
considering real-valued atomic operations 20 MACs and 1 real-valued
DIV are necessary, leading to a CC of 21. In total, for the computation
of G the CC is 22 (with complex-valued operations), or 69 when
accounting real-valued atomic operations.
In the case MT > 2, however, the high CC and the large dynamic
range render the use (3.8) impractical.
3.4.2 LR-decomposition
LR-decomposition [66] decomposes F into a left-triangular MT × MT
matrix L and a right-triangular MT × MT matrix R, such that LR =
F. Then, two successive back-substitution steps lead to the matrix G:
1) A = L−1 HH , and 2) G = R−1 A. No divisions are required during
the first back-substitution, since the diagonal entries of L are all 1.
However, for the second back-substitution the inversion of the diagonal
elements of R is required. The atomic operations for LR-decomposition
are ADD, MAC, and DIV.
3.4.3 LDL-decomposition
With LDL-decomposition [66], F is decomposed into the left-triangular
MT × MT matrix L and the diagonal MT × MT matrix D, with the
characteristic LDLH = F. To perform the detection, first the mul-

tiplication R = DLH leads to a right-triangular matrix R. There-
after, the two consecutive back-substitution steps utilized in the LR-
decomposition are performed to obtain the MMSE estimator G. As
for LR-decomposition, the atomic operations are ADD, MAC, and
DIV.
3.4.4 GS-decomposition
When using GS-decomposition9 [66] the augmented channel matrix
MT σIMT ]H ∈ C(MR +MT )×MT

p
H̄ = [HH
is decomposed into the (MR +MT )×MT orthonormal matrix Q̄ and the
right-triangular MT ×MT matrix R, such that Q̄R = H̄. Then, the
augmented matrix Ḡ = R−1 Q̄H is computed. During data processing
detection is performed as ŷ = Ḡȳ, where ȳ is the received vector y
extended with zeros to render the multiplication with Ḡ possible.
The presence of the SQRT atomic operation in the GS-decomposition,
in addition to the common atomic operations ADD, MAC, and DIV,
is a clear drawback of this method: when two methods have nearly
equal CC, the method with the fewer, or less costly, atomic operations
is favored.10
3.4.5 QR-decomposition
The classical QR-decomposition [66] involves the same steps as the GS-
decomposition. The only difference is the method used to obtain the
matrices Q̄ and R. The QR-decomposition analyzed in Appendix A.4.5
relies on Givens-rotations, which require the computation of the arc
tangent as a fundamental operation. Therefore, compared to GS-
decomposition, instead of the SQRT atomic operation, the ANGLE
9 Named after the initials of its two independent discoverers: Jorgen Pedersen
Gram and Erhard Schmidt. Gram published in 1883 [67], whereas Schmidt in
1907 [68].
10 Recall that, for deriving the CC, the weights of all atomic operations have been
set to 1 (in Section 3.2) which is, of course, an approximation. If the difference
among the CCs of the various linear MMSE detection methods that are evaluated
becomes too small for a clear choice, these weights may be refined. However, this
will not be necessary, as the final discussion will show.
atomic operation is required in addition to ADD, MAC, and DIV. As

for GS-decomposition, the presence of an additional atomic operation
(here ANGLE) represents a potential disadvantage.
3.4.6 Rank-1 update

Rank-1 update (R1) [63] directly inverts F starting from P(0) =
1/(MT σ 2 ) and by performing a series of MT rank-1 updates (k =
1, 2, . . . , MT ):
P(k−1) hkH hk P(k−1)

P(k) = P(k−1) − ,
1 + hk P(k−1) hkH
where hk denotes the kth row of the channel estimate H. Eventually,
the inverse of F is obtained as F−1 = P(MT ) and further employed
to compute G = F−1 HH . The required atomic operations are ADD,
MAC, and DIV.
3.4.7 Divide-and-Conquer algorithm

Divide-and-Conquer algorithm (D&C) [13] is a recursive matrix inver-
sion method. The key element of D&C is the partitioning of F as
A B

F= , (3.10)
BH C
with A ∈ Cp×p and C ∈ C(MT −p)×(MT −p) , 1 ≤ p < MT . Then, using

the Banachiewicz formula for the inverse of a partitioned matrix [69],11
F−1 is computed as:
A + A−1 BS−1 BH A−1 −A−1 BS−1

−1
F−1 = , (3.11)
−S−1 BH A−1 S−1
with S , C − BH A−1 B being the Schur complement of A in F.

As a result, the task of inverting the MT × MT matrix F can be
11 The formula for the inverse of a nonsingular partitioned block matrix was
introduced in 1937 by the astronomer Tadeusz Banachiewicz (1882-1954) [70].

However note that, as stated in [69], closely related results were obtained earlier
in 1923 by the geodesist Hans Boltz (1883-1947) [71] and in 1933 by Ralf Lohan
(1902-2000) [72].
replaced by the simpler tasks of inverting the p × p matrix A and the

(MT −p)×(MT −p) matrix S, followed by combining the resulting A−1
and S−1 according to (3.11). If the matrix A (or S) has dimension
2 × 2 or less, direct matrix inversion is performed to obtain A−1 (or
S−1 ). Otherwise, A (S) is partitioned as in (3.10), leading to a
recursive procedure to obtain A−1 (S−1 ). The recursion breaks when
the matrix A (S) of the actual level of recursion can be inverted by
scalar or 2 × 2 direct inversion.
The atomic operations required by D&C are ADD, MAC, and DIV.
3.4.8 Results and conclusion

Complexity Table 3.3 summarizes the preprocessing CCs for the
above-described linear MMSE detection methods. The CCs for both
the cases of real- and complex-valued atomic operations are formulated,
as done for the MIMO detector comparison. Figure 3.7 shows the
corresponding complexity-scaling for symmetric (MT = MR ) MIMO
systems and for the processing of one OFDM-subchannel.
When considering an FA that provides real-valued atomic opera-
tions, the LR and LDL algorithms have a slightly lower CC than D&C
[Figure 3.7(a)]. The GS-decomposition, the Rank-1 update algorithm,
and the QR-decomposition have the higher CCs. Recall that these
CCs are higher even though the weights wSQRT (for GS) and wANGLE
(for QR) were set to 1. On the other side, when complex-valued atomic
operations are at disposal, the D&C algorithm has the lowest CC for
a given antenna configuration, which is explained by the fact that
D&C has more complex-valued multiplications and less complex-valued
additions than LR has. The remaining rankings remain unchanged
[Figure 3.7(b)].
Thus, neglecting any implementation overhead, an FA that pro-
vides real-valued ADD, MAC, and DIV atomic operations will require
the fewest clock cycles with LR- or LDL-decomposition, whereas
a platform with the same set of complex-valued atomic operations
will require less cycles with D&C. For instance, the processing per-
formance required for a 2 × 2 MIMO-OFDM system with 52 data-
carrying OFDM-subchannels amounts to approximately 850 MdOp/s
and 280 MCdOp/s for real- and complex-valued atomic operations
respectively.
Table 3.3: CC of different linear MMSE detectors – Preprocessing.

Method Ops. Preprocessing
LR, C-Ops. 1 − 2MR + (−1/3 + 5/2MR )MT + (1 + 3/2MR )MT2 + MT3 /3
LR, R-Ops. 2 − 4MR + (−7/3 + 6MR )MT + (2 + 6MR )MT2 + 4/3MT3
LDL, C-Ops. −2MR + (4/3 + 5MR /2)MT + 3/2(1 + MR )MT2 + MT3 /6
LDL, R-Ops. −4MR + (4/3 + 6MR )MT + (5 + 6MR )MT2 + 2/3MT3
GS, C-Ops. −1 + (4 + MR )MT + (1/2 + 2MR )MT2 + MT3 /2
GS, R-Ops. −2 + (25/3 + 5MR )MT + (2 + 7MR )M T 2 + 5/3MT3
QR, C-Ops. −1 + (2 + 9MR + 2MR 2 )M + 2M M 2
T R T
QR, R-Ops. −2 + (3 + 30MR + 8MR 2 )M + 8M M 2
T R T
R1, C-Ops. 2MR + 2MR MT + 4MR MT2
R1, R-Ops. 3MR + 8MR MT + 14MR MT2
D&C, C-Ops. −9 + (41 + 3MR )MT /6 + (−1 + 3MR )MT2 /2 + 2/3MT3
D&C, R-Ops. −30 + (127/6 + 2MR )MT + (−3/2 + 6MR )MT2 + 7/3MT3
BER performance Figure 3.8 illustrates the uncoded BER per-

formance achievable in a 4 × 4 MIMO system transmitting 64-QAM
symbols, and where the linear MMSE estimator G is computed em-
ploying the previously described algorithms with 16 bit fixed-point
precision. Each algorithm has been independently optimized for ob-
taining the highest possible BER performance. In the figure, the
floating-point BER curve serves as a reference to quantify the imple-
mentation loss. As one can see, the Rank-1 update algorithm breaks off
first from the reference curve and manifests the highest error floor at
BER = 5 · 10−2 . The D&C algorithm’s error floor is at BER = 10−2 .
The LR, LDL, GS, and QR decompositions, all have a lower error
floor.
Conclusion It can be stated that for the implementation of MIMO(-

OFDM) systems with two or less antennas at the receiver, the adjoint
method (direct matrix inversion) is the most promising. For systems
with more than two antennas at the receiver, the D&C and LR algo-
rithms are the most promising ones when low CC is desired. However,
if the BER performance attained by D&C and LR is deemed to be
insufficient, the GS- and QR-decomposition come into play at the price
of a higher CC. Finally, for the implementation on an FA, the most
reasonable choice is to start by considering a 2 × 2 MIMO-OFDM
system where the CC appears affordable.
3000
D&C
2700 LR-decomp
LDL-decomp
2400
GS-decomp
Rank1
2100
QR-decomp
1800
CC
1500
1200
900
600
300
0
2 3 4 5
(a) CC accounting real-valued operators.
800
D&C
720 LR-decomp
LDL-decomp
640
GS-decomp
Rank1
560
QR-decomp
480
CC
400
320
240
160
80
0
2 3 4 5
(b) CC accounting complex-valued operators.
Figure 3.7: CC-scaling of different linear MMSE MIMO detectors

with MT = MR antennas during the preprocessing phase and for one
OFDM-subchannel.
0
10
−1
10
−2
10
BER
−3 Floating point
10
LR
GS
−4 LDL
10 QR
R1
DC
−5
10
0 10 20 30 40 50 60
SNR [dB]
Figure 3.8: Uncoded BER performance for a linear MMSE 4 × 4 MIMO

system using different matrix decomposition methods. Floating point
vs. quantized 16 bit fixed-point operation.
Preamble Data payload time
(1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1)
t1 t2 t3 ... t10 GI2 T1A T1B GI T2 GI1 D1 GI2 D2 GINd DNd
Receive antenna index
(2) (2) (2) (2) (2) (2) (2) (2) (2) (2) (2)
t1 t2 t3 ... t10 GI2 T1A T1B GI T2 GI1 D1 GI2 D2 GINd DNd
8 μs 8 μs 4 μs 4 μs 4 μs 4 μs
STF LTF 1 LTF 2 S1 MIMO-OFDM Symbols SNd
Receiver states 1 μs −> 20 Samples
1) Frame 2) STF 3) LTF 4) MIMO ch. 5) Data

...
detection processing processing processing processing
Frame
Preprocessing Data processing
detection
Figure 3.9: 2 × 2 MIMO-OFDM-frame structure (top), and corre-

sponding receiver states (bottom).
3.5 MIMO-OFDM Receiver Algorithms

Figure 3.9 shows the time-domain frame structure for the MIMO-
OFDM system considered in this thesis.12 Although the illustrated
frame-structure is specific for the 2 × 2 case, it can easily be extended
to the generic MR × MT case. The considered frame structure is
similar to that of the IEEE 802.11n standard [8] in HT-Greenfield
mode.
Frames start with a short training field (STF) that is composed
of ten identical short training sequences (t1 , t2 , . . . , t10 ), each of
length NSP samples. This sequence is designed to support frame-start
detection, automatic gain control adjustment,13 and coarse frequency
offset estimation. The STF is followed by a sequence of MT long
training fields (LTF1, LTF2, . . . , LTFMT ). The first long training
field (LTF1) comprises a long guard interval (GI2) of NGI2 samples and
12 In the following, the term OFDM-frame is used when referring to a SISO-
OFDM-frame as well as to a MIMO-OFDM-frame.

13 Automatic gain control (AGC) has not been implemented in this work.
3.5. MIMO-OFDM RECEIVER ALGORITHMS 73
Table 3.4: OFDM modulation parameters for the system under con-
sideration.
Parameter N Nc NSP NLP NGI2 NGI Tsym Ns fs
Value 64 52 16 64 64 16 4 µs 80 20 MHz
two identical long training symbols (T1A and T1B ), each of length NLP
samples. LTF1 is exploited to refine the frequency offset estimation
and participates in the channel estimation together with the remaining
long training fields (LTFn, n = 2, 3, . . . , MT ). Each of the remaining
LTFs is composed of a guard interval (GI) of length NGI samples,
followed by a training symbol Tn of length NLP . The MIMO-OFDM
data symbols Sm have a GIm of NGI samples and carry the data Dm
(m = 1, 2, . . . , N d). The number of data carrying OFDM-subchannels
is Nc and the remaining N − Nc subchannels are either unused or carry
pilot symbols. One OFDM-symbol has a duration Tsym and a length
of Ns = NGI + N samples, at a sample rate fs . The OFDM parameters
for the system under consideration are reported in Table 3.4.
Based on the above-described frame structure, proper reception of
an OFDM-frame can be divided into five states: frame-start detection,
STF processing, LTF processing, MIMO channel processing, and data
payload processing. The bottom section of Figure 3.9 illustrates how
these five receiver states are traversed during the reception of an
OFDM-frame. Note that the exact point in time for switching from
one receiver state to the next varies depending on the quality of the
received signal and, consequently, on when the frame start is detected.
Typically, the first 4 to 6 short training sequences are corrupted by
AGC.
3.5.1 Frame-start detection

The frame-start detection is the receiver’s idle state. In this state, the
presence of a new OFDM-frame has to be detected by analyzing the
incoming received BB samples. The corresponding detection algorithm
is extended from the well established single-antenna algorithm proposed
in [73]. The basic idea of this extended algorithm is to compute two
(n) (n)
metrics pL [d] and mL [d] for each time-domain BB sample r(n) [d],
for all receive antennas n = 1, . . . , MR according to

L−1
(n)
X
pL [d] = r(n) [d + j]H · r(n) [d + j + L] (3.12)
j=0
L−1
X 2
(n) (n)
mL [d] = r [d + j + L] (3.13)

j=0
with L = NSP . As a result, (3.12) correlates the received BB samples

over the length of two adjacent short training sequences and (3.13)
computes the energy over the length of one short training sequence.
(n) (n)
Next, the pL [d] and mL [d] metrics from all receive antennas are
averaged to obtain
MR
1 X (j)
p̄L [d] = p [d], (3.14)
MR j=1 L
MR
1 X (j)
m̄L [d] = r [d]. (3.15)
MR j=1 L
As a last step, p̄L [d] and m̄L [d] are compared. A frame start is
detected for the first discrete sample-time index d = dˆSP that satisfies
the threshold detection inequality
2 2
|p̄L [d]| > |m̄L [d]| /2. (3.16)
In that case, the receiver proceeds to the STF processing state.

The atomic operations required to compute the correlation (3.12)
and the mean energy (3.13) are multiply and accumulate (MAC)
operations, the arithmetic means in (3.14) and (3.15) require additions,
whereas the threshold detection in (3.16) requires comparisons.
3.5.2 STF processing

Once a frame start has been detected, the remaining short training
sequences are exploited to roughly estimate the rotation induced by
the carrier frequency offset on the received BB samples, i.e., to perform
the coarse frequency offset estimation (FOE). The corresponding phase
increment between two consecutive received BB samples is given by φ =

](p̄NSP [dˆSP ])/NSP .14 The phase φ can be computed by the CORDIC
algorithm (COordinate Rotation DIgital Computer, e.g. [74, 75]),
for which the required atomic operations are additions, shifts, and
comparisons.
To compensate for the estimated frequency offset at the receiver, all
received time-domain BB samples are rotated through multiplication
with a complex-valued phasor: r̃[d] = r[d] · e−jφd . The atomic opera-
tions associated with this frequency offset compensation (FOC) are
real-valued multiplications and additions (which together correspond
to a complex-valued multiplication).
Next, fine time-synchronization takes place in order to refine the
estimate of the location of the OFDM-symbol boundaries. To this
end, the computations executed for the frame start detection [(3.12)–
(3.16)] are repeated on r̃[d], with the only difference that L = NLP
now corresponds to the period of the two long training sequences T1A
and T1B . The start of LTF1 is detected for the first sample (where
d = dˆLP ) satisfying the threshold detection inequality (3.16). Then,
the time-domain frame start index is updated and the receiver proceeds
to the LTF processing state.
3.5.3 LTF processing

On the first part of LTF1, carrier FOE is performed a second time
based on the autocorrelation p̄NLP [dˆLP ], to allow for a more accurate
estimation of the residual phase rotation remaining after coarse FOC.
The residual phase rotation is compensated, involving the same algo-
rithms and atomic operations required for computing the phase φ and
for the coarse FOC performed in the STF processing state.
After removing GI2, the arithmetic mean between the first and the
second long training sequences is computed (i.e., T1 = (T1A +T1B )/2).
For each receive stream, the resulting N arithmetic-mean values are
transformed by an N-point Fourier transform, leading to the received
frequency-domain vectors zk [1] ∈ CMR , k = 1, 2, . . . , N. Next, the
GI is removed from the following long training fields (LTFn) and
14 The atomic operation ](z) returns the angle spanned between the real and
imaginary parts of a complex number z.

the residual phase rotation is compensated. The remaining samples

are N-point Fourier transformed as well, resulting in the received
frequency-domain vectors zk [n], n = 2, . . . , MT . The FFT’s atomic
operations are additions and multiplications. It is important to note
that for the computation of the FFT all N input samples are required
concurrently, whereas for the preceding computations the processing
is executed at sample rate.15 Hence, during LTF processing, the
processing granularity changes from single sample to entire OFDM-
symbol and it remains unchanged until the end of the OFDM-frame.
FOC represents the only exception and continues at sample rate.
Once the last long training field is received, frequency-offset com-
pensated, and Fourier transformed, the wireless MIMO channel can
be estimated. To perform this operation, the receiver switches to the
MIMO channel processing state.
3.5.4 MIMO channel processing

The knowledge of the transmitted long training fields is exploited at
the receiver to estimate the MIMO channel Ĥk for each subchannel
k. In order to obtain this estimation, during the training phase, the
received frequency-domain vectors for the kth OFDM-subchannel
Zk = zk [1], zk [2], . . . , zk [MT ] ∈ CMR ×MT

can be described by Zk = Hk Tk + nk , where the MT ×MT -dimensional

matrix Tk is the known training sequence. With this knowledge,
Ĥk = Zk T−1
k yields the ZF channel estimate.
In the system under consideration, Tk is a Hadamard matrix
which implies that the scaled entries of the corresponding inverse
T−1
k are +1 or −1. Therefore, the atomic operations for the matrix
multiplication required to compute Ĥk can be reduced to only additions
and subtractions, followed by the correct scaling; or, if preferred, the
matrix multiplication can be performed utilizing the multiplication
atomic operation. Next, the channel estimate Ĥk is used to obtain the
15 In practical systems, the input samples of the FFT may be processed slightly
staggered, potentially exploiting pipelining and thus slightly reducing the FFT’s
processing latency.
linear MMSE estimator matrix [remember (3.6)]

−1
2 −1 H
Gk = ĤH
k Ĥ k + M T σk I MT k = Fk Ĥk ,
ĤH
where σk2 denotes the noise variance on subchannel k.

The matrix inversion required to obtain F−1 k can be performed
by one of the methods presented earlier in Section 3.4. For the 2 × 2
MIMO-OFDM receiver considered in Chapter 5, direct matrix in-
version is used. The atomic operations required to compute Gk are
multiplications and additions (matrix multiplication). Matrix inversion
additionally requires real-valued divisions as atomic operations. After
computing the MMSE estimator Gk , the receiver proceeds to the data
processing state for decoding the OFDM-symbols that carry the actual
payload.
3.5.5 Data processing

During the data processing state, the guard interval GIm of each
received OFDM-symbol Sm is removed. Fine FOC is applied to the
remaining time-domain samples of the OFDM-symbol. The result
is then Fourier transformed into frequency domain, leading to the
received vectors yk (see Figure 3.1). Next, for each subchannel k,
sk is estimated by first computing ŷk = Gk yk , followed by detection.
Detection maps the entries of ŷk to the nearest constellation points in
A, resulting in the detected vector-symbol ŝk = Q(ŷk ) (cf. Figure 3.1).
Then, the constellation points composing ŝk are directly translated
into the corresponding binary labels (demapped) and the so-resulting
bitstream is de-interleaved and directed to a Viterbi decoder.16
The computation of ŷk requires a complex-valued matrix-vector
multiplication where the atomic operations are multiplications and
additions. Detection, to obtain ŝk , can be performed by shift operations.
When performing hard detection, the subsequent demapping requires a
table look-up, or, for soft detection, it requires a series of comparisons,
additions, and multiplications [53, 62].
16 Viterbi decoding is not taken into account for the implementation on an FA.
Its required performance (approximately MT · 4000 MdOp/s) is too high for an

efficient implementation on an FA and will thus be performed on a dedicated
hardware block.
3.5.6 Computational complexity of the presented

algorithms
The CCs for the five receiver states described in the previous section
are presented in Table 3.5. The complexities are given for a generic
MR × MT system and account for real-valued atomic operations. Fig-
ure 3.10 visualizes the CCs resulting for symmetric (MT = MR )
MIMO-OFDM systems and derives the corresponding processing per-
formance for sustaining real-time operation. These requirements are
obtained with the OFDM configuration summarized in Table 3.4 and
by constraining duration of the duty cycle Tdc , i.e. the duration
of each task in a state, to the duration Tdc = Tsym = 4 µs of one
OFDM-symbol. Consequently, the number of samples to be processed
in one duty cycle amounts to Ndc = Ns = 80, accommodating the
computation’s granularity over all three receiver phases.
The processing performance requirements for a given antenna
configuration are dictated by the receiver state with the highest CC.
In particular, for the symmetric system at hand, it is dictated by
the LTF processing and data processing, for systems with up to
2 antennas. For systems with more antennas, the CC of MIMO
channel processing, which grows polynomially and is inherent to matrix
inversion/manipulation, prevails over the CCs of the remaining states
and sets the processing performance mark.
Clearly, the estimates in Table 3.5 (and Figure 3.10) depend on the
specific implementation of the algorithm and do not take any overhead
(e.g., control, address-generation, load/store, or sorting of data) into
account. Nevertheless, they constitute a valid indicator of the required
processing performance and are instrumental in identifying the hard
computational kernels and suitable custom datapath configurations for
an implementation on an application-specific processor implementation
(as the one described in Chapter 5). In summary, the here presented
complexity evaluation permits to state that a SISO-OFDM receiver
claims a platform delivering a pure data-processing performance of
approximately 680 MdOp/s, and a 2 × 2, 3 × 3, and 4 × 4 MIMO-
OFDM receiver a platform delivering approximately 10 530 , 30 650 , and
80 580 MdOp/s, respectively. As a comparison, Appendix A of [34],
formulates the processing requirements for an IEEE 802.11a receiver
(SISO system). These result into 600 MdOp/s.
Table 3.5: CC for an MR ×MT MIMO-OFDM receiver.

State / Task Estimated real-valued CC
Frame-start detection
Task MAC MUL ADD
Correlation 4MR (NSP + 2(Ndc − 1)) – 2Ndc dlog2 (MR )e
Mean energy 2MR (NSP + 2(Ndc − 1)) – Ndc dlog2 (MR )e
Th. detection – 4Ndc 3Ndc
STF processing
Task MAC MUL ADD
Phase comp. – – 96
FOC – 4MR Ndc 2MR Ndc
Correlation 4MR (NLP + 2(Ndc − 1)) – 2Ndc dlog2 (MR )e
Mean energy 2MR (NLP + 2(Ndc − 1)) – Ndc dlog2 (MR )e
Th. detection – 4Ndc 3Ndc
LTF1 processing
Task MAC MUL ADD
Phase comp. – – 96
FOC – 2MR 4Ndc 2MR 2Ndc
Mean LTF1 – – 2NLP
FFT LTF1a – 768MR 768MR
LTFn processing, for n = 2, 3, . . . , MR
Task MAC MUL ADD
FOC – MR 4Ndc MR 2Ndc
FFT LTFna – 768MR 768MR
MIMO channel processing
Task MAC DIV ADD
Channel est. Ĥ – – 2MR MT2 Nc
Matrix Gb Nc NDC,MAC c Nc MT /2 Nc NDC,ADD d
Data processing
Task MAC MUL ADD
FOC – MR 4Ndc MR 2Ndc
FFTa – 768MR bNdc /80c 768MR bNdc /80c
Hard detect (64QAM) 4Nc MT MR bNdc /80c – Nc MR 12bNdc /80c
a Assuming a radix-2 FFT implementation.

b D&C is used for the matrix inversion.
cN
DC,MAC = 4(−6 + (4 − MR /2)MT + (−1 + 6MR )MT /4 + MT /2).
2 3
dN
DC,ADD = −6 + 14/3MT − MT /2 + MT /3.
2 3
Table 3.6: 2 × 2 MIMO-OFDM receiver processing requirements.

State / Task 2×2 System
Frame start detection
Task Op MdOp/s
Correlation (3.14), L = NSP 1’552 388
Mean energy 776 194
Th. detection 560 140
TOTAL 2’888 722
STF processing
Task Op MdOp/s
Phase comp. 96 24
FOC 960 240
Correlation 1’936 484
Mean energy 968 242
Th. detection 560 140
TOTAL 4’520 1’130
LTF1 processing
Task Op MdOp/s
Phase comp. 96 24
FOC 1920 480
Mean LTF1 128 32
FFT LTF1a 3’072 768
TOTAL 5’216 1’304
LTF2
Task Op MdOp/s
FOC 960 240
FFT LTF2a 3’072 768
TOTAL 4’032 1’008
Task Op MdOp/s
Channel est. Ĥ 832 208
Matrix Gb 3’380 845
TOTAL 4’212 1’053
Data processing
Task Op MdOp/s
FOC 960 240
FFTa 3’072 768
Hard detect (64QAM) 2’080 520
TOTAL 6’112 1’528
a 64-point radix-2 FFT implementation.

b Direct matrix inversion is used.
35000
Frame start detection 8580 MdOP/s

31500
STF
28000
LTF
MIMO
24500 Data payload
21000
CC
17500
3653 MdOP/s
14000
10500
7000 1528 MdOP/s
3500 680 MdOP/s
0
1 2 3 4
Figure 3.10: CC and processing requirements for a 1 × 1 (SISO), a

2 × 2, a 3 × 3, and a 4 × 4 MIMO-OFDM receiver.
3.6 Summary and Conclusion

Summary This chapter started with the description of the con-
sidered MIMO-OFDM system along with the evaluation of different
MIMO detectors. Linear MMSE detection appeared to have the best
trade-off between CC and achievable BER, for an FA. The following
fixed-point assessment of methods for computing the linear MMSE
estimator matrix G showed that the spectrum of achievable BER per-
formance vs. CC is vast. For a 2 × 2 system, direct matrix inversion
was identified as best candidate method, whereas for systems with
more antennas either the D&C method, the LR-decomposition, or QR-
decomposition are the most promising, depending on whether the focus
lies more on low CC or good BER performance. Once linear MMSE
detection was determined as the detection method, the description of
the three receiver phases – frame-start detection, preprocessing, and
data processing – could be tackled, eventually leading to the detailed
complexity analysis of the entire MIMO-OFDM receiver.
Conclusion In conclusion, for implementing a MIMO-OFDM re-

ceiver on an FA, linear MMSE detection seems the most reasonable
choice to start with. This statement is primarily dictated by the
limited processing resources available on these platforms. In addition,
for assessing the challenges of a practical implementation, it seems
reasonable to start with a SISO-OFDM receiver, upon which the 2 × 2
MIMO-OFDM receiver can be built. The architecture exploration in
the next chapter goes exactly along this line.
Chapter 4
Design Space
Exploration
In this chapter, three different SPAs are evaluated. The architectures

are representatives of three different signal processor classes: the Texas
Instruments (TI) C6455 represents the general purpose DSP class, the
MSEC4 is a special purpose BB processor, whereas the ASPE was
designed for multimedia streaming applications. The evaluation aims
at characterizing these architectures for the OFDM BB processing in
an exemplary manner and, although being far from stimulating all
corners of the design space, it helps distinguishing important from
unnecessary properties.
The considered architectures are briefly described, unveiling their
key characteristics. Subsequently, some of the hard BB processing
kernels identified in Chapter 3 are mapped onto each architecture, to
assess the available processing performance. The chapter terminates
with the discussion of the achieved results and with the suggestion of
an suitable SPA.
4.1 C6455
The C6455 [76] is a commercial high-performance fixed-point VLIW
DSP. Its core is depicted in Figure 4.1. The CPU consists of fetch and
83
84 CHAPTER 4. DESIGN SPACE EXPLORATION
L1P cache/SRAM
128
L1 program memory controller Advanced event

L2 memory
controller Badwidth management triggering
L2 256 Memory protection (AET)
128
cache/ Bandwidth
256 256
SRAM management
C64+ CPU
Instruction fetch
Memory
SPLOOP buffer
protection
IDMA 16/32-bit instruction dispatch
256 Instruction decode
Datapath 1 Datapath 2
128
.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2
A registerfile B registerfile
128 DMA
slave IF 256 64 64 Interrupt
128 L1 data memory controller & exception
Master port
(CPU cache req) Memory protection controller
256 Bandwidth management PWR control
32
L1D cache/SRAM
Figure 4.1: The C6455’s core.
decode logic together with two identical datapaths. Each datapath

incorporates four units, namely: a load unit (.L), a store unit (.S),
a multiply unit (.M), and an ALU (.D). In each clock cycle, up to
eight instructions can be fetched form the instruction memory and
be assembled to form one VLIW. Thus, up to eight instructions per
clock cycle can be executed in parallel exploiting the eight units. The
memory hierarchy is composed of two levels. L1 is a 32 KiB cache and
L2 a 2 MiB on-chip RAM. The C6455 is implemented in 90 nm CMOS
technology, occupies an area of 91 mm2 , and runs at a clock frequency
of 1 GHz – Figure 4.2 depicts the C6455 die used for measuring the
silicon area. The C6455’s peak performance is of 40 000 MdOp/s. A
powerful software development kit (SDK), offering many options to
produce optimized code, is at disposal for programming the DSP (Code
Composer Studio).
4.1.1 SISO-OFDM transceiver

The implementation of an acoustic single-antenna OFDM transmitter
and receiver on the C6455 allows to assess its qualification as a BB
4.1. C6455 85
Figure 4.2: Die photograph of the C6455. The black dots are solder
balls remaining after etching the package away.
processor [77]. The lower data rates of the acoustic domain permit
a real-time and real-life implementation, with data streaming across
the C6455’s input and output ports, while the programming does not
require a throughout optimization of the employed algorithms. In
addition, the acoustic physical front-end is more economic than its RF
counterpart.1
The OFDM parameters used for the acoustic communication system
are summarized in Table 4.2 at the end of this section. Out of the 64
allocated OFDM-subchannels, 54 carry data. To avoid digital up and
down conversion, the real-valued acoustic passband signal is generated
using a 128-point FFT. The first 64 tones are determined by the
transmit constellation points, and the remaining 64 correspond to the
symmetric and complex-conjugate of the first half (i.e., the constellation
points on the subchannels k = 64, 65, . . . , 127 are determined by sk =
127−k ). Eventually, one time-domain OFDM-symbol consists of 208
sH
1 Although not specifically considered here, all important physical RF system
impairments are recognizable in the acoustic domain as well, enabling the study of
appropriate countermeasures (e.g., sample rate estimation and tracking, carrier
frequency offset estimation and tracking, etc.)
Tx PC Rx PC
Ethernet link Ethernet link
C6455 DSK Boards
Figure 4.3: Acoustic transceiver setup with two C6455 DSK boards.
samples, of which 80 are GI samples. It is interesting to note that the

duration of the acoustic domain GI is relatively long compared to the
OFDM-symbol duration and is necessary to absorb the delay spread
by the acoustic channel.2
Figure 4.3 illustrates the system setup. Both the transmitter and
the receiver are composed of a PC connected to a C6455 DSK board,
over an Ethernet link. On the transmit side, the PC generates random
data that is assembled into Ethernet packets and sent to the transmit
DSK board over the Ethernet link.
The transmit C6455 generates the OFDM-frame’s preamble while

the first transmit data samples arrive into the DSP’s input buffer via
the Ethernet link. The preamble comprises an STF and an LTF, as
described in Section 3.5 (Figure 3.9). The STF is composed of ten
identical short training sequences, each having a length of 16 samples,
whereas the LTF is composed of two identical long training symbols
that have a length of 208 samples each. The data payload is rate
2 The measured delay spread is approximately 1 ms in the anechoic chamber.
4.1. C6455 87
R = 1/2 convolutionally encoded, with the generator polynomial [1338

1718 ] and constraint length 7. The encoded data is then interleaved and
OFDM modulated through inverse FFTs. Finally, the OFDM-frame is
written to the output buffer, from where it is sent to the board’s audio
codec. Double buffering of the output data avoids interference between
data processing and streaming out the processed data. The two output
buffers both have a size of NObuf samples and are read at a rate of
48 kS/s by the audio codec. Hence, a full output buffer is emptied
in TObuf = NObuf /(48 kS/s) and real-time operation is possible if the
processing of NObuf transmit samples takes Tp < TObuf time.
This constraint is fulfilled in the considered acoustic implementa-
tion, where the output buffer’s size is chosen to be NObuf = 410 600
samples, leading to TObuf = 866 ms.3 The profiling results for a 64-
QAM transmission are summarized in Table 4.1: one OFDM-symbol
is processed in 22.8 µs and thus the entire OFDM-frame, consisting
of 200 OFDM-symbols, is processed in Tp = 200 · 22.8 µs = 4.6 ms <
TObuf = 866 ms.
The receive C6455 double buffers the data samples received over
the audio codec’s input line. Each input buffer has a size of NIbuf =
NObuf received samples. These samples are processed block-wise for
detecting the OFDM frame start, in accordance with the method
described in Section 3.5.1, [(3.14)-(3.16)]. Once a frame has been
detected and the timing reconstructed, the acoustic channel is esti-
mated using the LTF. Next, the time-domain OFDM-symbols are
FFT-ed, demapped taking the channel estimates into account, and
deinterleaved. Note that neither FOE nor FOC are performed on the
receiver, reducing the processing power requirements (and, of course,
also the BER performance). Finally, the on-chip Viterbi coprocessor is
set up to decode the received data and to convey them to the receive
PC. The BER performance of the acoustic communication system is
computed on the receive-PC by comparing the transmitted with the
received data.
For the receiver to operate in real-time, the processing of the input
data buffer has to be finished in Tp < TIbuf (= TObuf ). Or, equivalently,
3 The length of one OFDM symbol is of N
GI + N = 80 + 128 samples. The
output buffer can hold at most 410 600/208 = 200 OFDM-symbols, cf. Table 4.2.
Table 4.1: Processing times resulting from the profiling of the trans-
mission (Tx) and reception (Rx) of one 64-QAM OFDM-symbol on
the C6455. In italics, the processing times for the RF system discussed
in Section 4.1.2.
Phase Tx-Task Time [ µs] Rx-Task Time [ µs]
Frame-start detection Frame start det. 1600 1600
Preprocessing Channel est. 3.3 2.9
Encode 10.7 9.5 Decode 12.6 11.2
Data Interleave 2 1.8 Deinterleave 1.2 1.1
proc. Map 8.8 7.8 Demap 10.2 9.0
I-FFT 1.3 0.5 FFT 1.3 0.5
Total 22.8 19.6 12.7 10.6
the processing rates associated to the three receiver phases frame-start

detection, preprocessing, and data processing have to be respected. As
detailed in the next paragraph, the real-time constraint is respected
during all these three phases.
Frame start detection over the NIbuf received samples requires

1.6 ms in the worst case, i.e., when no frame start is detected and all
NIbuf samples have to be processed. The corresponding per-sample
processing completes in 38.5 ns, which is much less than the sample
duration 20.8 ms, satisfying the real-time constraint. The subsequent
preprocessing, which consists in estimating the channel, is completed
in 3.3 µs and can thus be performed within the duration Tsym = 4.3 ms
of one OFDM-symbol. Then, as the data processing time breakdown
in Table 4.1 reveals, computing the 128-point FFT requires 1.3 µs,
the successive demapping takes 10.2 µs, and the deinterleaving 1.2 µs;
summing up to a total of 12.7 µs. Viterbi decoding on the DSP’s
on-chip coprocessor occurs concurrently, and one 64-QAM OFDM-
symbol (324 raw bits or 162 data bits) is decoded in 12.6 µs. The
processing time for one OFDM-symbol is dictated by the longer of
the two concurrent tasks, and is thus 12.7 µs. Again, the real-time
constraint is met.
4.1. C6455 89
4.1.2 Results, discussion, and conclusion

Results A first evident result is that the acoustic SISO-OFDM
system implemented on the C6455 operates in real-time, respecting the
OFDM system parameter specifications of Table 4.2. The maximum
design-datarate achievable with these specifications is obtained when
transmitting 64-QAM OFDM-symbols and amounts to 75 kbit/s.4
Unfortunately, the much harder RF constraints lead to a differ-
ent situation. The profiling results for the RF domain have been
obtained by compiling the transmit and receive programs with the
RF OFMD-system parameters in Table 4.2, and by setting the size of
the FFT to 64 points. The profiling results are reported in italic in
Table 4.1. The computations performed for frame start detection are
the same as in the acoustic domain, and hence, 38.5 ns are necessary
to process one received sample. The duration of one RF domain re-
ceived sample, however, changes and it lasts 50 ns, reducing the safety
margin but still allowing for real-time operation. The preprocessing is
completed in 2.9 µs, which in less than the duration Tsym = 4 µs of one
OFDM-symbol. The hurdle comes during the data processing. The
transmitting C6455 requires 19.6 µs to prepare one 64-QAM OFDM-
symbol, while the receive C6455 needs 11.2 µs to decode it. Since the
duration of one OFDM-symbol is Tsym = 4 µs neither transmission, nor
reception work in real-time. The corresponding datarate attained by
the transmitting C6455 for 64-QAM is 14.7 Mbit/s, while the receive
C6455 sustains data rates up to 25.7 Mbit/s.5
The power consumption of the C6455 during the reception of one
RF OFDM-frame has been determined with [78] and is approximately
2 W. Accordingly, the energy efficiency becomes 40 000 MdOp/s/2 W =
2 MdOp/s/mW.
Discussion Surprisingly, the transmitter achieves a lower data rate

than the receiver. A closer look at the profiling results reveals that the
convolutional encoding of the transmit bitstream requires most of the
processing time, closely followed by the Gray-mapping of the binary
4 The design datarate is computed as: (54 data carriers × 6 bits per
carrier)/(4.3 ms per OFDM-symbol) = 75 kbit/s.

5 The transmit data rate is computed as: (48 · 6 bit)/19.6 µs = 14.7 Mbit/s and
the receive data rate as: (48 · 6 bit)/11.2 µs = 25.7 Mbit/s.

labels onto constellation points. Convolutional encoding operates on

the incoming bitstream at a bitwise granularity. Even though the
C6455 supports bitwise operations, it is necessary to manipulate and
shift single bits, eventually resulting in an inefficient implementation.
At the receiver, instead, the decoding is more efficient since it is assisted
by the dedicated Viterbi coprocessor. The demapping at the receiver,
however, experiences the same difficulties of Gray-mapping at the
transmitter. The Gray-mapping of the binary labels onto constellation
points relies on either extensive if-then-else statement usage, or on
extensive masking and shifting followed by a table look-up. Both
options result in an inefficient C-code.
Lowering the modulation order and transmitting only BPSK mod-
ulated OFDM-symbols lightens the computational burden, and could
possibly enable real-time operation. In BPSK transmission, each tone
carries only one instead of six bits. Then, the transmitter processes
one OFDM-symbol in 3.7 µs. The receiver performs the FFT in 0.5 µs.
Demapping requires 1.5 µs, the deinterleaving 0.2 µs, and the concur-
rent Viterbi decoding requires 1.9 µs. Consequently, the processing
time of one OFDM-symbol results in 2.2 µs, which is less than the
duration of one OFDM-symbol. Since the processing of one OFDM-
symbol at both the transmitter and the receiver require less than
the OFDM-symbol duration, real-time operation is possible and the
associated datarate is 12 Mbit/s.
Table 4.2 also compares the obtained performance figures to those
of related work. Sereni [79] describes an IEEE 802.11a receiver im-
plementation for a C64x DSP running at 600 MHz and reports the
computational complexity for transmitter and receiver. Tariq [80]
presents an OFDM system centered on two C62x platforms connected
through a cable for the transmission in the BB. Video bursts are trans-
mitted at a sustained datarate of 1.7 Mbit/s. Cinquino [81] reports
cycle counts for an OFDM based system and maps these counts into
a datarate of 4.9 Mbit/s for a C64x platform that uses the Viterbi
coprocessor.
Conclusion Although the potential for optimizing the C-code of

the presented transceiver is intact, it can be stated that the C6455’s
datapath is not well suited for the fine-grained bit-wise operations
4.2. MSEC4 91
required by interleaving and deinterleaving, convolutional encoding and

decoding. The dedicated (partially configurable) Viterbi coprocessor
is a good example of how a dedicated component reliefs the CPU form
operations that do not match its granularity. However, despite the
dedicated Viterbi coprocessor, the processing performance delivered
by the C6455 is on the edge of what is required for the reception of
BPSK modulated OFDM-frames in an RF system as the one above
sketched – at a relatively low energy efficiency: 2 MdOp/s/mW.
In summary, for systems that require higher modulation orders the
available processing performance is not sufficient, nor it is sufficient to
afford the even higher requirements of an RF MIMO-OFDM system.
Finally, the main limiting factor of the C6455 is its power consumption
of 2 W (in 90 nm CMOS technology), which is definitely too high for a
mobile wireless device.
4.2 MSEC4
The MSEC4, developed at the IIS, ETH Zurich [82], is a fixed-point

SIMD BB processor targeted at OFDM and CDMA BB processing.
Figure 4.4 shows the block diagram of the MSEC4’s core. It contains
four parallel processing elements (PEs), a program control unit (PCU),
and an address generation unit (AGU). The MSEC4’s memory consists
of multiple tightly-coupled memories that directly supply the PEs via
parallel data buses. These memories are addressed via the AGU. The
MSEC4 core further contains a register file for intermediate data stor-
age and a system control unit (SCU). The SCU comprises instruction
fetch and decode, as well as several registers for device control. The
MSEC4 has two pipeline stages: in the first stage instruction fetch,
decode, and AGU address calculation are performed. Execution and
write-back take place in the second stage.
The description of the MSEC4’s building blocks in the next subsec-
tions shall underline important design aspects considered for tailoring
the processor to the BB processing domain.
CHAPTER 4. DESIGN SPACE EXPLORATION
Table 4.2: OFDM system parameters and performance for the acoustic and RF systems considered in this
thesis; and comparison to related work. The reported data rates refer to the raw over-the-air rate, and do
not consider coding.
This work, C6455 [79],a [80], [81],a
OFDM Parameter Acoustic RF C64x C62x C64x
Channel bandwidth [ MHz] 0.024 20 20 20 n.a.
# OFDM-subchannels (N) 64 64 64 64 64
# OFDM data carriers (Nc ) 54 48 48 48 64
Subchannel spacing [ kHz] 0.375 312.5 312.5 312.5 n.a.
GI [# samples] / [ µs] 80 / 10 666 16 / 0.8 16 / 0.8 16 / 0.8 n.a.
OFDM-symbol [# samples] / [ µs] 208 / 40 300 80 / 4 80 / 4 80 / 4 n.a.
BB sample rate [ MS/s] 0.048 20 20 20 n.a.
Modulation 64-QAM 64-QAM BPSK QPSK 16-QAM
Coding yes yes yes no yes
Design data rateb [ Mbit/s] 0.075 72 12 24 n.a.
Achieved Rx-data-ratec [ Mbit/s] 25.5 25.7 n.a.d 1.7 4.9
a Implements the receiver on a C62x platform and scales the results to a C64x platform.
b Data rate specified by the OFDM parameters, i.e., bits per OFDM-symbol divided by the OFDM-symbol duration.
c Effectivedata rate achieved on the DSP, i.e., bits per OFDM-symbol divided by the OFDM-symbol processing time.
d The CC is estimated to be 2977 MOPS for the processing of one OFDM-symbol.
92
4.2. MSEC4 93
X
DATA ADDRESS BUSES
AGU REG.FILE PCU PE0 PE1 PE2 PE3

BRANCH
CTRL
LOOP
INSTR. ADDRESS BUS
CTRL
SCU
INSTRUCTIONS
Figure 4.4: MSEC4 core.
4.2.1 Architecture details

Processing elements A crucial computation kernel in many BB
processing algorithms, including OFDM (cf. Table 3.5), is the Fourier
transform. As a result of its complex internal structure, its computation
is a tedious and time-consuming task when performed on architectures
that do not provide special Fourier-transform support. The fast Fourier
transform (FFT) simplifies the CC of an N -point FFT from N 2 to
N/r logr N , where r is the radix of the FFT algorithm and N has to
be a power of r. The atomic operation of a radix-r FFT is named
butterfly.
Accordingly, the MSEC4’s main execution unit (consisting of the
four PEs) has the structure of a radix-4 butterfly, enabling fast and
efficient FFT computation. Each of the four identical PEs includes
a 16 × 8 bit complex-valued multiplier, a ’trivial multiplier’, a 32 bit
complex-valued ALU with accumulator register, and a second-stage
32 bit complex-valued adder [see Figure 4.5.a)]. This composition
allows to interconnect the four PEs in such a way as to form one
complete single-cycle radix-4 butterfly [see Figure 4.5.b)] or two radix-
X Y
16|16
16|16 16|16
Complex Trivial
MULT. MULT.
To PE2
32|32
From PE2
-1 -1
ALU
To PE1
64|64
From PE1
j j
PE0
16|16
PE0 16|16
Z
a) b)
Figure 4.5: a) MSEC4’s processing element and b) radix-4 butterfly.
2 butterflies. In addition, each PE can perform one complex-valued

MAC instruction in a single clock cycle, providing optimal support for
vector/matrix operations, filters and other convolutional algorithms.
The trivial-multiplier is employed for efficient signal de-/spreading
with complex-valued binary codes, as found in various communication
protocols such as in UMTS. Here, signal de-/spreading is performed
by multiplication of the original data signal with a complex-valued
binary code sequence ∈ {±1±j}. While this could be done with the
main multiplier, a dedicated solution allows to perform the same
trivial multiplication operation more efficiently, both in terms of power
consumption and memory usage.
Data memories In contrast to conventional load/store architectures,

MSEC4 supports direct data processing from memory to memory. This
capability offers great advantages for stream-oriented applications that
often rely on data-block processing. Block sizes, though, tend to be
rather small for minimizing latency and limiting memory requirements.
4.2. MSEC4 95
The MSEC4 memory architecture perfectly meets these characteristics

by employing sixteen 64-word data-memory blocks and four 64-word
coefficient-memory blocks. At the same time, 8 memory read-ports and
4 write-ports allow for very high memory bandwidth. The data-memory
has a wordwidth of 16 bit whereas for the coefficients a wordwidth of
8 bit is sufficient. Furthermore, direct memory processing eliminates
the need for constant register file reloading which causes considerable
overhead and is an often experienced system bottleneck in conventional
load/store architectures. Very short access times are achieved by using
tightly-coupled memories that are, apart from their access latency,
comparable to large register files, but much smaller in area.6 Memory
access conflicts that occur when multiple PEs address the same memory
bank are solved automatically, stalling the computation until all data
items are ready, and are transparent to the programmer.
Address Generation Unit The AGU can compute up to twelve ad-

dresses per clock cycle: eight read (operands) and four write addresses
(results). Four specialized address generation modes are provided
for the efficient implementation of DSP algorithms: linear address-
ing (standard arithmetic) for general purpose computations; modulo
addressing allowing efficient data access in circular buffers; radix-4
and radix-2 bit-reverse addressing for FFT address calculation. These
modes can be arbitrarily combined with the AGU operations for ad-
dress modification, which include post in/de-crement, post increment
by signed offset, and indexed by signed offset, allowing for a rich variety
of addressing schemes.
Program Control Unit All tasks related to program sequence

control, i.e. tasks that manipulate the program counter (PC), are
performed by the PCU. This includes jumps, branches, subroutine calls,
return instructions and looping. MSEC4 supports up to four nested
zero-overhead hardware loops, an important feature for performance
optimization of heavily loop-based signal processing algorithms.
6 Depending on the underlying CMOS technology and design library, there might
be a critical memory size below which the instantiation of a register file is more
area-efficient.
Instruction Set Architecture MSEC4 is based on an orthogonal,

4-way SIMD instruction set that allows maximum use of processing
resources and high program memory utilization. Moreover, the instruc-
tion set’s orthogonal and regular structure enables fast instruction
decoding and eases the design verification. Dedicated instructions
for complex arithmetic lead to compact code for complex-valued algo-
rithms. Instructions are 62 bits long and can address up to four sets of
three operands (two sources, one destination) with individual address
modification operations in each clock cycle.
In addition to conventional SIMD instructions, MSEC4 provides
reconfigurable instructions [83]. In a first step, the operation with
reconfigurable instructions, requires application-specific PE configu-
rations to be loaded into a dedicated on-chip memory at runtime.
Then, on request, they can be activated by executing the appertaining
reconfigurable instruction, which contains the address to the config-
uration memory and the operands to be processed. Hence, the four
PEs can be individually configured, effectively enabling the execution
of almost arbitrary operations. In particular, it becomes possible to
perform different operations on different PEs, which represents an es-
sential enhancement to common SIMD architectures. The integration
of reconfigurable instructions into the MSEC4 architecture provides
an increase in terms of both flexibility and performance, yet without
introducing notable architectural complexity.

Results MSEC4, synthesized for a 0.25 µm CMOS technology, runs
at a clock frequency of 65 MHz and occupies an area of 8.14 mm2 . The
data and coefficient memories of the synthesized version are of 4 KiB
and 512 B, respectively. The resulting data processing performance is
1040 MdOp/s.7 The power consumption of the MSEC4’s core has been
determined in [84] (p. 70) by collecting the node toggling activities
experienced while computing a 1024-point FFT, on a placed and routed
design. The resulting power consumption amounts to 2.4 W on the
considered 0.25 µm CMOS technology.
7 For the performance, complex-valued operators are mapped to real-valued
ones, i.e., PP = 4 PEs × 4 real-valued dOp/PE × 65 MHz = 1040 MdOp/s. The

flexibility does not consider reconfigurable instructions.
4.2. MSEC4 97
The processing performance achieved by MSEC4 for the most

commonly used BB processing algorithms (FFT, FIR, LMS, etc.) is
evaluated and compared to the TI DSP-generations for mobile (C55x)
and high performance (C64x) applications [85]. The results of the
evaluation are summarized in Table 4.3. As shown, speedup factors
between two and fifteen in terms of cycle counts are achieved by the
MSEC4 design. The prime example is the computation of radix-4
FFTs that are greatly accelerated thanks to the PE’s butterfly layout.
Discussion A 64-point FFT, as required by the MIMO-OFDM re-

ceiver detailed in Section 3.5, is computed in only 93 clock cycles
compared to the 182 clock cycles required by the C6455. Unfor-
tunately, the single-cycle radix-4 configuration also determines the
longest timing path, thereby limiting the operating frequency of the
circuit. As a consequence, the achieved clock frequency of 65 MHz
is low compared to what is offered by the employed 0.25 µm CMOS
technology and a performance gain in terms of cycles is watered down
by longer execution times. Indeed, the scaling of the MSEC4 to 90 nm
CMOS technology, for a comparison with the C6455, confirms this
presentiment. The MSEC4’s scaling to 90 nm technology leads to an
area of 2.1 mm2 at a frequency of 130 MHz. The processing perfor-
mance becomes 2080 MdOp/s and the power consumption scales down
to approximately 320 mW. With these performance figures, the com-
putation of the 64-point FFT requires 715 ns on the MSEC4, whereas
on the C6455 only 182 ns are necessary.
Although OFDM-symbol processing is not jeopardized by the result-
ing 90 nm technology processing time and the introduction of pipeline
stages inside the PEs to shorten the critical timing path can relief
this shortcoming, the evaluation highlighted a few other characteris-
tics of the MSEC4 that prevent a re-design for the successive use as
MIMO-OFDM BB processor.
The high memory bandwidth required to feed the 4-way SIMD
datapath is obtained by splitting the memory into multiple small
memory banks that can be accessed concurrently. This solution is
excellent for computations that have a block-wise granularity because
the data samples processed simultaneously during one clock cycle
are distributed over different memory banks, allowing for single-cycle
memory accesses. Instead, when the granularity is fine, e.g. as during

the frame-start detection or STF processing, the computations become
more inefficient. The per-sample granularity compels the frequent
access to data samples residing inside the same memory bank, which
results in penalty clock cycles necessary to fetch the data through the
single read port provided by the addressed memory bank.
In addition, as for the C6455, no explicit support for computing
the angle of a complex number (required for FOE) is provided by
the MSEC4’s PEs, and thus FOE would result in a computationally
expensive and inefficient enterprise. Also for the subsequent FOC the
MSEC4 presents no appropriate support.
Conclusion The MSEC4 is computationally efficient and well suited

for regular, complex-valued, processing kernels that exhibit sufficient
data-level parallelism to exploit the four-way SIMD datapath.
However, for sample-based computations, the large memory band-
width provided by the MSEC4 is not well exploited due to frequent
memory access conflicts and the associated penalty, rendering the
processing overhead-afflicted.
4.3 ASPE
The adaptive stream processing engine (ASPE), developed at the
IIS, ETH Zurich [9], is a modular coarse-grained ASIP architecture
optimized for multimedia stream processing, which mainly consists of
regular and repetitive tasks.
4.3.1 Architecture
Figure 4.6 shows the ASPE architecture. The ASPE is tightly-coupled
with a general purpose processor (GPP) responsible for controlling and
setting up the ASPE, as well as for executing performance-uncritical
tasks. In addition, the ASPE has access to the system bus providing
the capability of autonomously handling datastreams.
The ASPE consists of a datapath and a controlpath. The datapath
employs two types of building blocks: functional units (FUs) and
storage units (SUs). FUs perform the arithmetic operations and SUs
4.3. ASPE
Table 4.3: Performance evaluation for typical DSP algorithms.

Cycle Estimates (Formula and Numeric Example)
Benchmark C55xa C64xb MSEC4c
N (=4k ) Point C FFT n.a. 0.75N log4 N + 38 (N/4 + 10) log4 N + 15
N = 256 4786 [1540%] 614 [197%] 311 [100%]
R FIR Filter nx/2(nh + 4) nx/4(nh + 11) + 15 nx/4(nh/2 + 2) + 36
nx = 100, nh = 32 1800 [370%] 1090 [220%] 486 [100%]
C FIR Filter nx(2nh + 4) nx nh + 24 nx/4nh + 14
nx = 100, nh = 32 6800 [840%] 3224 [400%] 814 [100%]
C Delayed LMS Filter nx(8nh + 5)d nx(3nh + 17)d nx(3/4nh + 10) + 24
nx = 100, nh = 32 26100 [760%] 11300 [330%] 3424 [100%]
C Matrix Product 2 r1 c1 c2 + 4 r1 c2 + 10 c2d r1 c1 c2 + 4.5 r1 c2 + 11d r1 c1 c2/4 + 6 r1 c2 + 6 c2 + 24
r1 = c1 = c2 = 16 9376 [350%] 5259 [200%] 2680 [100%]
nx: number of samples; nh: number of taps; ri: rows in matrix i; ci: columns in matrix i (c1 = r2)
a
Two real processing units; b Four real processing units; c Four complex processing units
All TI Benchmarks are from [85], except: d Extrapolated from corresponding R benchmarks.
99
GPP
Data, Commands, Control
ASPE 15 14 13 12 8
Empty
SEQ SEQ ... SEQ SU
SU SU
SU SU
SU SU ...
Data CWs
C-Net D-Net
Empty
Register
SEQ SEQ ... SEQ RF (RF)
File
FU FU
FU ...
0 1 2 3 7
System bus Controlpath Datapath Slot number
Figure 4.6: Block-diagram of the generic ASPE framework described

in [9].
provide local storage for the data processing. Design-time configuration

permits to select appropriate SUs and FUs which provide suitable
address generation modes and the atomic operations required for a
particular application, respectively. The units are selected from a
library to which – at design-time – new units can be easily added
via user defined modules implemented in predefined wrappers. In the
ASPE architecture, the FUs and SUs are connected through a run-
time reconfigurable network which allows to combine multiple FUs and
SUs to form a single atomic operation (e.g., SU → FU → FU → SU
→ FU → SU). This datapath reconfiguration provides an advantage
compared to conventional VLIW architectures since data does not need
to take turns over the bottleneck of a complex full-custom multi-port
register-file.
The controlpath consists of sequencer units (SEQs). The SEQs
provide the FUs and SUs with the necessary 16 bit control words
(CWs) that determine their operation mode, and they control the
reconfigurable network (D-Net) to route the data between FUs and
SUs. The SEQs support zero-overhead loops and data-dependent
control flow.
4.3. ASPE 101
4.3.2 SISO-OFDM receiver

The ASPE is evaluated through the implementation of a SISO-OFDM
receiver that is based on the IEEE 802.11a physical layer specifica-
tion [86]. The algorithms mapped onto the ASPE are described in
Section 3.5 and obtained by setting MT = MR = 1 [10]. The careful
analysis of the selected SISO-OFDM BB algorithms reveals that an
ASPE configuration composed of 1×SEQ, 3×FUs, and 8×SUs is neces-
sary in order to sustain real-time operation. As in the MSEC4 design,
nearly all operations (except one real-valued comparison operation)
required by the selected algorithms deal with complex-valued numbers,
thus all three FUs are designed to support complex-valued arithmetics.
Memory-access bottlenecks are easily avoided by storing the real and
imaginary parts in the lower and upper half of the same data word,
respectively.
Figure 4.7 depicts the block diagram of the ASPE customized for
SISO-OFDM BB processing. The single SEQ is configured with a
program memory of 762 words, each 192 bits wide, to store the program
control flow for the SEQ itself and the 16 bit CWs for the eleven
units (FUs and SUs).8 The three FUs correspond to one complex-
valued multiply and accumulate (CMAC) unit, and two complex-
valued arithmetic logic units (CALUs). The former performs atomic
operations such as multiply, multiply with complex-conjugate, and the
corresponding accumulate operations [e.g., as required by (3.12) and
(3.13)]. The CALUs implement basic ALU functionalities (i.e., add,
sub, shift, max, min, bit-wise and, and bit-wise or) and provide the
SEQ with flags for data-dependent decisions, as required for the state
transition according the threshold detection in (3.16). In addition, they
have been enhanced by a set of BB processing operations (CORDIC,
detect and demap, and absolute value) that are implemented by sharing
the already available CALU resources, thus adding only a minimal –
control-related – hardware overhead. Six of the eight SUs are composed
of 256×32 bit memories and incorporate addressing schemes for bit-
reversal (as proven to be important by the MSEC4 architecture) and
additionally de-interleaving. One SU acts as an input data buffer
(FIFO) and has a size of 64×32 bit, the last SU is a register-file of
8 The resulting instruction word length justifies the classification of this ASPE
customization as a VLIW architecture.

Data Ack
Data Req
Data In
Program Length (P)
VLIW program memory
Number of Units (Nu)
Control Network
RAM0 RAM1 ... RAM5 I-BUF
16 bit CW
Data Network
CWs
Controlpath
PC RF
Datapath
CMAC CALU0 CALU1
Sequencer
2-way SIMD unit
Figure 4.7: ASPE configured for SISO-OFDM BB processing.
eight registers.
All FUs implement a two-stage pipeline, resulting in an equal
execution time for all FUs independent of their complexity. The
potential advantage of exploiting the different FU’s execution times for
higher hardware efficiency, is traded-off with the advantage of regular
assembler programming for a shorter development time.
The FUs and SUs have been enhanced to operate in a 2-way SIMD
manner for better exploiting the data level parallelism inherent to
many signal processing algorithms – OFDM BB processing included.
Finally, a datapath word-width of 16 bit guarantees sufficient precision
for all the required computations.
Careful scheduling is required to efficiently share all FUs and SUs.
Table 4.4 summarizes the assembler cycle counts for the SISO-OFDM
BB implementation and the corresponding processing times, whereas
Figure 4.8 depicts the ASPE’s task schedule for the reception of a
BPSK modulated OFDM-frame during the data processing state. A
clock frequency of 160 MHz together with the duty cycle of Tdc = 4 µs,
4.3. ASPE 103
Tdc = 4 μs
Input buffer: Si-1 Si Si+1
Processed symbol: Si-2 Si-1 Si

time
Ressources: SU0
SU1
SU2
...
CMAC
CALU0
CALU1
Tasks: GI removal & FOC..............140 cycles Demapping & detection..36 cycles

64-point FFT........................160 cycles Deinterleaving..................66 cycles
Channel compensation.......96 cycles Total: 498 cycles / 640 cycles
Figure 4.8: Data processing task schedule for BPSK modulated OFDM-
symbols.
lead to a total of 640 clock cycles at one’s disposal for performing all
data processing related tasks. This has proven to be sufficient for
real-time reception of OFDM-frames modulated up to 64-QAM.

Results Using the described ASPE architecture, the complete BB
processing of an IEEE 802.11a receiver has been implemented in
assembler language. Its BER performance has been verified using
bit-true MATLAB models, to show only a small degradation compared
to a floating-point implementation. As attested by the cycle counts in
Table 4.4, real-time operation up to 54 Mbit/s is possible at a clock
frequency of 160 MHz which has been achieved when synthesizing the
ASPE for the 0.13 µm CMOS target process. The corresponding silicon
area amounts to 1.9 mm2 , and is low compared to the area required
by similar approaches, e.g., Montium [87] and MS1/MaRs [88] (in
Chapter 2). The resulting processing performance is of 20 560 MdOp/s.9
Unfortunately, no power consumption figures were extracted from this
9 The processing performance is obtained as: PP = 2-way SIMD ×
8dOps/SIMD × 160 MHz = 2560 MdOp/s.
Table 4.4: Assembler cycle counts and processing times for SISO-
OFDM BB processing on ASPE running at 160 MHz.
State / Task Assembler cycle counts # Time [ µs]
Frame-start detection
Correlation 2Ns + NSP + 20 196 1.22
Mean energy and th. Ns + 20 100 0.63
TOTAL 3Ns + NSP + 40 296 1.85
Short preamble processing (init)
coarse FOE 75 75 0.47
coarse FOC Ns + 10 90 0.56
TOTAL Ns + 85 165 1.03
Short preamble processing
coarse FOC Ns + 10 90 0.56
LTF1 start detect 6Ns + 40 520 3.25
TOTAL 7Ns + 90 610 3.81
LTF1 processing
fine FOE 75 75 0.47
fine FOC Ns + 10 90 0.56
mean LTF1 NLP /2 + 10 42 0.26
FFT on LTF1 160 160 1.00
Channel estimation NLP /2 + 10 42 0.26
TOTAL Ns + NLP + 265 409 2.55
Data processing
fine FOC Ns + 10 90 0.56
FFT 160 160 1.00
Channel compensation 114 114 0.71
Demap 64QAM 270 270 1.69
TOTAL Ns + 554 634 3.96
4.4. SUMMARY AND CONCLUSION 105
implementation.
Discussion The complete BB processing implementation permitted

an extensive evaluation of the described ASPE architecture. Although
real-time operation is possible, the potential for further increasing the
hardware efficiency is intact. The implementation pointed out, for in-
stance, that the SEQ program code exhibits a very poor density, or, that
the processor’s efficiency to compute the employed algorithms could
be further increased by only minimal modifications in the hardware
to facilitate the intra-/inter-kernel data sorting. As an example, the
computation of the SIMD 64-point radix-2 FFT requires 160 clock cy-
cles in the above described implementation, instead of the theoretical
192/2 = 96 clock cycles, showing that indeed there is a substantial
overhead. Another source of inefficiency is the control network residing
inside the ASPE’s controlpath. The control network is responsible
of scheduling the potentially concurrent accesses of multiple SEQs to
the same FU. The complexity of the resulting network is such that
it limits the achievable clock frequency. Thus, especially for ASPE
incarnations where only one SEQ is used, the control network becomes
a severe bottleneck.
The CORDIC algorithm support provided by the CALU unit
proved to be essential for FOE because it enabled a cycle efficient
computation of the angle of a complex-valued number.
Conclusion The ASPE’s two-fold adaptivity to tasks of different

granularities has proven to be important for the successful implemen-
tation of the SISO-OFDM BB processing. The evaluation showed
that the ASPE’s datapath can easily be tailored to the needs of the
application domain at hand: the design-time configurability provides
an enormous flexibility and enables the design of appropriate units
that, at run-time, provide exactly the right flexibility/granularity.
4.4 Summary and Conclusion

Summary The design space exploration presented in this chapter
highlighted important characteristics a software-programmable plat-
form demands for its deployment as an OFDM BB processor.
Facts The C6455 high performance DSP is extremely flexible and al-
lows for rapid code development thanks to its powerful SDK. However,
despite the two-fold datapath and the high clock frequency, its process-
ing performance is hardly sufficient to sustain real-time SISO-OFDM
BB processing with BPSK modulation. The MSEC4 special purpose
BB processor incorporates many efficient mechanisms that support
OFDM BB processing (e.g., radix-4 butterfly structure, flexible AGUs
for intra-kernel data sorting, zero overhead loop support). However,
mainly due to its long critical timing-path, but also because of its
difficulty of performing the per-sample processing required during the
initial reception phase, the processor cannot be employed for efficient
MIMO-OFDM BB processing without a substantial re-design. Finally,
the properly configured ASPE streaming processor comes with enough
flexibility to sustain both per-sample and per-symbol computations,
and delivers enough performance to sustain real-time SISO-OFDM
BB processing.
Important characteristics Although the OFDM BB processing is

mainly represented by regular and repetitive processing tasks and is
thus well suited for DSPs, frame-start detection and STF processing
present irregular tasks that demand sample-based processing and
program control-flow. The sample-based FOE requires the support
of dedicated hardware (e.g., support for the CORDIC algorithm) to
be computationally efficient. Next, the extensive support of complex-
valued arithmetics greatly reduces the overhead otherwise experienced
when complex-valued operations are performed on single, real-valued
operators (cf. Chapter 2). Another important characteristic required
to sustain the differing granularities of OFDM BB processing, is the
presence of mechanisms that assist efficient intra- and inter-kernel data
sorting/addressing.
Conclusion Table 4.5 reports the area and the corresponding op-
erating frequency for the evaluated DSPs for their original target
technology, as well as normalized for a 0.18 µm CMOS technology. As
reinforced graphically by Figure 4.9, the best area efficiency is attained
by the ASPE, followed by the MSEC4. The C6455’s efficiency is by far
the lowest which can be brought back, on one side, to the DSP’s large
4.4. SUMMARY AND CONCLUSION 107
Table 4.5: Areas and clock frequencies for the original designs and
their normalized versions for a 0.18 µm CMOS technology.
Original Normalized to 0.18 µm
Architecture CMOS f [MHz] A [mm2 ] f0.18 [MHz] A0.18 [mm2 ]
C6455 [76] 0.09 10 000 91 500 360
MSEC [82] 0.25 65 8.14 90 4.2
ASPE [10] 0.13 160 1.9 115 3.6
on-chip memory, and on the other side to its enormous flexibility.

To conclude, the ASPE architecture has the best prerequisites to
successfully implement the MIMO-OFDM BB processing. It has the
best potential to be tailored to its application domain and thus to be
fine positioned in the performance-flexibility design space thanks to
its design-time configurability, while both the MSEC4 and the C6455
do not.
600
500
Performance/area [MdOPS/mm ]
2
400
300
200
100
0
ASPE MSEC4 C6455
Figure 4.9: Processing performance per area, for the evaluated SPAs.
Chapter 5
MIMO-OFDM SDR
Receiver
This chapter presents the mapping of the relevant BB processing al-

gorithms described in Section 3.5 onto an SDR platform, in order to
form a 2 × 2 MIMO-OFDM receiver. The SDR platform is composed
of two ASIPs, each of which tailored to the computational needs of the
associated digital signal processing kernels. The first processor per-
forms the per-stream MIMO-OFDM processing. The second processor
handles the MIMO detection.
5.1 SDR Platform Overview

The two application-specific processors used to implement the SDR
platform are based on a modified version of the ASPE design framework
that prevailed against its two competitor architectures evaluated in the
previous chapter. The modified ASPE design framework is described
next.
5.1.1 The Modified ASPE

Figure 5.1 illustrates the modified ASPE design framework. Three
modifications mainly characterize this framework. The addition of
109
110 CHAPTER 5. MIMO-OFDM SDR RECEIVER
GPP
Data, Commands, Control
15 14 13 12 8
ASPE SEQ
Empty
IBUF
OBUF
DICTIONARY SU SU
SU SU ...
Data CWs
D-Net
Empty
Register
RF (RF)
File
FU FU
FU ...
INSTRUCTIONS
0 1 2 3 7
System bus Controlpath Datapath Slot number
Figure 5.1: Block-diagram of the modified ASPE design framework.
dedicated input and output buffers (IBUF and OBUF), dictated by

the streaming-like nature of the BB processing tasks, enables direct
access to the data source and sink, thus reducing the data movement
overhead otherwise experienced across the D-Net and the system bus.
Also, in this way, the system bus’ load is relieved and the so earned
bandwidth is at disposal for other – possibly more control-related –
tasks, as for instance reloading the SEQ’s program memory.
The second modification regards the controlpath structure. While
the initial framework permitted to employ many SEQs, the modified
ASPE framework supports just one SEQ. This restriction is motivated
by two considerations. First, the use of a single SEQ eases the assembler
programming considerably allowing a more rapid progress. Second, the
complexity of the control network, which was found to be a limiting
factor in the previous chapter, is greatly reduced enabling the operation
at higher clock frequencies.
The third modification regards the SEQ structure. In order to
improve the low program-code density, which is inherent to VLIW ar-
chitectures as pointed out by the SISO-OFDM receiver implementation
in the previous chapter, the SEQ is enhanced to support dictionary-
based program-code compression. With this technique, the ASPE
5.1. SDR PLATFORM OVERVIEW 111
instruction words are stored in a dictionary memory that is indexed

through the content of a much narrower, but deeper program mem-
ory. This mechanism reduces the overall storage requirements. The
method used to compress the program-code is detailed later on in
Section 5.4, whereas the next section concentrates on the high-level
receiver architecture.
5.1.2 Receiver Architecture

The choice of the high-level architecture for implementing the 2×2
SDR MIMO-OFDM receiver is mainly guided by the findings of the
algorithm analysis in Section 3.5 and of the SISO-OFDM BB im-
plementation in Section 4.3. The results of the BB implementation
indicate that the ASPE architecture has enough processing power to
achieve real-time operation for a single-antenna OFDM receiver. On
the other hand, the first order complexity-estimates in Table 3.5 reveal
that, in a two-antenna MIMO-OFDM receiver, the MIMO-OFDM
processing alone requires slightly more than twice as many operations
compared to a single-antenna receiver. In addition, significant effort
is required for the computation of the MIMO estimator matrix G [in
(3.6)], especially for the involved matrix inversion, but also for the
MIMO detection itself.
Nevertheless, a system architecture with only two ASPEs has been
chosen. In the proposed configuration, the first ASPE (in the following
it is named ASPE A) is dedicated to the OFDM-related processing
of the two receive chains. The second ASPE (ASPE B) handles the
computation of the MMSE estimator matrix and the MIMO detection.
The two ASPEs are connected through their dedicated I/O-buffers.
More precisely, ASPE A’s OBUF is connected to ASPE B’s IBUF
allowing to stream data from the first processor to the second. The
partitioning of the functionality is illustrated by the dashed boxes
in Figure 5.2 (cf. also Figure 3.1).
This approach appears reasonable from a structural point of view,
since MIMO-OFDM processing and MIMO-detection rely on differ-
ent computational kernels. The former requires mainly sample rate
processing and FFTs, while the latter relies on matrix manipulation
as inversions, matrix-matrix and matrix-vector multiplications. Thus,
one ASPE can be customized for the OFDM processing, while the
Transmitter Channel Receiver
Conversion and
Demapping and
OFDM OFDM
Conversion
Detection
TxData Mapping Mod. Demod. RxData
S to P
P to S
MIMO
OFDM OFDM
Mod. Demod.
Noise
ASPE A ASPE B
sk Hk nk yk ŝk
Figure 5.2: Simplified block diagram of the considered 2 × 2 MIMO-

OFDM platform and task partitioning onto ASPE A and ASPE B.
other ASPE can be tailored to the MIMO processing. In addition,

both ASPEs can operate concurrently in a pipelined fashion while
decoding an OFDM-frame, maximizing their resource utilization.
5.1.3 Common ASPE A and ASPE B configura-

tion
The design-time configuration of the two ASPEs strives at entirely sus-
taining the target application, while reducing the differences between
the two architectures to a minimum, simplifying the portability of the
design tools (e.g., assembler, interface, HDL test-bench). From this
perspective, the characteristics that are common to both ASPE A and
ASPE B are described in the following.
Since nearly all operations required by the selected algorithms
deal with complex-valued numbers, and, as revealed expedient by the
evaluation in Chapter 4, both ASPEs implement FUs that are designed
to support 16 bit fixed-point complex-valued arithmetic. As for the
SISO-OFDM implementation, memory-access bottlenecks are easily
avoided by storing the real and imaginary parts in the lower and upper
half of the same data word, respectively. The SUs are composed of
256×32 bit SIMD single-ported memories and incorporate common
addressing schemes (i.e., post increment by one, post decrement by
one, post increment by an offset register, and bit-reversal). Both the
SIMD IBUF and OBUF have a size of 64×32 bit and come with a read
5.2. ASPE A – MIMO-OFDM PROCESSING 113
and a write port to operate as first-in first-out buffers. Finally, the

temporary storage capability of the two ASPEs is sustained by eight
SIMD registers that compose the register-file (two read-ports and one
write-port).
In both ASPEs, the SEQ is configured with a VLIW dictionary
memory of 256 words, each 192 bits wide, to store the 16 bit CWs
for the SEQ itself and the CWs for the eleven units (FUs and SUs)
instantiated inside each ASPE. The program memory can contain up
to 1024 dictionary pointers. It stores the pointers to the dictionary
memory and guarantees proper sequencing of the dictionary entries in
order to reproduce the original uncompressed program.
The number and the type of FUs and SUs selected to realize the
particular datapath of ASPE A and of ASPE B are described in more
detail in the next two sections (Section 5.2 and Section 5.3). However,
the partitioning of the 16 bit CWs used to control the FU’s operation is
common for all FUs across the two designs. The CWs are partitioned
into the orthogonal fields:
Instr Shamt OpA OpB

4 bit 4 bit 4 bit 4 bit
where OpA and OpB select the operand sources (i.e., SUs or FUs),
Shamt defines a possible shift amount, and Instr codes the instruction
to be executed on the FU.
5.2 ASPE A – MIMO-OFDM Processing

The particular datapath configuration of ASPE A is detailed before
the description of the per-stream OFDM processing on the properly
configured architecture.
5.2.1 Datapath configuration

Figure 5.3 depicts the block diagram of the design-time configured
ASPE A. The analysis of the atomic operations required by the se-
lected BB algorithms reveals that for the 2×2 MIMO-OFDM related
processing an ASPE configuration with three FUs, five SUs, one IBUF,
Data Ack
Data Req
Data Out
Data Ack
Data Req
Data In
VLIW dictionary memories
Controlpath
Control Network
...
SU0 SU4 OBUF IBUF
Data Network
CWs
IDX
mem. REG CMAC CALU0 CALU1
SEQ Datapath
2-way SIMD unit
Figure 5.3: ASPE A: datapath configuration.
one OBUF, and the register-file can deliver the necessary processing
performance.
The three FUs have been selected principally by observing that the
hardest computational kernel in the MIMO-OFDM processing part
resides in the computation of the 64-point FFT and that this kernel is
best undermined by the use of butterfly operators that significantly
speed up its computation (see Table 3.5). Consequently, the three
FUs can be interconnected to form a radix-2 butterfly. The FUs are:
one complex-valued multiply and accumulate (CMAC) unit, and two
complex-valued arithmetic logic units (CALU0 and CALU1). The
CMAC unit is implemented with three pipeline stages, while the two
CALUs require only two pipeline stages to attain the same critical
path length on all FUs.
The CALU unit

The structure of one CALU is illustrated in Figure 5.4. The CALU
implements basic ALU functionalities (i.e., add, sub, shift, max, min,
bit-wise and, and bit-wise or) and it provides the SEQ with flags
for data-dependent decisions, as required, for example, for the state
transition triggered by the threshold detection in (3.16). As for the
SISO-OFDM implementation, the two CALUs have been enhanced
by a set of BB processing operations (CORDIC and absolute value
computation) that are implemented by sharing the already available
CALU resources. Table 5.1 lists the Instr-field coding of the CALU’s
CW.
The operation of the CALU is illustrated on the example of the
specialized CORDIC instruction since it differs from the conventional
ALU operation. Figure 5.5 shows how the two CALUs interact to
compute the angle of a complex-valued number z = x + j y, by using
the CORDIC algorithm [75]. This datapath configuration is required,
for instance, by the FOE in Section 3.5.3. The CORDIC algorithm
performs a series of iterations that successively lead to the angle
φ = ](z). One CORDIC iteration is defined by
x(i+1) = x(i) − d(i) y (i) 2−i (5.1)

(i+1) (i) (i) (i)
y =y +d x 2 −i
(5.2)
φ(i+1) = φ(i) − d(i) arctan(2−i ), (5.3)
where i = 0, 1, . . . , NCOR −1 is the iteration index, NCOR the number of

iterations, and d(i) = −sign(x(i) y (i) ) indicates whether φ(i) is positive
(d(i) = +1), or negative (d(i) = −1). The initial values are x(0) = x,
y (0) = y, and φ(0) = 0.
The CORDIC instruction of the CALU exploits and configures
the datapath to compute (5.1), (5.2), as well as d(i) which is mapped
onto the CALU’s flags. By checking the corresponding flag, the SEQ
decides whether to perform an addition or a subtraction between
the operands of the second CALU, eventually computing (5.3). The
arctan(2−i ) values are loaded into the ASPE A’s data memory before
the computation starts. In addition, to facilitate the FOC executed
by the CMAC unit, the arctan(2−i ) values are scaled by a factor of
256/2π (see later description of CMAC). The alternative to carve these
values into a look-up table that resides inside the CALU has been
discarded and traded-in for more numerical flexibility at run-time. The
FU’s datapath word-width of 16 bit permits to run up to NCOR = 16
iterations: arctan(2−15 ) = 3.0518 · 10−5 , which, quantized for a fixed-
point representation [16 15], corresponds to 1 and is the smallest
representable quantized number.1 Listings 5.1 and 5.2 show two
assembler code snippets that compute the CORDIC algorithm on 16
complex-valued data samples. The former code snippet is compact and,
at first sight, seems program memory and computationally efficient.
However, a closer look at the CordicIterLoop at line 20 reveals that only
two of the five VLIWs that compose the loop perform effective data
operations (i.e., one of those at lines 21 or 22 depending on the flag
of CALU0, and that at line 23), while the remaining two instructions
are required to fill the two branch delay slots. Thus, the code is not
computationally efficient.
The second code snippet (Listing 5.2) is computationally more
efficient. At every second clock cycle an effective data operation takes
place (again, depending on the flag of CALU0). The apparent program
code inefficiency caused by the several repetitions of the same VLIWs
turns out not to be an issue. Thanks to the dictionary-based program
code compression only the unique VLIWs need to be stored inside the
dictionary memory, hence resulting in a compact code. Eventually,
this code snippet is used for the MIMO-OFDM BB implementation.
The CMAC unit

The CMAC unit is illustrated in Figure 5.6. It performs 16 bit
fixed-point complex-valued operations such as multiply, multiply with
complex-conjugate, and the corresponding accumulate operations [e.g.,
as required by (3.12) and (3.13)]. In addition, the CMAC provides cir-
cuitry to support FOC, i.e., r̃[d] = r[d] · e−jφd . This circuitry includes
a look-up table (LUT) containing 256 phasors L[k] = e−j2πk/256 , with
k = 0, 1, . . . , 255 being the LUT’s address.
FOC is performed by first loading the scaled phase increment 256 2π φ,
1 The notation [ww f w] means that a fixed-point number x has a word-width

q
of ww bits and that its decimal point is positioned at the f wth bit, starting from
the least significant bit. The mapping from fixed-point number to its corresponding
real-valued number is: x = xq /2f w .
Listing 5.1: ASPE A assembler code snippet to compute CORDIC

within two loops. The code is compact, but overhead affected.
// I n i t
// D0 : contains data samples
// D1 : c o n t a i n s atan v a l u e s
$6 = NSAMPLE; \\ Load number o f s a m p l e s t o be p r o c e s s e d
5 nop , \
D0. or0R = 0 , \
D1. or0R = 1 6 , \
D2. or0R = 0 , \
D3. or0R = 0 , \
10 D4. or0R = 0 , \
COEFS . or0R = 0 ; \\ I n i t r i g h t SIMD memory o f f s e t r e g s
// Only work on r i g h t SIMD memory b a n k s

// Do l o o p o v e r a l l s a m p l e s and f o r each sample do t h e
15 // CORDIC i t e r a t i o n s
SampleLoop :
nop , D0[ rpR ] + 1 ; // Get new d a t a sample
nop , calu0 = c o r d i c (D0, 0 ) ;
$7 = NITER−1 , D1[ rpR ] + 1 , calu0 , calu1 = regs0 , regs0 [ 0 ] ;
20 CordicIterLoop :
i f ( ! f l a g 0 ) , calu0 , calu1 = i n t + D1;
i f ( ! cond ) , calu0 , calu1 = i n t − D1;
i f (−−$7 ) goto C o r d i c I t e r L o o p , calu0 = c o r d i c ( calu0 , 1 ) , calu1 ;
nop , D1[ rpR ] + 1 , calu0 , calu1 ; // F i l l d e l a y s l o t 1
25 nop , calu0 , calu1 ; // F i l l d e l a y s l o t 2
i f (−−$6 ) goto SampleLoop , D1[ rpR]− or0 , D2[ wpR]+1 = caluR1 ;
nop ; // F i l l d e l a y s l o t 1
nop ; // F i l l d e l a y s l o t 2
// D2 now c o n t a i n s t h e a n g l e s o f a l l p r o c e s s e d s a m p l e s
Listing 5.2: ASPE A assembler code snippet to compute CORDIC

within one loop. The code is repetitive, but computationally efficient.
1 // D0 : c o n t a i n s d a t a samples , D1 : c o n t a i n s atan v a l u e s
$6 = NSAMPLE; \\ Load number o f s a m p l e s t o be p r o c e s s e d
\\ I n i t r i g h t SIMD memory o f f s e t r e g s :
nop , D0. or0R = 0 , D1. or0R = 1 6 , D2. or0R = 0 ,D3. or0R = 0 , \
D4. or0R = 0 , COEFS . or0R = 0 ;
6 // Loop o v e r a l l s a m p l e s and f o r each sample do 16
// CORDIC i t e r a t i o n s
SampleLoop :
nop , D0[ rpR ] + 1 ; // Get new d a t a sample
nop , calu0 = c o r d i c (D0, 0 ) ;
11 nop , calu0 ;
nop , D1[ rpR ] + 1 , calu0=c o r d i c ( calu0 , 1 ) , calu1=regs0 [ 0 ] ;
i f ( ! f l a g 1 0 ) , D1[ rpR ] + 1 , calu0=c o r d i c ( i n t , 1 ) , calu1=i n t+D1;
i f ( ! cond ) , D1[ rpR ] + 1 , calu0=c o r d i c ( i n t , 1 ) , calu1=i n t −D1;
16 i f ( ! cond ) , D1[ rpR ] + 1 , calu0=c o r d i c ( i n t , 1 ) , calu1=i n t −D1;
21 i f ( ! f l a g 1 0 ) , D1[ rpR ] + 1 , calu0=c o r d i c ( i n t , 1 ) , calu1=i n t+D1;
. . . // Omitted 6 i n s t r u c t i o n s f o r s a k e o f b r e v i t y
31 i f ( ! cond ) , D1[ rpR ] + 1 , calu0=c o r d i c ( i n t , 1 ) , calu1=i n t −D1;
if (! flag10 ) , calu0=c o r d i c ( i n t , 1 ) , calu1=i n t+D1;
i f ( ! cond ) , calu0=c o r d i c ( i n t , 1 ) , calu1=i n t −D1;
i f (−−$6 ) goto SampleLoop ;
41 nop , D1[ rpR]− o r 0 ;
nop , D2[ wpR]+1 = caluR1 ;
// D2 now c o n t a i n s t h e a n g l e s o f a l l p r o c e s s e d s a m p l e s
A B imag(B) Control word
real(B)
16 | 16 16 | 16 16
Input selection network
|x|2 max, min +/- +/- Shift Shift Decoder

flags
Result selection network
Saturate
5.2. ASPE A – MIMO-OFDM PROCESSING
16 | 16
OUT
Figure 5.4: One of the two identical datapaths that compose the 2-way SIMD CALU FUs of ASPE A.
119
CHAPTER 5. MIMO-OFDM SDR RECEIVER
-i
A B Control word A arctan(2 ) Control word
16 0| 16 0| 16 16
16|16 Iteration
Counter
+
1
Re{.} Im{.}
Decoder Decoder
+/-
Shift Shift
(i)
-d
+/- +/-
OUT
(i)
16|16 -d
OUT FLAG
to SEQ
Figure 5.5: Configuration and combination of the two CALUs used to compute one CORDIC iteration.
120
Table 5.1: CALU instruction coding.

OU T : CALU output, A: operand A, B: operand B, SHAM T : shift
amount provided by the Shamt-field inside the CW.
Instr Mnemonic Meaning
0000 HOLD OU T C OU T
0001 ADD OU T C A + (B >> SHAM T )
0010 SUB OU T C A − (B >> SHAM T )
0011 MAX OU T C max(A, B)
0100 MIN OU T C min(A, B)
2
0101 SQU OU T C |A| >> SHAM T
0110 NOP OU T C A
1000 AND OU T C bitand(A, B)
1001 OR OU T C bitor(A, B)
1010 CORDICa OU T C CORDIC(A)
a The CORDIC instruction performs one CORDIC iteration. The first of a
sequence of CORDIC instruction has to reset the internal iteration counter. This
is done by setting the corresponding CW’s Shamt-field to ’0000’.
which is computed by the CALU through CORDIC rotations, into the

corresponding initialization register PHI.2 The scaling of φ is performed
to directly obtain the LUT’s address in the subsequent steps. For each
received sample, the scaled phase increment is accumulated in the phase
accumulator register PHIACCU that resides in the accumulator stage of
the CMAC. Concurrently, each received sample is multiplied by the
corresponding phasor L[PHIACCU] pointed to by the phase accumulator
register. The involved multiplication is performed on the CMAC’s
multiply-stage, thereafter the result bypasses the accumulator stage
and is available at the output.
The two accumulator registers (ACCU and PHIACCU) assure that no
reloading of the accumulated data nor accumulated phase is necessary
for the CMAC’s operation. Thus, no reloading penalty is inferred
when switching back and forth between FOC and FFT computation
during the data processing.
The CMAC’s FOC operation is another example of how the com-
2 Loading the PHI-register is done with the load PHI (LDPHI) instruction, see
Table 5.2.
Table 5.2: CMAC instruction coding.

OU T : CMAC output, A: operand A, B: operand B, , SHAM T :
shift amount provided by the Shamt-field inside the CW, ACCU :
accumulator register, P HI: φ register, P HIACCU : φ accumulator
register.
0000 NOP OU T C ACCU >> SHAM T
0010 MAC OU T C ACCU >> SHAM T ;
ACCU C ACCU + A · B
0100 MUL OU T C (A · B) >> SHAM T ;
ACCU C 0
0110 MSU OU T C ACCU >> SHAM T ;
ACCU C ACCU − A · B
1000 cMAC OU T C ACCU >> SHAM T ;
ACCU C ACCU + A · B H
1010 cMUL OU T C (A · B H ) >> SHAM T ;
ACCU C 0
1100 cMSU OU T C ACCU >> SHAM T ;
ACCU C ACCU − A · B H
1110 LDPHI P HI C A;
P HIACCU C 0
0001 FOC OU T C A · L[P HIACCU ];
P HIACCU C P HIACCU + P HI
putational blocks composing one unit can be combined together rais-

ing the utilization of these blocks and allowing for the support of
a broader range of operations, while adding only a minimal control-
related hardware-overhead.
To conclude, Table 5.2 summarizes the Instr field coding of the
CMAC’s CW.
A B Control Word
16|16 16|16 16
Re{A} Im{A} Re{B} Im{B}
16 16 16 16
Decoder
32 32 32 32
L[PHIACCU]
A
0 PHI 0
Shift Decoder
Im{ACCU}
LUT
PHIACCU Re{FOC bypass} Im{FOC bypass}
Re{ACCU}
Decoder
16|16
OUT
Figure 5.6: One of the two identical datapaths that compose the 2-way
SIMD CMAC FU of ASPE A.
5.2.2 BB processing on ASPE A

This section summarizes the tasks performed on ASPE A from the
frame-start detection to the data processing states. It illustrates the
corresponding datapath configurations and it reports clock cycle counts
required by the assembler implementation to process blocks of Ndc
samples.3 Together, these results are instrumental to determine the
clock frequency the processor has to attain for operating in real-time
and to determine a viable task schedule.
IBUF data sorting The IBUF has a capacity of 64 words. The

received data samples are stored in the left and right parts of the
SIMD memory according to the receive stream they belong to, as
shown in Figure 5.7. The buffer provides the upstream circuitry
with a handshake interface to control the data-flow and avoid buffer
overflowing. The requirements to the upstream circuitry are to provide
the IBUF with the received datastream sampled at a rate of 20 MS/s
and to be capable of buffering at least Ndc samples.
To increase the efficiency of the frame-start detection, the incoming
samples are processed in blocks of the length of one OFDM-symbol
(Ndc = Ns samples).
Frame start detection In a first step and through appropriate

VLIWs, the datapath is configured as illustrated in Figure 5.8(a) for
computing m̄16 [(3.15) in Section 3.5.1]. While computing m̄16 , the
received data samples r[d] are also temporarily stored into SU1 and
SU2, to be further processed in a second step. ASPE A requires
2Ndc + 35 clock cycles for obtaining Ndc mean-energy values m̄16 .
Next, the received data samples previously buffered in SU1 and SU2
are used to compute p̄16 [cf. (3.14)], with the datapath configuration
shown in Figure 5.8(b). To complete this task, another 2Ndc + 35 clock
cycles are required. Finally, the threshold detection (3.16) is performed
in 3Ndc + 20 cycles, configuring the datapath as in Figure 5.9. For
this operation, the datapath crossing mechanism at the input A of
the CALU is exploited and then only one of the two SIMD units are
3 The reported clock cycle counts are rounded up to the next multiple of five.
The terms ’clock cycle’ and ’cycle’ are used interchangeably in this chapter.
Interleaved rx
datastream
L R
... ...
(2)
r [2] r(1) [2]
r(2) [1] r(1) [1]
r(2) [0] r(1) [0]
Figure 5.7: IBUF receive-data sorting.
employed for the comparison. Once the frame start is detected, the
receiver proceeds into the STF processing state.
Summarizing, frame start detection requires 7Ndc + 90 clock cycles
to process blocks of Ndc received data samples.
STF processing First, FOE as described in Section 3.5.2 is per-

formed in 60 cycles. FOE needs to be computed only once during
STF processing. The datapath configuration for the computation of
the phase rotation by the CORDIC algorithm uses both CALUs and
has been detailed in Section 5.2.1. Then, coarse FOC is performed
on the Ndc received samples employing the CMAC unit and, for that,
Ndc + 10 cycles are taken. The subsequent computation of the mean
energy values m̄64 (3.15), the correlation p̄64 (3.14), and the threshold
detection in (3.16) with L = NLP = 64, necessary to detect the start of
LTF1, are performed in 6Ndc + 75 cycles. If the threshold is detected,
the SU’s base addresses are aligned to match the OFDM-symbol’s
boundary in at most Ndc + 10 cycles.
Thus, in steady state (without FOE), STF processing over Ndc
received data samples requires 8Ndc + 95 cycles.
L IBUF R L SU1 R L SU2 R

... ... ... ... ... ...
r(2) [2] r(1) [2] r(2) [2] r(1) [2] r(2) [2] r(1) [2]
(2) (1) (2)
Rx data r [1] r [1] r [1] r(1) [1] r(2) [1] r(1) [1]
r(2) [0] r(1) [0] r(2) [0] r(1) [0] r(2) [0] r(1) [0]
L ST1 R CMAC
CMAC
... ...
r(2) [2] r(1) [2] mean correlation
energy
r(2) [1] r(1) [1]
r(2) [0] r(1) [0]
CALU0 CALU0
average average
L ST2 R
... ...
r(2) [2] r(1) [2] CALU1
(2)
CALU1
r [1] r(1) [1] absolute absolute
(2) (1)
r [0] r [0] value value
.2 .2
L ST0 R L SU0 R
... ... ...
m̄16 [2] p̄16 [2] m̄16 [2]
m̄16 [1] p̄16 [0] m̄16 [1]
m̄16 [0] p̄16 [0] m̄16 [0]
(a) Computation of m̄16 . (b) Computation of p̄16 .
Figure 5.8: Datapath configuration for frame-start detection.

L SU0 R
... ...
p̄16 [2] m̄16 [2]
p̄16 [0] m̄16 [1]
p̄16 [0] m̄16 [0]
CALU0
threshold
detection
flags to
SEQ
Figure 5.9: Datapath configuration for threshold detection.
LTF processing Coarse FOC is performed on the samples of the

LTF2 not yet compensated during STF processing (in at most Ndc + 10
cycles). Subsequently, fine FOE takes place (in 60 cycles) employing
p̄64 [dˆ64 ], followed by fine FOC on the three OFDM-symbols that
together compose LTF1 and LTF2, in order to remove the residual
frequency offset (in 250 cycles). Thereafter, the granularity of the
operations switches to that of an OFDM-symbol and the average over
the two OFDM-symbols T1A and T1B composing LTF1 is calculated
(in 85 cycles). Finally, the long guard interval GI2 is removed, and
the averaged LTF1 and the LTF2 are fast Fourier transformed in 250
cycles each.
The datapath configuration for FOE and FOC is the same as
employed for the corresponding task during STF processing. Only
the addressing of the SUs changes and reflects the increased lag of
the correlation: from L = NSP = 16 to L = NLP = 64. To compute
the average over the two OFDM-symbols T1A and T1B that compose
LTF1, one CALUs reads the corresponding samples of the LTF1 from
SU1 and SU2 and averages them. The interleaved 64-point FFT
datapath configuration with the corresponding data sorting is shown
in Figure 5.10. Listing 5.3 illustrates the assembler code snippet used
to compute the second stage of the 64-point interleaved FFT.
Interleaved Datapath Twiddle

data factors
L R L R SU1 L R
... ... ... ... ... ...
...
r17 r1 r49 r33
r16 r0 r48 r32 c1 c1
r16 r0 r48 r32 c0 c0
SU0
Left and Right

REGFILE SIMD
datapaths
00
CMAC
radix-2
butterfly
CALU0 CALU1 + -
L R SU2 L R SU3
... ... ... ...
Figure 5.10: Datapath configuration for 64-point FFT computation.

Listing 5.3: Second FFT-stage on ASPE A.

// S t a g e 2
// Read d a t a and c o e f f i c i e n t s
nop , D3[ rpL , rpR ] + 1 , COEFS [ rpL , rpR ] + 1 ;
// −− F i l l p i p e l i n e s ( p r o l o g u e )
5 nop , cmac=D3∗COEFS>>SHAMT, D3[ rpL , rpR ] + 1 ;
nop , cmac=D3∗COEFS>>SHAMT, D3[ rpL , rpR ] + 1 ;
nop , cmac=D3∗COEFS>>SHAMT, D2[ rpL , rpR ] + 1 , D3[ rpL , rpR ] + 1 ;
nop , calu0=D2+cmac , calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT,
D2[ rpL , rpR ] + 1 , D3[ rpL , rpR ] + 1 ;
nop , calu0=D2+cmac , calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT,
D2[ rpL , rpR ] + 1 , D3[ rpL , rpR ] + 1 ;
10 // −− Radix −2 b u t t e r f l i e s ( s t e a d y −s t a t e )
nop , D0[ wpL , wpR]+1=[ caluL0 , caluR0 ] ,
D1[ wpL , wpR]+1=[ caluL1 , caluR1 ] , calu0=D2+cmac ,
calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT, D2[ rpL , rpR ] + 1 ,
D3[ rpL , rpR ] + 1 ;
D3[ rpL , rpR ] + 1 ;
. . . // −− Removed code t o s h o r t e n l i s t i n g
D3[ rpL , rpR ] + 1 ;
15 nop , D1[ wpL , wpR]+1=[ caluL0 , caluR0 ] ,
D3[ rpL , rpR]− o r 0 ;
// −− Empty p i p e l i n e s t a g e s ( e p i l o g u e )
calu1=D2−cmac , cmac=D3∗COEFS>>SHAMT, D2[ rpL , rpR ] + 1 ;
calu1=D2−cmac , D2[ rpL , rpR ] + 1 ;
calu1=D2−cmac , D2[ rpL , rpR]− o r 0 ;
20 nop , D1[ wpL , wpR]+1=[ caluL0 , caluR0 ] ,
calu1=D2−cmac ;
D0[ wpL , wpR]+1=[ caluL1 , caluR1 ] ;
nop , D1[ wpL , wpR]− o r 1 =[ caluL0 , caluR0 ] ,
D0[ wpL , wpR]− o r 0 =[ caluL1 , caluR1 ] ;
// −− T o t a l : 39 c l o c k c y c l e s f o r s t a g e 2
MIMO channel processing First, the coarse and fine phases are
added together to generate a single, fine frequency offset estimate (in
15 cycles). Fine FOC takes place next (Ndc + 10 cycles). Then, the
channel estimate is computed by the matrix-matrix multiplication
Ĥ = ZT−1 described in Section 3.5.4, and for which ASPE A requires
4 Nc + 10 cycles. The resulting channel estimates are transferred
to ASPE B, via the OBUF→IBUF link between the two processors.
Once on ASPE B, they are further elaborated to obtain the linear
MMSE estimator matrix G. Then, the processor switches to the data
processing state.
Data processing ASPE A performs fine FOC and computes an FFT

on the received samples, in Ndc + 10 and 250 cycles respectively. The
transformed data is then conveyed to ASPE B for proper detection.
5.3 ASPE B – MIMO Detection

This section describes the FUs, SUs, and the operation of ASPE B as
part of the SDR platform, in an analogous form as done for ASPE A.
5.3.1 Datapath configuration

ASPE B is dedicated to MIMO detection and configured with four FUs,
four SUs, an IBUF, an OBUF, and with the register-file as illustrated
in Figure 5.11. The FUs necessary to compute the matrix computation
related tasks of MIMO detection are: one CALU, two CMACs, and one
real-valued divider unit (DIV). The DIV unit is necessary to perform
matrix-inversion, as described in Section 3.5.4.
In order to reach the highest possible clock frequency, the FUs of
ASPE B have two different pipeline depths. The CALU and the two
CMAC units have three pipeline stages each, while the DIV unit has
been pipelined with seven stages. As a result, all units have the same
critical-path length.
5.3. ASPE B – MIMO DETECTION 131
Data Ack
Data Req
Data Out
Data Ack
Data Req
Data In
VLIW dictionary memories
Controlpath
Control Network
SU0 SU1 SU2 SU3 OBUF IBUF
Data Network
CWs
IDX
REG CMAC0 CMAC1 CALU DIV
mem.
SEQ Datapath
2-way SIMD unit Pipeline stage
Figure 5.11: Datapath configuration of ASPE B.
The CALU unit

The CALU (see Figure 5.12) implements the instructions reported in
Table 5.3. Compared to the ASPE A’s CALU, no CORDIC support is
provided since not required during the MIMO detection. Instead, the
important feature of adding only the real parts of two complex-valued
numbers, while setting the imaginary part to zero, is implemented.
The CMAC unit

The CMAC unit of ASPE B is illustrated in Figure 5.13. It performs
16 bit fixed-point complex-valued operations such as multiply, multiply
with complex-conjugate, and the corresponding accumulate opera-
tions. The Instr field of the CMAC codes the instructions reported
in Table 5.4. Even though its main functionality is the same as the
CMAC of ASPE A, its structure is adapted to the particular needs
of the MIMO detection algorithms. In particular, the FOC support
is dropped in place of a few additional variations of complex-valued
A B imag(B) Control word

real(B)
16 | 16 16 | 16 16
0 1
0 1 0 1
SHREG
Re{A} Im{A} 8
8
16 16 Shift 8
Re{B} Im{B}
Decoder
16 16
(MSBs)
(MSBs)
16 16
Saturate Saturate
16 16
0
0 1
16 | 16
OUT
Figure 5.12: One of the two identical datapaths that compose the
2-way SIMD CALU FU of ASPE B.
Table 5.3: ASPE B CALU instruction coding.

OU T : CALU’s output, A: operand A, B: operand B, SHREG: shift
amount register, SHAM T : shift amount provided by the Shamt-field
inside the CW.
0000 NOP OU T C B >> SHAM T
0001 HOLD OU T C OU T
0010 LDSHAMT SHREG C B
0011 SHIFT OU T C B >> (SHREG + SHAM T )
0100 ADDRe OU T C (<{A} + <{B}) >> (SHREG + SHAM T )
0110 SUB OU T C (A − B) >> (SHREG + SHAM T )
0111 ADD OU T C (A + B) >> (SHREG + SHAM T )
multiply operations (e.g., cMULn, cMULcn) and a wider shift range

for the CMAC’s result. Overall, the circuit’s complexity is reduced,
compared to the CMAC unit of ASPE A.
The DIV unit
Figure 5.14 depicts the real-valued divider unit, involved in the 2 × 2

direct matrix inversion required to compute estimator matrix G. The
divider performs 16 bit divisions and it provides the instructions re-
ported in Table 5.5. The DIV instruction performs a standard, signed
division between the operands A and B. Instead, the SHDIV instruc-
tion is mainly used to compute the shifted inverse 2blog2 Bc /B of the
real-valued fixed-point operand B. Note that for the SHDIV instruction,
the operands of the DIV unit have to be set both to operand B. The
so-obtained shifted inverse allows to fully exploit the dynamic range
of the division. The shift amount blog2 Bc is available at the DIV’s
output always one cycle before the actual division result and can thus
be incorporated in the subsequent computations. The shifted inverse
of a negative number results in the largest positive value at the output
of the DIV unit. This behavior is desired for the computation of the
matrix inversion where the determinant of the Hermitian matrix F,
which has to be inverted, is always positive.
A B Control Word
16|16 16|16 16
Re{A} Im{A} Re{B} Im{B}

16 16 16 16
8
Decoder
32 32 32 32 8
SHREG
33 33
0 0 0 0
1 1
8
37 37
Re{ACCU} Im{ACCU}
Decoder
37 37
Shift 8
8
16|16 8
OUT
2-way SIMD CMAC FU of ASPE B.
Table 5.4: ASPE B CMAC instruction coding.

OU T : CMAC’s output, A: operand A, B: operand B, SHREG: shift
amount register, SHAM T : shift amount provided by the Shamt-field
inside the CW.
0001 LDSHAMT SHREG C B
0010 MUL OU T C (A · B) >> (SHREG + SHAM T )
0011 MAC OU T C ACCU >> (SHREG + SHAM T );
ACCU C ACCU + A · B
0100 cMUL OU T C (A · B H ) >> (SHREG + SHAM T )
0101 cMAC OU T C ACCU >> (SHREG + SHAM T );
ACCU C ACCU + A · B H
0110 cMULn OU T C −(A · B H ) >> (SHREG + SHAM T )
1000 MACn OU T C ACCU >> (SHREG + SHAM T );
ACCU C ACCU − A · B H
1001 cMULcn OU T C −(AH · B) >> (SHREG + SHAM T )
1010 cMACcn OU T C ACCU >> (SHREG + SHAM T );
ACCU C ACCU − AH · B
1011 NOP OU T C ACCU >> (SHREG + SHAM T )
Table 5.5: ASPE B DIV instruction coding.

OU T : CMAC’s output, A: operand A, B: operand B, SHAM T : shift
amount provided by the Shamt-field inside the CW.
0000 NOP OU T C blog2 (A)c >> (SHREG + SHAM T )
0001 DIV OU T C A/B >> (SHREG + SHAM T )
0010 SHDIV OU T C 2blog2 (A)c /B >> (SHREG + SHAM T )
0011 HOLD OU T C OU T >> (SHREG + SHAM T )
A B Control word
Re{A} 16 Re{B} 16 16
16 16
if A>0
X = log2(Re{A})
else
X = 0
end
(X) (2X)
16 16 Decoder
32 5 pipeline
4 pipeline stages
stages
32
Shift
16
Decoder
16
0
16|16
OUT
2-way SIMD DIV FU of ASPE B.
5.3.2 BB processing on ASPE B

ASPE B remains idle until it is triggered by the first channel-estimate
samples dropping into its IBUF in order to start with MIMO channel
processing.
MIMO channel processing ASPE B computes blocks of four 2 × 2

linear MMSE estimator matrices G in 44 cycles: the computation of
four F matrices requires 12 clock cycles, the matrix inversions resulting
in four F−1 matrices are then performed in 20 clock cycles, and the
matrix multiplications to obtain four G = F−1 ĤH are computed in
12 clock cycles. Figure 5.15 illustrates the data sorting adopted and
how the FUs are chained in order to compute F = HH H + MT σ 2 I
(the addition of MT σ 2 to the diagonal elements of HH H is performed
on the CALU and is not shown in the figure).
Data processing During the data processing state, ŝ is computed

by a matrix-vector multiplication combined with an arithmetic shift
(thus performing hard-out detection). The configuration is similar
to that of Figure 5.15 and 16 cycles are required for computing four
symbols ŝ, independent from the modulation order M . Finally, the
detected data is written to the OBUF where it is ready to be demapped,
de-interleaved, and conveyed to an external Viterbi decoder.
Datapath
ST0 ST1
L
...
R
...
L
... ...
R
xL11 xR11 xL10 xR10
xL01 xR01 xL00 xR00
xL10 xR10 xL11 xR11
xL00 xR00 xL01 xR01
Left and Right
SIMD
datapaths
(.)H (.)H
CMAC1 CMAC0
H
YL= XL XL
H
YR= XR XR
L R ST2 L R ST3
... ... ... ...
yL11 yR11 yL01 yR01

yL10 yR10 yL00 yR00
Figure 5.15: Datapath configuration for the matrix-matrix multiplica-

tion ĤH Ĥ.
5.4. DBCC 139
5.4 Dictionary Based Program-Code Com-

pression
The SISO-OFDM BB processing implementation on the ASPE de-
scribed in Section 4.3 revealed that the achieved program code density
is poor, as it is typical for VLIW architectures.4 VLIW processors can
exploit the instruction-level parallelism (ILP) inherent to data process-
ing algorithms by operating the FUs allocated inside their datapath
concurrently. Unfortunately, the implemented application often does
not require to run all available units concurrently, and therefore the
code contains many no-operations (NOPs) for the unemployed units.
Especially for VLIW processors as the ASPE, which store the entire
VLIW in one program memory word (i.e., fixed instruction format
VLIW processors), this means that the code-density is low. More-
over, the application may require one VLIW to be repeated several
times inside the program memory, resulting in large programs (cf.
Listing 5.2).
In this thesis, dictionary-based program-code compression (DBCC)
is implemented to alleviate for this inefficiency. DBCC schemes collect
all unique instructions of a program into a dictionary memory, at
compile-time. Then, at run-time, the dictionary entries are indexed
through a much narrower, but deeper memory, which eventually allows
to reconstruct the original program flow. In [89], dictionary-based
program-code compression was introduced to VLIW architectures.
Compared to other methods that use, for instance, entropy encoding
(e.g., as in [90, 91, 92]) and are better suited for flexible instruction
format VLIWs, DBCC is well suited for fixed instruction width VLIW
processors. It achieves good compression ratios, while the additional
hardware overhead, required to decompress and restore the original
program flow, is low. Motivated by the observation that the dictionary
memory still contains a significant number of NOP CWs, in this thesis,
the dictionary memory is further compressed, which represents an
addition to conventional DBCC.
In the following, first the initial SISO-OFDM ASPE program
memory configuration is reported, which will serve as a reference. Then,
the actual DBCC method with the additional dictionary compaction
4 In # of instructions
this thesis, the code-density is defined as ρ = # of instructions + # of N OP s
.
Table 5.6: Benchmark program sizes and code densities.

Benchmark P ρ ρ̄a bunc
SISO BB proc. 762 24% 76% 146304
16 Tap FIRb 44 24% 76% 8448
64-point FFT 149 47% 53% 28608
Two 64-point FFTc 253 47% 53% 49152
MIMO BB kerneld 506 34% 66% 97152
a ρ̄ = 1 − ρ.
b 16 tap finite response filter (FIR) implementation.
c Implementation of two 64-point FFTs interleaved in memory, as used in
Section 5.2.2.
d Implementation of MIMO detection, as used in Section 5.2.2.
step is described and compared to the reference setup.
5.4.1 Reference design
The ASPE performing the SISO-OFDM BB processing of Section 4.3

is equipped with eleven units (FUs and SUs, cf. Figure 4.7). One of
its VLIWs is 192 bits wide and comprises Nu = 12 CW, to control
the SEQ unit and the eleven units (each CW is 16 bit wide). The
program memory can contain P = 762 VLIWs, thus resulting in an
uncompressed storage capacity of bunc = 1460 304 bit.
Figure 5.16 illustrates the area breakdown of the various building
blocks of the reference ASPE, placed and routed for 0.18 µm CMOS
technology. Here, the SEQ share amounts to almost 25% of the proces-
sor’s total silicon area (or 1 mm2 ), which underlines that the required
program-code memory area is indeed considerable. The program
code-density for the SISO-OFDM assembler program of Section 4.3
is reported in Table 5.6, together with a set of other benchmark pro-
grams. The SISO-OFDM assembler code implementation attains a
low code-density: 24% of all CWs are useful instructions, whereas the
remaining 76% of the CWs are filled with NOPs. For the additional
benchmarks the situation is similar.
5.4. DBCC 141
SEQ: 24%
Regfile: 2%
ST5:4%
IBUF: 4%
CMAC: 13%
ST3: 5%
ST4: 5%
ST0: 5% Other: 12%
ST1: 5%
ST2: 5% CALU0: 7%
CALU1: 7%
Figure 5.16: Area breakdown of the ASPE placed and routed for
0.18 µm CMOS technology. The total area amounts to 4.2 mm2 , the
SEQ containing the program memory occupies 1 mm2 .
5.4.2 DBCC with NOP bitmask

DBCC Figure 5.17 depicts the hardware components necessary to
implement DBCC (without the additional dictionary compression step
yet). To generate the dictionary, the initial assembler program is
parsed and each unique VLIW is saved into the dictionary binary
file. Concurrently, to generate the indexes that allow to reconstruct
the original program, for each parsed VLIW the address of the corre-
sponding VLIW inside the dictionary is stored into the index binary
file. These two files are loaded into the index and dictionary memo-
ries, after which the ASPE is ready to operate. The sequencer starts
fetching an index from the index memory. This index addresses the
dictionary form where the effective VLIW is fetched. Then, the CWs
composing the VLIW are directed to the corresponding units, i.e., to
the sequencer and to the eleven datapath units. In the figure, NOP
CWs are indicated by ’----’.
As suggested by the dashed contour in Figure 5.17, the change to
the SEQ structure required to support DBCC is only minimal. The
interfaces to the program memory remain the same as for the SEQ
Index
Unompressed dictionary memory
Number of units (Nu)
PC 0082----------------------------4110411041134113
----------------600E7800----------------E1D5E1D5
0x2 02E5------------2498--------------------81008100
02E5----AE98----25BA------------81008100A100A100
Program length (P)

00E7----4FD820491049----A300----42208800----A100
Sequencer 8805--------------------4102----4102410241024102
Dictionary length (L)

01E5------------25BA------------81008100----8500
F9AA------------------------------------46284628
0040----4FDB204A104A----A300----A300A300E3E6E3D5
F97C----------------------------A000A000A000A000
0FE1--------------------4102----41034103----4103
F9E2------------------------4400----440045404540
00E5------------2498--------------------81008100
F907--------------------420042004217421742004200
0x2 FFDA--------------------443A44004400440044004400
F9B0----4FD820491049------------E1E6----A300A300
02A5----CE98----5E04--------E1C4D050D0D0A300A300
0x2 8801----4FDA----104B------------A100A100----E1D5
F962----4FD920481048------------E1E6----A600A300
00A5----C198----5E04--------E1C4C105E1D5----A300
F8F1----4FD920481048------------E1E6E1D5A300A300
F8B4----4FDB----104A----A600----A600A300E3E6----
0185----4FD920481048----A300----8400----A100A100
CWs to the units
Figure 5.17: Dictionary-based decompression hardware.
operating without DBCC. The only difference lies in the additional

instruction fetch step inferred by the index memory and necessary
to retrieve the actual VLIW that has to be executed. Jumps and
branches are handled as without the dictionary, only the programmer
has to respect the additional latency cycle.
Table 5.7 summarizes the results obtained when applying DBCC
to the benchmark programs of Table 5.6. Clearly, the program length
P remains the same as in the reference design, instead its width is now
diminished to a wordwidth of dlog2 Le bits. L indicates the number of
words of the dictionary memory, and bcmp the storage bits required
for the index memory and the dictionary memory. The compression
ratio is commonly defined as Rbit = bcmp /bunc , and thus, it takes the
additional overhead caused by the index memory into account. Low
Rbit values indicate good compression ratios and, as the table reveals,
a first progress compared to the reference design is made. However,
the code densities ρ attained by the DBCC benchmark programs are
still poor. The additional dictionary compression step described next
further improves this aspect.
5.4. DBCC 143
Table 5.7: DBCC benchmark program sizes and code densities.

Benchmark P L ρ ρ̄a bcmp Rbit b
SISO BB proc. 762 433 27% 73% 89994 62%
16 Tap FIR 44 13 26% 74% 2672 32%
64-point FFT 149 118 46% 54% 23699 83%
Two 64-point FFT 253 145 49% 51% 29864 61%
MIMO BB kernel 506 213 42% 58% 44944 46%
a ρ̄ = 1 − ρ.
b The lower the better.
DBCC with NOP bitmask The dictionary memory is compressed

in a second step. Figure 5.18 depicts the decompression hardware
block diagram and illustrates the concept adopted for the dictionary
compression step. Again, it is interesting to note that the interfaces to
the program memory (see dashed contour) remain the same as without
DBCC, enabling a modular design. The uncompressed dictionary is
shown on top (label 1). Here, the three highlighted VLIWs can be
condensed together into one compressed dictionary memory entry.
The generated compressed dictionary, with the corresponding pro-
gram memory containing indexes and mask bits, is shown in the middle
of Figure 5.18 (label 2). The three highlighted VLIWs satisfy the test
described in the next paragraph and have been mapped onto the single,
condensed VLIW highlighted in the compressed dictionary memory.
The reconstruction of the original VLIW requires an additional bit-
mask to be stored inside the index memory, together with the pointer
to the corresponding compressed dictionary word. This bitmask is
determined by Nu bits and it is necessary to identify whether the CW
in the original VLIW is an effective instruction or a NOP. If the ith
bit of the bitmask is ’0’, then the CW at position i ∈ {1, . . . , Nu }
in the original VLIW was a NOP. Thus, the CW is masked out and
replaced with a NOP CW in any case. Otherwise, if the ith bit of
the bitmask is ’1’ the considered CW was an effective instruction, and
it is not replaced. As shown at the bottom of Figure 5.18 (label 3),
three appropriate bitmasks allow to reconstruct the original VLIWs,
starting from the condensed one, by setting the CW of unused units
to NOPs.
Uncompressed dictionary memory

----------------------------------------D1D0D150
----60004009----------------------------A740----
0038--------------------------------------------
1 --------------------------------85008500--------
--------CE98----------------------------A100A100
----------------600E7800----------------E1D5E1D5
--------4FDB--------------------A100A100-------- Further NOP removal
----------------------------------------46304630
----------------------------------------E1D5E1D5
------------------------------------A200--------
Index Mask
Compressed dictionary memory
Number of units (Nu)
PC 0082----------------------------4110411041134113
0x2 0038600040D930E0600E7800A100E1D5E14C4400E1D5E1D5
02E5------------2498--------------------81008100
02E5----AE98----25BA------------81008100A100A100
Program length (P)
00E7----4FD820491049----A300----42208800A100A100
Sequencer 8805--------------------4102----4102410241024102
Dictionary length (L)

01E5------------25BA------------8100810085008500
F9AA------------------------------------46284628
0040----4FDB204A104A----A300----A300A300E3E6E3D5
F97C----------------------------A000A000A000A000
0FE1--------------------4102----4103410341034103
2 F9E2------------------------44004400440045404540
00E5------------2498--------------------81008100
F907--------------------420042004217421742004200
0x2 FFDA--------------------443A44004400440044004400
F9B0----4FD820491049------------E1E6E1D5A300A300
02A5----CE98----5E04--------E1C4D050D0D0A300A300
0x2 8801----4FDA204B104B------------A100A100E1E6E1D5
F962----4FD920481048------------E1E6E1D5A600A300
00A5----C198----5E04--------E1C4C105E1D5A300A300
F8F1----4FD920481048------------E1E6E1D5A300A300
F8B4----4FDB204A104A----A6004010A600A300E3E6E3D5
0185----4FD920481048----A300----84008100A100A100
mask logic
----------------600E7800----------------E1D5E1D5
3 ----------------600E7800----------------E1D5E1D5
Masks:
000011000011 0038------------600E7800----------------E1D5E1D5
000000000011 CWs to the units
100000000000
Figure 5.18: Decompressor hardware for dictionary-based code decom-

pression with NOP bitmask.
5.4. DBCC 145
Two VLIWs can be condensed together if they pass the following

test. For each of the Nu unit slots, the two corresponding CWs of the
two VLIWs are compared. The comparison has four possible outcomes:
1. If the CWs of both VLIWs are NOPs, then the chance of con-
desing the two VLIWs is still intact.
2. If the CWs of both VLIWs are effective instructions (i.e., non-

NOPs), then these two CWs have to be equal and the chance of
condensing the two VLIWs is still intact.
3. If the CWs of one unit is an effective instruction and the second

a NOP CW, then chance of condensing the two VLIWs is still
intact.
4. If none of the above three tests is successful, then the two VLIWs
cannot be condensed.
Once all Nu CW-pairs are tested and it is found that the two VLIWs
can be merged, the resulting condensed VLIW is stored inside the
compressed dictionary memory.
The algorithm used to generate the compressed dictionary is illus-
trated by the pseudo-code fragment in Algorithm 1. The uncompressed
dictionary memory is considered as an L × Nu matrix M, and the
resulting compressed dictionary as an L0 × Nu matrix N. The entries
of both matrices are CWs of the datapath units (FUs and SUs) and
the SEQ. The notation M(m, :) means that the entire mth row of M is
accessed, and similarly, M(:, n) means that the entire nth column of M
is accessed. At line 1, the sortcols(.) function permutes the columns of
the uncompressed dictionary matrix M, such that the output matrix
T1’s first column contains the column of M with the most non-NOP
CWs, and the last column of T1 that column of M with the least
non-NOP CWs. The vector t1i stores the column-permutation indexes
of T1. The sortrows(.) function at line 3 instead, sorts the rows of
its argument vector according to the recurrence of the CWs inside the
original, uncompressed program. The most frequent CW is permuted
to the first row of t2, whereas the least frequent CW becomes the
last entry of t2. The function genpattern(.) (line 15) generates an
appropriate pattern pat that is used to test if the four above mentioned
Table 5.8: DBCC with NOP masking for the benchmark programs.
Benchmark P L0 ρ ρ̄a bcmp Rbit b
SISO BB proc. 762 142 53% 47% 42504 29%
16 Tap FIR 44 8 33% 67% 2196 25%
64-point FFT 149 74 59% 41% 17039 59%
Two 64-point FFT 253 97 59% 41% 23431 48%
MIMO BB kernel 506 116 63% 37% 31886 33%
a ρ̄ = 1 − ρ.
b The lower the better.
conditions can be satisfied. The operator =∼ performs this test at

once.
After the completion of Algorithm 1, the compressed dictionary
is available in the matrix N. The compressed dictionary pointer and
the NOP bitmaks can be easily generated by parsing the VLIWs of
original uncompressed program M and comparing these VLIWs with
the entries of the compressed dictionary memory. If the original VLIW
can be reconstructed from the condensed VLIW by an appropriate
bitmask, then the corresponding address of the condensed VLIW and
the NOP bitmask are stored inside the index memory.
Finally, Table 5.8 shows the results obtained with the additional dic-
tionary compression step. Compared to DBCC without the additional
compaction step, the compression ratio could be further decreased
and the compression ratio is now Rbit = 39 % on average. The decod-
ing latency remains of one clock cycle. The resulting decompression
hardware overhead of 0.015 mm2 is relatively small.
SEQ program memory configuration The definitive SEQ’s in-

dex and mask memory can contain P = 1024 pointers to the compressed
dictionary memory. The compressed dictionary memory is selected to
contain L0 = 256 condensed VLIWs, which, together with the support
of Nu = 12 units, defines the width of the index and mask memory as
dlog2 L0 e + Nu = 20 bits. With this program memory configuration,
the total area of the SEQ amounts to 0.78 mm2 . Thus, compared to
the reference ASPE, where the SEQ occupied an area of 1 mm2 , DBCC
5.4. DBCC 147
Algorithm 1 Dictionary compression step.

In: Uncompressed dictionary M
Out: Compressed dictionary N
[T1, t1i] = sortcols(M)
for k = 1, 2, . . . , Nu do
t2 = sortrows(T1(:,k))
for l = 1, 2, . . . , length(t2) do
5: for m = 1, 2, . . . , L do
cw = t2(l) // Get the CW to be processed
if cw == M(m, t1i(k)) then
if M(m, :) ∈ N then
// VLIW is already in compressed dictionary N
10: else
if size(N) == 0 then
N(1, :) = M(m, :) // Init
else
for o = 1, . . . , length(N) do
15: pat = genpattern(N(o, :))
if pat =∼ M(m, :) then
for p = 1, 2, . . . , Nu do
if N(o, p) == M(m, p) then
// CWs are equal
20: else if (N(o, p) == NOP) and (M(m, p) ! = NOP) then
N(o, p) = M(m, p) // Modify compressed dict. entry
else if (N(o, p) ! = NOP) and (M(m, p) == NOP) then
// Compressed dict. CW will be masked out.
else
25: error()
end if
end for
end if
break;
30: end for
else
if o == length(N) then // No matching entry was found
N(o + 1, :) = M(m, :); // Append VLIW to end of compr. dict.
end if
35: end if
end if
end if
end for
end for
40: end for
enabled an area saving of 22 %.

In addition, this configuration supports all of the considered bench-
mark programs and allows to run programs containing more than 762
VLIWs, namely up to 1024 VLIW when the ratio L0 /P ≤ 0.25. As a
reference, the average L0 /P -ratio for the five considered benchmark
programs amounts to 0.29, resulting in an average supported program
length of 256/0.29 = 882 VLIWs.
5.5 Implementation Results

Task Schedule for ASPE A and ASPE B Careful scheduling
is required to efficiently share all FUs and SUs on both ASPEs. Ta-
bles 5.9 and 5.10 summarize the cycle counts of the tasks performed on
ASPE A and ASPE B, respectively. With these results the schedule
of Figure 5.19 is determined. The figure depicts how the tasks are
scheduled among the two ASPE instances, when running at a clock
frequency of 250 MHz.
The hard computational kernels that have to be performed just
after the detection of a MIMO-OFDM-frame on ASPE A (e.g., LTF
processing) almost fill one duty cycle, fully exploiting the available
computational power. However, during the data processing part of
the frame ASPE A is less loaded. The situation is similar on ASPE B.
Here the computational load required for performing the matrix in-
version fully claims the available processing power during the MIMO
channel processing state. Later, while performing data processing, the
computational load is reduced and the resources that were allocated
for the highest-load period are only partially utilized.
At fist sight, the suboptimal resource utilization during the major
part of the frame suggests that the duty-cycle should be extended
to allow for a reduction of the hardware complexity. Unfortunately,
such an extension of the duty-cycle would also increase the receiver’s
latency that is usually constrained (in the solution at hand, by the
IEEE 802.11n standard) and is thus not a viable solution.
Silicon realization The system composed of two different ASPE

configurations presented in this chapter is capable of implementing the
BB processing relevant tasks required for a 2×2 MIMO-OFDM receiver,
5.5. IMPLEMENTATION RESULTS 149
Table 5.9: Assembler cycle counts and processing times for 2 × 2

MIMO-OFDM processing on ASPE A running at 250 MHz.
State / Task Cycle counts # Time [ µs]
Frame start detection
correlation 2Ndc + 35 195 0.78
mean energy 2Ndc + 35 195 0.78
threshold det. 3Ndc + 20 260 1.04
TOTAL 7Ndc + 84 650 2.60
Short preamble processing (init)
coarse FOE 60 60 0.24
coarse FOC Ndc + 10 90 0.36
TOTAL Ndc + 70 150 0.60
Short preamble processing
coarse FOC Ndc + 10 90 0.36
LTF1 start detect 6Ndc + 75 555 2.22
align SU addresses Ndc + 10 90 0.36
TOTAL 8Ndc + 95 735 2.94
LTF processing
coarse FOC max. Ndc + 10 90 0.36
fine FOE 60 60 0.24
fine FOC LTF1&LTF2 3Ndc + 10 250 1.00
average LTF1 85 85 0.34
FFT on LTF1 250 250 1.00
FFT on LTF2 250 250 1.00
TOTAL 4Ndc + 665 985 3.94
φf + φc 15 15
fine FOC Ndc + 10 90 0.36
Channel est. 4N + 10 218 0.87
FFT on S1 250 250 1.00
TOTAL Ndc + 4N + 285 573 2.29
Data processing
fine FOC Ndc + 10 90 0.36
FFT 250 250 1.00
TOTAL Ndc + 260 340 1.36
CHAPTER 5. MIMO-OFDM SDR RECEIVER
Duty cycle: 1000 clock cycles @ 250MHz => 4 μs

Antenna 1
Antenna 2 noise STFA STFB LTF1A LTF1B LTF2 S1 S2 S3 S4 S5 ...
time
S1 S2
ASPE A input
LTF1A
LTF1B
LTF1A
LTF2
LTF1B
LTF2
LTF1B
S1
z[2]
S3
S2 S4 S5
...
noise noise STFA STFB STFB STFB LTF1A LTF1A z[1] F{S1} S3 S4
STFa
ASPE A fd fd fd shp shp shp ltf1 ltf2 dp dp dp dp ...
S1 S2
ASPE A output LTF1B LTF2 LTF2 F{S1} S3
ASPE B input LTF1A LTF1A LTF1B LTF1B h[2] F{S2} S4 S5
noise noise STFA STFB STFB STFB LTF1A LTF1A h[1] F{S1} F{S3} F{S4}
ASPE B MMSE ...
ASPE B output s2
h[2]
s1
g[2]
s3
g[2]
s4
g[2]
...
h[1] g[1] g[1] g[1]
... states 1) Frame detection 4) MIMO ch.
Receiver 2) STF processing 3) LTF processing 5) Data processing ...
processing
Tasks cFOE cFOC cFOC cFOC fFOE fFOC fFOC fFOC fFOC
cFOC th.detect th.detect th.detect fFOC Ch.est FFT FFT FFT
mean LTF1
FFT on LTF1
FFT on LTF2 detect S0 det. S1 det. S2 det. S3
Figure 5.19: Scheduling of all receiver tasks for the presented 2 × 2 MIMO-OFDM system.
150
Table 5.10: Assembler cycle counts and processing times for 2 × 2

MIMO detection on ASPE B running at 250 MHz.
State / Task Cycle counts # Time [ µs]
linear MMSE estimator G 11N 572 2.29
TOTAL 11N 572 2.29
Data processing
Demapping 4N 208 0.83
TOTAL 4N 208 0.83
as described in Section 3.5. Both ASPE designs were synthesized and

placed in 0.18 µm 1P/6M CMOS technology, and fabricated on a
multi-project wafer run. The achieved post-layout clock frequency
of 250 MHz permits to follow the schedule of Figure 5.19, allowing
the system to operate in real-time. ASPE A occupies an area of
3.95 mm2 and ASPE B of 3.7 mm2 . Thus, the total area required by
the two programmable solutions together amounts to 7.65 mm2 , or to
approximately 792 kGE.5 Tables 5.11 and 5.12 summarize the post-
layout key figures of the implemented designs, whereas Figure 5.20
shows the final floorplan of the two ASPEs with the main building
blocks highlighted. The post-layout power consumption of each ASPE
amounts to approximately 700 mW, and consequently the complete
system with two ASPEs consumes around 1.4 W (see Appendix B).
Comparison with 2×2 MIMO-OFDM receiver in [12] To the

best of my knowledge, the sole comparable solution described in
literature is [12] (cf. Section 2.3), where the silicon implementation of
the ADRES architecture for a 90 nm TSMC process is described (with a
particular focus on power efficiency). The ADRES-core is responsible of
performing the MIMO-OFDM BB processing related tasks, comparable
to those described in this thesis. The so-obtained SDR is reported to
run at a clock frequency of 400 MHz, while occupying a total silicon
5 The area of one gate equivalent (GE) corresponds to the silicon area occupied
by one low-drive 2 input NAND. On the 0.18 µm CMOS technology considered,

this amounts to 9.67 µm2 .
Table 5.11: Post-layout results for ASPE A on a 0.18 µm 1P/6M

CMOS technology.
Entity Area Complexity Area share
[ mm2 ] [kGE] [%]
CMAC 0.54 56.418 13.7
CALU 0 0.32 32.713 8.1
CALU 1 0.31 31.881 7.8
SU 0 0.22 22.479 5.6
SU 1 0.22 22.562 5.6
SU 2 0.23 23.933 5.8
SU 3 0.21 21.963 5.3
SU 4 0.21 22.114 5.3
IBUF 0.17 18.095 4.3
OBUF 0.18 18.722 4.5
SEQ 0.78 80.862 20
other 0.56 56.641 14
ASPE A 3.95 408.388 100
Table 5.12: Post-layout results for ASPE B on a 0.18 µm 1P/6M

CMOS technology.
Entity Area Complexity Area share
[ mm2 ] [kGE] [%]
DIV 0.40 41.792 11
CMAC0 0.37 38.131 10
CMAC1 0.38 39.620 10
CALU 0.10 10.487 3
SU 0 0.22 22.623 6
SU 1 0.22 22.531 6
SU 2 0.23 23.801 6
SU 3 0.22 22.575 6
IBUF 0.10 10.976 3
OBUF 0.11 11.538 3
SEQ 0.78 80.459 21
other 0.57 58.576 15
ASPE B 3.70 383.109 100
(a) ASPE A for MIMO-OFDM pro- (b) ASPE B for MIMO detection.
cessing.
Figure 5.20: Floorplan of the two fabricated chips. The main building
blocks are highlighted and the corresponding areas (in kGE) are
reported. The non-labeled area is occupied by the D-Net, by the
control logic of the SUs, and mainly by filler-cells.
area of 5.79 mm2 . The power consumption amounts to approximately

220 mW.
Without memories, the ADRES-core occupies approximately 45%
of the total silicon area, or 2.6 mm2 , on the 90 nm TSMC process [12].
Scaling this area to the 0.18 µm CMOS reference technology, for a
comparison with the here presented solution, leads to a scaled core
area of: 2.6 mm2 · 4 = 10.4 mm2 . Scaling of the frequency leads to
400 MHz · 1/2 = 200 MHz and the power consumption (of the entire
chip) scales to 220 mW · (0.9/0.5)2 = 713 mW.
Although these figures are only rough estimates, they clearly show
that the implementation presented here is very competitive in terms
of area-efficiency.
Chapter 6
Summary and
Conclusions
Summary
As the domain of mobile wireless communications becomes increasingly
populated with differing communication protocols, the importance
of mobile software defined radio (SDR) terminals grows. The high
datarates prescribed, however, render an implementation on the limited
processing resources of a flexible architecture extremely challenging.
This trend is especially perceptible in the wireless local area network
(WLAN) domain, where the datarates are already high, compared for
instance to the datarates of mobile phone standards. Moreover, the
tight power consumption constraint, necessary to ensure long operation
times from battery, does not relax this challenge either.
Nevertheless, the implementation of the 2 × 2 MIMO-OFDM base-
band receiver algorithms on an SDR platform composed of two appli-
cation specific processors (ASIPs) proposed in this thesis, indicates
a viable solution to overcome these tough constraints. Therefore,
the joint analysis of the computational complexity and the BER per-
formance of MIMO detection algorithms was necessary to identify
suitable, low-complexity algorithms (Chapter 3). The subsequent
mapping of computationally hard OFDM baseband processing kernels
155
156 CHAPTER 6. SUMMARY AND CONCLUSIONS
onto three different software programmable architectures permitted to

select the ASPE architecture [9] for the 2 × 2 MIMO-OFDM receiver
implementation (Chapter 4). The ASPE architecture was further im-
proved by three modifications: 1) the addition of dedicated input and
output buffers simplified the data-stream handling; 2) the processor’s
controlpath structure was restricted to support only a single, VLIW-
fashioned program sequencer, reducing the control overhead and thus
the critical timing path; 3) the program sequencer was enhanced to
support dictionary based code decompression, which further increased
the ASPE’s area-efficiency. Next, the MIMO-OFDM receiver was
split into two parts and mapped onto two properly configured ASPEs.
The first ASPE – ASPE A – was prepared for frame-start detection,
frequency offset estimation and compensation, and OFDM processing.
The second – ASPE B – was responsible of the MIMO detection. The
two ASPEs were placed and routed for a 0.18 µm CMOS technology,
and the final post-layout clock frequency of 250 MHz permitted the
receiver to operate in real-time. The silicon area of both ASPEs
together resulted as 7.65 mm2 , or to approximately 792 kGE.1 This
area was then compared to that of a similar approach on the ADRES
processor [12], proving the competitiveness of the solution presented
in this thesis (Chapter 5).
Conclusions
The implementation of the 2 × 2 MIMO-OFDM baseband receiver
algorithms on the two ASPEs proved to be extremely challenging. On
one side, among the vast number of MIMO-OFDM receiver algorithms,
appropriate ones had to be selected. Therefore, extensive Monte
Carlo simulations were run for assessing the achievable (bit-true) BER
performance, and the involved atomic operations where counted for
deriving the algorithm’s computational complexity. On the other side,
the algorithms had to be implemented onto the ASPE architecture.
Therefore, the ASPE’s datapath first had to be configured with ap-
propriate units, followed by the mapping of the assembly coded, hand
1 The area of one gate equivalent (GE) corresponds to the silicon area occupied
by one low-drive 2 input NAND. On the 0.18 µm CMOS technology considered,

this amounts to 9.67 µm2 .
157
scheduled algorithms onto it. Bit-true hardware description language

(HDL) simulations were run afterwards, to verify the implementation’s
correctness. Among these design-flow steps, the most time-consuming
was the iterative scheduling and assembly coding of the various algo-
rithms onto the ASPEs, combined with possible changes to the ASPE
configuration.
Development tool support A first general, but important con-

clusion can be drawn from this observation. In the future, for an
efficient and rapid development of applications on the ASPE (and
on ASIPs in general), the support of a software toolkit is imperative.
Clearly, the human interaction still remains and is desired for taking
important design decisions, but on the other side, the time-consuming
and repetitive tasks may well be automated and accelerated. The ideal
software toolkit has to include a (bit-true) simulator and an appropri-
ate compiler, both enabling faster iterations from the algorithm level
to the architecture. The recent advances in the ASIP development
enforce this conclusion, and indeed, today such toolkits start gaining
commercial maturity (e.g., LISA processor description language [93]).
Application-specific implementation On the architectural level,

it can be stated that the modified ASPE design-framework has all
necessary characteristics to act as a platform for the SDR baseband
processing. As shown by the 2 × 2 MIMO-OFDM implementation, the
design-time configurability permitted to instantiate units matching the
particular application domain. This is an important characteristic that
permits to attain just as much flexibility and processing performance
as required by the application domain.
The functional units of both ASPEs perform complex-valued com-
putations as required by most of the involved algorithms, and a word-
width of 16 bits proved to be enough for the 2 × 2 MIMO-OFDM
receiver. The datapath of ASPE A was tailored to computationally in-
tensive kernels as correlations (required for frame-start detection), the
CORDIC algorithm (for frequency offset estimation/compensation),
and fast Fourier transforms (OFDM processing). ASPE B instead was
configured for matrix manipulations as matrix inversion, matrix-matrix
and matrix-vector multiplications – all computational kernels required
158 CHAPTER 6. SUMMARY AND CONCLUSIONS
for the (linear MMSE) MIMO detection. Hence, the partitioning of

the receiver into two parts respecting the granularity and the type of
the involved computational kernels, was a key decision for an efficient
implementation. And this is a crucial difference to other related ar-
chitectures as, for instance, Montium [27] or ADRES [12] (Chapter 2)
both providing a 1D respectively 2D array of almost equal functional
units.
An interesting observation, highlighted by the partitioning of the
receiver, is that the tasks involved in the MIMO-OFDM baseband
processing can be split into the following, elementary computational
kernels:
• correlation and filtering (atomic operation: MAC),
• fast Fourier transform (radix-r butterfly),
• computation of the angle of a complex number (CORDIC itera-

tion),
• matrix inversion, matrix-matrix and matrix-vector multiplication

(MAC),
• interleaving, convolutional encoding, and other bitwise atomic

operations,
• convolutional decoding (e.g., Viterbi decoder, or add-compare-

select units), and
• generation of appropriate addressing modes to support the above

listed elementary kernels.
An efficient ASIP SDR platform thus requires units that respect and
support the granularity of these elementary kernels.
DLP and ILP The degree of data level parallelism (DLP) inherent
to many signal processing algorithms should be exploited to increase
the processing performance while reducing the datapath control over-
head. In this thesis, the DLP inherent to the two receive streams of
the 2 × 2 MIMO-OFDM receiver was efficiently exploited by extend-
ing the ASPE’s datapath to operate in a 2-way SIMD-manner. The
159
instruction level parallelism (ILP) was exploited by controlling the

ASPE’s datapath through very long instruction words (VLIWs), which
permitted to efficiently map the above-mentioned elementary kernels
onto the instantiated functional units. Eventually, the implemented
dictionary based program-code decompression mechanism could in-
crease the otherwise low program-code density which is inherent to
fixed format VLIW architectures.
Final words and future work To conclude, the 2 × 2 MIMO-

OFDM receiver presented in this thesis has shown that ASIPs are
suitable and efficient platforms for the baseband processing of SDRs.
Nevertheless, it must be noted that some dedicated units – here the
Viterbi decoding – with little, or no configurability at all, may still be
necessary to allow for real-time operation.
Although the proposed direction is especially promising, the way
leading to an efficient fully-functional, multi-standard SDR is still long
and difficult. The future work strongly needs to consider and introduce
low-power design techniques in order to reduce the power consumption.
The power consumption achieved by the system presented in this
thesis is still too high to be integrated into a commercial mobile device.
Further, it needs to address the integration of the two ASPEs into a
single system-on-chip (SoC), with the addition of a general purpose
processor responsible of controlling the two ASPEs and of handling the
MAC layer protocol – a challenging engineering task. The extension of
the ASPEs to support a second wireless standard is a next necessary
step towards multi-standard systems. Finally, despite the extensive
simulations, the implementation needs to be validated with real-life
data. Here, a possible solution is to design a PCB for the two fabricated
ASPEs and to embed this PCB into an appropriate testbed, as for
instance the MIMO-OFDM testbed of the ETH Zurich [94, 95].
Appendix A
MIMO Detection
Methods
Maximum likelihood (ML) [e.g, [7, 52]] maximizes the probability of

a correct decision (i.e., that ŝ = s). The ML-detector has to solve
ŝ = arg min ky − Hsk2 . (A.1)

s∈AMT
A.1 Sphere Decoding

Before deriving the CC for processing one visited tree-node in the
sphere decoding (SD) algorithm, the algorithm itself is briefly reviewed.
The material presented in this review is mainly taken from [54].
Review SD maps the problem of finding the solution of (A.1) onto

an appropriate tree structure. In order to perform the mapping
onto the appropriate tree structure, during the preprocessing, the QR-
decomposition of H = QR has to be taken, leading to the MR × MT or-
thonormal matrix Q and to the MT × MT right-triangular matrix R.1
Then, during data processing, the received vector y is left-multiplied by
1 QR-decomposition is detailed later, in Section 3.4.
161
162 APPENDIX A. MIMO DETECTION METHODS
QH leading to the modified input-output relation [cf. (A.6)] ỹ = Rs+ ñ,

where ỹ = QH y and ñ = QH n has the same statistics as n.
The tree structure constructed to use SD is represented by MT tree
levels. The convention is that the tree root is at level MT , whereas the
leaves are at level 1. At each tree level i = MT , MT − 1, . . . , 1 there
are Nnodes (i) = M (MT +1−i) tree nodes. The path to each tree node
at level i represents one possible combination of constellation points
s(i) = [si , si+1 , . . . , sMT ] that could have been sent, up to the ith tree
level.
Vector-symbol candidates of the MIMO alphabet AMT that violate
the sphere constraint can be excluded from the search, by pruning the
corresponding tree branches. As a result, the CC is greatly reduced
compared to brute-force ML. The sphere constraint is defined as
d(s) = kỹ−Rsk2 < r2 , with s being the tested vector-symbol candidate.
The radius r constrains the search and influences the number of
visited tree branches. Eventually, the ML solution of (A.1) is given by
the vector-symbol candidate ŝ = arg mins|d(s)<r2 d(s), i.e., the vector-
symbol candidate s that leads to the smallest distance d(s).
The distance d(s) can be computed recursively descending the
i = MT , MT − 1, . . . , 1 tree levels. The partial distances are described
by
MT
Ti (s(i) ) = Ti+1 (s(i+1) ) + kỹi − rij sj k2 ,
X
(A.2)
j=i
where sj is the jth element of the tested vector-symbol candidate

s(i) , rij is the element in the ith row and jth column of R, and
TMT +1 = 0. Finally, once one tree leave is reached, the distance for
the corresponding symbol is obtained by d(s) = T1 .
Implementation The efficient implementation of the SD algorithm

relies upon a strategy that leads to the ML solution, by pruning as
much tree branches as possible (without loosing the ML solution).
One efficient implementation traverses the tree in a depth-first search.
Herein, the hardest computations that have to be performed at level i
A.2. K-BEST 163
and for each visited tree node are

MT
bi+1 (s(i+1) ) = ỹi −
X
rij sj , (A.3)
j=i+1
Di (sk ) = kbi+1 (s(i+1) ) − rii sk k2 , k = 1, 2, . . . , M (A.4)

Ti = Ti+1 + min Di (sk ). k = 1, 2, . . . , M (A.5)
First, (A.3) is computed. Therefore, 1 complex-valued ADD is required
and, in the worst case, MT − 1 complex-valued MACs. Then, all
distances to the M nodes sk , that can be reached from the current
node at level i + 1, are computed in (A.4). M is the size of the QAM
alphabet. This step requires M additions and 2M complex-valued
multiplications. Finally, the minimum over the previously computed
distances Di (sk ) is taken (A.5), in order to find the child node of s(i+1)
leading to the smallest partial euclidean distance Ti . The tree search
proceeds downwards from the node sk , whose distance leads to exactly
this minimum Ti .
The CC for the preprocessing and for each visited tree node during
the data processing are summarized in Table A.1. For obtaining the
overall CC for the entire tree search, the results of the preprocessing
have to be multiplied with the average number of visited tree nodes
Nav .
A.2 K-Best
The K-best algorithm is a breadth-first search algorithm that operates
on the same tree structure as the SD algorithm does. The description
of the here employed K-best algorithm and a corresponding ASIC
implementation is presented in [59]. In the following, the K-best algo-
rithm is only briefly described. The aim is to highlight the differences
to the SD algorithm, and to compute its CC.
Two preprocessing methods are considered in [59]. With QR-
decomposition the resulting problem is complex-valued, whereas with
real-valued decomposition (RVD) and subsequent QR-decomposition,
the √problem becomes only real-valued. With RVD each tree level
has M nodes and the tree depth doubles resulting in 2MT (instead,
without RVD, each of the MT tree levels has M nodes). The BER
Table A.1: CC for SD.

Preprocessing
Step CMAC ANGLE Total
QR-Decompositiona (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT TC b
Total C-Ops. (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT TC
Total R-Ops. 2(13 + 4MR )MR MT + 6MR MT2 2MR MT TR c
Detection
Step CMAC CADD Total
(A.3) MT − 1 1 MT
(A.4) 2M M 3M
(A.5) 0 M +1 M +1
Total C-Ops. 2M + MT − 1 2M + 2 4M + MT + 1
Total R-Ops. 8M + 4MT − 4 4M + 4 12M + 4MT
a As described in Section A.4.5, (A.25)

bT
C = (17/2 + 2MR )MR MT + 3/2MR MT2
cT
R = 4(7 + 2MR )MR MT + 6MR MT2
performance of the RVD problem is slightly better then when operating

on the QR-decomposed problem.
In contrast to SD, at each tree level i, only the nodes that lead
to the K smallest distances are further considered and expanded.
Once the lowest tree level is attained, the vector-symbol s among the
K evaluated ones leading to the smallest distance d(s) is declared
as the received vector-symbol. The solution is not necessarily the
ML-solution.
For the CC the case with RVD during the preprocessing is consid-
ered and thus all atomic operations are real-valued. However, with
QR-decomposition the results lead to a similar complexity. The KB
algorithm is sketched in pseudo code in Algorithm 2 and the corre-
sponding estimated operation count is given by:
2M
XT √ √
NM U LT = M + K(2MT − i + M)
i=1
√ √
= (K(2 M − 1) + 2 M )MT + 2 KMT2
√
NADD = 2MT M (1 + 2 K)
√
Table A.2 summarizes these findings (for K < M ).
A.2. K-BEST 165
Algorithm 2 KB.
In: R, ỹ
Out: ŝ
1: for i = 2MT , 2M√ T − 1, . . . , 1 do
2: for n = 1, . . . , M do
3: di (sn ) = rii sn − ỹi // M ult : 1, Add : 1
4: end for
5: for k = 1, . . . , K doP
2MT
6: bi+1 (s(i+1) (k)) = j=i+1 rij sj (k) // M ult : 2MT − i
√
7: for n = 1, . . . , M do
8: D(k, n) = Ti+1 (k) + kbi+1 (s(i+1) (k)) + di (sn )k2 // M ult :
1, Add : 2
9: end for
10: end for
11: Ti [1 : K] = sort(D)[1 : K] // sort in ascending order and take
the K smallest distances
12: Store the K candidate vector-symbols s(i) (k) that lead to
Ti [1 : K]
13: end for
14: ŝ = min(s(1) [1 : K])
Table A.2: CC for KB decoding.

Preprocessing
QR-Decompositiona (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT Tpp,C b
Total C-Ops. (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT Tpp,C
Total R-Ops. 2(13 + 4MR )MR MT + 6MR MT2 2MR MT Tpp,R c
Symbol-Vector Detection
Step √ √ MAC √ ADD Total
KB (K(2 M − 1) + 2 M )MT + 2 KMT2 2MT M (1 + 2 K) Tdp,C d
√ √ √
Total C-Ops. (K(2√M − 1) + 2√M )MT + 2 KMT2 2MT √M (1 + 2 K) Tdp,C
Total R-Ops. (K(2 M − 1) + 2 M )MT + 2 KMT2 2MT M (1 + 2 K) Tdp,R e
a As described in Section A.4.5, (A.25). RVD is neglected.

bT
pp,C = (17/2 + 2MR )MR MT + 3/2MR MT2
cT
pp,R = 4(7 +√2MR )MR MT√+ 6MR MT2
dT
dp,C = (K(6√M − 1) + 4√M )MT + 2KMT2
eT
dp,R = (K(6 M − 1) + 4 M )MT + 2KMT2
Algorithm 3 SIC.
In: R, ỹ
Out: ŝ
1: for i = MT , . . . , 1 do
PMT
2: ŷi = ỹi − j=i+1 ri,j ŝj // M ult : MT − i, Add : 1
√
3: ŝi = Q(ŷi , ri,i ) // M ult : b M c, Comp : log2 M
4: end for
A.3 Successive Interference Cancellation

The successive interference cancellation (SIC) algorithm does not
achieve ML performance. As for SD and KB, for performing SIC
on the received vector-symbol, first the QR-decomposition of the
channel matrix H has to be taken, which transforms the problem
into ỹ = Rs + ñ. Then, SIC solves the detection problem by back-
substitution. After each back-substitution step, the entry of the
obtained solution is sliced and mapped to the nearest constellation
point in the alphabet A. SIC is summarized in Algorithm 3 and
requires the following number of operations:
MT
√ √
MT − i + b M c = (−1/2 + M )MT + MT2 /2
X
NM U LT =
i=1
NADD = (1 + log2 M )MT .
Table A.3 summarizes the findings.
A.4 Linear Detection – Matrix Decompo-

sition and Inversion Methods
The MIMO input-output relation and the steps necessary to reconstruct
the received transmitted symbol-vector s with linear MMSE detection,
A.4. LINEAR DETECTION 167
Table A.3: CC for SIC.

Preprocessing
QR-Decompositiona (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT Tpp,C b
Total C-Ops. (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT Tpp,C
Total R-Ops. 2(13 + 4MR )MR MT + 6MR MT2 2MR MT Tpp,R c
Step √ CMAC CADD Total
SIC (−1/2 + M )MT + MT2 /2 (1 + log2 M )MT Tdp,C d
√
Total C-Ops. (−1/2 + √M )MT + MT2 /2 (1 + log2 M )MT Tdp,C
Total R-Ops. (−1/2 + M )4MT + 2MT2 (1 + log2 M )2MT Tdp,R e
a As described in Section A.4.5, (A.25). RVD is neglected.

bT
pp,C = (17/2 + 2MR )MR MT + 3/2MR MT2
cT
pp,R = 4(7 + 2M√ R )MR MT + 6MR MT 2
2
dT
dp,C = (1/2√ + M + log 2 M )M T + M T /2
eT
dp,R = 2(2 M + log2 M )MT + 2MT2
are recapitulated in (A.6)-(A.9) below.
y = Hs + n ∈ CMR (A.6)
2
F = H H + MT σ I ∈ CMT ×MT
H
(A.7)

G=F −1
H H
∈C MT ×MR
(A.8)
ŷ = Gy ∈ C MT
(A.9)
At the receiver, the (perfect) channel estimate H ∈ CMR ×MT , as well

as the received vector y ∈ CMR are known. The noise n is unknown,
whereas we assume its variance σ 2 is known. We note that the matrix
F has to be inverted for obtaining G.
For hard detection, the vector ŷ is further processed to obtain the
estimated transmitted symbol ŝ = Q(ŷ) through slicing. The CC of
slicing is not considered in this evaluation, since this step is common
to all methods and would not change the overall rankings.
The decomposition and inversion methods to obtain F−1 , which are
evaluated in the following, are mainly taken from [66, 51]. Their CC is
derived by counting and weighting the number of operations necessary
to detect one received symbol vector ŷ. For this evaluation, which
targets a software-programmable platform, the considered operators
are: ADD, MUL, and DIV (plus, in certain cases additional operators
as SQRT or ANGLE, see Section 3.2).

For the decomposition methods that construct an upper triangular
matrix R, the detection of ŷ through back-substitution (BS) is consid-
ered as well. BS is applied to linear equation systems Rb = a, with
R ∈ CM ×M , a ∈ CM , and b ∈ CM being the unknown. BS consists
in solving the M th equation of the system, i.e. bM = aM /rM,M , sub-
stituting the result into the (M − 1)th equation, and then repeating
the procedure until obtaining b. Otherwise, when no upper triangular
matrix is constructed, only detection by matrix multiplication (MM)
is considered. MM consists in computing (A.9) directly. The notation
MM and BS has been taken from [51].
A.4.1 Adjoint method

The classical way of writing and presenting the inverse of an M × M
matrix F employs the adjoint method
adj (F)
F−1 = . (A.10)
det(F)
When M = 3, (A.10) becomes slightly more demanding than the

case where M = 2 presented in Section 3.4:
 −1  
f11 f12 f13 f i11 f i12 f i13
1
F−1 =  f12
∗
f22 f23  =  f i∗12 f i22 f i23  . (A.11)
∗ ∗ D
f13 f23 f33 f i∗13 f i∗23 f i33
Where the determinant of F is D = f11 (f33 f22 − f23 ∗

f23 ) − f12
∗
(f33 f12 −
f23 f13 ) + f13 (f23 f12 − f22 f13 ) and the entries of F :
∗ ∗ −1
f i11 = f33 f22 − f23

∗
f23
f i12 = f33 f12 − f23
∗
f13
f i13 = f23 f12 − f22 f13
f i22 = f33 f22
∗ ∗
− f13 f13
f i23 = −(f23 f11 − f12
∗
f13 )
f i33 = f12 f11 − f12
∗
f12 .
Algorithm 4 Implementation of 2 × 2 HPD matrix inversion with

the adjoint method.
r1 ← ac CMUL
r2 ← r1 − bb∗ CMAC
r3 ← r2−1 DIV
x ← r3 c CMUL
y ← −r3 b CMUL
z ← −r3 a CMUL
x y
F =
−1
–
y∗ z
In total, roughly 21 complex-valued MAC operations are required, or,

translating into real-valued operations, 84 real-valued MAC operations.
Thus, the CC of (A.11) is roughly of 21 when accounting for complex-
valued atomic operations, and of 84 when accounting for real-valued
ones. However, the high numerical precision and dynamic range
required to compute 1/D render the use of (A.10) impractical for
matrices with M > 2.
Eventually, Algorithm 4 describes the 2 × 2 hermitian and positive-
definite (HPD) matrix inversion as it is implemented in this thesis (on
ASPE B).
A.4.2 LR-decomposition
The LR-decomposition (or LU-decomposition) in Algorithm 5 is stable
for strictly diagonally dominant matrices, which is the case of the F
matrix at low-SNR regime. The steps leading to ŷ are:
[L, R] = LRdecomp(F), (A.12)

A = BS(L, H ) = L
H −1
H
H
∈C MT ×MR
. (A.13)
• For BS-based detection we proceed as follows:
m = Ay ∈ CMT , (A.14)
ŷ = BS(R, m) = R −1
m∈C MT
. (A.15)
The corresponding CC is reported in Table A.4.

Algorithm 5 [L, R] = LRdecomp(F)

In: F
Out: LR = F
1: for i = 0, 1, . . . , MT − 1 do
2: for j = i, i + 1, .P . . , MT − 1 do
i−1
3: ri,j = fi,j − k=0 rk,j · li,k // M ult : i(MT − i), Add :
MT − i Pi−1
4: lj,i = (fj,i − k=0 rk,i · lj,k )/ri,i // M ult : (i + 1)(MT −
i − 1), Div : 1, Add : MT − i − 1
5: end for
6: end for
• For MM-based detection the matrix G is constructed by two

subsequent BSs:
G = BS(R, A) = R−1 A ∈ CMT ×MR , (A.16)

ŷ = Gy ∈ CMT .
The corresponding CC are reported in Table A.5.
The number of operations required to complete Algorithm 5 is

estimated as follows:
T −1
MX
NM U LT = i(MT − i) + (i + 1)(MT − i − 1) = 1/3MT (MT2 − 1),
i=0
NDIV = MT ,
T −1
MX
NADD = MT − i + MT − i − 1 = 1 − 2MT + MT2 .
i=1
A.4.3 LDL-decomposition
Three versions of the LDL-decomposition are reported. Algorithm 6 is
the implementation reported in [66], Algorithm 7 and Algorithm 8 are
modified versions with slightly lower CC.
Table A.4: LR-decomposition’s CC, BS-based detection.
Preprocessing
Step CMAC DIV CADD Total
(A.7) MT MR (MT + 1)/2 0 MT (1 + MR /2)MT + MR MT2 /2
(A.12) 1/3MT (MT2 − 1) MT 1 − 2MT + MT2 1 − 4/3MT + MT2 + MT3 /3
(A.13) MT MR (MT − 1)/2 0 MR (MT − 1) −MR + MR MT /2 + MR MT2 /2
Total C-Ops. T1 a MT T2 b 1 − MR − MT /3 + MR MT + (1 + MR )MT2 + MT3 /3
A.4. LINEAR DETECTION
Total R-Ops. 4T1 MT 2T2 TR c

(A.14) MT MR 0 0 MT MR
(A.15) MT (MT + 1)/2 0d MT − 1 −1 + 3/2MT + MT2 /2
Total C-Ops. (1/2 + MR )MT + MT2 /2 0 MT − 1 −1 + (3/2 + MR )MT + MT2 /2
Total R-Ops. (1 + 2MR )2MT + 2MT2 0 2MT − 2 −2 + 4(1 + MR )MT + 2MT2
1 T
a T = −M /3 + M M 2 + M 3 /3
R T T
2 R R T
b T = 1 − M + (M − 1)M + M 2
T
R R T R
c T = 2 − 2M + (−7/3 − 2M )M + (2 + 4M )M 2 + M 3 4/3
R T T
d Division is avoided: 1/r
i,i is computed during preprocessing.
171
APPENDIX A. MIMO DETECTION METHODS
Table A.5: LR-decomposition’s CC, MM-based detection.

Preprocessing
(A.12) 1/3MT (MT2 − 1) MT 1 − 2MT + MT2 1 − 4/3MT + MT2 + MT3 /3
(A.16) MT MR (MT + 1)/2 0 MR (MT − 1) −MR + 3/2MR MT + MR MT2 /2
Total C-Ops. T1 a MT T2 b TC c
Total R-Ops. 4T1 MT 2T2 TR d
Total C-Ops. MT MR 0 0 MR MT
Total R-Ops. 4MT MR 0 0 4MR MT
aT
1 = (−1/3 + MR /2)MT + 3/2MR MT2 + MT3 /3
bT
2 = 1 − 2MR + (−1 + 2MR )MT + MT2
c TC = 1 − 2MR + (−1/3 + 5/2MR )MT + (1 + 3/2MR )MT2 + MT3 /3
dT
R = 2 − 4MR + (−7/3 + 6MR )MT + (2 + 6MR )MT2 + 4/3MT3
172
To detect ŷ = Gy using the LDL-decomposition, with LDLH = F

and observing that F−1 = L−H D−1 L−1 , the computations are:
F = HH H + MT σ 2 I ∈ CMT ×MT ,

[L, D] = LDLdecomp(F), (A.17)

A = BS(L, H ) = L
H −1
H
H
∈C MT ×MR
,
R = DL H
∈C MT ×MT
. (A.18)
• For BS-based detection, we proceed as follows:
m = Ay ∈ CMT ,
ŷ = BS(R, m) = R−1 m ∈ CMT ,
and the corresponding CC is reported in Table A.6.
• For MM-based detection, the matrix G is constructed by a second
BS, before performing MM:
G = BS(R, A) = R−1 A ∈ CMT ×MR ,
ŷ = Gy ∈ CMT .
The corresponding CC is reported in Table A.7.
The number of operations for Algorithm 6 is estimated as:
 
MT i−1 MT
i = −7MT /6 + MT2 + MT3 /6
X X X
NM U LT =  1+i−1+
n=1 j=1 j=i+1
NDIV = MT
NADD = −MT /2 + MT2 /2.
The estimated number of operations to complete Algorithm 7 is:
 
MT i−1
j − 1 + 1 = −2MT /3 + MT2 /2 + MT3 /6
X X
NM U LT = i − 1 +
i=1 j=1
NDIV = MT
NADD = −MT /2 + MT2 /2.
Algorithm 6 [L, D] = LDLdecomp(F), Golub-version [66].

In: F
Out: LDLH = F
1: for i = 1, . . . , MT do
2: for j = 1, . . . , i − 1 do
3: vj = li,j
H
dj //M ult : 1
4: end for P
i−1
5: vi = fi,i − k=1 vk · li,k //M ult : i − 1, Add : 1
6: di = vi
7: ri = 1/di //Div : 1
8: for j = i +1, . . . , MT do
Pi−1
9: lj,i = fj,i − m=1 vm · lj,m · ri //M ult : i − 1 +
1, Add : 1
10: end for
11: end for
Algorithm 7 [L, D] = LDLdecomp(F). Version with MT divisions.

In: F
Out: LDLH = F
1: for i = 1, . . . , MT do
2: for j = 1, . . . ,Pi − 1 do
j−1
3: vj = fi,j − k=1 vk · lj,k H
//M ult : j − 1, Add : 1
4: li,j = vj · rj //M ult : 1
5: end for P
i−1
6: di = fi,i − k=1 vk · li,k H
//M ult : i − 1, Add : 1
7: ri = 1/di //Div : 1
8: end for
Algorithm 8 [L, D] = LDLdecomp(F). Version with more than MT

divisions.
In: F
Out: LDLH = F
1: for i = 1, . . . , MT do
2: for j = 1, . . . ,P
i − 1 do
j−1
3: vj = fi,j − k=1 vk · lj,k H
//M ult : j − 1, Add : 1
4: li,j = vj /dj // Div : 1
5: end for P
i−1
6: di = fi,i − k=1 vk · li,k H
//M ult : i − 1, Add : 1
7: end for
The number of operations to complete Algorithm 8 is:

 
MT i−1
j − 1 = −MT /6 + MT3 /6
X X
NM U LT = i − 1 +
i=1 j=1
MT X
i−1
1 = −MT /2 + MT2 /2
X
NDIV =
i=1 j=1
NADD = −MT /2 + MT2 /2.
A.4.4 GS-decomposition
To perform Gram-Schmidt based√QR-decomposition, we start by ob-
serving that with H̄ = [HH MT σIMT ]H ∈ C(MR +MT )×MT we
obtain
Ḡ = (H̄H H̄)−1 H̄H

= (HH H + MT σ 2 IM T )−1 [HH
p
MT σIMT ]
= [G H̃], (A.19)
√
where H̃ = MT σF−1 .
Table A.6: LDL-decomposition’s CC, BS-based detection.

Preprocessing
Algorithm 6 −7MT /6 + MT2 + MT3 /6 MT −MT /2 + MT2 /2 −2MT /3 + 3MT2 /2 + MT3 /6
Algorithm 7 −2MT /3 + MT2 /2 + MT3 /6 MT −MT /2 + MT2 /2 −MT /6 + MT2 + MT3 /6
Algorithm 8 −MT /6 + MT3 /6 −MT /2 + MT2 /2 −MT /2 + MT2 /2 −7MT /6 + MT2 + MT3 /6
(A.18) MT (MT + 1)/2 0 0 MT (MT + 1)/2
Total C-Ops.a T1 b MT T2 c TC d
Total R-Ops. 4T1 MT 2T2 TR e
(A.15) MT (MT + 1)/2 0f MT − 1 −1 + 3/2MT + MT2 /2
Total R-Ops. 4(1/2 + MR )MT + 2MT2 0 2MT − 2 −2 + 4(1 + MR )MT + 2MT2
a For the Total Algorithm 7 has been taken into account, since it has a lower CC.
b T = −M /6 + (1 + M )M 2 + M 3 /6
1 T R T T
c T = −M + (1/2 + M )M + M 2 /2
2 R R T T
d T = −M + (4/3 + M )M + (3/2 + M )M 2 + M 3 /6
C R R T R T T
R = −2MR + (4/3 + 2MR )MT + (5 + 4MR )MT + 2MT /3
e T 2 3
f Division is avoided if 1/r
i,i can be stored during preprocessing.
176
Table A.7: LDL-decomposition’s CC, MM-based detection.
Preprocessing
(A.7) MT MR (MT + 1)/2 0 MT (1 − MR /2)MT + MR MT2 /2
Algorithm 6 −7MT /6 + MT2 + MT3 /6 MT −MT /2 + MT2 /2 −2MT /3 + 3MT2 /2 + MT3 /6
Algorithm 7 −2MT /3 + MT2 /2 + MT3 /6 MT −MT /2 + MT2 /2 −MT /6 + MT2 + MT3 /6
Algorithm 8 −MT /6 + MT3 /6 −MT /2 + MT2 /2 −MT /2 + MT2 /2 −7MT /6 + MT2 + MT3 /6
(A.18) MT (MT + 1)/2 0 0 MT (MT + 1)/2
(A.16) MT MR (MT + 1)/2 0a MR (MT − 1)
−MR + 3/2MR MT + MR MT2 /2

Total C-Ops.b T1 c MT T2 d TC e
Total R-Ops. 4T1 MT 2T2 TR f
Total C-Ops. MR MT 0 0 MR MT
Total R-Ops. 4MR MT 0 0 4MR MT
a Divisionis avoided if 1/ri,i can be stored during preprocessing.

b Forthe Total Algorithm 7 has been taken into account, since it has a lower CC.
1 R T R
c T = (−1/6 + M /2)M + (1 + 3/2M )M 2 + M 3 /6
T T
2 R R T
d T = −2M + (1/2 + 2M )M + M 2 /2
T
R R T R
e T = −2M + (4/3 + 5M /2)M + 3/2(1 + M )M 2 + M 3 /6
C T T
R R T R
f T = −4M + (4/3 + 6M )M + (5 + 6M )M 2 + 2/3M 3
R T T
177
Now we can write

y

ŷ = Ḡȳ = G H̃ (A.20)

.
0
By taking the GS-decomposition of H̄

[Q̄, R] = GSdecom(H̄) (A.21)
we obtain the matrix Q̄ ∈ C(MR +MT )×MT , which has orthonormal

columns, and the upper triangular matrix R ∈ CMT ×MT .
Substituting H̄ with Q̄R in (A.19) leads to:
−1
Ḡ = (Q̄R)H (Q̄R) (Q̄R)H
−1 H H
= RH Q̄H Q̄R (R Q̄ )
= (RH R)−1 (RH Q̄H )
= R−1 Q̄H .
Thus, according to (A.20), we have ŷ = R−1 Q̄H ȳ. Since, the last
MT entries of ȳ are all zero and with
Q

Q̄ = ,
Q̃
the detection problem simplifies to ŷ = R−1 QH y. Depending on the

choice of the detection method we proceed as follows.
• For BS-based detection:
m = QH y ∈ CMT (A.22)
ŷ = BS(R, m) = R −1
m∈C MT
. (A.23)
Table A.8 summarizes the CC of detecting ŷ by following the
above described procedure.
• Whereas for MM-based detection we compute:
G = BS(R, QH ) = R−1 QH ∈ CMT ∈ CMT ×MR (A.24)
ŷ = Gy ∈ C MT
.
Table A.9 summarizes the corresponding CC.
Table A.8: GS-decomposition’s CC, BS-based.
Preprocessing
Step CMAC DIV/SQRT CADD Total
(A.21) T1 a 2MT T2 b TC c
Total C-Ops. T1 2MT T2 TC
Total R-Ops. 4T1 2MT 2T2 TR d
(A.22) MR MT 0 0 MR MT
(A.23) MT (MT + 1)/2 0e MT − 1 −1 + 3/2MT + MT2 /2
Total R-Ops. 4(1/2 + MR )MT + 2MT2 0 2MT − 2 −2 + (2 + 4(1/2 + MR ))MT + 2MT2
1 R T R
a T = (7/6 + M )M + (1/2 + M )M 2 + M 3 /3
T T
2 R T
b T = (−1 − 3M )M /6 + M M 2 /2 + M 3 /6
R T T
R T R
c T = (6 + M )M /2 + (1 + 3M )M 2 /2 + M 3 /2
C T T
R T R
d T = (19/3 + 3M )M + (2 + 5M )M 2 + M 3 5/3
R T T
e Division is avoided since 1/r
i,i is computed during preprocessingand stored.
179
Table A.9: GS-decomposition’s CC, MM-based.

Preprocessing
Step CMAC DIV/SQRT CADD Total
(A.21) T1 a 2MT T2 b T3 c
(A.24) MR MT (MT + 1)/2 0d MT − 1 −1 + (2 + MR )MT /2 + MR MT2 /2
Total C-Ops. T4 e 2MT T5 f TC g
Total R-Ops. 4T4 2MT 2T5 TR h
a T = (7/6 + M )M + (1/2 + M )M 2 + M 3 /3
1 R T R T T
b T = (−1 − 3M )M /6 + M M 2 /2 + M 3 /6
2 R T R T T
3 = (6 + MR )MT /2 + (1 + 3MR )MT /2 + MT /2
c T 2 3
d Division is avoided since 1/r
i,i is computed during preprocessingand stored.
e T4 = (7 + 9MR )MT /6 + (1 + 3MR )MT2 /2 + MT3 /3
fT
5 = −1 + (5 − 3MR )MT /6 + MR MT2 /2 + MT3 /6
gT
C = −1 + (4 + MR )MT + (1/2 + 2MR )MT2 + MT3 /2
hT
R = −2 + (25/3 + 5MR )MT + (2 + 7MR )M T 2 + 5/3MT3
180
Algorithm 9 [Q̄, R] = GSdecomp(H̄)

In: Q̄ = H̄, R = 0MT ×MT .
Out: Q̄R = H̄.
1: for i = 1,p2, . . . , MT do
2: ri,i = q̄iH q̄i // M ult : MR + i, Sqrt : 1
3: q̄i = q̄i /ri,i // M ult : MR + i, Div : 1
4: for k = i + 1, . . . , MT do
5: ri,k = q̄iH q̄k // M ult : MR + i − 1
6: q̄k = q̄k − ri,k q̄i // M ult : MR + i, Add : MR + i
7: end for
8: end for
The number of operations for completing Algorithm 9 is:

MT
X
NM U LT = 2(MR + i) + (MT − i)(2MR + 2i − 1) =
i=1
= (7/6 + MR )MT + (1/2 + MR )MT2 + MT3 /3
NSQRT = MT
NDIV = MT
MT
X
NADD = (MT − i)(MR − i) =
i=1
= −MT /6 − MR MT /2 + MR MT2 /2 + MT3 /6.
We note that the square root operator (SQRT) is required. For
computing the CC we set the weight of the SQRT operator to an
optimistic value of wSQRT = 1. With this choice, we obtain a lower
bound of the GS-decomposition’s CC, as defined in Section 3.2.
A.4.5 QR-decomposition
In order to detect ŷ = Gy using the classical Givens-rotations based
QR-decomposition, we proceed
√ in an analogous way as for GS-decompo-
sition. With H̄ = [HH MT σIMT ]H and (A.19), we obtain (A.20).
By taking the QR-decomposition of H̄ = Q̄R̄ we obtain an unitary
matrix Q̄ ∈ C(MR +MT )×(MR +MT ) and an upper triangular matrix
R̄ = [RH 0]H ∈ C(MT +MR )×MT . Since the last MR rows of R̄

are all zero, and with the optimization in [51], we can reduce the
number of required operations by modifying the QR-decomposition to
¯ = [QH Q̃H ]H ∈ C(MR +MT )×MT , such that Q̄
compute only Q̄ ¯ R = H̄.
¯
Substituting the decomposed H̄ matrix with Q̄ and R in (A.19) we
obtain:
¯ R) −1 (Q̄
¯ R)H (Q̄ ¯ R)H

Ḡ = (Q̄
¯ R −1 (RH Q̄
¯ H Q̄ ¯ H)

= RH Q̄
¯ H)
= (RH R)−1 (RH Q̄
¯ H.
= R−1 Q̄
This leads to the solution of the detection problem by computing

¯ H ȳ = R−1 QH y. The optimized QR-decomposition is
ŷ = R−1 Q̄
described in Algorithm 10 and delivers
[Q, R] = QRdecom(H, σ). (A.25)
The so-called Givens rotations required by the QR-decomposition

algorithm are performed by aid of two unitary matrices Em (α) and
Um,n (β), defined as follows:
↓m
1 0 0
 
...
.. ..
0 . .
 
 
.. .. ..
 
. . .
 
Em (α) = 
 
0 0 ←m

 ... ... ejα ... 
0 1
 
 
.. ..
 
. .
and
↓m ↓n
1 0 0 0
 
...
.. .. ..
0 . . .
 
 
.. .. .. ..
 
. . . .
 
 
Um,n (β) =  0 . . . . . . cos β sin β 0 ←m .
 
... 
0 . . . . . . − sin β cos β 0 ←n
 
 ... 
0 0 1
 
 
.. .. ..
 
. . .
According to the detection method we proceed as follows.
• For BS-based detection:
m = Q H y ∈ CM T
ŷ = BS(R, m) = R−1 m ∈ CMT .
Table A.10 resumes the CC of detecting ŷ by the above described
procedure.2
• Whereas for MM-based detection the steps are:
G = BS(R, QH ) = R−1 QH ∈ CMT ∈ CMT ×MR
ŷ = Gy ∈ CMT .
The number of operations to complete Algorithm 10 is:

MT X
X MR
NM U LT = [2(MT − n + 2) + (MT − n) + 4m + 2]
n=1 m=1
= 5/2(MR MT ) + 3/2(MR MT2 ) + 2MR (1 + MR )MT
+ 2MR MT = (13/2 + 2MR )MR MT + 3/2MR MT2
NDIV = 0
NADD = 0
NAN GLE = 2MR MT
2 No division is required during symbol processing if the slicing operation Q (.)
takes the matrix R into account.

Algorithm 10 [Q, R] = QRdecomp(H, σ)

In: H, σ
Out: QR = H √
1: R̄ = [HH MT σIMT ]H
2: Q̄ = [IMR 0MR ×MT ]H
3: for n = 1, 2, . . . , MT do
4: for m = 1, 2, . . . , MR do
5: p = MR + n − m
6: q =p−1
7: ϕ = atan (=(rq,n )/<(rq,n )) // Angle : 1
8: R̄ = Eq (ϕ)R̄ // M ult : 1
9: Q̄H = Eq (ϕ)Q̄H // M ult : 1
10: ϑ = atan(<(rp,n )/<(rq,n )) // Angle : 1
11: R̄ = Up,q (ϑ)R̄ // M ult : 2(MT − n + 2) + (MT − n)
12: Q̄H = Up,q (ϑ)Q̄H // M ult : 2 · 2m
13: end for
14: end for
15: Q = Q̄MR ×MT
16: R = R̄MT ×MT
Note: With infinite precision both rp,n and rq,n are real-valued in
ϑ = atan(<(rp,n )/<(rq,n )).
Algorithm 11 F−1 = Rank1Update(H, 1/(MT σ 2 ))

In: H, 1/(MT σ 2 )
Out: F−1
1: P(0) = 1/(MT σ 2 )I
2: for i = 1, 2, . . . , MR do
3: m = P(i−1) hiH // M ult : MT2
4: s = 1 + hi m // M ult : MT , Add : 1
5: se = blog2 (s)c, sm = 2se /s // Div : 1
6: m̃ = sm m // M ult : MT
7: P(i) = P(i−1) − m̃mH 2−se // M ult : MT2 , Add : MT2
8: end for
9: F−1 = PMR
The number of operations that require the arc tangent function

[atan()] is NAN GLE . For the computation of the QR-decomposition’s
CC we assume that an execution unit delivering the result in one clock
cycle is available and thus, wAN GLE = 1. However, we remember
that in the case another decomposition method has a similar CC but
requires less atomic operators, we prefer this other method.
A.4.6 Rank-1 update method
The Rank-1 update method, inverts the matrix F directly:
F−1 = Rank1Update(H, 1/(MT σ 2 )) ∈ CMT ×MT (A.26)

G=F −1
H
H
∈C MT ×MR
ŷ = Gy ∈ C MT
.
The associated CC is summarized in Table A.12, and the Rank-1

update algorithm is described by Algorithm 11 [63].
Table A.10: QR-decomposition’s CC, BS-based detection.

Preprocessing
Step CMAC ANGLE CADD Total
(A.25) (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT 0 (17/2 + 2MR )MR MT + 3/2MR MT2
Total C-Ops. (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT 0 (17/2 + 2MR )MR MT + 3/2MR MT2
Total R-Ops. 2(13 + 4MR )MR MT + 6MR MT2 2MR MT 0 4(7 + 2MR )MR MT + 6MR MT2
(A.23) MT (MT + 1)/2 MT a MT − 1 −1 + 3/2MT + MT2 /2
Total C-Ops. (1/2 + MR )MT + MT2 /2 MT MT − 1 −1 + (5/2 + MR )MT + MT2 /2
Total R-Ops. (2 + 4MR )MT + 2MT2 MT 2MT − 2 −2 + (5 + 4MR )MT + 2MT2
a Division can be avoided if the BS(.) performs slicing and takes the factor r
i,i into account for adjusting the decision
boundaries.
186
Table A.11: QR-decomposition CC, MM-based detection.
Preprocessing
Step CMAC ANGLE CADD Total
(A.25) (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT 0 (17/2 + 2MR )MR MT + 3/2MR MT2
(A.24) MR MT (MT + 1)/2 MT MT − 1 −1 + (2 + MR /2)MT + MR MT2 /2
2
Total C-Ops. (7 + 2MR )MR MT + 2MR MT2 (1 + 2MR )MT MT − 1 −1 + (2 + 9MR + 2MR )MT + 2MR MT2
2
Total R-Ops. 4(7 + 2MR )MR MT + 8MR MT2 (1 + 2MR )MT 2MT − 2 −2 + (3 + 30MR + 8MR )MT + 8MR MT2
187
Table A.12: Rank-1 update CC for MM-based detection.

Preprocessing
(A.26) 2MR (MT + MT2 ) MR MR (1 + MT2 ) 2MR + 2MR MT + 3MR MT2
(A.8) MR MT2 0 0 MR MT2
Total C-Ops. 2MR MT + 3MR MT2 MR MR (1 + MT2 ) 2MR + 2MR MT + 4MR MT2
Total R-Ops. 8MR MT + 12MR MT2 MR 2MR (1 + MT2 ) 3MR + 8MR MT + 14MR MT2
188
The number of operations for completing Algorithm 11 is:

MR
2(MT + MT2 ) = 2MR (MT + MT2 )
X
NM U LT =
i=1
NDIV = MR
MR
1 + MT2 = MR (1 + MT2 ).
X
NADD =
i=1
A.4.7 Divide-and-conquer method

The Divide-and-Conquer (D&C) method recursively inverts F. The
steps leading to MM-based MMSE detection are:
F = HH H + MT σ 2 I ∈ CMT ×MT

F−1 = D&C(F) ∈ CMT ×MR (A.27)

G=F −1
HH
∈C MT ×MR
ŷ = Gy ∈ C MT
The complete, recursive, D&C method is described in Algorithm 12 [13].

The corresponding CCs are reported in Table A.13.
Table A.13: D&C CC for MM-based detection.

Preprocessing
(A.27) T1 a MT /2 T2 b T3 c
(A.8) MR MT2 0 0 MR MT2
Total C-Ops. T4 d
MT /2 −3 + 7/3MT − MT2 /4 + MT3 /6 TC e
Total R-Ops. 4T4 MT /2 −6 + 14/3MT − MT2 /2 + MT3 /3 TR f
a T1 = −6 + 4MT − MT2 /4 + MT3 /2
bT
2 = −3 + 4/3MT − MT2 /4 + MT3 /6
cT
3 = −9 + 35/6MT − MT2 /2 + MT3 2/3
d T4 = −6 + (4 + MR /2)MT + (−1 + 6MR )MT2 /4 + MT3 /2
eT
C = −9 + (41 + 3MR )MT /6 + (−1 + 3MR )MT2 /2 + 2/3MT3
fT
R = −30 + (127/6 + 2MR )MT + (−3/2 + 6MR )MT2 + 7/3MT3
190
Algorithm 12 F−1 = D&C(F)

In: F ∈ CM ×M
Out: F−1
1: if M = 1 then
2: F−1 = 1/f1,1
3: else if M = 2 then
−1
1

4: f1,1 f1,2 f2,2 −f1,2
F =
−1
∗ = ∗ ∗
f1,2 f2,2 f1,1 f2,2 − f1,2 f1,2 −f 1,2 f1,1
5: else
6: pick a suitable p satisfying 1 ≤ p < M
7: partition F as in (3.10)
8: A−1 = D&C(A)
9: S = C − BH A−1 B
10: S−1 = D&C(S)
11: F−1 as in (3.11)
12: end if
Appendix B
Datasheet
Pinout
Figure B.1 illustrates the pinout of ASPE A and ASPE B in their
PGA120 package. The two ASPEs are pin-compatible. Table B.1
describes the functionality of the I/O signal pads.
193
194 APPENDIX B. DATASHEET
FifoOutDataxDO_PAD_15
FifoOutDataReqxSI_PAD
pad_vdd_p6
pad_vdd_c8
pad_vss_c8
pad_vss_p6
pad_vdd_c4
A11 C10 B12
A1 B3 C4A2 A3 B4 C5 A4 B5 A5 C6 B6 A6 A7 C7 B7 A8 B8 C8 A9 B9 A10C9 B10 B11 A12 C11
120 110 100 91
pad_vdd_c1 C3 1 90 A13
FifoOutWritexSO_PAD_0 B2 C12
FifoOutWritexSO_PAD_1 B1 SlaveOutxDO_PAD
D11
StBistOutxTO_PAD_13 D3 SlaveInxDI_PAD
B13
StBistOutxTO_PAD_12 C2 SSxSBI_PAD
C13
StBistOutxTO_PAD_11 C1 D12 pad_vss_p5
pad_vss_p1 D2 E11 SCKxCI_PAD
StBistOutxTO_PAD_10 E3 D13 pad_vdd_p5
D1 E12
StBistOutxTO_PAD_9 E2 10 co ScanEnablexTI_PAD
re E13
StBistOutxTO_PAD_8 E1 _v s
ss vs 80 F11 ScanModexTI_PAD
d_
StBistOutxTO_PAD_7 F3 pa F12 BistEnablexTI_PAD
120
StBistOutxTO_PAD_6 F2 F13 BistModexTI_PAD

StBistOutxTO_PAD_5 F1 G13
pad_vss_c5 G2 G11 pad_vdd_c7
pad_vdd_c5 G3 G12 pad_vss_c7
StBistOutxTO_PAD_4 G1 H13
StBistOutxTO_PAD_3 H1 H12
StBistOutxTO_PAD_2 H2 H11
StBistOutxTO_PAD_1 H3 20
co J13
ss re
StBistOutxTO_PAD_0 J1 d _v _v
70 J12
J2 pa ss
K13 pad_vdd_p4
pad_vdd_p2 K1 J11 ClkxCI_PAD
RstxRBI_PAD J3 K12 pad_vss_p4
pad_vss_p2 K2 L13
SeqBistOutxTO_PAD_2 L1 L12
SeqBistOutxTO_PAD_1 M1 K11
SeqBistOutxTO_PAD_0 K3 M13
FifoInWritexSI_PAD_1 L2 M12
FifoInWritexSI_PAD_0 N1
30 61 L11 pad_vdd_c3
31 40 50 60
L3 M2 N2 L4 M3 N3 M4 L5 N4 M5 N5 L6 M6 N6 M7 L7 N7 N8 M8 L8 N9 M9 L9 N11 L10 N13
N10 M10 N12 M11
pad_vdd_c2
FifoInDataxDI_PAD_15
FifoInDataxDI_PAD_9
FifoInDataxDI_PAD_8
pad_vss_p3
pad_vdd_c6
pad_vdd_p3
FifoInDataxDI_PAD_7
FifoInDataxDI_PAD_6
FifoInDataxDI_PAD_5
FifoInDataxDI_PAD_4
FifoInDataxDI_PAD_3
FifoInDataxDI_PAD_2
FifoInDataxDI_PAD_1
FifoInDataxDI_PAD_0
FifoInDataReqxSO_PAD
pad_vss_c6
Core Power
Pad Power
GND
Figure B.1: ASPE A and ASPE B pinout.

pad_vdd_p1
195
Table B.1: Description of signal pads.

Signal Name I/O Width Description
RstxRBI in 1 Asynchronous reset (active-low)
ClkxCI in 1 Core clock
SCKxCI in 1 SPI clock
SSxSBI in 1 SPI slave select (active-low)
SlaveInxDI in 1 SPI slave data in
SlaveOutxDO out 1 SPI slave data out
FifoInWritexSI in 2 Input FIFO, write assert
FifoInDataxDI in 16 Input FIFO, data in
FifoInDataReqxSO out 1 Input FIFO, data request
FifoOutDataReqxSI in 1 Output FIFO, data request
FifoOutWritexSO out 2 Output FIFO, write data assert
FifoOutDataxDO out 16 Output FIFO, data out
StBistOutxTO out 14 SU, BIST output
SeqBistOutxTO out 3 SEQ, BIST output
BistModexTI in 1 Bist mode;
0: Show BIST ok,
1: Show BIST done
BistEnablexTI in 1 Bist enable;
0: disabled,
1: enabled
ScanModexTI in 1 Scan mode;
0: no RAM bypassing,
1: RAM bypassing
ScanEnablexTI in 1 Scan enable;
0: normal mode, 1: scan mode
Table B.2: Input signal timing constraints.

Group Pad Setup Hold
SPI timing SSxSBI 5 ns 0.5 ns
(SCKxC, min. 7 ns) SlaveInxDI 2.25 ns 1.8 ns
I/O timing FifoInDataxDI 2.4 ns 0.6 ns
(ClkxC, min. 4 ns) FifoInWritexSI 0.7 ns 0.8 ns
FifoOutDataReqxSI 0.8 ns 0.6 ns
Test timing ScanModexTI 8.2 ns 0.8 ns
ScanEnablexTI 8 ns 4.7 ns
BistEnablexTI 5.1 ns
BistModexTI 7.2 ns 4.1 ns
Table B.3:Output signal timing constraints.

Group Pad Prop. Delay
SPI timing SlaveOutxDO 4 ns
I/O timing FifoOutDataxDO 4 ns
FifoInDataReqxSO 3.6 ns
Test timing StBistOutxTO 5.7 ns
SeqBistOutxTO 5.1 ns
Post-layout Timing and Power Consump-

tion
The timing constraints and power consumption of both ASPEs are
extracted, using the design tools and the post-layout netlist. The
post-layout timing and power consumption are estimated on the netlist
generated with CDS First Encounter v6.2. The power consumption
is estimated extracting the simulation-based node toggeling activity
obtained with the postlayout netlist.
Timing The maximum core clock frequency (ClkxCI) resulting from

the post-layout simulation is 250 MHz and the maximum SPI clock
frequency (SCKxC) is 142 MHz. Table B.2 reports the correspond-
ing input timing constraints, whereas Table B.3 the output timing
constraints. These constraints are achieved on both ASPEs.
197
Table B.4: Post-layout power consumption.

Clock period [ ns] Power consumption [ mW]
ASPE A ASPE B
4 700 700
5
7.5
Power For ASPE A, the node toggling activity was extracted while
simulating the computation of a 64-point FFT. For ASPE B, the
node toggling activity was extracted while simulating the computation
of the inverse of a 2 × 2 matrix. The resulting power consumption is
reported in Table B.4 At 250 MHz the resulting power consumption is
around 700 mW for each ASPE.
Operating Modes
Configuration through SPI
The serial peripheral interface (SPI) permits to access the ASPE’s
memories. This access is essential for loading software into the se-
quencer’s program memory. Also, while debugging, the access through
SPI to SUs and FUs facilitates the error localization. ScanEnablexTI,
ScanModexTI, and BistEnablexTI shall be tied to ground during
configuration.
Figure B.2 shows the generic timing diagram of one SPI transaction.
One transaction begins with SSxSBI being de-asserted, while SCKxCI
is low. Next, an SPI-command-byte is transmitted, bit by bit and
from MSB to LSB, over the SlaveInxDI pin. Concurrently, one SPI-
response-byte is output from MSB to LSB, over the SlaveOutxDO pin.
Three SPI-commands are available: READ from and WRITE to an ASPE
memory, and NOP.
READ The structure of an SPI-READ sequence is shown in Figure B.3.

Reading one 32 bit data-word from one ASPE memory-location involves
four phases:
1 2 3 4 8
SCKxCI
SSxSBI
SlaveInxDI
SlaveOutxDO
Figure B.2: SPI transaction.
1. Issue an SPI-READ command. Requires one SPI transaction.
2. Send address, from least significant byte (A0) to most significant

one (A2). Requires three SPI transactions.
3. Send SPI-NOP commands until RREADY command is output on

SlaveOutxDO. Requires a variable number of transactions.
4. Receive data, from the least significant byte (D0) to the most
significant one (D3). Requires four SPI transactions.
WRITE Figure B.4 illustrates one SPI-WRITE sequence, used for

writing one 32 bit data-word D (bytes D3-D0) to the ASPE memory at
address A (bytes A2-A0). The operation is divided into four phases:
1. Issue an SPI-WRITE command. Requires one SPI transaction.
2. Send write address, from least significant byte (A0) to most

significant one (A2). Requires three SPI transactions.
3. Send data-word to be written, from the least significant byte (D0)

to the most significant one (D3). Requires four SPI transactions.
199
SCKxCI
SSxSBI
SlaveInxDI READ A0 A1 A2
SlaveOutxDO NOP READ RACK0 RACK1
NOP NOP NOP NOP
RACK2 READING RREADY D0
NOP NOP NOP
D1 D2 D3
Figure B.3: SPI-READ.

SCKxCI
SSxSBI
SlaveInxDI WRITE A0 A1 A2
SlaveOutxDO NOP WRITE RACK0 RACK1
D0 D1 D2 D3
RACK2 DACK0 DACK1 DACK2
NOP NOP NOP
DACK3 WRITING WDONE
Figure B.4: SPI-WRITE.
4. Send SPI-NOP commands until WDONE command is output on

SlaveOutxDO. Requires a variable number of transactions.
SPI commands Table B.5 summarizes the SPI commands and SPI
responses.
201
Table B.5: SPI commands and SPI responses.

SPI commands (SlaveInxDI)
Mnemonic Cmd Meaning
NOP 0x00 No-operation
READ 0x01 Read data-word
WRITE 0x02 Write data-word
SPI responses (SlaveOutxDO)
Mnemonic Cmd Meaning
NOP 0x00 No-operation
READ 0x01 Read acknowledge
WRITE 0x02 Write acknowledge
ACK0 0x03 R/W A0 address acknowledge
DACK0 0x06 W D0 data acknowledge
RREADY 0x0A Read done
READING 0x0B Read access in progress
WDONE 0x0C Write done
WRITING 0x0D Write access in progress
Memory Map Figure B.5 depicts the physical design of the ASPE’s
index and dictionary memory. The index memory is a 1024 × 20 bit
SRAM and the dictionary memory is composed of three 256 × 64 bit
SRAMs (DICT0-DICT2). The 16 bit control words (CWs) used to
control the SEQ, the FUs, and SUs are assigned to the dictionary
memory slots as in Table B.6 (cf. Figure B.5). Finally, the memory
map in Table B.7 allows to access the index and dictionary memories
over SPI.
Access to the SUs and FUs is also possible over SPI. For the FU
and SU access, the 24 bit SPI-address A (bytes A0-A3) has the bit
structure 11uu uuaa aaaa aaaa aaaa aa00. The 4 bits uuuu encode
the unit’s internal slot number and the 16 bits aaaa aaaa aaaa aaaa
the unit’s CW. The internal slot number (uuuu) assignment is reported
in Table B.8. The memory map required to access tha ASPE’s storage
units is reported in Table B.9. The IBUF and OBUF are accessed in
an analogous way.
Normal operation
The ASPE starts its autonomous operation once the stall register
residing inside the SEQ is cleared. Therfore, the SPI command ’WRITE
0x780060 0x00000000’ is issued (write the data word 0x00000000 to
address 0x780060, which corresponds to the SEQ stall register). The
SEQ then starts fetching the first dictionary pointer located at address
0x000 of the index memory.
ScanEnablexTI, ScanModexTI, and BistEnablexTI shall be tied to
ground during normal operation.
BIST
The memory BIST writes a chess-board pattern into the memories.
Thereafter, it reads out the memories and checks whether the pattern
matches the expected one or not. If the check for one memory passes
the BIST-signal of that memory is raised, otherwise it remains zero.
Figure B.6 shows the signaling scheme used to enable the mem-
ory built-in self-test (BIST) mode. To enter the BIST mode, the
BistEnablexTI signal has to be set to 1 before RstxRBI is released.
BistModexTI selects whether the "Bist DONE" or "Bist OK" status is
203
ASPE index and dictionary memories
IDX DICT2
Addr: 0x000 Addr: 0x000 0x001 0x002 0x003
0x004 0x005 0x006 0x007
0x001
256
...
0x3FC 0x3FD 0x3FE 0x3FF

... Slot: U0 U1 U2 U3
(SEQ)
DICT1
Addr: 0x000 0x001 0x002 0x003
0x3FF 0x004 0x005 0x006 0x007
256
...

Slot: U4 U5 U6 U7
DICT0
Addr: 0x000 0x001 0x002 0x003
0x004 0x005 0x006 0x007
256
...

Slot: U8 U9 U10 U11
Figure B.5: Physical view of ASPE’s index and dictionary memories.

Table B.6: Assignment of dictionary memory slots to SEQ, FUs, and

SUs.
Dict. Slot ASPE A ASPE B
U0 SEQ SEQ
U1 REG REG
U2 CMAC DIV
U3 CALU0 CMAC0
U4 CALU1 CMAC1
U5 SU0 CALU
U6 SU1 SU0
U7 SU2 SU1
U8 SU3 SU2
U9 SU4 SU3
U10 OBUF OBUF
U11 IBUF IBUF
Table B.7: Index and dictionary memory maps for ASPE A and
ASPE B(write only).
Unit Address Range
DICT0 0x720000
0x720FFC
DICT1 0x760000
0x760FFC
DICT2 0x7A0000
0x7A0FFC
IDX 0x7B0000
0x7B0FFC
205
Table B.8: Internal slot number uuuu assignment for FUs and SUs.
Internal slot ASPE A ASPE B
uuuu
0x0 REG REG
0x1 nil nil
0x2 nil DIV
0x3 nil CMAC0
0x4 CMAC CMAC1
0x5 CALU0 CALU
0x6 CALU1 nil
0x7 nil nil
0x8 SU0 SU0
0x9 SU1 SU1
0xA SU2 SU2
0xB SU3 SU3
0xC SU4 OBUF
0xD OBUF IBUF
0xE IBUF nil
visible on StBistOutxTO_X and SeqBistOutxTO_X. If BistModexTI

is set to 1, the "Bist DONE" status is visible, otherwise "Bist OK".
Table B.11 reports the StBistOutxTO_X and SeqBistOutxTO_X
mapping to the different memories.
SCAN
The scan chain is activated trough ScanEnablexTI (active if 1). Scan-
ModexTI isolates the memories from the scan chain and needs to be
set to 1 during test mode.
Table B.9: Memory maps for addressing SUs of ASPE A and ASPE B.
Unit Address Range Meaning
SU0 0xE1A000 Write to both SIMD mems
0xE1A3FF
0xE18000 Write to R mem
0xE183FF
0xE19000 Write to L mem
0xE193FF
0xE14000 Read from R mem
0xE143FF
0xE15000 Read from L mem
0xE153FF
0xE5A3FF
0xE583FF
0xE593FF
0xE543FF
0xE553FF
0xE9A3FF
0xE983FF
0xE993FF
0xE943FF
0xE953FF
207
Table B.10: Memory maps for addressing SUs of ASPE A and ASPE B.
Note that SU4 is only available on ASPE A.
Unit Address Range Meaning
SU3 0xEDA000 Write to both SIMD mems
0xEDA3FF
0xED8000 Write to R mem
0xED83FF
0xED9000 Write to L mem
0xED93FF
0xED4000 Read from R mem
0xED43FF
0xED5000 Read from L mem
0xED53FF
SU4 0xF1A000 Write to both SIMD mems
0xF1A3FF
0xF18000 Write to R mem
0xF183FF
0xF19000 Write to L mem
0xF193FF
0xF14000 Read from R mem
0xF143FF
0xF15000 Read from L mem
0xF153FF
ClkxCI
BistEnablexTI
before RstxRBI goes high
RstxRBI
BistModexTI Show BIST DONE Show BIST OK
StBistOutxTO_X BIST DONE BIST OK
SeqBistOutxTO_X BIST DONE BIST OK
Figure B.6: BIST signaling.

209
Table B.11: BIST pad-to-memory mapping.

Pad ASPE A ASPE B
StBistOutxTO_X:
0: SU0, RAM R SU0, RAM 0
1: SU0, RAM L SU0, RAM 1
8: SU4, RAM R always 1
9: SU4, RAM L always 0
10: always 0 always 0
11: always 0 always 0
12: OBUF OBUF
13: IBUF IBUF
SeqBistOutxTO_X:
0: DICT0 DICT0
1: DICT1 DICT1
2: DICT2 DICT2
Bibliography
[1] M. Eteläperä and J.-P. Soininen, “4G mobile terminal

architectures,” Technical Research Centre of Finnland (VTT),
Tech. Rep., 2007. [Online]. Available: http://rooster.oulu.fi/
materiaalit/4G%20terminal%20architectures.pdf 3
[2] G. E. Moore, “Cramming more components onto integrated cir-

cuits, reprinted from electronics, volume 38, number 8„” IEEE
Solid-State Circuits Newsletter, vol. 20, no. 3, pp. 33–35, Sep.
2006. 4
[3] S. Cherry, “Edholm’s law of bandwidth,” IEEE Spectrum, vol. 41,

no. 7, pp. 58–60, Jul. 2004. 4
[4] U. Ramacher, “Software-defined radio prospects for multistandard

mobile phones,” Computer, vol. 40, no. 10, pp. 62–69, Oct. 2007.
4
[5] K. van Berkel, F. Heinle, P. P. E. Meuwissen, K. Moerman, and

M. Weiss, “Vector processing as an enabler for software-defined
radio in handheld devices,” EURASIP Journal on Applied Signal
Processing, vol. 2005, no. 16, pp. 2613–2625, 2005. 4, 26, 27, 38
[6] H. Bölcskei, “MIMO-OFDM wireless systems: basics, perspectives,

and challenges,” IEEE Wireless Communications, vol. 13, no. 4,
pp. 31–37, Aug. 2006. 5
[7] A. J. Paulraj, D. A. Gore, R. U. Nabar, and H. Bölcskei, “An

overview of MIMO communications - a key to gigabit wireless,”
211
212 BIBLIOGRAPHY
Proceedings of the IEEE, vol. 92, no. 2, pp. 198–218, Feb. 2004.
5, 53, 57, 161
[8] IEEE, Draft Standard for Information Technology-
Telecommunications and information exchange between systems–
Local and metropolitan area networks–Specific requirements– Part
11: Wireless LAN Medium Access Control (MAC) and Physical
Layer (PHY) specifications: Amendment 4: Enhancements for
Higher Throughput, 2007. 5, 72
[9] T. Boesch, “Adaptive stream processor for network multimedia
consumer electronic devices,” Ph.D. dissertation, ETH Zurich,
2004. 5, 6, 98, 100, 156
[10] S. Eberli, A. Burg, T. Boesch, and W. Fichtner, “An IEEE 802.11a
baseband receiver implementation on an application specific pro-
cessor,” in Circuits and Systems, 2007. MWSCAS 2007. 50th
Midwest Symposium on, Montreal, Que., Aug. 5–8, 2007, pp.
1324–1327. 5, 101, 107
[11] S. Eberli, A. Burg, and W. Fichtner, “Implementa-
tion of a 2 × 2 MIMO-OFDM receiver on an appli-
cation specific processor,” Microelectronics Journal, vol.
In Press, Corrected Proof, pp. –, 2009 (invited). [On-
line]. Available: http://www.sciencedirect.com/science/article/
B6V44-4VWJ1YP-1/2/bbcb3d4c513f25e650913b83fef4c11d 6
[12] B. Bougard, B. De Sutter, S. Rabou, D. Novo, O. Allam,
S. Dupont, and L. Van der Perre, “A coarse-grained array based
baseband processor for 100Mbps+ software defined radio,” in De-
sign, Automation and Test in Europe, 2008. DATE ’08, Munich,
Germany, Mar. 2008, pp. 716–721. 6, 7, 33, 34, 38, 151, 154, 156,
158
[13] S. Eberli, D. Cescato, and W. Fichtner, “Divide-and-conquer ma-
trix inversion for linear MMSE detection in SDR MIMO receivers,”
in NORCHIP, 2008., Tallinn, Nov. 2008, pp. 162–167. 6, 67, 189
[14] R. H. Dennard, J. Cai, and A. Kumar, “A perspective on today’s
scaling challenges and possible future directions,” Solid-State
Electronics, vol. 51, no. 4, pp. 518–525, April 2007. 11
BIBLIOGRAPHY 213
[15] H. Kaeslin, Digital Integrated Circuits: From VLSI Architectures

to CMOS Fabrication. Cambridge University Press, May 2008.
12
[16] T. Pionteck, L. D. Kabulepa, and M. Glesner, “Exploring the

capabilities of reconfigurable hardware for OFDM-based WLANs,”
ser. IFIP International Federation for Information Processing, vol.
200. Springer Boston, 2006, pp. 149–164. [Online]. Available:
http://www.springerlink.com/content/0148404m07082636/ 17,
18
[17] C. Ebeling, C. Fisher, G. Xing, M. Shen, and H. Liu, “Implement-

ing an OFDM receiver on the RaPiD reconfigurable architecture,”
IEEE Transactions on Computers, vol. 53, no. 11, pp. 1436–1448,
Nov. 2004. 17, 19, 20, 38
[18] C. Ebeling, D. C. Cronquist, and P. Franklin, Field-Programmable

Logic Smart Applications, New Paradigms and Compilers, ser.
Lecture Notes in Computer Science. Springer Berlin / Heidelberg,
1996, vol. 1142, ch. RaPiD – Reconfigurable pipelined datapath,
pp. 126–135. 17, 19, 38
[19] D. C. Cronquist, C. Fisher, M. Figueroa, P. Franklin, and C. Ebel-

ing, “Architecture design of reconfigurable pipelined datapaths,”
in Advanced Research in VLSI, 1999. Proceedings. 20th Anniver-
sary Conference on, Atlanta, GA, Mar. 1999, pp. 23–40. 20
[20] A. Kamalizad, N. Tabrizi, N. Bagherzadeh, and A. Hatanaka,

“A programmable DSP architecture for wireless communication
systems,” in Application-Specific Systems, Architecture Processors,
2005. ASAP 2005. 16th IEEE International Conference on, Jul.
23–25, 2005, pp. 231–238. 20
[21] N. Tabrizi, N. Bagherzadeh, A. Kamalizad, and H. Du, “MaRS: A

macro-pipelined reconfigurable system,” ACM Computing Fron-
tiers, pp. 343 – 349, 2004. 20
[22] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and

E. M. Chaves Filho, “MorphoSys: an integrated reconfigurable
system for data-parallel and computation-intensive applications,”
214 BIBLIOGRAPHY
IEEE Transactions on Computers, vol. 49, no. 5, pp. 465–481,

May 2000. 20, 21, 38
[23] Y. Lin, H. Lee, M. Who, Y. Harel, S. Mahlke, T. Mudge,

C. Chakrabarti, and K. Flautner, “SODA: A low-power architec-
ture for software radio,” in Computer Architecture, 2006. ISCA
’06. 33rd International Symposium on, Boston, MA, 2006, pp.
89–101. 22, 23, 24, 38
[24] G. K. Rauwerda, P. M. Heysters, and G. J. M. Smit, “An

OFDM receiver implemented on the coarse-grain reconfigurable
montium processor,” in Proceedings of the 9th International
OFDM-Workshop, Dresden, Germany, September 2004, pp. 197–
201. [Online]. Available: http://eprints.eemcs.utwente.nl/1496/
25, 38
[25] P. M. Heysters, G. J. Smit, and E. Molenkamp, “Montium -

balancing between energy-efficiency, flexibility and performance,”
in International Conference on Engineering of Reconfigurable
Systems and Algorithms, ERSA, 2003, pp. 235–241. [Online].
Available: http://doc.utwente.nl/46380/ 25
[26] G. K. Rauwerda, “Multi-standard adaptive wireless communica-

tion receivers: adaptive applications mapped on heterogeneous
dynamically reconfigurable hardware,” Ph.D. dissertation, Univ.
of Twente, Enschede, January 2008. [Online]. Available:
http://dx.doi.org/10.3990/1.9789036526074 25
[27] G. K. Rauwerda, P. M. Heysters, and G. J. M. Smit, “Towards

software defined radios using coarse-grained reconfigurable hard-
ware,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 16, no. 1, pp. 3–13, Jan. 2008. 25, 26, 158
[28] J. Kneip, M. Weiss, W. Drescher, V. Aue, J. Strobel, T. Oberthür,

M. Bolle, and G. Fettweis, “Single chip programmable baseband
ASSP for 5GHz wireless LAN applications,” IEICE Trans. Elec-
tron., vol. E85-C, pp. 359 – 367, Feb. 2002. 26, 38
[29] T. Richter, W. Drescher, E. Engel, S. Kobayashi, V. Nikolajevic,

Weiss, and G. Fettweis, “A platform-based highly parallel digital
BIBLIOGRAPHY 215
signal processor,” in Custom Integrated Circuits, 2001, IEEE

Conference on., San Diego, CA, May 2001, pp. 305–308. 26
[30] J. H. Moreno, V. Zyuban, U. Shvadron, F. D. Neeser, J. H.

Derby, M. S. Ware, K. Kailas, A. Zaks, A. Geva, S. Ben-David,
S. W. Asaad, T. W. Fox, D. Littrell, M. Biberstein, D. Naishlos,
and H. Hunter, “An innovative low-power high-performance pro-
grammable signal processor for digital communications,” IBM J.
Res. Dev., vol. 47, no. 2-3, pp. 299–326, 2003. 28
[31] D. Iancu, H. Ye, E. Surducan, M. Senthilvelan, J. Glossner, V. Sur-

ducan, V. Kotlyar, A. Iancu, G. Nacer, and J. Takala, “Software
implementation of wimax on the sandbridge sandblaster platform,”
in SAMOS, 2006, pp. 435–446. 28, 38
[32] C. J. Glossner, T. Raja, E. Hokenek, and M. Moudgill, “A multi-

threaded processor architecture for sdr,” Proceedings of the Korean
Institute of Communication Sciences, pp. 70–85, November 2002.
28, 29, 38
[33] M. Schulte, J. Glossner, S. Jinturkar, M. Moudgill, S. Mamidi,

and S. Vassiliadis, “A low-power multithreaded processor for
software defined radio,” The Journal of VLSI Signal Processing,
vol. 43, no. 2-3, pp. 143–159, June 2006. [Online]. Available:
http://www.springerlink.com/content/t535603181168478/ 28, 29
[34] E. Tell, “Design of programmable baseband processors,” Ph.D.

dissertation, Linköping University, 2005. [Online]. Available: http:
//liu.diva-portal.org/smash/get/diva2:20611/FULLTEXT01 30,
31, 38, 78
[35] A. Nilsson, “Design of programmable multi-standard baseband

processors,” Ph.D. dissertation, Linköping University, 2007.
[Online]. Available: http://www.ep.liu.se/smash/get/diva2:
23639/FULLTEXT01 30, 31, 38
[36] A. Nilsson, E. Tell, and D. Liu, “An 11 mm2 , 70 mW fully pro-

grammable baseband processor for mobile WiMAX and DVB-
T/H in 0.12 µm CMOS,” in Solid-State Circuits, IEEE Journal
of, vol. 44, no. 1, Lille, France, Jan. 2009, pp. 90–97. 30
216 BIBLIOGRAPHY
[37] J. Leĳten, G. Burns, J. Huisken, E. Waterlander, and A. van

Wel, “AVISPA: a massively parallel reconfigurable accelerator,”
System-on-Chip, 2003. Proceedings. International Symposium on,
pp. 165 – 168, Nov. 19–21, 2003. 32, 33, 38
[38] T. R. Halfhill, “Silicon hive breaks out,” Microprocessor Report,

pp. 165 – 168, Dec. 1, 2003. 32
[39] I. Held and B. VanderWiele, “AVISPA CH - embedded communi-

cations signal processor for multi-standard digital television,” in
GSPx TV to Mobile, Mar. 2006. 32
[40] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins,

“ADRES: An architecture with tightly coupled VLIW processor
and coarse-grained reconfigurable matrix,” in FPL, 2003, pp.
61–70. 33, 34, 38
[41] B. Bougard, A. Bourdoux, F. Naessens, M. Glassee, V. Derudder,

and L. Van der Perre, “Energy-efficient software-defined radio so-
lutions for MIMO-based broadband communication,” in European
Signal Processing Conference. Proceedings of, Poznan, Poland,
Sep. 2007. 33, 37
[42] D. Novo, W. Moffat, V. Derudder, and B. Bougard, “Mapping

a multiple antenna SDM-OFDM receiver on the ADRES coarse-
grained reconfigurable processor,” IEEE Workshop on Signal
Processing Systems Design and Implementation, pp. 473–478. ,
Nov. 2–4, 2005. 33
[43] R. Enzler, “The current status of reconfigurable computing,” ETH

Zurich, Electronics Laboratory, Tech. Rep., 1999. 37
[44] ——, “Architectural trade-offs in dynamically reconfigurable pro-

cessors,” Ph.D. dissertation, ETH Zurich, 2004. 37
[45] R. Hartenstein, “A decade of reconfigurable computing: a vision-

ary retrospective,” Design, Automation and Test in Europe, 2001.
Conference and Exhibition 2001. Proceedings, pp. 642 – 649, 13-16
Mar. 2001. 37
BIBLIOGRAPHY 217
[46] H. Amano, “A survey on dynamically reconfigurable processors,”

IEICE Transactions on Communications, vol. E89-B, no. 12,
pp. 3179–3187, 2006. [Online]. Available: http://ietcom.
oxfordjournals.org/cgi/content/abstract/E89-B/12/3179 37
[47] “IEEE Micro Special Issue: Accelerator Architectures,” IEEE
Micro, vol. 28, no. 4, Jul./Aug. 2008. 37
[48] T. Pionteck, L. D. Kabulepa, and M. Glesner, “Exploring the
capabilities of reconfigurable hardware for ofdm-based wlans,” in
VLSI-SOC, Darmstadt, Germany, Dec. 1–3, 2003, pp. 149–164.
38
[49] P. M. Heysters, G. J. Smit, and E. Molenkamp, “A Flexible and
Energy-Efficient Coarse-Grained Reconfigurable Architecture for
Mobile Systems,” The Journal of Supercomputing, vol. 26, no. 3,
pp. 283 – 308, November 2003. 38
[50] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time
Wireless Communications. Cambridge Univ. Press, 2003. 49
[51] A. Burg, “VLSI circuits for MIMO communication systems,” Ph.D.
dissertation, ETH Zurich, 2006. 53, 167, 168, 182
[52] R. G. Galager, Principles of Digital Communication. Cambridge
University Press, 2008. 53, 161
[53] I. B. Collings, M. R. G. Butler, and M. McKay, “Low complexity
receiver design for MIMO bit-interleaved coded modulation,” in
Spread Spectrum Techniques and Applications, 2004 IEEE Eighth
International Symposium on, Aug. 30–Sep. 2, 2004, pp. 12–16. 54,
58, 77
[54] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and
H. Bölcskei, “VLSI implementation of MIMO detection using the
sphere decoding algorithm,” IEEE Journal of Solid-State Circuits,
vol. 40, no. 7, pp. 1566–1577, Jul. 2005. 54, 55, 161
[55] A. Burg, M. Borgmann, M. Wenk, C. Studer, and H. Bölcskei,
“Advanced receiver algorithms for MIMO wireless communications,”
in Design, Automation and Test in Europe, 2006. DATE ’06.
Proceedings, vol. 1, Mar. 6–10, 2006. 55
218 BIBLIOGRAPHY
[56] C. Studer, M. Wenk, A. Burg, and H. Bölcskei, “Soft-output

sphere decoding: Performance and implementation aspects,” in
Signals, Systems and Computers, 2006. ACSSC ’06. Fortieth
Asilomar Conference on, Pacific Grove, CA, Oct./Nov. 2006, pp.
2071–2076. 55
[57] C. Studer, A. Burg, and H. Bölcskei, “Soft-output sphere decoding:

algorithms and VLSI implementation,” IEEE Journal on Selected
Areas in Communications, vol. 26, no. 2, pp. 290–300, Feb. 2008.
55
[58] K. wai Wong, C. ying Tsui, R. S. K. Cheng, and W. ho Mow,

“A VLSI architecture of a k-best lattice decoding algorithm for
MIMO channels,” in Circuits and Systems, 2002. ISCAS 2002.
IEEE International Symposium on, vol. 3, 2002, pp. 273–276. 56
[59] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner,

“K-best MIMO detection VLSI architectures achieving up to 424
Mbps,” in Circuits and Systems, 2006. ISCAS 2006. Proceedings.
2006 IEEE International Symposium on, Island of Kos. 56, 163
[60] J. Antikainen, P. Salmela, O. Silven, M. Juntti, J. Takala, and

M. Myllyla, “Application-specific instruction set processor im-
plementation of list sphere detector,” in Signals, Systems and
Computers, 2007. ACSSC 2007. Conference Record of the Forty-
First Asilomar Conference on, Pacific Grove, CA, Nov. 4–7, 2007,
pp. 943–947. 56
[61] E. Zimmermann and G. Fettweis, “Adaptive vs. hybrid iterative

MIMO receivers based on MMSE linear and soft-SIC detection,”
in Personal, Indoor and Mobile Radio Communications, 2006
IEEE 17th International Symposium on, Helsinki, Sep. 2006, pp.
1–5. 57
[62] S. Haene, A. Burg, D. Perels, P. Luethi, N. Felber, and W. Ficht-

ner, “Silicon implementation of an MMSE-based soft demapper
for MIMO-BICM,” in Circuits and Systems, 2006. ISCAS 2006.
Proceedings. 2006 IEEE International Symposium on, May 21–24,
2006. 58, 77
BIBLIOGRAPHY 219
[63] A. Burg, S. Haene, D. Perels, P. Luethi, N. Felber, and W. Ficht-

ner, “Algorithm and VLSI architecture for linear MMSE detection
in MIMO-OFDM systems,” in Circuits and Systems, 2006. ISCAS
2006. Proceedings. 2006 IEEE International Symposium on, May
2006. 58, 67, 185
[64] M. Wenk, “Real-time MIMO-OFDM testbed: Challenges, imple-
mentations, and measurement results,” Ph.D. dissertation, ETH
Zurich, 2010, to appear. 59
[65] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge, U.K.,
1985. 64
[66] G. H. Golub and C. F. V. Loan, Matrix computations (3rd ed.).
Baltimore, MD, USA: Johns Hopkins University Press, 1996. 65,
66, 167, 170, 174
[67] J. P. Gram, “Ueber die entwickelung reeller funtionen in reihen
mittelst der methode der kleinsten quadrate,” Journal für die
reine und angewandte Mathematik, vol. 94, pp. 41–73, 1883. 66
[68] E. Schmidt, “Zur theorie der linearen und nichtlinearen integral-
gleichungen. i. teil: Entwicklung willkürlicher funktionen nach
systemen vorgeschriebener,” Mathematische Annalen, vol. 63, pp.
433–476, 1907. 66
[69] F. Zhang, Ed., The Schur Complement and Its Applications, ser.
Numerical Methods and Algorithms. Springer US, March 30
2006, vol. 4. 67
[70] T. Banachiewicz, “Méthode de résolution numérique des équations
linéaires, du calcul des déterminants et des inverses, et de réduc-
tion des formes quadratiques.” Bull. internat. Acad. Polonaise
Sci. Leit., Cl. Sci. math. natur., A, vol. 1938, pp. 393–404, 1938.
67
[71] H. Boltz, “Entwickelungs-verfahren zum ausgleichen geodädtis-
cher netze nach der methode der kleinsten quadrate,” in Verof-
fentlichungen des Preussischen Geodatischen Institutes, Neue
Folge 90, Druck und Verlag von P. Stankiewicz’ Buchdruckerei,
1923. 67
220 BIBLIOGRAPHY
[72] R. Lohan, “Das entwicklungsverfahren zum ausgleichen geodätis-

cher netze nach boltz im matrizenkalkül,” Zeitschrift für ange-
wandte Mathematik und Mechanik, vol. 13, pp. 59–60, 1933. 67
[73] T. M. Schmidl and D. C. Cox, “Robust frequency and timing syn-
chronization for OFDM,” IEEE Transactions on Communications,
vol. 45, no. 12, pp. 1613–1621, Dec. 1997. 73
[74] J. Volder, “The CORDIC trigonometric computing technique,” in
IRE Trans. Electronic Computers, vol. EC-8, no. 3, Sep. 1959, pp.
330–334. 75
[75] B. Parhami, Computer Arithmetic: Algorithms and Hardware
Designs. Oxford University Press, 2000. 75, 115
[76] SPRS276H, TMS320C6455 – Fixed-Point Digital Signal Processor,
Texas Instruments Incorporated, May 2005. 83, 107
[77] L. Bardelli, L. Henzen, and C. Pedretti, “Design of a real-time
underwater acoustic mimo-ofdm communication system,” Master’s
thesis, ETH Zürich, Jun. 2007. 85
[78] SPRAAE8B, TMS320C6455/C6454 Power Consumption Sum-
mary (Rev. B), Texas Instruments Incorporated, Oct. 2007.
[Online]. Available: http://focus.ti.com.cn/cn/lit/an/spraae8b/
spraae8b.pdf 89
[79] E. Sereni, S. Culicchi, V. Vinti, E. Luchetti, S. Ottaviani, and
M. Salvi, “A software radio OFDM transceiver for WLAN
applications,” 2001. [Online]. Available: http://www.di.uoa.gr/
speech/dsp/X/PERUGI.PDF 90, 92
[80] M. Tariq, Y. Baltaci, T. Horseman, M. Butler, and A. Nix, “De-
velopment of an OFDM based high speed wireless LAN platform
using the TI C6x DSP,” Communications, 2002. ICC 2002. IEEE
International Conference on, vol. 1, pp. 522 – 526, 28 April-2 May
2002. 90, 92
[81] Y. Cinquino, A.L. amd Shayan, “A real-time software implemen-
tation of an OFDM modem suitable for software defined radios,”
Electrical and Computer Engineering, 2004. Canadian Conference
on, vol. 2, pp. 697 – 701, 2-5 May 2004. 90, 92
BIBLIOGRAPHY 221
[82] M. Schoenes, S. Eberli, A. Burg, D. Perels, S. Haene, N. Felber,

and W. Fichtner, “A novel SIMD DSP architecture for software
defined radio,” 2003. MWSCAS ’03. Proceedings of the 46th IEEE
International Midwest Symposium on Circuits and Systems, vol. 3,
pp. 1443–1446, Dec. 2003. 91, 107
[83] F. Barat and R. Lauwereins, “Reconfigurable instruction set pro-

cessors: a survey,” in Rapid System Prototyping, 2000. RSP 2000.
Proceedings. 11th International Workshop on, Paris, France, 2000,
pp. 168–173. 96
[84] C. Merk and C. Studer, “Power and performance optimization

of a complex-number arithmetic logic unit,” ETH Zürich, Tech.
Rep., Feb. 2003. 96
[85] Benchmarks for TMS320C55x and TMS320C64x DSP Families,

Texas Instruments Incorporated. [Online]. Available: http:
//dspvillage.ti.com 97, 99
[86] IEEE Std., Part 11: Wireless LAN medium Access Control (MAC)
and Physical Layer (PHY) specifications, High-speed Physical
Layer in the 5GHz Band, 1999. 101
[87] P. M. Heysters, G. K. Rauwerda, and G. J. Smit, “Implementa-

tion of a HiperLAN/2 receiver on the reconfigurable Montium
architecture,” in Parallel and Distributed Processing Symposium,
2004. Proceedings. 18th International, 26-30 Apr. 2004, p. 147.
103
[88] A. Niktash, R. Maestre, and N. Bagherzadeh, “A case study of

performing OFDM kernels on a novel reconfigurable DSP architec-
ture,” in Military Communications Conference, 2005. MILCOM
2005. IEEE, 17-20 oct 2005, pp. 1813–1818. 103
[89] M. Ros and P. Sutton, “Compiler optimization and ordering

effects on VLIW code compression,” in CASES ’03: Proceedings
of the 2003 international conference on Compilers, architecture
and synthesis for embedded systems. New York, NY, USA: ACM,
2003, pp. 95–103. 139
222 BIBLIOGRAPHY
[90] S. Y. Larin and T. M. Conte, “Compiler-driven cached code com-

pression schemes for embedded ILP processors,” in Microarchitec-
ture, 1999. MICRO-32. Proceedings. 32nd Annual International
Symposium on, Haifa, Nov. 16–18, 1999, pp. 82–92. 139
[91] Y. Xie, W. Wolf, and H. Lekatsas, “Code compression for em-
bedded VLIW processors using variable-to-fixed coding,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 14, no. 5, pp. 525–536, May 2006. 139
[92] C. H. Lin, Y. Xie, and W. Wolf, “LZW-based code compression
for VLIW embedded systems,” in Design, Automation and Test
in Europe Conference and Exhibition, 2004. Proceedings, vol. 3,
Feb. 16–20, 2004, pp. 76–81. 139
[93] O. Schliebusch, H. Meyr, and R. Leupers, Optimized
ASIP Synthesis from Architecture Description Language
Models. Springer Netherlands, 2007. [Online]. Available:
http://www.springerlink.com/content/wpg782/ 157
[94] S. Haene, D. Perels, and A. Burg, “A real-time 4-stream MIMO-
OFDM transceiver: System design, FPGA implementation, and
characterization,” IEEE Journal on Selected Areas in Communi-
cations, vol. 26, no. 6, pp. 877–889, Aug. 2008. 159
[95] M. Wenk, P. Luethi, T. Koch, P. Maechler, M. Lerjen, N. Felber,

and W. Fichter, “Hardware platform and implementation of a real-
time multi-user MIMO-OFDM testbed,” in Proc. IEEE Int. Symp.
on Circuits and Systems (ISCAS’09), May 2009, pp. 789–792. 159
Curriculum Vitae
Stefan Eberli was born in 1978 in Lugano. He studied electrical engi-

neering at the ETH Zürich where he received the Dipl. Ing. degree in
2003, writing a thesis about a novel digital signal processor for software
defined radio applications at the Integrated Systems Laboratory (IIS).
After two years at BridgeCo AG, Dübendorf, Switzerland, where
he worked in the field of ASIC functional verification, he joined the IIS,
ETH Zürich in 2005, as a research and teaching assistant. His research
interests lie in the domain of digital signal processing and include, in
particular, the design of efficient platforms for software-defined radios.
223

Mimo Ofdm

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Mimo Ofdm

Загружено:

Авторское право:

Доступные форматы

Application-Specific Processor for MIMO-OFDM

for the degree of

accepted on the recommendation of

I would like to express my gratitude to Prof. Dr. Fichtner, who gave

divide-and-conquer matrix inversion method – grazie Davide.

Software-defined radios (SDRs) present a promising approach to face

Software-definierte Radios (SDRs) stellen einen viel versprechenden

die MIMO-Detektion. Zu guter Letzt wird das 0.18 µm 1P/6M CMOS

2 State of the Art 9

3 Algorithms and Computational Complexity 43

3.3.1 Brute-force maximum-likelihood (ML) . . . . . 53

4 Design Space Exploration 83

5 MIMO-OFDM SDR Receiver 109

6 Summary and Conclusions 155

A MIMO Detection Methods 161

1.1 Motivation – Mobility and Wireless

Diffusion to mass market In the late 1980s, the definition of

universities and research laboratories to interconnect their computers.

Today: plethora of standards Despite all attempts of global

Figure 1.1: Wireless networks form an heterogeneous environment

this rapidly evolving world.

SDR Innovative solutions group multiple standards onto a single,

perfectly fit this heterogeneous and rapidly evolving habitat.

1.2 This Thesis

supplement, enabling data rates up to 54 Mbit/s. Currently, the use of

Contributions In summary, the two main contributions of this

• The implementation of the complete IEEE 802.11a baseband

here is very competitive in terms of silicon area.

A number of minor contributions were necessary to construct and

• Finally, this thesis acts as proof of concept for the design-time

State of the Art

This chapter reviews the literature of flexible architectures that are

2.1 Design Considerations

Flexible Architecture (FA)

stores the data to be processed, as well as instructions or configurations.

Processing performance The peak processing performance of FAs

2.1.2 Technology scaling

2.1.3 Real-valued vs. complex-valued functional

AT−plot for 16bit Adder

AT−plot for 16bit x 16bit Multiplier

1 The computation of the complex-valued multiplication C = A · B is di-

vided into the six real-valued steps: s1 = <{A}<{B}, s2 = ={A}={B},

Figure 2.3: AT-plot for complex-valued multiplication with real-valued

Re{A} Re{B} Im{A} Im{B} Re{A}Im{B} Im{A} Re{B}

Figure 2.4: Left: complex-valued multiplication unit. Right: real-

2.2 Flexible Architectures for OFDM Base-

RaPiD In [17] (2004) a 4 antenna OFDM receiver has been imple-

Figure 2.5: Reconfigurable datapath block diagram (source [16]).

External Memory External Sensors

CONFIGURABLE INSTRUCTION DECODE

Figure 2.6: RaPiD block diagram (source [18]).

bits (as for example those controlling the bus inter-connectors) do

MS1 and MaRS A Viterbi decoder and an FFT are implemented

Figure 2.7: MS1 block diagram (source [22]).

SODA The SODA architecture [23] (2006) is especially designed

5 PP = 4 PEs × 32 dOp/PE × 400 MHz = 510 200 MdOp/s.

LOCAL LOCAL LOCAL LOCAL

EXECUTION EXECUTION EXECUTION EXECUTION

SCALAR SCALAR SIMD SIMD

Figure 2.8: SODA multi-core architecture (source [23]).

PE SIMD SCRATCHPAD MEMORY (8KB)

Figure 2.9: One SODA PE (source [23]).