Вы находитесь на странице: 1из 5

Hardware Accelerator for Minimum Mean Square

Error Interference Alignment


Markus Kock Steffen Busch Holger Blume
Institute of Microelectronic Systems Institute of Cartography and Geoinformatics Institute of Microelectronic Systems
Leibniz Universitt Hannover Leibniz Universitt Hannover Leibniz Universitt Hannover
Hannover, Germany Hannover, Germany Hannover, Germany
kock@ims.uni-hannover.de steffen.busch@ikg.uni-hannover.de blume@ims.uni-hannover.de

AbstractA dedicated hardware architecture for the digital Mean Square Error (MMSE) IA architecture is required.
baseband processing of minimum mean square error interfer- Furthermore, the acceptable latency depends on the varying
ence alignment is presented. The computationally intensive task channel. In this paper, the acceptable latency requirement is
of calculating the precoding and decoding matrices has been
implemented and the underlying algorithm has been optimized assumed to be 1 ms for fast changing channels. The precod-
for real-time capability, efciency and exibility. The required ing matrices have to be adapted to the time-varying channel
number of iterations has been optimized and appropriate low- sufciently fast to achieve real-time operation. Especially,
latency algorithms for the computation of basic operations outdated precoding matrices result in insufciently aligned
have been identied to meet a real-time constraint of 1 ms interference at the receivers. Thus, the channel has to be
processing latency. The architecture has been veried and syn-
thesized for a Xilinx Virtex-6 LX550T FPGA. The maximum tracked and the precoding matrices have to be updated at a
number of antennas, users and data streams is congurable sufciently low latency. The energy budget for computing
at synthesis time. The actual parameters are congurable one set of precoding and decoding matrices is limited,
at runtime. Different degrees of parallelism allow a trade- especially in mobile high subcarrier count OFDM systems.
off between resource requirements, latency and throughput. From the hardware perspective, previous testbeds focused
The target FPGA resources are sufcient for real-time system
congurations up to 5 users with 3 antennas. on demonstrating the real-world feasibility of Interference
Index TermsInterference Alignment; Hardware Accelera- Alignment with questions in system level integration and
tor; Testbed; algorithmics [5] [6] [7] [8] [9]. These testbeds have provided
the proof-of-concept for IA by leveraging rapid-prototyping
I. INTRODUCTION approaches to hardware implementation. However, real-time
Channel capacity in multi-user wireless communication capability for fast changing channels and hardware efciency
systems is limited by interference. A well known approach are not within the scope of these papers. Processing latency
towards exploiting the available channel capacity is In- and hardware resource requirements have usually not been
terference Alignment (IA) [1]. Depending on the context published. For real-world usage, highly efcient hardware
and application, different approaches to IA are feasible. A architectures with sufcient latency, throughput and power
multitude of different algorithms for IA in a variety of consumption are required. Very few has been published
applications and constraints exist to leverage system channel about resource-efcient digital hardware implementations
capacity in multi-user (MU) multiple-input-multiple-output feasible for low-power and mobile devices. In this contri-
(MIMO) systems [2]. The focus of this paper is on minimum bution, we focus on efcient digital hardware architectures
mean square error interference alignment as presented in [3]. for the computation of the linear precoding and decoding
Here, linear precoding and decoding matrices are applied at matrices according to the Minimum Mean Square Error
the transmitter and receiver, respectively. From the hardware criterion. An estimate of the computational complexity of
point of view, current general purpose and Software-Dened MMSE IA has been previously published in [4].
Radio (SDR) systems are unable to compute these precoding The rest of the paper is organized as follows. The system
and decoding matrices in real-time. model and MMSE IA algorithm are recapitulated in Sec-
Due to its iterative nature and the required type of tions II and III. Algorithm optimizations aiming at hardware
operations, the computational complexity of this algorithm is implementation are presented in Section IV. The hardware
demanding, especially for mobile devices [4]. Thus, for high architecture and synthesis results are discussed in Sections
data rate systems, a high throughput real-time Minimum V and VI, respectively.

978-1-4799-8058-1/15/$31.00 2015 IEEE. 575


II. MIMO SYSTEM MODEL (0) Init V k

The interference channel model is shown in Fig. 1. K  1


users share a common transmission medium in a point-to- (1) U k = ( K f (V )) f (H, V k )
point communication scenario. Each transmitter TX multi- 
plies the vector of d discrete transmit data streams with a (2) Rk = K f (U )
precoding matrix V k and applies the result to Nt transmit
antennas. The receiver RXk applies the decoding matrix (3) k = 0
U k to the signals picked up by Nr receive antennas. The
received signal y i at each receiver i is (4) V k = (Rk + k I)
1
f (H, U k )

y i = H ii V i si + H ij V j sj + ni (1)
j=i (6) update k (5) P = V k 2F 1
with sj being the transmit data at transmitter j and ni
being the noise picked up by receiver i. H ii are the desired (7)
channels, all H ij convey interference. no |P | < ?

yes
7; 5;
(8) MSE
9 + 8
+  + 


(9)
7; 5; no MSE < ?


yes
+

9 + 8
nished
+


+

7; 5;
Fig. 2. MMSE IA algorithm owchart

+

9 + 8

IV. ALGORITHM OPTIMIZATION


Fig. 1. Multi-user 3 3 MIMO system with precoding matrices V j and Two reference software models have been implemented
decoding matrices U i in MATLAB using oating-point representation. Emphasis
was put on numerical stability by using MATLABs SVD-
III. MMSE IA ALGORITHM based pinv() function for matrix inversion and fzero()
The MMSE IA algorithm used to compute the precoding function for root-nding to solve for k . This MATLAB
and decoding matrices V and U is taken from [3]. This model serves as the base and verication reference for the
section presents the algorithm details required throughout algorithmic optimizations described below. The optimized
the rest of the paper. model serves as a reference for hardware verication.
Arbitrary initial values are chosen for the precoding The algorithm loop structure shown in Fig. 2 allows the
matrices V k in the rst algorithm step. Then, the decoding identication of the most often executed core operations. The
matrices U k are computed according to Eq. 2. In an iterative data dependencies dictate the sequential execution of the
process, the precoding matrices V k are updated according shown loops. The following optimization approaches have
to Eq. 3 based on the current set of decoding matrices. This been considered in order to achieve real-time execution with
outermost loop continues until convergence. For each update a maximum latency of 1 ms:
of V k , the Lagrange multiplier k 0 has to be chosen reduced number of loop iterations
iteratively to satisfy the transmit power scaling constraint use of low latency operations
V k 2F 1. parallelization
1 The required iteration counts of both loops in Fig. 2 are
K
Uk = H kj V j V H H 2
j H kj + I H kk V k (2) data dependent. Proper system setups within the bounds of
j=1
Nt = Nr = 2..11, d = 1..5, K = 3..19 were used to
1 simulate 100 random channels each. Evaluating the conver-
K
gence has shown that the nal channel capacity has been
Vk = HH H
jk U j U j H jk + k I
HHkk U k (3) reached within 1% after at most 100 iterations. Therefore,
j=1 the maximum number of outer loop iterations has been xed
Figure 2 shows the corresponding owchart. to 100.

576
Now consider the inner loop. An explicit solution of only the delay of one real MAC operation and thus have a
V k (k )2F = 1 can not be given. Therefore, the equation small latency compared to matrix inversions.
has to be solved for k numerically. However, it can be Computing a decoding matrix U k in step (1) depends on
shown that a unique solution exists [3]. Several standard all matrices V k from the previous iteration. As there is no
root-nding approaches have been evaluated in this work data dependency between the computation of individual U k ,
including Newton iterations, secant method and Brents they can be computed in parallel for all users. Steps (2)
method. It must be noted that the standard algorithms do through (7) update the precoding matrices V k , including
not always nd the desired root for k 0, but instead a numerical root-nding in steps (4) through (7). These
somtimes get stuck at negative k . computations depend on all U k from step (1). It can be
The knowledge about the function can be used to guar- noted that computing V k can also be parallelized across all
antee convergence and improve convergence speed, i.e. the users.
number of iterations in the numerical root-nding loop. We
propose a modied secant method. Based on the curvature of V. HARDWARE ARCHITECTURE
the function, simple heuristic rules have been established for A xed-point dedicated hardware accelerator has been
a correction of the root estimated by the secant method. De- implemented for the use in an OCP- or AXI-based System-
pending on the system parameters, this reduces the number on-Chip. Its structure is shown in Fig. 4. All processing
of required iterations by a factor between 3 and 10 compared units and data ow is controlled by a top-level controller.
to the unmodied secant method. The second modication The outer loop from Fig. 2 and communication with external
is the choice of the initial value for in step (3). If the last memory is handled here. In the initialization phase, channel
accepted from the previous global iteration is reused as matrices H and initial values for V are loaded from external
the initial value, the average number of root-nding function memory into an on-chip BRAM cache. Then, the matrices
evaluations can be reduced to about 3.9. Fig. 3 shows the V and U are updated iteratively according to Eq. 2 and 3 by
number of iterations for the standard and optimized root- processing elements (PEs). Each PE handles the update of
nding algorithms. Representative system setups from the either V or U for one user at a time, and up to K PEs work
space of proper setups as discussed above have been chosen in parallel. The number of instantiated PEs is congurable
for simulation. The discrete setups are plotted on the x-axis. at synthesis time. The MSE unit computes the total mean
square error for the current set of matrices in parallel to the
80 next matrix update. When a predened threshold has been
Modied secant method reached, the outer iteration loop is stopped.
Secant method Fig. 5 shows the block structure of a PE. It can be
Newtons iterations congured to compute either an update of the encoding
60 Binary search
matrix V or the decoding matrix U for one user according
to steps (1) or (2) to (7), respectively. Computing U requires
a subset of the hardware resources needed to compute V .
Iterations

40 The matrix multipliers denoted as x sequentially compute


H kj V j V H H H H
j H kj or H jk U j U j H jk for all users j = 1..K
and sum up the result. The matrix multiplications can be
20 congured to either use three matrix multipliers in parallel
(nM M = 3) as shown in Fig. 5 or sequentially com-
pute the expressions above using a single matrix multiplier
0
(nM M = 1) to save resources. Each pipelined matrix
0 20 40 60 80 multiplier computes one full complex matrix multiplication
Setup per clock cycle.
The latency of the equation system solver required in steps
Fig. 3. Average number of root-nding function evaluations vs. system (1) and (4) is crucial for the overall system latency. Several
congurations approaches including LU- and QR-decomposition with back-
substitution as well as Singular Value Decomposition have
Matrix inversion is the most time-demanding operation in been evaluated for their achievable latency in hardware and
the innermost loop in step (4) from a hardware perspective. for precision in the context of this application.
All matrix inversions shown in Fig. 2 have to be carried Gaussian elimination has been chosen as the equation
out sequentially. The other required basic mathematical system solver. The Bareiss algorithm [10] is a division-free,
operations are not discussed here in detail in order to focus integer-preserving variant of Gaussian elimination especially
on the most latency-relevant operations. For example, the suitable for low latency hardware implementation. Here,
matrix multiplications can be fully parallelized and require division-free means that no divisions are required in the

577
elimination loop, the elimination is achieved with multiply +98 98
and add operations. Only one nal division is required per 3(98
result variable, but all divisions can be computed in parallel.
In this work, a two-step Bareiss algorithm Systolic Array
# #
Processor has been implemented as a tradeoff between re-
source requirements, stability and latency. Two variables are
eliminated per step, requiring two clock cycles. To achieve
sufcient numerical stability, a row-wise renormalization #
step has been inserted after each elimination using shifts,
requiring one additional clock cycle. Table I summarizes %
the required number of clock cycles per operation. M is the
data word length in bits. The worst-case evaluated system 
setup (Nt = Nr = 11, K = 19) requires a total of 26022
clock cycles including overhead, leading to a minimum clock      
frequency of 26.02 MHz for 1 ms latency. The hardware
implementation has been veried against the MATLAB
oating-point reference model.    !!!!"&

TABLE I
C LOCK CYCLES PER OPERATION (PE MODULE )    

Operation Clock cycles


Matrix mult. block, nM M = 3 K+1
Matrix mult. block, nM M = 1 3K Fig. 5. V/U Processing Element detail
Gaussian elimination (Bareiss) 2 Nt,r + M/2
Matrix norm 2
update 3 + M/2
of parallelism. System K = 3, Nt = Nr = 2 and d = 1 is
the smallest feasible IA system. The system congurations
 shown in Table II have been successfully mapped to the
target FPGA. System congurations with more users or
antennas require more DSP48E1 blocks than available on the
006(,$ target FPGA. Although optimization and deeper pipelining
   could increase the throughput, the potential for an improved

  total latency is fundamentally limited by the logic depth in
the iterative loops.

TABLE II
   

S YNTHESIS RESULTS FOR X ILINX V IRTEX -6 XC6VLX550T FPGA

K Nt,r d nP E nM M FF LUT DSP48E1


3 51,531 81,560 364
   3
 3 2 1
1
20,942 32,702 148
89 89 1 16,585 25,682 80
3 32,980 56,974 330
5 3 1 1
1 24,916 43,514 232
Available 687,360 343,680 864

VII. CONCLUSION
Fig. 4. MMSE IA hardware accelerator top level block diagram
The minimum mean square error interference alignment
algorithm has been optimized for low latency hardware
VI. SYNTHESIS RESULTS implementability. The overall required operation count has
The system has been synthesized for a Xilinx been reduced and the algorithm has been implemented on
XC6VLX550T-2 FPGA using Xilinx ISE 14.7. A 50 MHz an FPGA. It has been shown that the computation of the
clock constraint was met for all congurations. It has been precoding and decoding matrices is possible in hardware
chosen to enable the sequential computation of two sets under a real-time latency constraint of one millisecond. The
of matrices within 1 ms. Table II summarizes the resource synthesis results show high DSP resource requirements even
requirements for certain system congurations and degrees for small system congurations.

578
REFERENCES
[1] V. Cadambe and S. Jafar, Interference alignment and degrees of
freedom of the K-user interference channel, Information Theory,
IEEE Transactions on, vol. 54, no. 8, pp. 3425 3441, aug. 2008.
[2] D. Schmidt, C. Shi, R. Berry, M. Honig, and W. Utschick, Compar-
ison of distributed beamforming algorithms for MIMO interference
networks, Signal Processing, IEEE Transactions on, vol. 61, no. 13,
pp. 34763489, July 2013.
[3] , Minimum mean squared error interference alignment, in
Signals, Systems and Computers, 2009 Conference Record of the
Forty-Third Asilomar Conference on, nov. 2009, pp. 1106 1110.
[4] M. Kock, S. Hesselbarth, M. Ptzner, and H. Blume, Hardware-
accelerated design space exploration framework for communication
systems, Analog Integrated Circuits and Signal Processing,
vol. 78, no. 3, pp. 557571, 2014. [Online]. Available:
http://dx.doi.org/10.1007/s10470-013-0127-6
[5] J. A. Garca-Naya, L. Castedo, . Gonzlez, D. Ramrez, and
I. Santamara, Experimental evaluation of interference alignment
under imperfect channel state information, in 19th European Signal
Processing Conference (EUSIPCO 2011), Barcelona, Spain, August
2011.
[6] O. Gonzlez, D. Ramrez, I. Santamara, J. Garca-Naya, and
L. Castedo, Experimental validation of interference alignment tech-
niques using a multiuser MIMO testbed, in Smart Antennas (WSA),
2011 International ITG Workshop on, feb. 2011, pp. 1 8.
[7] P. Greisen, S. Haene, and A. Burg, Simulation and emulation
of MIMO wireless baseband transceivers, EURASIP Journal on
Wireless Communications and Networking, vol. 2010, no. 1, 2010.
[8] J. Massey, J. Starr, S. Lee, D. Lee, A. Gerstlauer, and R. Heath, Im-
plementation of a real-time wireless interference alignment network,
in Signals, Systems and Computers (ASILOMAR), 2012 Conference
Record of the Forty Sixth Asilomar Conference on, 2012, pp. 104108.
[9] P. Zetterberg and N. N. Moghadam, An experimental investigation
of SIMO, MIMO, interference-alignment (IA) and coordinated multi-
point (CoMP), CoRR, vol. abs/1111.3616, 2011.
[10] E. H. Bareiss, Sylvesters identity and multistep integer-preserving
gaussian elimination, Math. Comp., vol. 22, pp. 565578, 1968.

579