© All Rights Reserved

Просмотров: 6

© All Rights Reserved

- LTE and LTE Advanced
- LTE Advanced
- Question 1
- 06 LTE Radio Planning Capacity.pdf
- Power Allocation in Multiple-Input Multiple Output
- Part a Solutions of ZIO
- ESTUDIO - Ceragon Product portfolio Quick Presentation Oct2013.pdf
- Icassp96 Vts
- MODUL A SET 1.doc
- 7
- WiMAx¿¡¼-LTE·ÎÀÇ½Å³ª´Â¿©Çà
- 856-C10011
- UAV CELLULAR NETWORK
- MATLAB Excercises
- 5. Foundation Embedded in a Layered Half-Space
- Module 18 - Matrix Analysis 3 (self study)
- Lecture 3
- 07127688.pdf
- Conf_IST%2707_Downlink PHY Layer LTE
- Finger prints in the ether-PHY security

Вы находитесь на странице: 1из 5

Markus Kock Steffen Busch Holger Blume

Institute of Microelectronic Systems Institute of Cartography and Geoinformatics Institute of Microelectronic Systems

Leibniz Universitt Hannover Leibniz Universitt Hannover Leibniz Universitt Hannover

Hannover, Germany Hannover, Germany Hannover, Germany

kock@ims.uni-hannover.de steffen.busch@ikg.uni-hannover.de blume@ims.uni-hannover.de

AbstractA dedicated hardware architecture for the digital Mean Square Error (MMSE) IA architecture is required.

baseband processing of minimum mean square error interfer- Furthermore, the acceptable latency depends on the varying

ence alignment is presented. The computationally intensive task channel. In this paper, the acceptable latency requirement is

of calculating the precoding and decoding matrices has been

implemented and the underlying algorithm has been optimized assumed to be 1 ms for fast changing channels. The precod-

for real-time capability, efciency and exibility. The required ing matrices have to be adapted to the time-varying channel

number of iterations has been optimized and appropriate low- sufciently fast to achieve real-time operation. Especially,

latency algorithms for the computation of basic operations outdated precoding matrices result in insufciently aligned

have been identied to meet a real-time constraint of 1 ms interference at the receivers. Thus, the channel has to be

processing latency. The architecture has been veried and syn-

thesized for a Xilinx Virtex-6 LX550T FPGA. The maximum tracked and the precoding matrices have to be updated at a

number of antennas, users and data streams is congurable sufciently low latency. The energy budget for computing

at synthesis time. The actual parameters are congurable one set of precoding and decoding matrices is limited,

at runtime. Different degrees of parallelism allow a trade- especially in mobile high subcarrier count OFDM systems.

off between resource requirements, latency and throughput. From the hardware perspective, previous testbeds focused

The target FPGA resources are sufcient for real-time system

congurations up to 5 users with 3 antennas. on demonstrating the real-world feasibility of Interference

Index TermsInterference Alignment; Hardware Accelera- Alignment with questions in system level integration and

tor; Testbed; algorithmics [5] [6] [7] [8] [9]. These testbeds have provided

the proof-of-concept for IA by leveraging rapid-prototyping

I. INTRODUCTION approaches to hardware implementation. However, real-time

Channel capacity in multi-user wireless communication capability for fast changing channels and hardware efciency

systems is limited by interference. A well known approach are not within the scope of these papers. Processing latency

towards exploiting the available channel capacity is In- and hardware resource requirements have usually not been

terference Alignment (IA) [1]. Depending on the context published. For real-world usage, highly efcient hardware

and application, different approaches to IA are feasible. A architectures with sufcient latency, throughput and power

multitude of different algorithms for IA in a variety of consumption are required. Very few has been published

applications and constraints exist to leverage system channel about resource-efcient digital hardware implementations

capacity in multi-user (MU) multiple-input-multiple-output feasible for low-power and mobile devices. In this contri-

(MIMO) systems [2]. The focus of this paper is on minimum bution, we focus on efcient digital hardware architectures

mean square error interference alignment as presented in [3]. for the computation of the linear precoding and decoding

Here, linear precoding and decoding matrices are applied at matrices according to the Minimum Mean Square Error

the transmitter and receiver, respectively. From the hardware criterion. An estimate of the computational complexity of

point of view, current general purpose and Software-Dened MMSE IA has been previously published in [4].

Radio (SDR) systems are unable to compute these precoding The rest of the paper is organized as follows. The system

and decoding matrices in real-time. model and MMSE IA algorithm are recapitulated in Sec-

Due to its iterative nature and the required type of tions II and III. Algorithm optimizations aiming at hardware

operations, the computational complexity of this algorithm is implementation are presented in Section IV. The hardware

demanding, especially for mobile devices [4]. Thus, for high architecture and synthesis results are discussed in Sections

data rate systems, a high throughput real-time Minimum V and VI, respectively.

II. MIMO SYSTEM MODEL (0) Init V k

users share a common transmission medium in a point-to- (1) U k = ( K f (V )) f (H, V k )

point communication scenario. Each transmitter TX multi-

plies the vector of d discrete transmit data streams with a (2) Rk = K f (U )

precoding matrix V k and applies the result to Nt transmit

antennas. The receiver RXk applies the decoding matrix (3) k = 0

U k to the signals picked up by Nr receive antennas. The

received signal y i at each receiver i is (4) V k = (Rk + k I)

1

f (H, U k )

y i = H ii V i si + H ij V j sj + ni (1)

j=i (6) update k (5) P = V k 2F 1

with sj being the transmit data at transmitter j and ni

being the noise picked up by receiver i. H ii are the desired (7)

channels, all H ij convey interference. no |P | < ?

yes

7; 5;

(8) MSE

9 + 8

+ +

(9)

7; 5; no MSE < ?

yes

+

9 + 8

nished

+

+

7; 5;

Fig. 2. MMSE IA algorithm owchart

+

9 + 8

Fig. 1. Multi-user 3 3 MIMO system with precoding matrices V j and Two reference software models have been implemented

decoding matrices U i in MATLAB using oating-point representation. Emphasis

was put on numerical stability by using MATLABs SVD-

III. MMSE IA ALGORITHM based pinv() function for matrix inversion and fzero()

The MMSE IA algorithm used to compute the precoding function for root-nding to solve for k . This MATLAB

and decoding matrices V and U is taken from [3]. This model serves as the base and verication reference for the

section presents the algorithm details required throughout algorithmic optimizations described below. The optimized

the rest of the paper. model serves as a reference for hardware verication.

Arbitrary initial values are chosen for the precoding The algorithm loop structure shown in Fig. 2 allows the

matrices V k in the rst algorithm step. Then, the decoding identication of the most often executed core operations. The

matrices U k are computed according to Eq. 2. In an iterative data dependencies dictate the sequential execution of the

process, the precoding matrices V k are updated according shown loops. The following optimization approaches have

to Eq. 3 based on the current set of decoding matrices. This been considered in order to achieve real-time execution with

outermost loop continues until convergence. For each update a maximum latency of 1 ms:

of V k , the Lagrange multiplier k 0 has to be chosen reduced number of loop iterations

iteratively to satisfy the transmit power scaling constraint use of low latency operations

V k 2F 1. parallelization

1 The required iteration counts of both loops in Fig. 2 are

K

Uk = H kj V j V H H 2

j H kj + I H kk V k (2) data dependent. Proper system setups within the bounds of

j=1

Nt = Nr = 2..11, d = 1..5, K = 3..19 were used to

1 simulate 100 random channels each. Evaluating the conver-

K

gence has shown that the nal channel capacity has been

Vk = HH H

jk U j U j H jk + k I

HHkk U k (3) reached within 1% after at most 100 iterations. Therefore,

j=1 the maximum number of outer loop iterations has been xed

Figure 2 shows the corresponding owchart. to 100.

576

Now consider the inner loop. An explicit solution of only the delay of one real MAC operation and thus have a

V k (k )2F = 1 can not be given. Therefore, the equation small latency compared to matrix inversions.

has to be solved for k numerically. However, it can be Computing a decoding matrix U k in step (1) depends on

shown that a unique solution exists [3]. Several standard all matrices V k from the previous iteration. As there is no

root-nding approaches have been evaluated in this work data dependency between the computation of individual U k ,

including Newton iterations, secant method and Brents they can be computed in parallel for all users. Steps (2)

method. It must be noted that the standard algorithms do through (7) update the precoding matrices V k , including

not always nd the desired root for k 0, but instead a numerical root-nding in steps (4) through (7). These

somtimes get stuck at negative k . computations depend on all U k from step (1). It can be

The knowledge about the function can be used to guar- noted that computing V k can also be parallelized across all

antee convergence and improve convergence speed, i.e. the users.

number of iterations in the numerical root-nding loop. We

propose a modied secant method. Based on the curvature of V. HARDWARE ARCHITECTURE

the function, simple heuristic rules have been established for A xed-point dedicated hardware accelerator has been

a correction of the root estimated by the secant method. De- implemented for the use in an OCP- or AXI-based System-

pending on the system parameters, this reduces the number on-Chip. Its structure is shown in Fig. 4. All processing

of required iterations by a factor between 3 and 10 compared units and data ow is controlled by a top-level controller.

to the unmodied secant method. The second modication The outer loop from Fig. 2 and communication with external

is the choice of the initial value for in step (3). If the last memory is handled here. In the initialization phase, channel

accepted from the previous global iteration is reused as matrices H and initial values for V are loaded from external

the initial value, the average number of root-nding function memory into an on-chip BRAM cache. Then, the matrices

evaluations can be reduced to about 3.9. Fig. 3 shows the V and U are updated iteratively according to Eq. 2 and 3 by

number of iterations for the standard and optimized root- processing elements (PEs). Each PE handles the update of

nding algorithms. Representative system setups from the either V or U for one user at a time, and up to K PEs work

space of proper setups as discussed above have been chosen in parallel. The number of instantiated PEs is congurable

for simulation. The discrete setups are plotted on the x-axis. at synthesis time. The MSE unit computes the total mean

square error for the current set of matrices in parallel to the

80 next matrix update. When a predened threshold has been

Modied secant method reached, the outer iteration loop is stopped.

Secant method Fig. 5 shows the block structure of a PE. It can be

Newtons iterations congured to compute either an update of the encoding

60 Binary search

matrix V or the decoding matrix U for one user according

to steps (1) or (2) to (7), respectively. Computing U requires

a subset of the hardware resources needed to compute V .

Iterations

H kj V j V H H H H

j H kj or H jk U j U j H jk for all users j = 1..K

and sum up the result. The matrix multiplications can be

20 congured to either use three matrix multipliers in parallel

(nM M = 3) as shown in Fig. 5 or sequentially com-

pute the expressions above using a single matrix multiplier

0

(nM M = 1) to save resources. Each pipelined matrix

0 20 40 60 80 multiplier computes one full complex matrix multiplication

Setup per clock cycle.

The latency of the equation system solver required in steps

Fig. 3. Average number of root-nding function evaluations vs. system (1) and (4) is crucial for the overall system latency. Several

congurations approaches including LU- and QR-decomposition with back-

substitution as well as Singular Value Decomposition have

Matrix inversion is the most time-demanding operation in been evaluated for their achievable latency in hardware and

the innermost loop in step (4) from a hardware perspective. for precision in the context of this application.

All matrix inversions shown in Fig. 2 have to be carried Gaussian elimination has been chosen as the equation

out sequentially. The other required basic mathematical system solver. The Bareiss algorithm [10] is a division-free,

operations are not discussed here in detail in order to focus integer-preserving variant of Gaussian elimination especially

on the most latency-relevant operations. For example, the suitable for low latency hardware implementation. Here,

matrix multiplications can be fully parallelized and require division-free means that no divisions are required in the

577

elimination loop, the elimination is achieved with multiply +98 98

and add operations. Only one nal division is required per 3(98

result variable, but all divisions can be computed in parallel.

In this work, a two-step Bareiss algorithm Systolic Array

# #

Processor has been implemented as a tradeoff between re-

source requirements, stability and latency. Two variables are

eliminated per step, requiring two clock cycles. To achieve

sufcient numerical stability, a row-wise renormalization #

step has been inserted after each elimination using shifts,

requiring one additional clock cycle. Table I summarizes %

the required number of clock cycles per operation. M is the

data word length in bits. The worst-case evaluated system

setup (Nt = Nr = 11, K = 19) requires a total of 26022

clock cycles including overhead, leading to a minimum clock

frequency of 26.02 MHz for 1 ms latency. The hardware

implementation has been veried against the MATLAB

oating-point reference model.
!!!!"&

TABLE I

C LOCK CYCLES PER OPERATION (PE MODULE )

Matrix mult. block, nM M = 3 K+1

Matrix mult. block, nM M = 1 3K Fig. 5. V/U Processing Element detail

Gaussian elimination (Bareiss) 2 Nt,r + M/2

Matrix norm 2

update 3 + M/2

of parallelism. System K = 3, Nt = Nr = 2 and d = 1 is

the smallest feasible IA system. The system congurations

shown in Table II have been successfully mapped to the

target FPGA. System congurations with more users or

antennas require more DSP48E1 blocks than available on the

006(,$ target FPGA. Although optimization and deeper pipelining

could increase the throughput, the potential for an improved

total latency is fundamentally limited by the logic depth in

the iterative loops.

TABLE II

3 51,531 81,560 364

3

3 2 1

1

20,942 32,702 148

89 89 1 16,585 25,682 80

3 32,980 56,974 330

5 3 1 1

1 24,916 43,514 232

Available 687,360 343,680 864

VII. CONCLUSION

Fig. 4. MMSE IA hardware accelerator top level block diagram

The minimum mean square error interference alignment

algorithm has been optimized for low latency hardware

VI. SYNTHESIS RESULTS implementability. The overall required operation count has

The system has been synthesized for a Xilinx been reduced and the algorithm has been implemented on

XC6VLX550T-2 FPGA using Xilinx ISE 14.7. A 50 MHz an FPGA. It has been shown that the computation of the

clock constraint was met for all congurations. It has been precoding and decoding matrices is possible in hardware

chosen to enable the sequential computation of two sets under a real-time latency constraint of one millisecond. The

of matrices within 1 ms. Table II summarizes the resource synthesis results show high DSP resource requirements even

requirements for certain system congurations and degrees for small system congurations.

578

REFERENCES

[1] V. Cadambe and S. Jafar, Interference alignment and degrees of

freedom of the K-user interference channel, Information Theory,

IEEE Transactions on, vol. 54, no. 8, pp. 3425 3441, aug. 2008.

[2] D. Schmidt, C. Shi, R. Berry, M. Honig, and W. Utschick, Compar-

ison of distributed beamforming algorithms for MIMO interference

networks, Signal Processing, IEEE Transactions on, vol. 61, no. 13,

pp. 34763489, July 2013.

[3] , Minimum mean squared error interference alignment, in

Signals, Systems and Computers, 2009 Conference Record of the

Forty-Third Asilomar Conference on, nov. 2009, pp. 1106 1110.

[4] M. Kock, S. Hesselbarth, M. Ptzner, and H. Blume, Hardware-

accelerated design space exploration framework for communication

systems, Analog Integrated Circuits and Signal Processing,

vol. 78, no. 3, pp. 557571, 2014. [Online]. Available:

http://dx.doi.org/10.1007/s10470-013-0127-6

[5] J. A. Garca-Naya, L. Castedo, . Gonzlez, D. Ramrez, and

I. Santamara, Experimental evaluation of interference alignment

under imperfect channel state information, in 19th European Signal

Processing Conference (EUSIPCO 2011), Barcelona, Spain, August

2011.

[6] O. Gonzlez, D. Ramrez, I. Santamara, J. Garca-Naya, and

L. Castedo, Experimental validation of interference alignment tech-

niques using a multiuser MIMO testbed, in Smart Antennas (WSA),

2011 International ITG Workshop on, feb. 2011, pp. 1 8.

[7] P. Greisen, S. Haene, and A. Burg, Simulation and emulation

of MIMO wireless baseband transceivers, EURASIP Journal on

Wireless Communications and Networking, vol. 2010, no. 1, 2010.

[8] J. Massey, J. Starr, S. Lee, D. Lee, A. Gerstlauer, and R. Heath, Im-

plementation of a real-time wireless interference alignment network,

in Signals, Systems and Computers (ASILOMAR), 2012 Conference

Record of the Forty Sixth Asilomar Conference on, 2012, pp. 104108.

[9] P. Zetterberg and N. N. Moghadam, An experimental investigation

of SIMO, MIMO, interference-alignment (IA) and coordinated multi-

point (CoMP), CoRR, vol. abs/1111.3616, 2011.

[10] E. H. Bareiss, Sylvesters identity and multistep integer-preserving

gaussian elimination, Math. Comp., vol. 22, pp. 565578, 1968.

579

- LTE and LTE AdvancedЗагружено:Abdenour Bentahar
- LTE AdvancedЗагружено:pra_s_3
- Question 1Загружено:Atif Sharif
- 06 LTE Radio Planning Capacity.pdfЗагружено:fikri rabbani
- Power Allocation in Multiple-Input Multiple OutputЗагружено:naveednad2003556
- Part a Solutions of ZIOЗагружено:BharatRox
- ESTUDIO - Ceragon Product portfolio Quick Presentation Oct2013.pdfЗагружено:gouky10
- Icassp96 VtsЗагружено:jimakosjp
- MODUL A SET 1.docЗагружено:Mia Shera
- 7Загружено:jagadeesh jagade
- WiMAx¿¡¼-LTE·ÎÀÇ½Å³ª´Â¿©ÇàЗагружено:alfaroq11557
- 856-C10011Загружено:Rajeev Ranjan Kumar
- UAV CELLULAR NETWORKЗагружено:Madhumitha M
- MATLAB ExcercisesЗагружено:Ranaissance
- 5. Foundation Embedded in a Layered Half-SpaceЗагружено:ararrati
- Module 18 - Matrix Analysis 3 (self study)Загружено:api-3827096
- Lecture 3Загружено:Nathan Cornwell
- 07127688.pdfЗагружено:Fakhar Abbas
- Conf_IST%2707_Downlink PHY Layer LTEЗагружено:Ekwere Wilfred Udoh
- Finger prints in the ether-PHY securityЗагружено:f00k3r
- EScholarship UC Item 5kk6z8v3Загружено:hmalrizzo469
- 014BSCIT029_JAVA_1.docxЗагружено:Bwapii Thex
- 1709.04693.pdfЗагружено:hendra lam
- scheme of work form 5 2008.docЗагружено:mfkhairi
- Lecture 01Загружено:Thahir Shah
- 25. Chapter 25 - Modeling Recommendations _a4lЗагружено:steven_gog
- InstructionsЗагружено:Meeit Guleria
- 09 Basic - TRips and TRaps Cheat SheetЗагружено:werwerwer
- Intro MatlabЗагружено:cosaefren
- tesЗагружено:ratnom

- FPGA based Accelerating platform for Big Data Matrix Processing.pdfЗагружено:BoppidiSrikanth
- 1407.3360.pdfЗагружено:BoppidiSrikanth
- PeterSutor_HonorsThesis.pdfЗагружено:BoppidiSrikanth
- furer.pdfЗагружено:BoppidiSrikanth
- Svyatoslav Covanov Rapport de Stage Recherche 2014Загружено:BoppidiSrikanth
- (Oxford Series in Electrical and Computer Engineering) Allen, Phillip E._ Holberg, Douglas R-CMOS Analog Circuit Design-Oxford University Press, USA (2011)Загружено:ASDFER
- Analog Integrated Circuit DesignЗагружено:harsh
- Reducing FPGA Algorithm Area by Avoiding Redundant Computation.pdfЗагружено:BoppidiSrikanth
- SNNAP Approximate Computing on Programmable SoCs via Neural Acceleration.pdfЗагружено:BoppidiSrikanth
- Revealing Potential Performance Improvements By Utilizing Hybrid work sharing for resource Intensive Seismic Applications.pdfЗагружено:BoppidiSrikanth
- Performance-Energy Optimizations for Shared Vector Accelerators in Multicores.pdfЗагружено:BoppidiSrikanth
- PXIe-Based LLRF Architecture and Versatile Test Bench for Heavy Ion Linear Acceleration.pdfЗагружено:BoppidiSrikanth
- Rapid Heterogeneous Prototyping from Simulink.pdfЗагружено:BoppidiSrikanth
- Pipelined Decision Tree Classification Accelerator Implementation in FPGA (DT-CAIF).pdfЗагружено:BoppidiSrikanth
- Population-based MCMC on multi-core CPUs, GPUs and FPGAs.pdfЗагружено:BoppidiSrikanth
- Optimised Multiplication Architectures for Accelerating Fully Homomorphic Encryption.pdfЗагружено:BoppidiSrikanth
- Hardware-Acceleration of Short-read Alignment Based on the Burrows-Wheeler Transform.pdfЗагружено:BoppidiSrikanth
- MACRON The NoC based Many Core Parallel Processign Platform and its Applicatons in 4G Communication Systems.pdfЗагружено:BoppidiSrikanth
- HMFPCC Hybrid Mode Floating Point Conversion Co Processor.pdfЗагружено:BoppidiSrikanth
- Mitigating Memory-induced Dark Silicon in Many-Accelerator Architectures.pdfЗагружено:BoppidiSrikanth
- NoC Centric Partitioning and Reconfiguration Technologies for the Efficient Sharing of Multi Core Programmable Accelerators.pdfЗагружено:BoppidiSrikanth
- Loop Coarsening in C based High Level Synthesis.pdfЗагружено:BoppidiSrikanth
- Hardware Implementation on FPGA for Tasklevel Parallel Dataflow Execution Engine.pdfЗагружено:BoppidiSrikanth
- FPGA based accelerator for visual features detection.pdfЗагружено:BoppidiSrikanth
- Heterogeneous Cloud Framework for Big Data Genome Sequencing.pdfЗагружено:BoppidiSrikanth
- High Performance Sparse LU Solver FPGA Accelerator using a Static Synchronous Data Flow Model.pdfЗагружено:BoppidiSrikanth
- Hardware accelerators for Informnation Retrieval and Data Mining.pdfЗагружено:BoppidiSrikanth
- Hardware Accelerator for Similarity Bassed Data Dedupe.pdfЗагружено:BoppidiSrikanth
- FPGA Implementation of Low-Power 3D Ultrasound Beamformer.pdfЗагружено:BoppidiSrikanth
- Framework for a selection of custom instructions for Ht-MPSoC in area-performance aware manner.pdfЗагружено:BoppidiSrikanth

- 4Загружено:sree2728
- Cegep Linear Algebra ProblemsЗагружено:ham.karim
- civilЗагружено:api-236544093
- Schmidt Collapse of the State VectorЗагружено:Ben Steigmann
- Modeling & Control Of Automotive Clutch SystemsЗагружено:robinisc
- Matrix1.pdfЗагружено:shanur begulaji
- JAM collection of papersЗагружено:harshanthr
- determinants.pdfЗагружено:Ojibwe Unanimes
- 1480215236.pdfЗагружено:mrgrizzley
- BMR.pdfЗагружено:Anonymous eFnAMq6q
- LectureNotes ClaireЗагружено:winner always
- DynaMIT a Simulation-based System for Traffic PredictionЗагружено:chera64
- ITЗагружено:Purna Satish Banchode
- NAV5_0_Implementations_RecommendationsЗагружено:pavan.wind
- Cimg TutorialЗагружено:peterhaijin
- De Chapter 1Загружено:essamqad
- CompAidBook.pdfЗагружено:p09el860
- 2009isc Comp PracЗагружено:aashray
- Journal.pone.0093313Загружено:Fatima Herranz Trillo
- Linear ODEЗагружено:epamd
- Activity No6Загружено:tearsome
- Modeling Transformer Core Effects in OpenDSS.pdfЗагружено:Lucas Godoi
- Fundamentals_of_TransportationЗагружено:Binu Raman
- Weatherwax Pacheco ProblemsЗагружено:narendra
- arrays tutorialЗагружено:Antaryami Das
- Lesson Plan of Math Xii Sem 1,2 10-11Загружено:Yuniati Purnomo
- Karnataka 1 PUC Computer Science Model Question Paper 4Загружено:Raghu Gowda
- Bba Syllabus CbcsЗагружено:lakhan_thakor
- Direct Methods for Limit States in Structures, [Franck Pastor, Joseph Pastor, Djimedo Kondo (Auth.), Konstantinos Spiliopoulos, Dieter Weichert (Eds.)]Загружено:Geoffrey Armstrong
- Crippling analysis of composite stringers.pdfЗагружено:Dhimas Surya Negara