# A 757 Mb/s 1.5 mm<sup>2</sup> 90 nm CMOS Soft-Input Soft-Output MIMO Detector for IEEE 802.11n

C. Studer<sup>\*</sup>, S. Fateh<sup>‡</sup>, and D. Seethaler<sup>\*</sup>

ETH Zurich, 8092 Zurich, Switzerland e-mail: \*{studerc,seethal}@nari.ee.ethz.ch, <sup>‡</sup>fateh@iis.ee.ethz.ch

Abstract-Multiple-input multiple-output (MIMO) wireless technology is the key to meet the demands for data rate, qualityof-service, and bandwidth-efficiency of modern wireless communication systems. MIMO technology is therefore adopted in many recent communication standards, such as IEEE 802.11n. Here, the MIMO detector has a strong impact on the overall system performance. In fact, the full potential of MIMO communication systems can only be achieved by means of iterative MIMO decoding using soft-input soft-output (SISO) data detection. In this paper, we present-to the best of our knowledge-the first VLSI implementation of a SISO detector for iterative MIMO decoding. The presented ASIC supports SISO detection for four spatial streams and enables more than 6 dB signal-to-noise-ratio improvement over state-of-the-art MIMO detectors. The 1.5 mm<sup>2</sup> ASIC is fabricated in 90 nm CMOS and achieves 757 Mb/s, which exceeds the 600 Mb/s IEEE 802.11n peak data-rate.

# I. INTRODUCTION

Modern wireless communication systems, such as the IEEE 802.11n WLAN standard [1], are based on multiple-input multiple-output (MIMO) technology, which meets the demand for reliable, high-speed, and bandwidth-efficient data transmission. In these systems, MIMO detection, i.e., the separation of the spatially-multiplexed data streams, and channel decoding are among the main challenges in computational complexity and corresponding efficient implementations are the key to facilitate high-performance and low-cost user equipment.

ASIC implementations of state-of-the-art high-performance MIMO detection using sphere-decoding (SD) [2], [3] are unable to achieve the 600 Mb/s peak data-rate of IEEE 802.11n, which is due to SD's prohibitive worst-case complexity. Recent ASIC implementations of suboptimum MIMO detection, e.g., the k-Best detector [4] or soft-output minimum meansquare error (MMSE) detection [5], exceed the 600 Mb/s peak data-rate, but at the cost of inferior error-rate performance, which eventually degrades the system throughput, coverage, and range. All these techniques rely on a single channeldecoding step without iteratively exchanging information with the MIMO detector. However, as it was shown in [6], the full potential of MIMO wireless communication systems can only be achieved through *iterative MIMO decoding*.

At the heart of an iterative MIMO decoder is a soft-input soft-output (SISO) MIMO detector (referred to as "SISO detector"), which iteratively exchanges reliability information of the coded bits with a SISO channel decoder. A SISO detector exhibits, in general, very high computational complexity (see, e.g., [6]), which necessitates the design of low-complexity algorithms and corresponding dedicated ASIC implementations. *Contributions:* In this paper, we present—to the best of our knowledge—the first ASIC implementation of a SISO detection algorithm for iterative MIMO decoding. To this end, we develop a reduced-complexity variant of the MMSE parallel interference cancellation (PIC) algorithm proposed in [7] and design a VLSI architecture consisting of eight parallel processing units (PUs) to achieve the peak data-rate of IEEE 802.11n. We provide measurement results of the 90 nm CMOS ASIC and finally demonstrate that substantial performance gains can be achieved compared to state-of-the-art (non-iterative) MIMO-detector implementations.

*Notation:* Matrices are set in boldface capital letters, vectors in boldface lowercase letters. The superscript  $^{H}$  stands for conjugate transpose and  $\mathbf{I}_{M}$  is the  $M \times M$  identity matrix.  $P[\cdot]$  denotes probability. Expectation and variance are referred to as  $\mathbb{E}[\cdot]$  and  $\operatorname{Var}[\cdot]$ , respectively.

# II. MIMO SYSTEM AND ALGORITHM DESCRIPTION

We consider a coded MIMO system with  $M_{\rm T}$  transmit and  $M_{\rm R} \ge M_{\rm T}$  receive antennas (see Fig. 1) employing spatial multiplexing as specified in IEEE 802.11n [1]. The information bits **b** are encoded (e.g., using a convolutional code) and the coded bit-stream **x** is mapped to a sequence of transmit vectors  $\mathbf{s} \in \mathcal{O}^{M_{\rm T}}$ , where  $\mathcal{O}$  corresponds to the scalar complex constellation of size  $2^Q$ . Each transmit vector **s** is associated with  $M_{\rm T}Q$  binary values  $x_{i,b} \in \{0,1\}$ ,  $i = 1, \ldots, M_{\rm T}, b = 1, \ldots, Q$ , corresponding to the *b*th bit of the *i*th entry (i.e., spatial stream) of **s**. The baseband inputoutput relation of the wireless MIMO channel is given by  $\mathbf{y} = \mathbf{H}\mathbf{s} + \mathbf{n}$ , where **H** stands for the  $M_{\rm R} \times M_{\rm T}$  complexvalued channel matrix, **y** is the  $M_{\rm R}$ -dimensional received vector, and **n** is  $M_{\rm R}$ -dimensional i.i.d. zero-mean complex Gaussian distributed with variance  $N_0$  per entry.

## A. Principle of Iterative MIMO Decoding

Iterative MIMO decoding applies the key ideas of turbodecoding [8] to data detection in MIMO systems. Here, reliability information of the coded bits—in terms of loglikelihood ratios (LLRs)—is iteratively exchanged between the SISO detector and the SISO channel decoder (see Fig. 1) to successively improve the error-rate performance. In each iteration, the SISO detector computes the LLRs [6]

$$L_{i,b}^{\mathrm{D}} = \log\left(\frac{\mathrm{P}[x_{i,b}=1 \mid \mathbf{y}]}{\mathrm{P}[x_{i,b}=0 \mid \mathbf{y}]}\right)$$
(1)



Figure 1. MIMO communication system using iterative MIMO decoding.

for each coded bit  $x_{i,b}$ , based on the received vector y, the channel matrix H, and the a-priori LLRs  $L_{i,b}^{A}$ ,  $\forall i, b$ . The LLRs  $L_{i,b}^{D}$  are then delivered to the SISO channel decoder, which computes *new* a-priori LLRs  $L_{i,b}^{A}$ ,  $\forall i, b$ , that are used by the SISO detector in the next iteration. After a given number of iterations (denoted by I), the SISO channel decoder computes final estimates  $\hat{\mathbf{b}}$  for the information bits.

## B. Reduced-Complexity SISO MMSE-PIC Algorithm

Even for a small number of spatial streams (say  $M_{\rm T} > 2$ ), exact computation of the LLRs in (1) exhibits prohibitive complexity. Therefore, a complexity-reduced variant of the SISO MMSE-PIC algorithm in [7] is considered in the following. Our algorithm performs SISO detection in *six steps* and is summarized below (refer to [9] for more details):

1) Gram matrix and matched-filter: To reduce the amount of recurrent (and hence, redunant) operations, compute the Gram matrix  $\mathbf{G} = \mathbf{H}^H \mathbf{H}$  and the matched-filter output according to  $\mathbf{y}^{\text{MF}} = \mathbf{H}^H \mathbf{y}$ .

2) Soft-symbols and variances: Compute soft-symbols for each spatial stream  $i = 1, ..., M_{\rm T}$ , according to

$$\hat{s}_i = \mathbb{E}[s_i] = \sum_{a \in \mathcal{O}} \mathbb{P}[x_{i,b} = [a]_b] a \tag{2}$$

where  $[a]_b$  corresponds to the *b*th bit associated with the constellation point  $a \in \mathcal{O}$ . The soft-estimates in (2) are computed on the basis of the a-priori LLRs  $L_{i,b}^{A}$  (provided by the SISO channel decoder) according to  $P[x_{i,b} = x] = \frac{1}{2}(1 + (2x - 1) \tanh(\frac{1}{2}L_{i,b}^{A}))$ . In the first iteration, no a-priori information is available, which implies  $L_{i,b}^{A} = 0, \forall i, b$ . The variances  $E_i = \operatorname{Var}[s_i]$  of the soft-symbols are computed analogously to (2).

*3) Parallel interference cancellation (PIC):* Next, the SISO detector performs PIC according to

$$\hat{\mathbf{y}}_{i}^{\mathrm{MF}} = \mathbf{y}^{\mathrm{MF}} - \sum_{j \neq i} \mathbf{g}_{j} \hat{s}_{j} = \mathbf{g}_{i} s_{i} + \mathbf{n} + \sum_{\substack{j \neq i \\ \mathrm{noise-plus-interference}}} \mathbf{g}_{j} e_{j} \tag{3}$$

for each stream *i*, where  $\mathbf{g}_i$  stands for the *i*th column of  $\mathbf{G}$  and  $e_j = s_j - \hat{s}_j$ . The SISO MMSE-PIC algorithm now performs approximate detection based on (3). To this end, the single-stream system in (3) is considered as *independent* from the other spatial streams  $j \neq i$  and the errors  $e_j$  are assumed as zero-mean Gaussian with variances  $E_j$ .

4) Matrix inversion for MMSE filtering: For each spatial stream in (3), an MMSE-filter operation is performed to suppress the noise-plus-interference term. The original algorithm [7] requires  $M_{\rm T}$  matrix inversions for the computation of all  $M_{\rm T}$  MMSE filter vectors, which inhibits the efficient implementation in hardware. Hence, we deploy a low-complexity method that yields the same LLRs (see [9] for the proof) and only requires one matrix inversion of the same size for the simultaneous computation of all filter vectors. To this end, we compute the inverse  $\mathbf{A}^{-1} = (\mathbf{G}\mathbf{\Lambda} + N_0\mathbf{I}_{M_{\rm T}})^{-1}$ , where  $\mathbf{\Lambda}$  is an  $M_{\rm T} \times M_{\rm T}$  diagonal matrix with  $\Lambda_{i,i} = E_i$ ,  $\forall i$  and the rows of  $\mathbf{A}^{-1}$  correspond to the  $M_{\rm T}$  filter vectors.

5) *MMSE filtering:* Compute the MMSE filter outputs according to  $z_i = \mu_i^{-1} \mathbf{a}_i^H \hat{\mathbf{y}}_i^{\text{MF}}$ ,  $\forall i$ , where  $\mathbf{a}_i^H$  is the *i*th row of the matrix  $\mathbf{A}^{-1}$  and  $\mu_i = \mathbf{a}_i^H \mathbf{g}_i$ .

6) *LLR computation:* The SISO MMSE-PIC algorithm finally approximates the LLRs in (1) according to

$$L_{i,b}^{\rm D} \approx \rho_i \left( \min_{a \in \mathcal{Z}_b^{(0)}} |z_i - a|^2 - \min_{a \in \mathcal{Z}_b^{(1)}} |z_i - a|^2 \right)$$
(4)

with  $\rho_i = \frac{\mu_i}{1 - E_i \mu_i}$  being the *i*th-stream post-equalization signalto-noise-plus-interference-ratio and  $\mathcal{Z}_b^{(0)}$  and  $\mathcal{Z}_b^{(1)}$  refer to the subsets of  $\mathcal{O}$ , where the *b*th bit is 0 and 1, respectively.

#### **III. VLSI ARCHITECTURE**

In order to efficiently compute the reduced-complexity SISO MMSE-PIC algorithm in hardware, we propose an architecture consisting of eight processing units (PUs) each having similar structure. The high-level VLSI architecture of the PUpartitioning, along with the corresponding six processing steps (as described in Sec. II-B), is depicted in Fig. 2. We optimized each PU independently, which led itself to a high-throughput and area-efficient VLSI architecture, while requiring low development and verification time.

The proposed architecture processes six receive vectors concurrently and in a pipelined manner. Each PU performs the assigned tasks in  $T_{\rm s} = 18$  clock cycles, which was chosen to arrive at a low silicon complexity while achieving the 600 Mb/s peak data rate of IEEE 802.11n. The results of a PU are passed to the subsequent PUs (or to the output of the detector) every 18th cycle, which is referred to as the "exchange-cycle" in the following. This systolic-like processing scheme leads to an overall latency of 108 clock cycles and achieves a constant throughput of  $\frac{M_{\rm T}Q}{T_{\rm s}}f_{\rm clk}$  bit/s scaling linearly in the clock frequency  $f_{\rm clk}$ . Consequently, in this architecture, the throughput is maximized by minimizing the lengths of the critical paths in all PUs.

### A. Processing Unit (PU) Architecture

The architectural principle underlying each PU is depicted on the left-hand side of Fig. 3. Each PU contains a finite state machine (FSM) controlling the data memory, an interconnection network, and a task-specific set of arithmetic units (AUs). The data memories are formed by arrays of flip-flops in order to meet the high memory-bandwidth required by the parallel AU-instances and to enable irregular access to multiple data words. The total set of AUs corresponds to adders, multipliers,



Figure 2. Proposed high-level VLSI architecture of the SISO MMSE-PIC detector.

multiply-accumulate (MAC) units, arithmetic shifters (mainly used to improve numerical precision), comparators, look-up tables (required to approximate the probabilities  $P[x_{i,b} = [a]_b]$ ), and reciprocal units. The set of AUs required by a specific PU is determined such that all required operations are completed in exactly  $T_s = 18$  clock cycles.

To minimize the length of the critical path, fixed-point arithmetic is used and the AU-internal word-lengths are optimized with the aid of simulations. Further reduction of the length of the critical path is obtained by inserting a pipeline-register at the input of each AU, which is then re-timed with the aid of the synthesis tool. The feed-through capability allows a parallel transfer of all the data-memory contents from one PU to the subsequent PU(s) in the exchange-cycle. In this cycle, some AUs also provide computation results, which are directly passed to the corresponding next PU. To reduce dynamic power consumption in the case that no data-frame needs to be processed, the clock of each PU can be gated individually.

# B. Matrix Inversion Using the LU-Decomposition

The computation of  $A^{-1}$  in Step 4 of the SISO MMSE-PIC algorithm (see Section II-B) dominates the computational complexity of the algorithm. In order to perform matrixinversion at high throughput and with sufficiently high arithmetic precision, we propose the use of a LU-decomposition (LUD) based inversion procedure. In contrast to other methods (such as, e.g., QR-based matrix inversion), we observed that it is economic and exhibits good numerical stability. As shown in Fig. 2, the required inversion computations are performed in two separate PUs, where the first PU computes the LU-decomposition (LUD) A = LU, where L and U are lower- and upper-triangular matrices, respectively, and the forward-substitution procedure to solve  $\mathbf{L}\mathbf{v}_i = \mathbf{e}_i$  for  $\mathbf{v}_i, i = 1, \dots, M_{\mathrm{T}}$ , where  $\mathbf{e}_i$  denotes the *i*th unit vector. The second PU associated to the LUD-based matrix inversion step computes the back-substitution  $\mathbf{U}\mathbf{x}_i = \mathbf{v}_i$  for  $\mathbf{x}_i$ , i = $1, \ldots, M_{\rm T}$ , which finally yields the desired inverse according to  $\mathbf{A}^{-1} = [\mathbf{x}_1 \cdots \mathbf{x}_{M_T}].$ 

# C. Newton-Raphson-Based Reciprocal Unit

At various steps of the algorithm (such as for the matrix inversion in Step 4 and the computation of  $\mu_i$  and  $\rho_i$  in Step 5 and Step 6, respectively) reciprocal values (i.e., 1/x) have to be computed. We identified these reciprocal computations as critical in terms of the maximum achievable clock-frequency as well as in terms of the required arithmetic precision.



Figure 3. Left: PU architecture overview. Right: Register transfer-level architecture of the pipelined Newton-Raphson-based reciprocal unit.

Therefore, we designed a custom reciprocal unit delivering one reciprocal value per clock cycle (shown on the righthand side of Fig. 3). First, to improve numerical precision, the input value x is shifted according to  $\tilde{x} = 2^{\alpha}x$  (with  $\alpha \in \mathbb{Z}$ ) such that the MSB of  $\tilde{x}$  becomes non-zero (the scaling  $2^{\alpha}$  is accounted for in later stages using arithmetic shifters). Next, based on an initial guess  $\tilde{x}_0$  of  $1/\tilde{x}$  obtained from an 8 bit look-up table (LUT), a *single* Newton-Raphson iteration according to  $\tilde{x}_1 \leftarrow 2\tilde{x}_0 - \tilde{x}\tilde{x}_0^2$  is performed. The resulting unit provides  $\tilde{x}_1 \approx 1/\tilde{x}$  with 15 bit precision (excluding the initial shift), which was shown to be sufficient to attain a very small implementation loss (see Fig. 5). The insertion of two pipeline stages in the reciprocal unit finally moved the critical path to a 24 bit×28 bit multiplier of the back-substitution PU.

#### **IV. PERFORMANCE AND IMPLEMENTATION RESULTS**

The final design performs SISO detection of four spatial streams and supports BPSK, QPSK, 16-QAM, and 64-QAM. The ASIC was fabricated in 90 nm (1P/9M) CMOS technology. Fig. 5 shows the chip micrograph (due to library constraints, no signal routing was used on the 9th metal layer).

#### A. Performance of Iterative MIMO Decoding

Fig. 5 demonstrates the SNR-performance advantages of iterative MIMO decoding using the SISO MMSE-PIC detector (based on the proposed SISO MMSE-PIC ASIC and based on the corresponding ideal floating-point algorithm according to [7]) over non-iterative (i.e, I = 1) state-of-the-art MIMO detection schemes based on hard-output SD [2], [3], k-Best detection [4], and soft-output MMSE detection [5]. We note that for I = 1, SISO MMSE-PIC detection coincides with soft-output MMSE detection. One can observe that for the



Figure 4. SISO MMSE-PIC chip micrograph with highlighted PUs (I/O refers to logic required for the input/output interface of the chip).



Figure 5. Packet error-rate (PER) comparison of various MIMO detection algorithms in a typical 40 MHz IEEE 802.11n scenario (MCS 27) with 4-spatial streams, 16-QAM, rate-1/2 convolutional code, 864 information bits per packet, and using a TGn type C channel model. The arrow indicates the SNR-performance gain through iterative MIMO decoding using the SISO MMSE-PIC algorithm with four iterations over (non-iterative) hard-output SD.

non-iterative case, hard-output SD slightly outperforms all other detection algorithms at 1% PER. However, two or four iterations with the SISO MMSE-PIC detector yield 3.9 dB and 6 dB SNR improvement, respectively, over the corresponding non-iterative algorithms. Finally, one can observe that the implementation loss of the proposed SISO MMSE-PIC ASIC compared to that of the ideal algorithm is less than 0.2 dB.

#### **B.** Implementation Results

The proposed SISO MMSE-PIC ASIC has the following key characteristics (see also Tbl. I).<sup>1</sup> Its core area is  $1.5 \text{ mm}^2$  (at 86% cell density) and its maximum clock frequency is 568 MHz, which results in a maximum throughput of 757 Mb/s per iteration (for 4-spatial streams and 64-QAM) achieving the 600 Mb/s peak data-rate specified in IEEE 802.11n with margin. The power consumption<sup>2</sup> is 769 mW leading to an energy-efficiency of 1.02 nJ/bit per iteration.

Table I ASIC IMPLEMENTATION RESULTS AND COMPARISON

|                                                  | This work        | Burg<br>et al. [5]  | Shabany and<br>Gulak [4] |
|--------------------------------------------------|------------------|---------------------|--------------------------|
| Detection algorithm                              | SISO<br>MMSE-PIC | soft-output<br>MMSE | hard-output<br>k-Best    |
| Iterative MIMO decoding                          | yes              | no                  | no                       |
| SNR operating point <sup>a</sup> [dB]            | 13.7             | 21.7                | 20.3                     |
| CMOS technology [nm]                             | 90               | 130                 | 130                      |
| Preprocessing area [kGE]<br>Detection area [kGE] | 410 <sup>b</sup> | 251<br>67           | _<br>114                 |
| Max. throughput [Mb/s]                           | 757              | 1386 <sup>c</sup>   | 950 <sup>c</sup>         |

<sup>*a*</sup>Corresponding to the minimum SNR required for 1% PER (see Fig. 5.) <sup>*b*</sup>One gate equivalent (GE) corresponds to a 2-input drive-1 NAND gate. <sup>*c*</sup>Throughput scaled by 1.45 to account for 130 nm CMOS technology.

Tbl. I provides a comparison of the proposed SISO MMSE-PIC ASIC with two state-of-the-art non-iterative MIMO detector implementations [4], [5] that exhibit constant throughput and achieve the 600 Mb/s peak data-rate of the IEEE 802.11n standard.<sup>3</sup> We note that no VLSI implementation of a SISO detection algorithm for iterative MIMO decoding was reported in the open literature. From Tbl. I one can observe that, by enabling the significant SNR-performance gains offered by iterative MIMO decoding, the proposed SISO MMSE-PIC ASIC is only two times less efficient in terms of kGE/Mb/s than the soft-output MMSE detector of [5]. We note that the area result of the k-Best detector in [4] is rather optimistic, as it does not include the necessary preprocessing circuitry.

#### **ACKNOWLEDGMENTS**

The authors gratefully acknowledge the support from H. Bölcskei, A. Burg, N. Felber, W. Fichtner, F. Gürkaynak, and Q. Huang during the ASIC design and writing of the paper.

#### REFERENCES

- IEEE Draft Standard; Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications; Amendment 4: Enhancements for Higher Throughput, P802.11n/D3.0, Sep. 2007.
- [2] A. Burg et al., "VLSI implementation of the sphere decoding algorithm," in Proc. IEEE ESSCIRC, Sept. 2004, pp. 303–306.
- [3] C.-H. Yang and D. Marković, "A 2.89mw 50GOPS 16×16 16-core MIMO sphere decoder in 90nm CMOS," in *Proc. IEEE ESSCIRC*, Sept. 2009, pp. 344–347.
- [4] M. Shabany and P. G. Gulak, "A 0.13 μm CMOS, 655 Mb/s 4×4 64-QAM k-best MIMO detector," in *Dig. Techn. Papers, IEEE ISSCC*, Feb. 2009.
- [5] A. Burg *et al.*, "A 4-stream 802.11n baseband transceiver in 0.13 μm CMOS," in *Dig. Techn. Papers, Symp. on VLSI Circuits*, Jun. 2009, pp. 282–283.
- [6] B. M. Hochwald and S. ten Brink, "Achieving near-capacity on a multipleantenna channel," *IEEE T-COM*, vol. 51, no. 3, pp. 389–399, Mar. 2003.
- [7] X. Wang and H. V. Poor, "Iterative (turbo) soft interference cancellation and decoding for coded CDMA," *IEEE T-COM*, vol. 47, no. 7, pp. 1046– 1061, Jul. 1999.
- [8] C. Berrou, A. Glavieux, and P. Thitimajshima, "Near Shannon limit errorcorrecting coding and decoding," in *Proc. of IEEE ICC*, May 1993, pp. 1064–1070.
- [9] C. Studer, S. Fateh, and D. Seethaler, "Soft-input soft-output MIMO detection using MMSE parallel interference cancellation: Algorithm and VLSI implementation," *in preparation*.

 $^{3}$ We note that the hard-output SD implementations of [2], [3] do *not* achieve the 600 Mb/s peak data-rate of the IEEE 802.11n standard [1].

<sup>&</sup>lt;sup>1</sup>All measurement results (for maximum clock-frequency and power consumption) were carried out on an HP 83 000 F660 VLSI test system.

<sup>&</sup>lt;sup>2</sup>Measured at max. throughput, 1.2 V core supply, and T = 300 K.