Abstract—This paper presents a novel data detector ASIC for massive multiuser multiple-input multiple-output (MU-MIMO) wireless systems. The ASIC implements a modified version of the large-MIMO approximate message passing algorithm (LAMA), which achieves near-optimal error-rate performance (i) under realistic channel conditions and (ii) for systems with as many users as base-station (BS) antennas. The hardware architecture supports 32 users transmitting up to 256-QAM simultaneously and in the same frequency band, and provides soft-input soft-output capabilities for iterative detection and decoding. The fabricated 28nm CMOS ASIC occupies 0.37 mm², achieves a throughput of 354 Mb/s, consumes 151 mW, and improves the SNR by more than 11 dB compared to existing data detectors in systems with 32 BS antennas and 32 users for realistic wireless channels. In addition, the ASIC achieves $4 \times$ higher throughput per area than a recently proposed message-passing detector.

Contributions: We propose the first data detector ASIC that achieves near-MAP performance for 32 UEs under realistic propagation conditions. Furthermore, the ASIC provides soft-input soft-output (SISO) capabilities for iterative detection and decoding. The algorithm builds upon the large-MIMO approximate message passing (LAMA) algorithm [3], which achieves MAP-optimal error-rate performance for Rayleigh fading channels and in the large-antenna limit, assuming that the UE-to-BS antenna ratio is less than a threshold that depends on the constellation. In contrast to linear data detectors, LAMA exploits information on the constellation to improve performance; for QPSK, for example, LAMA achieves optimal performance in the large-antenna limit and for systems where the number of UE and BS antennas are identical. Since practical systems are finite-dimensional and real-world channels exhibit correlation, we include algorithm-level optimizations to support realistic channels with LAMA. To achieve high throughput at low area, our ASIC uses coarse-grained pipeline interleaving, processing two detection problems within the same architecture. The fabricated 28nm CMOS ASIC outperforms existing designs under realistic channel conditions and for systems in which the number of UEs is comparable to the number of BS antennas.

I. INTRODUCTION

Massive MU-MIMO enables higher per-cell spectral efficiency compared to conventional, small-scale MIMO. This improvement, however, comes at a significant increase in baseband processing complexity [1]. In particular, data detection at the base-station (BS) in the massive MU-MIMO uplink is among the most critical tasks in terms of power consumption and throughput [2]. To exacerbate the situation, the complexity of optimal, maximum a-posteriori (MAP), data detection grows exponentially in the number of user equipment (UE) antennas [3], which prevents its implementation in practice.

To enable high-throughput massive MU-MIMO data detection, a variety of low-complexity algorithms (see, e.g., [1], [4]) and application-specific integrated circuits (ASICs) [5]–[7] have been proposed. These algorithms and ASIC designs either rely on idealistic channel-hardening assumptions [5], [6] or deploy approximations [1], [4] to reduce complexity. Unfortunately, both of these simplifications result in high error rates (i) under realistic propagation conditions, such as correlation and per-user path loss, and (ii) in systems where the number of UEs is equal to the number of BS antennas. As a consequence, achieving near-optimal performance in realistic systems necessitates novel data detection algorithms that can be implemented efficiently.

A. Iterative MIMO Decoding

Iterative detection and decoding in MIMO systems achieves near-optimal spectral efficiency in MIMO wireless systems [8].
Reliability information on the coded bits, often expressed as log-likelihood ratios (LLRs), is iteratively exchanged between the MIMO data detector and the channel decoder. In each iteration, a soft-input soft-output (SISO)-capable MIMO data detector computes extrinsic LLRs for the coded bits $x_{u,q}$ as

$$
\Lambda_{u,q}^d = \log \left( \frac{P[x_{u,q} = 1 | y]}{P[x_{u,q} = 0 | y]} \right) - \Lambda_{u,q}^\text{prior},
$$

using the received vector $y$ and a-priori LLRs $\Lambda_{u,q}^\text{prior}$, $u = 1, \ldots, U$, $q = 1, \ldots, Q$, obtained from the channel decoder. The extrinsic LLRs $\Lambda_{u,q}^d$, which represent reliability estimates for each coded bit $x_{u,q}$, are then passed to the channel decoder, which computes new a-priori LLRs $\Lambda_{u,q}^\text{prior}$, $\forall u,q$, that are used by the MIMO data detector in the next iteration. After a small number of iterations $I$, the channel decoder generates final decisions $b$ for the information bit vector $b$.

### B. Hardware Friendly LAMA Algorithm

LAMA is an efficient data detection algorithm based on approximate message passing (AMP), that is provably optimal (in terms of error-rate performance) in the large-system limit (i.e., fix $\beta = B/U$ and $B \rightarrow \infty$) with i.i.d. Rayleigh fading channels [3]. In each of its $t_{\text{max}}$ iterations, LAMA decouples the MIMO system into parallel and independent AWGN channels with equal signal-to-interference-plus-noise ratio (SINR). As a result, LAMA optimally denoises the parallel AWGN channels in every iteration, which successively increases the post-equalization SINR and improves the error-rate performance.

To deal with realistic channel conditions (such as correlation and per-UE path loss), we apply algorithm-level modifications to the original LAMA algorithm in [3]. First, we transform LAMA so that it operates on the $U \times U$ dimensional Gram matrix $G = H^H H$, instead of $B \times U$ channel matrix $H$, which reduces the per-iteration complexity. Second, we deployed message damping techniques [9] to reduce the performance loss of LAMA in finite-dimensional systems that exhibit correlation and large-scale UE fading. Specifically, we damp the updates of $\hat{\tau}_d$ and $\rho^t$ by a factor $\bar{\theta} \in (0, 1)$, i.e., we use $\hat{\tau}_d^t$ instead of $\tau^t$ in line 6 of Algorithm 1, and $\rho^{t+1} = \theta \rho^t + (1 - \theta) \rho_{\text{damp}}^t$. Third, we include support for iterative detection and decoding. The implemented LAMA algorithm is summarized in Algorithm 1.

### Algorithm 1 Large MIMO AMP (LAMA) Algorithm

1. inputs: $H, y, N_0, \Lambda_{u,q}^\text{prior}, \forall u,q$
2. preprocessing: $G = I_U - \text{diag}(G)^{-1}G$ with $G = H^H H$, $\hat{\mathbf{y}}_{\text{MF}} = \text{diag}(G)^{-1} H^H y$, and $g_u = G_{uu}/U$, $u = 1, \ldots, U$
3. initialize: $\hat{\mathbf{y}} = \mathbf{0}_{U \times 1}$, and $\rho^t = 0$
4. for $t = 1, 2, \ldots, t_{\text{max}}$ do
5. mean and variance estimation:
   $$
   \hat{s}_{t+1} = F(\tau^t, \rho^t, \Lambda_{u,q}^\text{prior}) \quad \text{(mean update)}
   $$
   $$
   \tau_{t+1} = G(\tau^t, \rho^t, \Lambda_{u,q}^\text{prior}) \quad \text{(variance update)}
   $$
   $$
   \rho_{t+1} = \frac{1}{\tau_{t+1}^2} \left( \frac{1}{\tau_{t+1}^2} + \frac{1}{\rho^t} \right) \quad \text{(Ossanger term)}
   $$
6. interference cancellation:
   $$
   \tau_{t+1} = \hat{\mathbf{y}}_{\text{MF}} + \hat{\tau}_{t+1}^\text{IC} + \rho_{t+1}^\text{IC} \tau_{t+1}^\text{IC} \quad \text{(interference cancellation)}
   $$
    $$
   \rho_{t+1} = \frac{1}{\tau_{t+1}^\text{IC}} \left( \frac{1}{\tau_{t+1}^\text{IC}} + \frac{1}{\rho_{t+1}^\text{IC}} \right) \quad \text{(post-equalization SINR update)}
   $$
7. end for
8. output: extrinsic LLR values $\Lambda_{u,q}^d$, $\forall u = 1, \ldots, U$, $q = 1, \ldots, Q$

(message damping details are excluded). The functions $F$ and $G$ correspond to the posterior mean and variance applied element-wise, i.e., $F(z, \rho, \Lambda_{u,q}^\text{prior}) = \mathbb{E}_q(S \mid z = S + \rho^{-1/2} N)$, $N \sim \mathcal{CN}(0, 1)$ and $\rho(S)$ can be derived from the a-priori LLRs $\Lambda_{u,q}^\text{prior}$; $G$ can be derived similarly—see [3] for the details.

### III. VLSI ARCHITECTURE

Fig. 1 depicts the top-level architecture of the LAMA data detector. LAMA performs two main tasks per iteration: The first task estimates the mean and variance (MV) of the data transmitted by each UE; the second task cancels interference (IC) among the UEs—both of these tasks are detailed below. To maximize throughput, two independent detection problems are processed simultaneously in a pipeline-interleaved manner, i.e., one problem per task. The two main processing units, namely MV and IC, perform the assigned computations in $T_s$ clock cycles, and the results of both units are exchanged for further processing in the subsequent iteration. In the last $t_{\text{max}}$ iteration, the outputs from the IC unit are sent to the LLR computation unit, which takes $T_{\text{LLR}}$ clock cycles. Thus, LAMA delivers a new set of $UQ$ LLR values at a sustained throughput of

$$
\Theta = \frac{UQ}{t_{\text{max}} T_s + T_{\text{LLR}}} f_{\text{clk}} \quad \text{[bit/s].}
$$

The final design supports 32 UEs, which requires $T_s = 36$ clock cycles and $T_{\text{LLR}} = 1$ clock cycle.

**A. Mean and Variance Estimation (MV) Unit**

In the first task (line 5 in Algorithm 1), the MV unit receives estimates of the UE’s data and the associated SINR to compute mean and variance values. As shown in Fig. 2, a straightforward MV unit would require a large number of multipliers. Although statistical independence in the real and imaginary parts of the transmitted constellation points simplifies computation from $M^2$-QAM to two $M$-PAM constellations, mean and variance computation for 16-PAM (to support 256-QAM) still requires 16 function units and a division, resulting in high complexity. Furthermore, the intrinsic LLR values obtained from the channel decoder must be transformed from bit-domain to symbol-domain for SISO processing. To
reduce complexity, existing ASICs [5]–[7] use hard-symbol clipping or linear approximations, and do not provide support for SISO processing. However, accurate message mean and variance computation is key to support realistic channels and systems with a comparable number of UEs and BS antennas.

To accurately compute the message mean and variance at low complexity, we (i) compute all quantities in the bit-domain, (ii) exploit Gray-mapping symmetries, and (iii) use the max-log approximation. The conversion into the bit-domain and the max-log approximation only requires 4 log-likelihood functions for 16-PAM, instead of 16 functions in the symbol domain. In addition, we avoid the need of a division per UE by using a LUT-based $\tanh(\cdot)$ function as in [8] with 7 input bits.

The resulting architecture, depicted in Fig. 3, also avoids the need of a division per UE. Furthermore, the architecture naturally supports SISO processing for iterative MIMO decoding. Our simulations in Section IV for various antenna configurations and channel models show that our approach entails a negligible performance loss at around $4\times$ lower area.

B. Interference Cancellation (IC) Unit

In the second task (line 6 in Algorithm 1), the IC unit performs interference cancellation and updates the SINR.

1) 32-MAC matrix-vector multiplication: Interference cancellation requires a $32 \times 32$ complex-valued matrix-vector multiplication, which we compute sequentially in 32 clock cycles using a linear array of 32 complex-valued multiply-accumulate (MAC) units in a column-by-column fashion. To minimize the critical path caused by the large fan-out of a conventional linear array of MAC units, Fig. 4 shows a simplified version of Cannon’s algorithm [10], which circularly shifts the array’s input vector while sequentially processing rows of the matrix over multiple clock cycles; this reduces the vector memory fan-out from 32 MAC units to one MAC unit and a register. To further reduce the critical path and simplify placement, each row of the Gram matrix is stored next to each MAC unit with standard-cell-based latch-arrays.

2) SINR computation: The post-equalization SINR is computed in parallel using a Newton-Raphson (NR) reciprocal unit [8]. We first shift the input $x$ according to $\bar{x} = 2^\alpha x$, $\alpha \in \mathbb{Z}$ so that $\bar{x} \in [0.5, 1)$, resulting in high numerical stability. Based on an initial guess obtained from a look-up table, a single NR iteration is sufficient to compute $\bar{y}_1 \approx \bar{x}^{-1}$; the final result $y = 2^\alpha \bar{y}_1$ corresponds to an approximation of $x^{-1}$.

IV. IMPLEMENTATION RESULTS AND COMPARISON

Figs. 5 and 6 show the PER of our LAMA ASIC in comparison with the linear minimum-mean squared-error (MMSE) equalizer and channel hardening-exploiting message passing (CHEMP) algorithm [4]. The number of algorithm iterations are indicated after the dash; e.g., LAMA–14 represents LAMA with 14 iterations. Outer iterations over the channel decoder are shown as either none ($I = 0$; solid lines) or one ($I = 1$; dashed lines) iteration. We simulate an LTE-based massive MU-MIMO-OFDM system at $f_c = 2$ GHz with 1200 active subcarriers and per-user convolutional coding with rate $R$. We use two channel models: (a) Rayleigh fading and (b) WINNER II typical urban micro [11] to model a realistic propagation environment. For a typical $256 \times 32 (B \times U)$ massive MU-MIMO scenario, LAMA achieves the same performance as linear MMSE, but avoids a matrix inversion; CHEMP suffers an error floor above 10% PER. For the challenging $32 \times 32$ system, LAMA significantly outperforms the linear MMSE detector, achieving more than 11 dB SNR improvements for the typical urban micro channel; CHEMP fails to successfully detect packets. Extensive numerical simulations have been carried out to determine the ASIC’s fixed-point parameters; the implemented design achieves near-floating-point performance.

A. Implementation Results

Fig. 7 shows a micrograph of the fabricated and fully-functional 28nm CMOS ASIC with the LAMA detector core highlighted. The LAMA ASIC only occupies $0.37 \text{ mm}^2$; the rest of the chip contains unrelated designs. The clock signal was generated by a VLSI test system and directly fed into the ASIC. At nominal supply of 0.9 V at 300 K, the ASIC reaches a maximum measured clock frequency of 400 MHz at 151 mW, which results in $354 \text{ Mbits/s}$ for 32 UEs transmitting 256-QAM.
Fig. 8 shows measured energy-efficiency in pJ/bit obtained via voltage-frequency scaling. By reducing the supply close to the threshold voltage, the detector achieves optimal energy-efficiency: at 0.35 V we have 123 pJ/bit (achieving 2.60 Mb/s). If maximum throughput is desired, one can increase the supply to 1.15 V and obtain 511 Mb/s (at 670 pJ/bit efficiency).

Table I compares LAMA to state-of-the-art massive MU-MIMO data detectors. Our LAMA ASIC achieves more than 4× improved normalized area efficiency than [7], which computes a matrix inversion. Although LAMA achieves lower area efficiency (in Gb/s/mm$^2$) than the detectors in [5], [6], these designs suffer an error floor higher than LTE specifications under realistic channel conditions (cf. Figs 5 and 6). We note that the nominal energy efficiency is inferior to other designs due to increased arithmetic precision requirements in support of realistic channel conditions and symmetric massive MU-MIMO systems. To the best of our knowledge, the proposed LAMA ASIC is the first silicon prototype of a 32-UE massive MU-MIMO data detector that provides near-optimal error rates under realistic propagation conditions and for symmetric systems. Both of these advantages are critical to BS providers as one can support up to 32 UEs with relatively small ($B \geq 32$) BS antenna arrays under realistic channel conditions.

### Table I: Performance Summary and ASIC Comparison

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Max UEs</td>
<td>32</td>
<td>32</td>
<td>8</td>
</tr>
<tr>
<td>Algorithm</td>
<td>LAMA</td>
<td>CHEMP</td>
<td>CHEMP</td>
</tr>
<tr>
<td>Modulation</td>
<td>256-QAM</td>
<td>256-QAM</td>
<td>QPSK</td>
</tr>
<tr>
<td>Realistic channels</td>
<td>yes</td>
<td>no</td>
<td>yes</td>
</tr>
<tr>
<td>Technology [nm]</td>
<td>28</td>
<td>40</td>
<td>40</td>
</tr>
<tr>
<td>Supply [V]</td>
<td>0.9</td>
<td>0.9</td>
<td>0.9</td>
</tr>
<tr>
<td>Area [mm$^2$]</td>
<td>0.37</td>
<td>0.58</td>
<td>0.076</td>
</tr>
<tr>
<td>Frequency [MHz]</td>
<td>400</td>
<td>425</td>
<td>500</td>
</tr>
<tr>
<td>Power [mW]</td>
<td>151</td>
<td>220.6</td>
<td>77.9</td>
</tr>
<tr>
<td>Throughput [Gb/s]</td>
<td>0.354</td>
<td>2.76</td>
<td>8</td>
</tr>
<tr>
<td>Energy$^a$ [pJ/b]</td>
<td>426</td>
<td>79.9</td>
<td>9.74</td>
</tr>
<tr>
<td>Area Eff.$^b$ [Gb/s/mm$^2$]</td>
<td>0.95</td>
<td>4.76</td>
<td>105.26</td>
</tr>
<tr>
<td>Norm. Area Eff.$^d$ [Gb/s/mm$^2$]</td>
<td>0.95</td>
<td>13.87</td>
<td>19.18</td>
</tr>
</tbody>
</table>

$^a$expectation-propagation, $^b$soft-output support only, $^c$energy efficiency is power/throughput, $^d$area efficiency is throughput/area, $^e$technology normalized to 28nm, $V_{dd}=0.9$ V, normalized by (U/32)$^2$.

### References


