Abstract—Massive multiuser (MU) multiple-input multiple-output (MIMO) will be a core technology in fifth-generation (5G) wireless systems as it offers significant improvements in spectral efficiency compared to existing multi-antenna technologies. The presence of hundreds of antenna elements at the base station (BS), however, results in excessively high hardware costs and power consumption, and requires high interconnect throughput between the baseband-processing unit and the radio unit. Massive MU-MIMO that uses low-resolution analog-to-digital and digital-to-analog converters (DACs) has the potential to address all these issues. In this paper, we focus on downlink precoding for massive MU-MIMO systems with 1-bit DACs at the BS. The objective is to design precoders that simultaneously mitigate multi-user interference (MUI) and quantization artifacts. We propose two nonlinear 1-bit precoding algorithms and corresponding very-large scale integration (VLSI) designs. Our algorithms rely on biconvex relaxation, which enables the design of efficient 1-bit precoding algorithms that achieve superior error-rate performance compared to that of linear precoding algorithms followed by quantization. To showcase the efficacy of our algorithms, we design VLSI architectures that enable efficient 1-bit precoding for massive MU-MIMO systems in which hundreds of antennas serve tens of user equipments. We present corresponding field-programmable gate array (FPGA) implementations to demonstrate that 1-bit precoding enables robust and high-rate downlink data transmission in practical systems.

Index Terms—Biconvex relaxation, digital-to-analog converter (DAC), field-programmable gate array (FPGA), massive multi-user multiple-input multiple-output (MU-MIMO), precoding, quantization, very large-scale integration (VLSI).

I. INTRODUCTION

Massive multiuser (MU) multiple-input multiple-output (MIMO) is widely believed to be a core technology in fifth-generation (5G) wireless systems as it enables substantial improvements in spectral efficiency and reliability compared to traditional, small-scale MIMO technology [2]–[4]. These advantages are a result of equipping the base station (BS) with hundreds or thousands of antennas, which enables fine-grained beamforming to serve tens of user equipments (UEs) in the same time-frequency resource. However, the large number of antenna elements and radio frequency (RF) chains at the BS results in a significant increase in hardware complexity, system costs, and circuit power consumption. Furthermore, massive MU-MIMO requires high interconnect and chip input/output (I/O) bandwidth between the baseband-processing unit at the BS and the radio units [5]. As a consequence, the successful deployment of this technology in 5G wireless systems requires novel design approaches that jointly reduce system costs, power consumption, and interconnect bandwidth without degrading the spectral efficiency and reliability.

A. Massive MU-MIMO with 1-bit DACs

We consider the massive MU-MIMO downlink in which the BS is equipped with 1-bit digital-to-analog converters (DACs) and transmits data to multiple UEs in the same time-frequency resource. In traditional multi-antenna BSs, each RF port is connected to a pair of high-resolution DACs (e.g., with 10-bit precision). Scaling such architectures to massive MIMO BSs, with hundreds or thousands of antennas would result in prohibitively high power consumption and system costs. The deployment of 1-bit DACs at the BS mitigates this problem. In addition, the use of 1-bit DACs enables one to lower the linearity and noise requirements of the surrounding RF circuitry, which has the potential to additionally reduce the RF circuit power. Another benefit of using 1-bit DACs is the fact that lowering their resolution also reduces the fronthaul bandwidth between the baseband-processing unit and the radio unit. This aspect is of practical relevance for deployment scenarios in which these two units are separated by a large distance.

The key challenges of 1-bit massive MU-MIMO systems are to maintain high spectral efficiency and reliability. As shown in [1], [6]–[9], the use of 1-bit DACs in the downlink enables reliable data transmission if sophisticated precoding algorithms that simultaneously mitigate multi-user interference (MUI) and quantization artifacts are used. While conventional linear precoding methods, such as zero-forcing (ZF) or minimum mean-square error (MMSE) precoding followed by quantization, require high computational complexity [10]–[13], more sophisticated, nonlinear methods are necessary to enable reliable communication at high spectral efficiency. Such precoding methods, however, typically require high computational complexity. As a consequence, a successful deployment of 1-bit massive MU-MIMO calls for the design of novel and efficient precoding algorithms that can be implemented in hardware and reliably achieve high throughput.

O. Castañeda and C. Studer are with the School of Electrical and Computer Engineering, Cornell University, Ithaca, NY (e-mail: ocast6@cornell.edu, studer@cornell.edu).

S. Jacobsson is with Ericsson Research and Chalmers University of Technology, Gothenburg, Sweden (e-mail: sven.jacobsson@ericsson.com).

G. Durisi is with Chalmers University of Technology, Gothenburg, Sweden (e-mail: giuseppe.durisi@chalmers.se).

M. Coldrey is with Ericsson Research, Gothenburg, Sweden (e-mail: mikael.coldrey@ericsson.com).

T. Goldstein is with the Department of Computer Science, University of Maryland, College Park, MD (e-mail: tmg@cs.umd.edu).

The C1PO algorithm implemented in this paper builds upon the 1-bit precoding algorithm to be presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing [1]; in contrast to the algorithm in [1], C1PO directly operates in the complex domain, comes with convergence guarantees, and can be implemented efficiently in VLSI.

A system simulator for 1-bit precoding in massive MU-MIMO systems will be made available on GitHub after (possible) acceptance of the paper.
B. Contributions

In this paper, we develop novel, computationally efficient precoding algorithms for 1-bit massive MU-MIMO systems and corresponding very-large scale integrated (VLSI) designs. Our main contributions can be summarized as follows:

- We use biconvex relaxation (BCR) \textsuperscript{14} to design a non-linear 1-bit precoding algorithm. Our algorithm, referred to as C1PO (short for biConvex 1-bit PrecOding), enables reliable, high-rate downlink transmission in 1-bit massive MU-MIMO systems for medium-sized antenna arrays.
- We propose a scalable and low-complexity algorithm variant, referred to as C2PO, which enables high-performance 1-bit precoding for massive MU-MIMO systems with hundreds or thousands of antenna elements.
- For C1PO and C2PO, which both solve nonconvex problems, we provide analytical convergence guarantees.
- We develop two massively parallel VLSI architectures that implement C1PO and C2PO, and achieve high throughputs in a hardware-efficient manner. Our architectures support various BS and UE antenna configurations.
- We present reference designs on an Xilinx Virtex-7 field-programmable gate array (FPGA) for various antenna configurations that demonstrate the efficacy of our algorithms and VLSI architectures.
- We compare our designs to a baseline precoder that uses maximum ratio combining (MRC-Q), a method that achieves high hardware-efficiency at the cost of poor error-rate performance.
- We study the trade-offs between error-rate performance and hardware efficiency (in terms of throughput per area) for the proposed FPGA designs.

Our results demonstrate the practical feasibility of 1-bit precoding in massive MU-MIMO systems, supporting reliable and high-rate downlink data transmission.

C. Relevant Prior Art

A number of papers have studied the use of low-resolution analog-to-digital converters (ADCs) for the massive MU-MIMO uplink (UEs transmit data to the BS) with particular focus on the 1-bit case; see, e.g., \textsuperscript{[5–9]} and the references therein. All these results have shown that the use of 1-bit ADCs is sufficient for reliable low-rate uplink transmission and that 4-to-6 bits are sufficient to close the gap to the infinite-precision case in most scenarios. In contrast to the uplink, the quantized downlink has gained attention only recently. The results in \textsuperscript{[10–13]} have shown that so-called linear-quantized precoders, which perform traditional linear precoding followed by quantization, enable reliable uplink transmission for very large BS antenna arrays in the high signal-to-noise ratio (SNR) regime, even for systems that use 1-bit DACs. More sophisticated, nonlinear precoding algorithms have been proposed only recently in \textsuperscript{[1, 6–9]} and significantly outperform linear-quantized methods in the case of 1-bit DACs. The computational complexity of these algorithms, however, is typically high, which prevents an efficient implementation in practical systems. In contrast to these precoding methods, we propose two novel nonlinear precoding algorithms and VLSI designs that achieve high throughput in a hardware-efficient manner.

While a large number of VLSI designs for data detection in the massive MU-MIMO uplink have been proposed in the literature (see, e.g., \textsuperscript{[20–23]} and references therein), only a handful of precoder designs for multi-antenna downlink systems exist \textsuperscript{[24–26]}. Reference \textsuperscript{[24]} proposes a VLSI design for vector-perturbation precoding in small-scale MIMO systems with infinite-precision DACs. The papers \textsuperscript{[25]} and \textsuperscript{[26]} discuss hardware implementations for approximate linear and ZF/MRC-based precoding, respectively, for massive MU-MIMO systems with infinite-precision DACs. Unfortunately, both of these publications do not provide detailed FPGA implementation results. Hence, to the best of our knowledge, the VLSI designs proposed in this paper are the first hardware implementations reported in the open literature that are suitable for precoding in 1-bit massive MU-MIMO systems.

D. Notation

Lowercase and uppercase boldface letters designate column vectors and matrices, respectively. For a matrix \textbf{A}, we denote its transpose, Hermitian transpose, conjugate function, and matrix \(\ell_2\)-norm by \(\textbf{A}^T\), \(\textbf{A}^H\), \(\textbf{A}^\ast\), and \(\|\textbf{A}\|_2\), respectively; the entry on the \(k\)-th row and on the \(\ell\)-th column of \(\textbf{A}\) is \([\textbf{A}]_{k,\ell}\). The \(M \times M\) identity matrix is denoted by \(\textbf{I}_M\) and the \(M \times N\) all-zeros matrix is denoted by \(\textbf{0}_{M \times N}\). For a vector \textbf{a}, the \(k\)-th entry is \([\textbf{a}]_k\) and we use \([\|\textbf{a}\|]_2\) to denote the \(\ell_2\)-norm of the vector \textbf{a}. The real and imaginary parts of a complex vector \textbf{a} are \(\Re\{\textbf{a}\}\) and \(\Im\{\textbf{a}\}\), respectively. The signum function \(\text{sgn}(\cdot)\) is defined as \(\text{sgn}(a) = +1\) for \(a \geq 0\) and \(\text{sgn}(a) = -1\) for \(a < 0\) and is applied entry-wise to vectors. The multivariate complex-valued circularly-symmetric Gaussian probability density function (PDF) with covariance matrix \(\textbf{K}\) is denoted by \(\mathcal{CN}(\textbf{0}, \textbf{K})\). We use \(\mathbb{E}_x[\cdot]\) to denote expectation with respect to the random vector \(x\).

E. Paper Outline

The rest of this paper is organized as follows. In Section \textbf{II} we introduce the system model and formulate the precoding problem for systems with 1-bit DACs. In Section \textbf{III} we propose two new 1-bit precoding algorithms, namely C1PO and C2PO. In Section \textbf{IV} and Section \textbf{V} we detail our VLSI architectures for C1PO and C2PO, respectively. In Section \textbf{VI} we show numerical simulations, reference FPGA implementation results, and a comparison with an MRC-based baseline precoder. We conclude the paper in Section \textbf{VII}.

II. System Model and 1-bit Precoding

We start by introducing the downlink system model and then provide the necessary details about optimal precoding in 1-bit massive MU-MIMO systems.

A. Downlink System Model

We focus on the downlink of a single-cell, narrowband massive MU-MIMO system as illustrated in Figure \textbf{1}.

The
system consists of a $B$-antenna BS that serves $U \leq B$ single-antenna UEs simultaneously and in the same frequency band. We use the standard input-output relation $y = Hx + n$ to model the narrowband downlink channel [2]. Here, the vector $y = [y_1, \ldots, y_U]^T$ contains the received signals at all UEs, where $y_u \in \mathbb{C}$ is the signal received at the $u$th UE. The matrix $H \in \mathbb{C}^{U \times B}$ represents the downlink channel. The so-called precoded vector is denoted by $x \in \mathcal{X}^B$, where $\mathcal{X}$ represents the transmit alphabet; this set coincides with the set $\mathbb{C}$ of complex numbers in the case of infinite-precision DACs. In 1-bit massive MU-MIMO systems, the in-phase and quadrature components are quantized separately using a pair of 1-bit DACs running at Nyquist rate and hence, the per-antenna quaternary transmit alphabet is $\mathcal{X} = \{\pm \ell, \ell j\}$ for a given (and fixed) $\ell > 0$ that determines the transmit power. The vector $n \in \mathbb{C}^U$ models i.i.d. circularly-symmetric complex Gaussian noise with variance $N_0$ per complex entry, i.e., $n_u \sim CN(0, N_0)$, for $u = 1, \ldots, U$. In what follows, we assume that the realization of the channel matrix $H$ and the noise variance $N_0$ are perfectly known at the BS.

**B. Precoding Basics**

The main purpose of precoding is to transmit constellation points $s_u \in \mathcal{O}$ to each UE $u = 1, \ldots, U$, where $\mathcal{O}$ is the constellation set (e.g., 16-QAM). The BS uses the available channel state information (CSI) to precode the symbol vector $s = [s_1, \ldots, s_U]^T$ into the precoded vector $x \in \mathcal{X}^B$. Throughout the paper, we assume that the precoded vector $x$ must satisfy an instantaneous power constraint $\|x\|^2_2 = P$; this leads to $\mathcal{X} = \{\pm \ell, \ell j\}$ with $\ell = \sqrt{P/(2B)}$.

Coherent transmission of data using multiple BS antennas leads to an array gain, which depends on the realization of the fading channel and the precoding method. As in [6], [7], we assume that the $u$th UE is able to rescale its received signals $y_u$ by a factor $\beta_u \in \mathbb{C}$ in order to compute an estimate $\hat{s}_u = \beta_u y_u$ for $u = 1, \ldots, U$ of the transmitted symbol $s_u \in \mathcal{O}$.

Since the UEs cannot perform joint processing to recover the transmitted data, precoding must simultaneously reduce MUI and increase signal power at all UEs [27]. To accomplish these goals, there exist multiple formulations of this optimization problem based on different performance metrics, e.g., sum-rate throughput or error rate (see [28] for a survey). As in [6], [7], we will focus exclusively on precoders that minimize the mean-square error (MSE) between the estimated symbol vector $\hat{s} = [\hat{s}_1, \ldots, \hat{s}_U]^T = \beta y$ and the transmitted symbol vector $s$ given by

$$E_n[\|s - \hat{s}\|^2_2] = \|s - \beta Hx\|^2_2 + |\beta|^2 U N_0, \tag{1}$$

where we restrict ourselves to the case in which the precoder results in the same precoding factor $\beta$ for all UEs. Hence, in the remainder of this paper we shall assume that $\beta_u = \beta$ for $u = 1, \ldots, U$. In [7] it is shown that the UEs are able to accurately estimate the precoding factor $\beta$ using pilot-based transmission in block-fading scenarios.

In the infinite-precision case, an MSE-optimal linear precoder multiplies the symbol vector $s$ with a precoding matrix $P \in \mathbb{C}^{B \times U}$ so that [1] is minimized on average over all possible transmit vectors $s$ subject to the power constraint. This problem, which has been studied extensively for the case of finite-precision DACs [29], [30], enables the design of low-complexity linear precoding algorithms [2].

**C. MSE-Optimal 1-bit Precoding Problem**

In the 1-bit case, linear-quantized precoders perform first linear precoding and then quantize the result to the finite transmit set $\mathcal{X}^B$ as

$$x = \sqrt{\frac{P}{2B}} \left(\text{sgn}(\Re\{Ps\}) + j \text{sgn}(\Im\{Ps\})\right)$$

for a given precoding matrix $P$. Linear-quantized precoders can be analyzed theoretically and typically exhibit low complexity [6]. However, as recently shown in [1], [6] - [9], significant performance improvements can be obtained by using sophisticated nonlinear precoding methods.

Fig. 1. Overview of a massive MU-MIMO downlink system with 1-bit DACs. Left: $B$ antenna massive MU-MIMO BS containing a 1-bit precoder that mitigates multi-user interference and quantization artifacts in the 1-bit DACs; Right: $U$ single-antenna UEs.

---

1For simplicity, we focus on single-antenna UEs; the model can easily be expanded to support multi-antenna UEs.

2Knowledge of $H$ is typically acquired via training in the uplink in a time-division duplexing system [2]. As discussed in [6], channel estimation errors incur only a small performance loss. Knowledge of the noise variance $N_0$ at the BS can be obtained by explicit feedback from the UEs to the BS.

3In contrast to references [6], [7], which assumed a real-valued factors $\beta_u$, $u = 1, \ldots, U$, we allow these factors to be complex-valued.
We emphasize that for a fixed value of \( \beta \), the problem (OPP) is a closest vector problem that is known to be NP-hard \([31]–[33]\); this implies that there exists no known algorithm to solve it efficiently for large values of \( B \). In \([6, 7]\), approximate methods for solving (2) using convex relaxation have been proposed, such as the squared-infinity norm Douglas-Rachford splitting (SQUID) algorithm. Such relaxation-based methods, however, still require high computational complexity, which prevents their deployment in practical systems.

III. 1-bit Precoding via Biconvex Relaxation

Since the problem (OPP) is of combinatorial nature, a brute-force search for a solution is intractable in massive MU-MIMO systems with hundreds of BS antennas. We next propose two nonlinear precoding algorithms that yield approximate but accurate solutions at low computational complexity.

A. Approximating (OPP)

To solve (OPP) efficiently, we use the BCR framework put forward in \([14]\), which was initially proposed for solving large semidefinite programs that appear in computer vision. In order to use this framework, we first simplify the objective function of (OPP) by assuming that \( N_0 \to 0 \), i.e., we assume that the system operates in the high-SNR regime. As in \([1, Eq. 3]\), we take a leap of faith with the approximation

\[
\min_{x \in \mathcal{X}^B} \| s - \beta H x \|_2^2 \approx \min_{x, \alpha \in \mathcal{X}^B} \| \alpha s - H x \|_2^2. 
\]

This approximation can be justified by noting that if we can find a precoded vector \( x \in \mathcal{X}^B \) for which \( s = H x \), then both problems in (3) are equivalent. These approximations allow us to rewrite (OPP) as follows:

\[
\text{OPP}^* \quad \hat{x} = \arg \min_{x \in \mathcal{X}^B} \| \alpha s - H x \|_2^2. 
\]

We next get rid of the parameter \( \alpha \) in (OPP*). For a fixed \( x \), the optimal parameter \( \hat{\alpha}(x) \) that minimizes the objective function of (OPP*) is given by

\[
\hat{\alpha}(x) = \arg \min_{\alpha \in \mathcal{C}} \| \alpha s - H x \|_2^2 = \frac{s^H H x}{\| s \|_2^2}. 
\]

By inserting \( \hat{\alpha}(x) \) into the objective function of (OPP*), we obtain

\[
\| \hat{\alpha}(x)s - H x \|_2^2 = \| A x \|_2^2 
\]

with

\[
A = QH \quad \text{and} \quad Q = I_U - \frac{ss^H}{\| s \|_2^2},
\]

where the matrix \( Q \in \mathbb{C}^{U \times U} \) is a projection onto the orthogonal complement of the space spanned by the symbol vector \( s \). Using (4), the problem (OPP*) can be simplified to

\[
\text{OPP}^* \quad \hat{x} = \arg \min_{x \in \mathcal{X}^B} \| A x \|_2^2, 
\]

which remains to be a closest vector problem. Nevertheless, the specific form of (OPP*) enables us to use BCR to efficiently compute approximate but accurate solutions.

B. Biconvex Relaxation (BCR)

To solve (OPP*) using BCR, we first introduce a copy \( z \) of the vector \( x \), and replace (OPP*) with the approximation

\[
\hat{x} = \arg \min_{x \in \mathcal{X}^B, \, z \in \mathcal{C}^B} \| A z \|_2^2 + \gamma \| z - x \|_2^2, 
\]

where \( \gamma > 0 \) is a (fixed) regularization parameter. We next relax the nonconvex alphabet constraint \( x \in \mathcal{X}^B \) to its convex envelope given by

\[
B^B = \left\{ e \in \mathbb{C}^B \mid \| R \{ c_b \} \|_2 \leq \sqrt{\frac{P}{2B}}, \| \Im \{ c_b \} \|_2 \leq \sqrt{\frac{P}{2B}}, \, b = 1, \ldots, B \right\},
\]

which allows us to convexify the precoding problem as follows:

\[
\hat{x} = \arg \min_{x \in B^B, \, z \in \mathcal{C}^B} \| A z \|_2^2 + \gamma \| z - x \|_2^2. 
\]

Unfortunately, solving this problem yields, in general, the all-zeros vector, i.e., \( x = 0_{B \times 1} \). One of the key ideas of BCR is to force the solution of this new problem to satisfy the constraints in (6) with equality. This can be accomplished by including a nonconvex regularization term in the objective that promotes large values of \( x \). As suggested in \([14]\), we use a negative \( \ell_2 \)-norm term to obtain the following biconvex relaxation (BCR) optimization problem:

\[
\text{BCR} \quad \hat{x}^\text{BCR} = \arg \min_{x \in B^B, \, z \in \mathcal{C}^B} \| A z \|_2^2 + \gamma \| z - x \|_2^2 - \delta \| x \|_2^2, 
\]

where \( \delta > 0 \) is a (fixed) regularization parameter. When \( \delta < \gamma \), the formulation (7) is \( \ell_2 \)-bi-convex (i.e., the minimization with respect to \( x \) is convex when \( z \) is fixed, and vice versa). Robust parameter choices are \( \gamma = \| A H A \|_2 \) and \( \gamma / \delta = 2 \); see \([14]\) for more details. In practice, we tune the parameters \( \gamma \) and \( \delta \) to further improve the empirical performance of our algorithms.

C. CIPO: biConvex 1-bit Precoding

We noted above that the BCR problem is biconvex, meaning that minimization with respect to \( x \) alone (with \( z \) fixed) or \( z \) alone (with \( x \) fixed) is convex. Hence, as done in \([14]\), we can solve the BCR problem approximately using alternating minimization. Since the problem is nonconvex, initialization critically affects the performance of our algorithm. We initialize our algorithm with the MRC precoded vector \( x^{(1)} = H s \), which yields excellent performance in practice and can be computed efficiently. Then, we solve for \( z \) while holding \( x \)
which can be computed once during a preprocessing stage, and (ii) the per-iteration matrix-vector multiplication $\mathbf{z}^{(t+1)} = \mathbf{Gx}^{(t)}$ in step (3); the complexity of the projection in step (9) is negligible. Unfortunately, the complexity of the matrix inversion, evaluated in terms of operations\(^6\) scales roughly with $B^3$ and the complexity of the per-iteration matrix-vector product with $B^2$. Both of these tasks are particularly inefficient for massive MU-MIMO systems with a large number of BS antennas. Therefore, we next propose an algorithmic variant that avoids both of these issues and whose complexity scales more favorably with the number of BS antennas.

### D. Fast Algorithm for Very-Large Systems: C2PO

To obtain our alternative algorithm, we start from the BCR formulation in (7) but rather than introducing the auxiliary variable $\mathbf{z}$, we attempt to solve the following nonconvex problem\(^5\)

$$\hat{x} = \arg\min_{x \in B} \frac{1}{2} \|\mathbf{Ax}\|_2^2 - \frac{\delta}{2} \|x\|_2^2. \quad (11)$$

We use forward-backward splitting (FBS)\(^{34-36}\), a computationally efficient method to solve large convex problems. Since the problem in (11) is nonconvex, FBS is not guaranteed to converge to the optimal solution. Nevertheless, as shown in Section VI, the proposed algorithm performs well in practice. FBS is an efficient iterative procedure to solve convex optimization problems of the form

$$\hat{x} = \arg\min_{x \in B} f(x) + g(x),$$

where the function $f$ is smooth and convex, and the function $g$ is convex but not necessarily smooth or bounded. FBS consists of the following iteration\(^{34, 35}\):

$$\mathbf{z}^{(t+1)} = \mathbf{x}^{(t)} - \tau^{(t)} \nabla f(x^{(t)})$$

$$\mathbf{x}^{(t+1)} = \text{prox}_g(\mathbf{z}^{(t+1)}; \tau^{(t)})$$

for $t = 1, 2, \ldots, t_{\text{max}}$ or until convergence. Here, $\nabla f(x)$ is the gradient of the smooth function $f$, and the so-called proximal operator for the function $g$ is defined as follows\(^{37}\):

$$\text{prox}_g(z; \tau) = \arg\min_{x} \left\{ \tau g(z) + \frac{1}{2} \|x - z\|_2^2 \right\}.$$  

The sequence $\{\tau^{(t)} > 0\}$ contains suitably chosen step-size parameters. For the problem (11), we show below that FBS monotonically decreases the objective (11) for any constant step size that satisfies $\tau^{(t)} = \tau < \|\mathbf{A}^\dagger \mathbf{A}\|_{2,2}^{-1}$.

In order to approximately solve (11) using FBS, we set

$$f(x) = \frac{1}{2} \|\mathbf{Ax}\|_2^2 \quad \text{and} \quad g(x) = \chi(x \in B^2) - \frac{\delta}{2} \|x\|_2^2, \quad (12)$$

where $\chi$ is a characteristic function that is zero if the condition $x \in B^2$ is met and infinity otherwise. For this choice of $f$ and $g$, the gradient is given by $\nabla f(x) = \mathbf{A}^\dagger \mathbf{Ax}$ and the proximal operator is given by the expansion-reprojection operation

$$\text{prox}_g(z) = \text{sgn}(\Re\{z\}) \min \left\{ \frac{1}{1 - \tau \delta} |\Re\{z\}|, \sqrt{\frac{P}{2B}} \right\} \quad + \text{j} \text{sgn}(\Im\{z\}) \min \left\{ \frac{1}{1 - \tau \delta} |\Im\{z\}|, \sqrt{\frac{P}{2B}} \right\}. \quad (13)$$

---

\(^6\)For simplicity, we count the number of complex-valued multiplications to characterize the operation count.

\(^5\)To simplify notation, we have divided both terms in the objective function by a factor of two; this scaling does not affect the result.
which is valid for $\tau \delta < 1$ and applied element-wise to vectors. By using FBS with the above-mentioned ingredients, we obtain the following simple algorithm, which we call C2PO.

**Algorithm 2 (C2PO).** Set $A$ as in [3]. Initialize $x^{(1)} = \mathbf{H}^* s$ and fix the parameters $\delta$ and $\tau$ so that $\tau \delta < 1$. Then, for every iteration $t = 1, 2, \ldots, t_{\text{max}}$ compute:

$$
\begin{align*}
    z^{(t+1)} &= x^{(t)} - \tau A^H A x^{(t)} \quad (14) \\
    x^{(t+1)} &= \text{prox}_{g}(z^{(t+1)}; \tau). \quad (15)
\end{align*}
$$

Here, the $\text{prox}_{g}$ operator is given in (13) and is applied element-wise to the vector $z^{(t+1)}$. In the last iteration $t_{\text{max}}$, the output $x^{(t_{\text{max}}+1)}$ of C2PO is quantized to the quaternary alphabet $X$ as in (10).

The following result shows that C2PO is well behaved, provided that the step size is chosen appropriately; the proof is given in Appendix B.

**Theorem 2.** Suppose the step size used in C2PO satisfies $\tau < \|A^H A\|_{2,2}^{-1}$, and $\tau \delta < 1$. Then, C2PO decreases the objective (11) monotonically, and any limit point of the iterates $\{x^{(t)}\}$ is a stationary point.

The most complex operation of C2PO (Algorithm 2) is the matrix-vector multiplication in step (14). In contrast to C1PO (Algorithm 1), however, this step requires a minimal amount of preprocessing and can be computed efficiently, especially for large BS antenna arrays. To see this, we rewrite $A^H A$, where $A$ was given in [3], as follows:

$$
A^H A = \mathbf{H}^H \mathbf{H} - \frac{\mathbf{H}^H \mathbf{s} \mathbf{s}^H \mathbf{H}}{\|\mathbf{s}\|^2_2} = \mathbf{H}^H \mathbf{H} - \mathbf{v} \mathbf{v}^H = \mathbf{H}^H \mathbf{H}.
$$

Here, $\mathbf{v} = \mathbf{H}^H \mathbf{s}/\|\mathbf{s}\|_2$ is a normalized version of the MRC vector and $\mathbf{H} = [\mathbf{H}_b, \mathbf{j} \mathbf{v}^H]$ is an augmented $(U+1) \times B$ matrix. With these definitions, we can now simplify step (14) to

$$
\begin{align*}
    z^{(t+1)} &= x^{(t)} - \tau \mathbf{H}^H \mathbf{H} x^{(t)}, \quad (16)
\end{align*}
$$

where we first compute $w = \mathbf{H}(\tau x^{(t)})$, then $w' = \mathbf{H}^H w$, and finally $z^{(t+1)} = x^{(t)} - w'$. As a result, we conclude that each iteration of C2PO requires only two matrix-vector products with a cost of roughly $2B(U+1)$ operations (in contrast to $B^2$ operations for C1PO). In addition, the preprocessing stage of this algorithm only needs to compute the normalized MRC vector $\mathbf{v}$, which requires roughly $BU$ operations (in contrast to $B^3$ operations for C1PO). Hence, the complexity of C2PO can be significantly lower than that of C1PO, especially since the antenna configurations of typical massive MU-MIMO systems satisfy $U \ll B$. As we will show in Section VI, the hardware efficiency of C2PO is superior to that of C1PO for large BS antenna arrays and the error-rate performance is comparable.

**E. Alternative Derivation of C2PO**

It is interesting to note that there is a strong connection between Algorithm 1 and Algorithm 2. In fact, one can obtain C2PO directly from C1PO using the following well-known series expansion. Let $\|A^H A\|_{2,2} < \gamma$. Then, we have the following Neumann series expansion [38]:

$$
(\mathbf{I}_B + \gamma^{-1} A^H A)^{-1} = \sum_{n=1}^{\infty} (-\gamma^{-1} A^H A)^n.
$$

As suggested in [20], we can approximate the inverse by truncating the series to the two first terms:

$$
(\mathbf{I}_B + \gamma^{-1} A^H A)^{-1} \approx \mathbf{I}_B - \gamma^{-1} A^H A.
$$

By using this approximation in step (8) of Algorithm 1, we immediately obtain Algorithm 2 after setting $\gamma^{-1} = \tau$. Note that the Neumann series expansion is only convergent for $\|A^H A\|_{2,2} < \gamma$, which corresponds to the step size restriction $\tau < \|A^T A\|_{2,2}^{-1}$; this is the step size requirement in Theorem 2.

**IV. VLSI DESIGN FOR C1PO**

We now present a high-throughput VLSI architecture for C1PO as in Algorithm 1. We then discuss the key optimizations performed in our FPGA implementation.

**A. Architecture Overview**

The proposed VLSI architecture that implements C1PO as detailed in Algorithm 1 is shown in Figure 2. Our architecture consists of a linear array of $B$ identical processing elements (PEs) that share a common control unit. The PEs essentially consist of three main building blocks: (i) a $\mathbf{g}_b$-memory, (ii) a complex-valued multiply-accumulate (MAC) unit, and (iii) a projection unit. For the $b$th PE, the $\mathbf{g}_b$-memory stores the $b$th row of the matrix $\mathbf{G} = (\mathbf{I}_B + \gamma^{-1} A^H A)^{-1}$, which we assumed was computed during a preprocessing stage. The complex-valued MAC unit is used by each PE to sequentially compute an entry of the output.
vector \( z^{(t+1)} \) on line (8), while the entries of the vector \( x^{(t)} \) are exchanged between the PEs in a cyclic fashion; this is done to avoid an architecture with a centralized \( x^{(t)} \) memory that would suffer from a high fan-out because the memory’s output has to be distributed to all the PEs. The projection unit implements the expansion-reprojection operator \( \text{proj}(\cdot) \) on line (9) in a hardware-friendly manner. The outputs of the projection unit are also used to generate the quaternary outputs of the 1-bit precoder; to this end, each PE simply takes the sign bits of the complex-valued output vector \( x^{(t+1)} \).

### B. Architecture Operation

We now detail the (rather technical) operation of the C1PO architecture illustrated in Figure 2. In the first iteration (i.e., at \( t = 1 \)), each PE \( b \) is initialized with the \( b \)th entry of the vector \( x^{(1)} \). Furthermore, the entries of the \( g_b \)-memory are stored so that the first memory address corresponds to \( [G]_{b,b} \), the second address to \( [G]_{b,b+1} \), and so forth (addresses wrap around).

In the first clock cycle, each PE \( b \) computes \( [G]_{b,b}[x^{(t)}]_b \) and the result is stored in the accumulator. As shown on the left side of Figure 2 in the same clock cycle the \( b \)th PE passes the value \( [x^{(t)}]_b \) to PE \( (b-1) \), while it receives the value \( [x^{(t)}]_{b+1} \) from PE \( (b+1) \); PE 1 passes its value to PE \( B \). In the second clock cycle, since the exchange operation made the element \( [x^{(t)}]_{b+1} \) available at PE \( b \), each PE computes \( [G]_{b,b+1} \cdot [x^{(t)}]_{b+1} \) and uses the accumulator to add it to the result of the previous cycle. Once again, in the same clock cycle, the \( b \)th PE passes the \( x^{(t)} \) entry that is currently being multiplied on its MAC unit to PE \( (b-1) \); PE 1 passes its value to PE \( B \). Consequently, in the third clock cycle, the \( b \)th PE will use the values \( [G]_{b,b+2} \) and \( [x^{(t)}]_{b+2} \) to continue performing MAC operations. By repeating this procedure \( B \) times, each entry of the vector \( [x^{(t)}] \) circulates through all the PEs exactly once, enabling each PE \( b = 1, 2, \ldots, B \) to compute \( [z^{(t+1)}]_b \). Thus, the matrix-vector multiplication on line (8) is completed. Since the complex-valued MAC unit contains three pipeline stages, two clock cycles are required to flush the pipeline. Hence, the matrix-vector operation requires a latency of \( B + 2 \) clock cycles. After \( B + 2 \) clock cycles, the vector \( z^{(t+1)} \) is available at the outputs of the MAC units.

In the subsequent clock cycle, each PE projects their respective entry of the vector \( z^{(t+1)} \). The choice \( \gamma/\delta = 5 \), which implies that \( \gamma/(\gamma - \delta) = 1.25 \), works well for the considered antenna configurations. Furthermore, to reduce the hardware complexity, we assume \( P = 2B \) so that the clipping threshold of the expansion-reprojection operator \( \text{proj}(\cdot) \) is 1. As a result, the \( \text{proj}(\cdot) \) operator in (9) is implemented by applying the following operations independently to the real and imaginary parts of \([z^{(t+1)}]_b\): we multiply the real (or imaginary) part of \([z^{(t+1)}]_b\) by 1.25; this is accomplished by adding \( [z^{(t+1)}]_b \) value with a 2× right-shifted version of itself. At the same time, the real (or imaginary) part of \([z^{(t+1)}]_b\) is compared to \(-0.8 \) and \(+0.8 \). If the real (or imaginary) part of \([z^{(t+1)}]_b\) is between these two numbers, then the projection unit outputs \( 0.8 \cdot [z^{(t+1)}]_b \); if it is smaller than \(-0.8 \), then the projection unit generates \(-1 \); if it is larger than \(+0.8 \), it generates \(+1 \). The result from this projection is stored as the next iterate \([x^{(t+1)}]_b\) in the input register of the complex-valued MAC unit, which completes one C1PO iteration. Since the projection requires an additional clock cycle, one C1PO iteration is completed in exactly \( B + 3 \) clock cycles.

### C. FPGA Implementation Details

To minimize the FPGA implementation complexity of C1PO, we exclusively use fixed-point arithmetic; see Section VI-A for the fixed-point error-rate performance of C1PO. To represent the entries of the vector \( x^{(t)} \), we use 12-bit fixed-point values with 5 fraction bits. The entries of the \( G \) matrix are represented using 10 bits with 9 fraction bits, and we use FPGA look-up tables (LUTs) as a distributed RAM to store these values. The complex-valued MAC unit uses 18 bits with 11 fraction bits; the projection unit uses 15 bits with 8 fraction bits. In our C1PO design, all adders and multipliers do not saturate, but wrap around; number resizing uses truncation. All complex-valued multipliers consist of four real-valued multipliers and two adders; we use the built-in DSP48 units for these operations.

### V. VLSI Design for C2PO

We now present a high-throughput VLSI architecture for C2PO as in Algorithm 2. We then discuss the key optimizations used in our FPGA implementation.

#### A. Architecture Overview

The proposed VLSI architecture that implements C2PO (Algorithm 2) is shown in Figure 3. In what follows, we assume that \( B \) is a multiple of \( U \) and \( B \gg U \). Our architecture consists of \( B/U \) linear arrays; each array consists of \( U + 1 \) PEs and a control unit. The architecture divides the operation in (16) into two separate matrix-vector products: (i) \( w = H^H(\tilde{x}^{(t)}) \) and (ii) \( w' = \tilde{H}^H w \); see also the discussion at the end of Section III-D. Note that for the first matrix-vector product, the matrix \( H \) has more columns (\( B \)) than rows (\( U + 1 \)); for the second matrix-vector product, the matrix \( H^H \) has more rows than columns. Therefore, we will refer to the first matrix-vector product as the \textit{wide product}, while the second one will be identified as the \textit{tall product}. The final subraction required to compute \( z^{(t+1)} = x^{(t)} - w' \) in (16) is incorporated into the tall-product operation; see Section V-B for more details.

To perform both the wide and tall products in a single computation unit, the matrix \( H \) is divided into \( B/U \) sub-matrices \( H_w \in \mathbb{C}^{(U + 1) \times U} \), \( w = 1, 2, \ldots, B/U \), so that \( H = \begin{bmatrix} H_1 & H_2 & \cdots & H_{B/U} \end{bmatrix} \). Analogously, the vector \( x^{(t)} \) is divided into \( B/U \) sub-vectors \( x_w^{(t)} \in \mathbb{C}^U \), \( w = 1, 2, \ldots, B/U \), so that \( x^{(t)} = \begin{bmatrix} x_1^{(t)} & x_2^{(t)} & \cdots & x_{B/U}^{(t)} \end{bmatrix}^T \). We next outline the architectural principle of the wide and tall products.

**i. Wide Product:** Each linear array takes one sub-matrix \( H_w \) and the associated sub-vector \( x_w^{(t)} \) as its inputs and computes their product in a sequential, column-by-column manner. This operation is analogous to that of the C1PO architecture (cf. Section [V-B]) and within each linear array, the entries of the scaled
sub-vector $\tau \mathbf{x}_w^{(t)}$ are cyclically exchanged among the PEs. The resulting vectors $\mathbf{w}_u = \mathbf{H}_u(\tau \mathbf{x}_w^{(t)})$ are then added to obtain

$$\mathbf{w} = \mathbf{H}(\tau \mathbf{x}^{(t)}) = \sum_{u=1}^{B/U} \mathbf{H}_u(\tau \mathbf{x}_w^{(t)}),$$

which completes the wide product. Each entry $[\mathbf{w}]_u$ of the resulting vector $\mathbf{w}$ is then stored in PE $u$ of all linear arrays.

(ii) Tall Product: With the $\mathbf{w}$ vector available in all the linear arrays, each array now computes $U$ entries of $\mathbf{z}^{(t+1)}$ by implementing $\mathbf{z}_u^{(t+1)} = \mathbf{x}_w^{(t)} - \mathbf{H}_u \mathbf{w}$. Here, however, we use a sequential procedure in which the accumulated results are exchanged between PEs of the same array; see Section V-B for a detailed explanation. As a result, each linear array can project its computed $\mathbf{z}_u^{(t+1)}$ entries to generate the next iterate $\mathbf{x}_w^{(t+1)}$, which are then used by the same linear array to proceed with the next iteration. The sign bits of the new vector $\mathbf{x}_w^{(t+1)}$ correspond to the outputs of the C2PO architecture.

As in the C1PO architecture, each PE $u = 1, 2, \ldots, U + 1$ is formed by three main units. The first unit is an $\mathbf{H}_u$ memory, which stores the $u$th row of the $\mathbf{H}_u$ sub-matrix. The second unit is a complex-valued MAC unit, which supports (i) multiplications of $a \times b$ and $a \times b^*$, (ii) accumulation by addition or subtraction, and (iii) initialization of the accumulator with a non-zero value. The third unit is the projection unit, which is equivalent to the one of C1PO, although it is merged with the accumulator of the MAC unit.

B. Architecture Operation

We now provide the (rather technical) operation details of the C2PO architecture illustrated in Figure 3. Without loss of generality, we focus our description on the $u$th linear array of PEs, which operates on the $\mathbf{H}_u$ sub-matrix and the $\mathbf{x}_w^{(t)}$ sub-vector. In the first iteration (i.e., at $t = 1$), the entry $[\mathbf{X}_w^{(1)}]_u$ and its scaled version $\tau [\mathbf{X}_w^{(1)}]_u$ are stored in PE $u = 1, 2, \ldots, U$ in two different registers: The value $[\mathbf{X}_w^{(1)}]_u$ is stored in the register labeled with "a" in Figure 3 which will later be used to initialize the accumulator in the complex-valued MAC unit; the value $[\mathbf{X}_w^{(1)}]_u$ is stored at the input register of the MAC unit labeled with "b." We restrict the stepsize $\tau$ to be $2^{-\alpha}$ for some fixed $\alpha \in \mathbb{N}^+$, which enables us to acquire $[\mathbf{X}_w^{(1)}]_u$ from a simple arithmetic right-shifted version of $[\mathbf{X}_w^{(1)}]_u$. The $(U + 1)$th PE stores the same value $[\mathbf{X}_w^{(1)}]_u$, as that in PE 1. Similar to the C1PO architecture, the entries of the $\mathbf{H}_u$-memory are stored so that the first memory address contains $[\mathbf{H}_w]_{u,u}$, the second address $[\mathbf{H}_w]_{u,u+1}$, and so forth (addresses wrap around). For the $(U + 1)$th PE, the first address of the $\mathbf{H}_u$-memory contains $[\mathbf{H}_w]_{U+1,1}$, the second address contains $[\mathbf{H}_w]_{U+1,2}$, etc.

(i) Wide Product: In the first clock cycle, each PE $u = 1, 2, \ldots, U$ computes $[\mathbf{H}_u]_{u,u} \cdot [\tau \mathbf{x}_w^{(t)}]_u$ and stores the result in the accumulator. The $(U + 1)$th PE computes $[\mathbf{H}_u]_{U+1,1} \cdot [\tau \mathbf{x}_w^{(t)}]_1$. As shown in the upper left side of Figure 3 in the same clock cycle, the $u$th PE passes the value $[\tau \mathbf{x}_w^{(t)}]_u$ to PE $u - 1$, while it receives the value $[\tau \mathbf{x}_w^{(t)}]_{u+1}$ from PE $(u + 1)$; PE 1 passes its value to PE $U$, while PE $(U + 1)$ does not pass anything. In the second clock cycle, since the cyclic exchange operation made the entry $[\tau \mathbf{x}_w^{(t)}]_{u+1}$ available at PE $u$, each PE computes $[\mathbf{H}_u]_{u,u+1} \cdot [\tau \mathbf{x}_w^{(t)}]_{u+1}$ and uses the accumulator to add it to the result of the previous cycle. The $(U + 1)$th PE uses the same value $[\tau \mathbf{x}_w^{(t)}]_{1}$ as PE 1; hence, it can compute $[\tau \mathbf{x}_w^{(t)}]_2 \cdot [\mathbf{H}_u]_{U+1,2}$. Again, in the same clock cycle, the $u$th PE passes the $[\tau \mathbf{x}_w^{(t)}]_u$ entry that is currently being multiplied on its MAC unit to PE $(u - 1)$; PE 1 passes its value to PE $B$, while PE $(U + 1)$ does not pass anything. Consequently, in the third clock cycle, the $u$th PE will use the values $[\mathbf{H}_u]_{u,u+2}$ and $[\tau \mathbf{x}_w^{(t)}]_{u+2}$ to continue performing MAC operations. During this third cycle, the $(U + 1)$th PE will calculate the product $[\mathbf{H}_u]_{U+1,3} \cdot [\tau \mathbf{x}_w^{(t)}]_3$. By repeating this procedure $U$ times, each entry of the sub-vector $(\tau \mathbf{x}_w^{(t)})$ cycles through all the PEs exactly once, enabling the $u$th linear array of PEs to compute $\mathbf{H}_w(\tau \mathbf{x}_w^{(t)})$. Since the complex-valued MAC unit contains three pipeline stages, two clock cycles are required to flush the pipeline. Hence, the previous matrix-vector operation has a latency of $U + 2$ clock cycles. To complete the wide product, the vectors $\mathbf{H}_w(\tau \mathbf{x}_w^{(t)})$ must be added. We use a
binary adder tree with $\log_2(B/U)$ pipeline stages. Hence, the vector $w$ is computed after $U + \log_2(B/U) + 2$ clock cycles. The $u$th PE in each linear array stores the entry $[w]_u$ in the MAC unit’s input register labeled with “$u$” in Figure 3.

(ii) Tall Product: In the next clock cycle, the computation of the tall-product starts. During the first clock cycle of the tall product computation, the PE $u = 1, 2, \ldots, U$ has available $[w]_u$, as well as $[H^w]_{u,u,u}$, the first entry in its memory. The PE can then compute $[H^w]_{u,u,u} \cdot [w]_u = [H^u]_{u,u,u} \cdot [w]_u$. Using the accumulator, this product is then subtracted from $[x^{(i)}]_u$, which was stored during the initialization phase in the register labeled with “$a$” in Figure 3. During the same clock cycle, the $u$th PE sends its accumulated result to the $(u-1)$th PE; PE 1 sends its accumulated result to PE $U$. Also, in the same clock cycle, the $(U+1)$th PE multiplies the conjugate of the first entry of its memory with its $w$ entry. In words, the product

$$[	ilde{H}^u]_{u,u+1} \cdot [w]_{u+1} = [H^u]_{u,u+1} \cdot [w]_{u+1}$$


is computed after $\tau x^{(i)}_u$ entries. In the second clock cycle, the $(u-1)$th PE multiplies the value $[w]_{u-1}$ with $[H^u]_{u-1,u,u-1}$. The product is then subtracted from the accumulated value received from the $u$th PE during the previous cycle, and the new accumulated value is passed to the $(u-2)$th PE. In the same clock cycle, the $(U+1)$th PE multiplies the value $[w]_{U+1}$ with $[H^u]_{U+1,u+2} = [H^u]_{U+2,u+1}$ and sends the result to the $U$th PE, so it can cycle through the linear array. Furthermore, PE $U$ passes the $[H^u]_{U+1,u+1} \cdot [w]_{U+1}$ (previously received from the $(U+1)$th PE) to PE $(U-1)$. In the third clock cycle, the $(u-2)$th PE calculates $[H^u]_{u,u-2} \cdot [w]_{u-2}$, subtracts it from the accumulated result received on the second cycle from the $(u-1)$th PE and passes the result to the $(u-3)$th PE. In the same clock cycle, the $(U+1)$th PE computes $[H^u]_{U+3,u,u+1} \cdot [w]_{U+1}$ and sends it to PE $U$. Meanwhile, $[H^u]_{U+1,u+1} \cdot [w]_{U+1}$ is passed from PE $U$ to PE $(U-1)$ and $[H^u]_{U+1,u+1} \cdot [w]_{U+1}$ is passed from PE $(U-1)$ to PE $(U-2)$. After repeating this procedure for $U$ clock cycles, each PE $u = 1, 2, \ldots, U$ will contain the accumulated result for the $u$th entry of $\tilde{x}^{(i)}_u - \tilde{H}^w_w$, received from the $(u+1)$th PE during the previous cycle. However, this accumulated result is missing the product $[H^u]_{u,u+1} \cdot [w]_{u+1}$, which was computed and sent by the $(U+1)$th PE. Nonetheless, in the $(U+1)$th cycle of the tall product procedure, the $u$th PE receives the missing $[H^u]_{u,u+1} \cdot [w]_{u+1}$ value from the $(u+1)$th PE. Thus, the $z^{(i+1)} = x^{(i)} - H^w_w$ entries are calculated after $U+1$ cycles. Since the complex MAC unit is used again, two additional clock cycles are required to flush its pipeline. Hence, $U + 3$ cycles are used for the tall product.

Finally, in the subsequent clock cycle after the tall product is completed, the projection operator is applied in a similar fashion as for the C1PO architecture. The only difference is that now the accumulator of the MAC unit is used to multiply the real and imaginary parts of each $z^{(i+1)}$ entry with 1.25, by adding each $z^{(i+1)}$ entry with a $2 \times$ right-shifted version of itself. The result from this projection is stored as the next iterate $[x^{(i+1)}]_u$ in the two initialization locations previously mentioned, completing one C2PO iteration. Since the projection operation requires an additional clock cycle, a full C2PO iteration is completed in exactly $2U + \log_2(B/U) + 6$ clock cycles.

C. FPGA Implementation Details

As for the C1PO FPGA implementation, we exclusively use fixed-point arithmetic for the C2PO FPGA design; see Section VI-A for the fixed-point error-rate performance of C2PO. To represent the entries of the vector $x^{(i)}$, we use 12-bit fixed-point values with 5 fraction bits. For the scaled $\tau x^{(i)}$ values, we use 12-bit fixed-point values with 11 fraction bits. The entries of the $H$ matrix consist of 10 bits with 8 fraction bits, and we use FPGA look-up tables (LUTs) as a distributed RAM to store these values. The complex-valued MAC unit uses 18 bits with 15 fraction bits when doing the wide product and 11 fraction bits when doing the tall product; the projection unit uses 18 bits with 11 fraction bits. The adder tree uses 21 bits with 15 fraction bits. Identical to the C1PO implementation, all adders and multipliers do not saturate, but wrap around; number resizing uses truncation. All complex-valued multipliers are built with four real-valued multipliers and two adders; we use DSP48 units for these operations.

VI. RESULTS

We now provide error-rate performance results for massive MU-MIMO systems and show reference FPGA implementation results for C1PO and C2PO.

A. Simulation Results

Figure 4 shows uncoded bit error rate (BER) curves versus the normalized transmit power $\varrho = P/N_0$ for massive MU-MIMO dowlink systems with $U = 16$ UEs for various precoding algorithms. In Figure 4(a) we consider the case of $B = 32$ BS antennas with BPSK, whereas in Figure 4(b) we consider the case of $B = 128$ BS antennas with 16-QAM. For both systems, we use Gray mapping, generate i.i.d. Rayleigh fading channel matrices, and average the BER over 10,000 Monte-Carlo trials. We compare ZF followed by quantization (ZF-Q), MRC followed by quantization (MRC-Q), the nonlinear SQUID algorithm proposed in [6], as well as C1PO and C2PO, for systems with 1-bit DACs. As a reference, we also include ZF with infinite-precision DACs (Inf. prec. ZF). SQUID runs $t_{\max} = 50$ iterations; C1PO and C2PO both run $t_{\max} = 24$ iterations. For all algorithms, the curves represent MATLAB floating-point performance; for C1PO and C2PO, the markers correspond to fixed-point performance of our hardware designs. Clearly, the fixed-point implementation loss of our hardware designs is negligible, i.e., less than 0.15 dB SNR at 1% uncoded BER for both considered scenarios.

For the $16 \times 32$ system (we use the notation $U \times B$ to refer to a downlink scenario with $U$ users and $B$ BS antennas) with BPSK, Figure 4(a) shows that all nonlinear precoders significantly outperform the linear-quantized precoders (ZF-Q and MRC-Q), which exhibit a high error floor. Furthermore, we see that C1PO and C2PO achieve similar performance as that of SQUID. At low-SNR, SQUID is marginally better, whereas C2PO achieves the best performance at high SNR, closely followed by C1PO and SQUID.
For the $16 \times 128$ system with 16-QAM, Figure 4(b) shows a similar trend, i.e., non-linear precoders significantly outperform linear-quantized precoders. SQUID outperforms C1PO and C2PO (which perform equally well) by about 0.5 dB SNR at 1% BER. However, we note that the complexity (in terms of operation counts) of SQUID is more than $2 \times$ higher than that of C1PO and C2PO, and also involves the sorting of $B$ dimensional vectors which is difficult to implement efficiently in VLSI. We also observe that non-linear precoders enable reliable transmission of higher-order modulation schemes (such as 16-QAM), which is not possible with linear-quantized methods—an observation also made recently in [7].

B. FPGA Implementation Results

To demonstrate the efficacy of C1PO and C2PO, we implemented several FPGA designs for different antenna configurations, namely for 32, 64, 128, and 256 BS antennas; all designs support downlink transmission to 16 UEs for modulation schemes ranging from BPSK to 16-QAM. The FPGA designs were developed on register transfer level (RTL) using Verilog, implemented using Xilinx Vivado Design Suite, and optimized for a Xilinx Virtex-7 XC7VX690T FPGA. Table I shows reference FPGA implementation results for C1PO and C2PO.

We see that the logic area (in terms of slices, logic LUTs, flipflops, and DSP48 units) for all designs increases roughly linearly with the number of BS antennas; this is a result of using a linear array of PEs. The only exception is the memory requirements of C1PO (in terms of memory LUTs), which scales roughly quadratically in the number of BS antennas; this is a result of having to store the entire $B \times B$ matrix $G$ in contrast to storing only the augmented $(U + 1) \times B$ matrix $\bar{H}$ for C2PO. We also see that the logic area for C1PO is 20% to 50% smaller than that of C2PO for all array sizes; the memory area of C1PO, however, is significantly larger for 128 and 256 BS antennas. This is because the architecture for C1PO is slightly simpler than that of C2PO, but the memory requirements of C1PO scale quadratically in $B$ whereas the memory requirements of C2PO only scale linearly in $B$.

The maximum clock frequency for C1PO is slightly higher than that of C2PO, which is due to the slightly simpler architecture of C1PO. As expected, the maximum clock frequency slowly decreases with $B$, since the FPGA routing overhead increases with $B$. In fact, after implementing our designs, the critical paths are typically in interconnect networks. Before mapping our designs to the FPGA, however, the critical path for the C1PO designs is in the real-valued multipliers that form part of the complex multiplier, while for the C2PO designs it is in the adders that form part of the complex multiplier. The latency of one C1PO iteration is significantly larger than that of C2PO for 64, 128, and 256 antennas. This results in significantly higher per-iteration throughput of C2PO for these BS antenna array sizes. Hence, C2PO is more efficient (in terms of throughput per area) for large BS antenna arrays, whereas C1PO is more efficient for small ones.

We finally note that the implementation results in Table I ignore the preprocessing complexity. For C1PO, preprocessing requires a $B \times B$ matrix inversion, which is computationally demanding, especially for large BS antenna arrays [20]. In stark contrast, preprocessing for C2PO only requires the computation of the scaled MRC output, which requires a $B \times U$ matrix-vector multiplication. Hence, C2PO can be considered as the preferred method in practical massive MU-MIMO systems.

C. Comparison with MRC-based Precoding

While the papers [26] and [25] propose hardware designs for precoding in massive MU-MIMO systems with high-precision DACs, neither of them provide detailed FPGA implementation results. Reference [26] describes an FPGA-based testbed that uses MRC and ZF-based precoding but does not report area and clock frequency results; [25] only provides ASIC implementation results. To enable a comparison of conventional precoders
with C1PO and C2PO, we developed a baseline design that implements MRC followed by quantization (MRC-Q).

Our MRC-Q baseline design is essentially a stripped-down and heavily optimized version of C1PO with only the necessary circuitry to implement MRC-Q. More specifically, our architecture corresponds to \( B/U \) linear arrays, each one with \( U \) PEs and a control unit. The arrays and PEs are organized as in Figure 2, with the exception that the projection unit is removed from the PEs. In addition, no multipliers are required as MRC-Q computes \( \mathbf{H} s \) with \( s \in \mathbb{O}^U \), and hence all multiplications are with constants (given by the constellation set \( \mathbb{O} \) and can be implemented with adders and shifters.

The FPGA implementation results for the MRC-Q baseline designs are reported in Table II. Note that these designs do not require any DSP48 units as the multiplication with constants are carried out with conventional logic. In comparison to the 1-bit precoder designs reported in Table I we see that MRC-Q-based precoding is roughly 5× to 6× more efficient than C2PO and up to 30× more efficient than C1PO (in terms of throughput/LUTs). This efficiency advantage comes at a significant loss in terms of error-rate performance (cf. Figure 4). We note, however, that for massive MU-MIMO systems with significantly more BS antennas than UEs (e.g., more than \( 8 \times \)), MRC-Q is a viable low-complexity alternative—a well-known fact in the massive MU-MIMO literature [2]–[4].

### D. Performance–Complexity Trade-Offs

In Figure 5 we provide the performance–complexity trade-offs between C1PO (dashed lines with circle markers) and C2PO (dotted lines with square markers) for various BS antenna array sizes. This trade-off is characterized in terms of the minimum normalized transmit power \( \varrho \) required to achieve 1% uncoded BER for BPSK (as in Figure 4(a)); the complexity is characterized by the throughput per area (in terms of billion symbols per second per FPGA LUT). As a reference, we also show the performance for ZF precoding with infinite-precision DACs (vertical lines). As in Figure 4, we consider a transmission to \( U = 16 \) UEs. We see that for small antenna arrays (i.e., for \( B = 32 \) and \( B = 64 \)), C1PO outperforms C2PO. However, for large antenna arrays (i.e., for \( B = 128 \) and \( B = 256 \)), C2PO significantly outperforms C1PO. We also see that only a small number of iterations are required (e.g., 2 to 4 iterations) for such large antenna arrays to achieve the performance limits of our algorithms.

In Figure 5 we additionally show the trade-off achieved by the MRC-Q baseline design reported in Section VI-C. Clearly, MRC-Q achieves higher throughput per LUT than C1PO and C2PO for large BS arrays (\( B = 128 \) and \( B = 256 \)); this gain comes, however, at the cost of rather poor error-rate

### TABLE I

**IMPLEMENTATION RESULTS FOR C1PO AND C2PO ON A XILINX VIRTEX-7 XC7VX690T FPGA**

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>C1PO</th>
<th>C2PO</th>
</tr>
</thead>
<tbody>
<tr>
<td>BS antennas ( B )</td>
<td>32  64  128  256</td>
<td>32  64  128  256</td>
</tr>
<tr>
<td>Slices</td>
<td>2700  5187  10324  21951</td>
<td>3375  6519  12690  24748</td>
</tr>
<tr>
<td>LUTs</td>
<td>6671  13305  30979  71817</td>
<td>10817  21920  43710  85323</td>
</tr>
<tr>
<td>– LUTs as logic</td>
<td>6031  12025  25939  51897</td>
<td>10069  20424  40718  79339</td>
</tr>
<tr>
<td>– LUTs as memory</td>
<td>640  1280  5040  19920</td>
<td>748  1496  2992  5984</td>
</tr>
<tr>
<td>Flipflops</td>
<td>6830  13624  26683  52175</td>
<td>5677  12461  26083  53409</td>
</tr>
<tr>
<td>DSP48 units</td>
<td>128  256  512  1024</td>
<td>136  272  544  1088</td>
</tr>
<tr>
<td>Max. clock frequency [MHz]</td>
<td>285  264  244  205</td>
<td>222  206  208  193</td>
</tr>
<tr>
<td>Min. latency [clock cycles]</td>
<td>35  67  131  259</td>
<td>39  40  41  42</td>
</tr>
<tr>
<td>Max. throughput( a ) [Msymbols/s]</td>
<td>130  63  30  13</td>
<td>91  82  81  74</td>
</tr>
<tr>
<td>Power consumption( b ) [W]</td>
<td>1.13  1.97  3.43  5.74</td>
<td>1.04  1.70  3.17  5.80</td>
</tr>
<tr>
<td>Max. throughput/LUTs</td>
<td>19 529  4 733  962  177</td>
<td>8 413  3 756  1 853  862</td>
</tr>
</tbody>
</table>

\( a \): The latency and maximum throughput are measured for one algorithm iteration.

\( b \): Statistical power estimation at maximum clock frequency and 1.0 V supply voltage.

### TABLE II

**IMPLEMENTATION RESULTS FOR A MRC-Q-BASED PRECODER ON A XILINX VIRTEX-7 XC7VX690T FPGA**

<table>
<thead>
<tr>
<th>BS antennas ( B )</th>
<th>32  64  128  256</th>
</tr>
</thead>
<tbody>
<tr>
<td>Slices</td>
<td>2543  5097  9444  17630</td>
</tr>
<tr>
<td>LUTs</td>
<td>7842  15617  32476  64446</td>
</tr>
<tr>
<td>– LUTs as logic</td>
<td>7010  13953  29148  57790</td>
</tr>
<tr>
<td>– LUTs as memory</td>
<td>832  1664  3328  6566</td>
</tr>
<tr>
<td>Flipflops</td>
<td>5711  11419  21902  42764</td>
</tr>
<tr>
<td>Clock freq. [MHz]</td>
<td>412  410  388  359</td>
</tr>
<tr>
<td>Latency [cycles]</td>
<td>18  18  18  18</td>
</tr>
<tr>
<td>TP [Msymbols/s]</td>
<td>366  365  345  319</td>
</tr>
<tr>
<td>Power( c ) [W]</td>
<td>0.79  1.25  1.84  3.16</td>
</tr>
<tr>
<td>Throughput/LUTs</td>
<td>46 665  23 356  10 621  4 945</td>
</tr>
</tbody>
</table>

\( c \): Statistical power estimation at max. clock freq. and 1.0 V supply voltage.
performance. For small BS antenna arrays \((B = 32 \text{ and } B = 64)\), MRC-Q is unable to achieve the target BER of 1%. Hence, MRC-Q is only suitable for massive MU-MIMO systems with very high BS-to-UE antenna ratios in which best-in-class error-rate performance is not the main design objective.

VII. Conclusions
We have proposed two nonlinear precoding algorithms, namely C1PO and C2PO, which achieve excellent error-rate performance in 1-bit massive MU-MIMO systems at low computational complexity. To substantiate this claim, we have designed corresponding VLSI architectures—to the best of our knowledge, the first for 1-bit massive MU-MIMO systems—and have presented FPGA reference implementations for a variety of BS antenna array configurations. Our results demonstrate that nonlinear precoding for 1-bit massive MU-MIMO systems is feasible from a hardware implementation perspective, even for antenna arrays with hundreds of BS antennas. As a result, our hardware designs enable BS antenna arrays with 1-bit DACs to reliably transmit high-rate data to multiple UEs, which has the potential to keep the hardware complexity, system costs, and circuit power consumption within manageable limits.

There are many avenues for future work. Besides our convergence results, a theoretical performance analysis of C1PO and C2PO is a challenging open research topic. Implementing precoders for other nonlinear algorithms, such as SQUID \([6]\), which perform better than C1PO and C2PO at low SNR, is left for future work. The design of 1-bit precoding algorithms and hardware accelerators for wideband massive MU-MIMO systems that use orthogonal-frequency division multiplexing (OFDM) is the subject of ongoing work.

APPENDIX A
PROOF OF THEOREM \(7\)
Let \(E(z, x) = \|Ax\|^2_2 + \gamma \|z - x\|^2_2 - \delta \|x\|^2_2\) denote the objective \((11)\) minimized by C1PO. Because \(B^B\) is bounded, the sequence of iterates \(\{(z^{(t)}, x^{(t)})\}\) remains bounded and thus contains a convergent sub-sequence. Denote the limit of this sub-sequence by \((z^*, x^*)\) and set \(E^* = E(z^*, x^*)\). Consider the point \(\hat{z}^* = \arg\min_{x} E(z, x) = (I_B + \gamma^{-1}A^HA)^{-1}x^*\). If \(\hat{z}^* \neq z^*\), then we have the strict inequality
\[
E((\hat{z}^* + z^*)/2, x^*) < \frac{1}{2}E(\hat{z}^*, x^*) + \frac{1}{2}E(z^*, x^*) = E^*
\]
because \(E\) is strongly convex in \(z\). However, this contradicts the fact that \(\hat{z}^* = \arg\min_{x} E(z, x)\), and so it must be the case that \(\hat{z}^* = z^*\). Because \(\delta < \gamma\), \(E\) is strongly convex in \(x\), and a similar argument shows that \(x^* = \arg\min_{x} E(z^*, x)\). Hence, \((z^*, x^*)\) minimizes \(E\) with respect to \(z\) and \(x\) separately; this, combined with the fact that \(E\) is differentiable, and \(B\) coordinate-wise separable, guarantees that \((z^*, x^*)\) satisfies the first-order conditions for \((7)\); see Theorem 2 in \([39]\) and similar arguments in \([40]\).

APPENDIX B
PROOF OF THEOREM \(2\)
Let \(E(z, x) = \|Ax\|^2_2 - \delta \|x\|^2_2\) denote the objective \((11)\) minimized by C2PO. Let \(f\) and \(g\) be defined as in \([12]\). Using the definition of the proximal operator \((13)\) together with \((15)\), the second update \((14)\) of C2PO can be written as
\[
x^{(t+1)} = \arg\min_{x} g(x) + \frac{1}{2\tau}\|x - (x^{(t)} + \tau \nabla f(x^{(t)}))\|^2 \]
\[
= \arg\min_{x} g(x) + f(x^{(t)}) + (x - x^{(t)}, \nabla f(x^{(t)}))\]
\[
+ \frac{1}{2\tau}\|x - x^{(t)}\|^2.
\]
Observe that, whenever \(\tau < \|A^TA\|^{-1}_{2,2}\), the inequality
\[
f(x) \leq f(x^{(t)}) + (x - x^{(t)}, \nabla f(x^{(t)})) + \frac{1}{2\tau}\|x - x^{(t)}\|^2
\]
holds for all \(x\). Using this observation, we can write
\[
E(x^{(t+1)}) = g(x^{(t+1)}) + f(x^{(t+1)})
\]
\[
\leq g(x^{(t+1)}) + f(x^{(t+1)}) + (x^{(t+1)} - x^{(t)}, \nabla f(x^{(t)}))
\]
\[
+ \frac{1}{2\tau}\|x^{(t+1)} - x^{(t)}\|^2
\]
\[
= \min_{x} g(x) + f(x^{(t)}) + (x - x^{(t)}, \nabla f(x^{(t)}))
\]
\[
+ \frac{1}{2\tau}\|x - x^{(t)}\|^2
\]
\[
\leq g(x^{(t)}) + f(x^{(t)}) = E(x^{(t)}).
\]
This shows that the sequence \(\{E(x^{(t)})\}\) is monotonically decreasing. Since the sequence is bounded below, there is some limit \(L = \lim_{t\to\infty} E(x^{(t)})\). Let \(\{x^{(t_k)}\}\) be a convergent sub-sequence of iterates (which must exist because the iterates are bounded) with limit point \(x^*\). Let
\[
x^* = \arg\min_{x} g(x) + f(x^*) + (x - x^*, \nabla f(x^*))
\]
\[
+ \frac{1}{2\tau}\|x - x^*\|^2
\]
be the result of applying the C2PO iteration starting at \(x^*\). Observing that \(E(x^{(t_k+1)}) \leq E(x^{(t_k)}) \leq E(x^{(t_{k-1})})\), and letting \(k \to \infty\), we find that \(E(x^*) = E(x^*) = L\), and
so \( \mathbf{x}^* \) is a minimizer of \( \Phi(f) \). This is only possible if \( 0 \in \partial g(\mathbf{x}^*) + \nabla f(\mathbf{x}^*) \), in which case \( \mathbf{x}^* \) is a stationary point.

ACKNOWLEDGMENTS

The authors thank O. Tirkkonen for insightful discussions on 1-bit precoding. The work of O. Castañeda and C. Studer was supported in part by Xilinx Inc. and by the US National Science Foundation (NSF) under grants ECCS-1408006 and CCF-1535897. The work of S. Jacobsson and G. Durisi was supported by the Swedish Foundation for Strategic Research under grant ID14-0022, and by the Swedish Governmental Agency for Innovation Systems (VINNOVA) within the center ChaseOn. The work of T. Goldstein was supported in part by the US NSF under grant CCF-1535902 and by the US Office of Naval Research under grant N00014-17-1-2078.

REFERENCES


