A New Approach to Automatic Memory Banking using Trace-Based Address Mining

Yuan Zhou · Khalid Al-Hawaj · Zhiru Zhang

School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

{yz882, ka429, zhiruz}@cornell.edu

ABSTRACT

Recent years have seen an increased deployment of FPGAs as programmable accelerators for improving the performance and energy efficiency of compute-intensive applications. A well-known “secret sauce” of achieving highly efficient FPGA acceleration is to create application-specific memory architecture that fully exploits the vast amounts of on-chip memory bandwidth provided by the reconfigurable fabric. In particular, memory banking is widely employed when multiple parallel memory accesses are needed to meet a demanding throughput constraint.

In this paper we propose TraceBanking, a novel and flexible trace-driven address mining algorithm that can automatically generate efficient memory banking schemes by analyzing a stream of memory address bits. Unlike mainstream memory partitioning techniques that are based on static compile-time analysis, TraceBanking only relies on simple source-level instrumentation to provide the memory trace of interest without enforcing any coding restrictions. More importantly, our technique can effectively handle memory traces that exhibit either affine or non-affine access patterns, and produce efficient banking solutions with a reasonable runtime. Furthermore, TraceBanking can be used to process a reduced memory trace with the aid of an SMT prover to verify if the resulting banking scheme is indeed conflict free. Our experiments on Xilinx FPGAs show that TraceBanking achieves competitive performance and resource usage compared to the state-of-the-art across a set of real-life benchmarks with affine memory accesses. We also perform a case study on a face detection algorithm to show that TraceBanking is capable of generating a highly area-efficient memory partitioning based on a sequence of addresses without any obvious access patterns.

1. INTRODUCTION

With the general-purpose CPU performance scaling significantly slowing in the past decade, heterogeneous computer architectures that integrate specialized accelerators are gaining popularity to achieve improved performance and energy efficiency. Along the line, field-programmable gate arrays (FPGAs) have evolved into an attractive option for fulfilling the role of application-specific hardware acceleration, owing to the many recent technological advances on FPGA hardware capabilities as well as the software tooling support for high-level design entrys.

An FPGA-based hardware accelerator is typically highly parallelized and/or deeply pipelined in order to achieve a desirable throughput. As a result, multiple parallel accesses to a single on-chip memory are often required to provide the necessary data bandwidth to sustain the high throughput of the accelerator. However, the embedded memory blocks available on modern FPGA devices (e.g., BRAMs) only provide a very limited number of ports for concurrent reads/writes.1 Simply replicating the memory blocks would not be feasible due to the steep area overhead and potential memory coherence overhead resulting from write operations.

A more viable solution is memory banking, which partitions a memory block into several smaller banks; thus, concurrent memory accesses are distributed to different banks, avoiding or minimizing banking conflicts. Since each memory bank only holds a subset of the original memory contents, memory banking usually yields a significantly lower storage overhead compared to memory duplication. Nevertheless, additional banking logic is still required to orchestrate the data movement between banked memories and compute units in the accelerator. For non-expert FPGA designers, devising a minimum-conflict banking scheme with low hardware overheads is certainly a challenging task. While commercial high-level synthesis (HLS) tools provide some basic support for array partitioning [8], the users remain responsible for manually specifying the banking scheme via vendor-specific pragmas or directives. For this reason, there is an active body of HLS research tackling the problem of automatic array partitioning (i.e., memory banking) given a throughput constraint that is usually specified in terms of pipeline initiation interval (II) [6, 12, 14, 16, 17].

In this paper, we also focus our study on automatic memory banking, and propose TraceBanking, a trace-based banking algorithm that is very different from the existing methods. Specifically, TraceBanking mines a stream of memory address bits to determine a banking scheme that minimizes the number of access conflicts and simplifies the banking logic. Unlike mainstream techniques that are mostly

1Even for ASICs, it is not feasible to have a large number of memory ports due to the excessive area and power overhead [15].
based on static compiler analysis, TraceBanking only relies on simple source-level instrumentation to provide the memory trace of interest without enforcing any coding restrictions (such as static control parts often required by polyhedral analysis [3]). The major technical contributions of our work are threefold:

(1) We offer a fresh look at memory banking, by waiving the requirements of using static compile-time analysis. We show that from a trace of memory addresses, we can identify a set of “interesting” address bits that form the basis of the hardware-efficient memory banking function. In addition, our technique is able to form banking functions that do not belong to the solution space of the existing linear-transformation-based techniques.

(2) We propose a two-step heuristic to solve the trace-based memory banking problem. This heuristic is not only able to exploit regular memory access patterns, but can also generate efficient solutions for applications with irregular memory accesses.

(3) We propose an SMT-based checker that can formally verify if a memory banking solution is free of access conflicts under all possible execution traces. This allows the usage of a reduced (or incomplete) memory trace to significantly speed up TraceBanking, but without the risk of achieving an inferior banking solution. We believe that this formal verification technique is also useful for validating the soundness of existing memory banking algorithms, even though they are designed to be correct by construction.

The rest of this paper is organized as follows: Section 2 presents related work in memory banking; Section 3 formulates the trace-based memory banking problem; Section 4 provides motivational examples to illustrate the intuition behind our work; Section 5 describes our address mining algorithm in detail; Section 6 introduces the SMT-based banking solution checker; Section 7 reports the experimental results on commonly used benchmarks with affine memory accesses, which is followed by a case study on a face detection application with irregular memory accesses in Section 8; and Section 9 concludes this paper with discussion on future work.

2. RELATED WORK

There is a recent line of research that investigates the problem of automatic array partitioning in the context of HLS [20]. Initial efforts focus on one-dimensional arrays and attempt to find a proper cyclic partitioning with optimal scheduling to ensure conflict-free parallel data accesses [7, 11]. More recent proposals such as [16, 17] generalize these results to handle nested loops and multi-dimensional arrays.

Notably, linear transformation is extensively used among the existing array partitioning techniques. For example, the LTB approach [17] searches for a coefficient vector \( \vec{\alpha} \) to construct a cyclic banking function \( bank(X) = (\vec{\alpha} \cdot x) \% N \), given the number of banks \( N \) and the affine memory access pattern. Meng et al. proposed a fast algorithm to generate the LTB coefficient vector \( \vec{\alpha} \) according to the topology of the memory access pattern in multi-dimensional memory space [12]. The GMP approach is a generalization of LTB, which can generate block-cyclic banking function in the form of \( bank(X) = (\vec{\alpha} \cdot x) / B \% N \) [16]. More recently, Cilardo et al. proposed a lattice-based banking algorithm using polyhedron analysis [6].

3. PROBLEM FORMULATION

In this section we provide the definitions and formulate the trace-based memory banking problem. An example of the hardware architecture under discussion, shown in Figure 1, contains a set of compute units and memory banks connected by a crossbar. The number of compute units is denoted as \( K \), and the number of memory banks is denoted as \( N \). For simplicity, we assume that each compute unit only has one memory load port, and each memory bank only has one read port. Our problem formulation as well as the proposed technique can be generalized to handle multi-bank and multi-port memories. In the following, we first define several important concepts before formulating the actual optimization problem.
### Definition 1. Memory Trace

A memory trace $T$ is a sequence of addresses that are grouped into $L$ lists, where all addresses in the same list need to be accessed in parallel. Each of these lists is called a step. Each step contains $K$ addresses which are issued by the compute units. In the following discussions, we refer to the $l$-th step in memory trace $T$ as $T[l]$, and the memory operation requested by the $k$-th compute unit in step $l$ as $T[l][k]$. If compute unit $k$ does not issue any memory request in step $l$, $T[l][k]$ is marked as invalid.

### Definition 2. Memory Banking

A memory banking solution of a trace $T$ consists of a banking function $bank(A)$ and an offset function $offset(A)$. $bank(A)$ maps address $A$ to a memory bank ID, while $offset(A)$ determines the intra-bank position. A memory banking solution can be fully represented by a set of binary variables $\{b_{A,n} \mid A \in T, 0 \leq n < N, n \in \mathbb{N}\}$, where $b_{A,n}$ evaluates to one if and only if address $A$ is mapped to bank $n$, otherwise evaluates to zero.

### Definition 3. Banking Conflict

A banking conflict occurs when two different addresses in the same step are mapped to the same bank.

### Definition 4. Mux Size

The mux size of a memory bank $n$, $M_n$, refers to the number of compute units which access bank $n$ in the memory trace. $M_n$ can be represented by binary variables $\{b_{A,n}\}$ using the following equation:

$$M_n = \sum_{k=0}^{K-1} \sum_{i=0}^{L-1} b_{T[i][k],n}$$

With the above definitions, we can formulate the memory banking problem as an integer linear programming (ILP) problem:

**Problem:** Given a memory trace $T$, find a mapping function $bank(A)$ to optimize the following objective function, which minimizes memory access conflicts as the primary goal and reduces muxing overhead as the secondary goal.

**Objective:**

$$\alpha \cdot Conflicts + \beta \cdot Muxing$$

subject to

$$i, j \in [0, K-1], i \neq j, s_{A_i,A_j} = \sum_{n=0}^{N-1} (b_{A_i,n} \cdot b_{A_j,n}).$$

Here the addresses $A_i$ and $A_j$ refer to the $i$-th and $j$-th address in step $T[l]$, respectively. The binary variable $s_{A_i,A_j}$ equals to one if and only if addresses $A_i$ and $A_j$ are mapped to the same bank, otherwise evaluates to zero (we omit the linearization of non-linear terms due to page limit).

### 4. MOTIVATIONAL EXAMPLES

We use two examples to illustrate the intuition behind TraceBanking. We start with the very simple example in Figure 2, where Figure 2(a) shows a simple loop kernel containing two memory accesses per iteration, and Figure 2(b) shows the associated (truncated) memory trace. From Figure 2(b), it is not difficult to tell that the two addresses in each iteration always differ in the least significant bit (LSB). We refer to such bit as a mask bit. Informally, we define

$$int \ A[SIZE+1];$$

for (int $i=0; i<SIZE; i++$)

for (int $j=1; j<Cols-1; j++)$

foo($A[i], A[i+1]$);

#pragma HLS pipeline II=1
as GMP [16], analyze the symbolic expression of memory accesses and search for appropriate coefficients to construct a banking function in the form of \( \text{bank}(i, j) = [(\alpha_0 i + \alpha_1 j)/B] \% N \). Figure 3(d) shows the resulting 4-bank partitioning scheme, where \( \alpha_0 = 1, \alpha_1 = 2, \) and \( B = 2 \).

Figure 3(e) shows an alternative banking scheme, which is not in the solution space of the GMP approach.\(^2\) By examining the memory trace in Figure 3(c), we can identify two mask bits: the second-to-last bit of \( i \), plus the second-to-last bit of \( j \). These two bits combined can differentiate the four memory accesses belonging to the same iteration. As a result, we can divide the original array into four memory banks according to the values of the two mask bits and arrive at the alternative scheme in Figure 3(e).

These two examples demonstrate the possibility of performing memory banking based on a stream of memory addresses. Although these examples both have affine memory access patterns, TraceBanking is also capable of generating memory partitioning for applications with irregular memory accesses. Regardless of the memory access pattern, it is important to identify the mask bits that form the basis of banking. The value of mask bits is referred to as mask ID. In the following sections, we will discuss how TraceBanking identifies mask bits and derives efficient banking accordingly.

5. TRACEBANKING ALGORITHM

A straightforward method to optimize the objective function formulated in Section 3 is to use an ILP solver. However, ILP solvers are not scalable in general. Therefore, there is a need for heuristics which can find an optimal mapping between addresses and banks with a reasonable execution time.

In this section, we introduce TraceBanking algorithm, a flow of heuristics to solve the problem formulated in Section 3. Specifically, TraceBanking takes the number of available banks as a constraint and finds an optimized mapping by solving two sub-problems: (1) Finding a set of promising address bits to form mask bits, and (2) Finding a mapping between the generated mask IDs and available banks.

The flow of our algorithm is shown in Figure 4. The raw memory trace is first compressed, and a search is performed to find a feasible set of mask bits. Then, a graph will be generated based on the discovered mask bits. The generated graph will be colored such that each color represents a distinct memory bank. Since only the mask bits are used in the banking function, TraceBanking can potentially generate more area-efficient hardware to calculate bank IDs.

5.1 Finding Masks Bits

TraceBanking finds a set of bits from the address to form a mask in the first step. At the beginning, raw trace is preprocessed to remove redundant information, i.e., memory accesses with the same address in the same step; because having multiple accesses with the same address can be satisfied in the same cycle with no overhead. Then, the raw memory trace is compressed by initial-compression, which combines steps with identical accesses into a single step. A weight property is added to each step indicating its frequency in the raw trace. The resultant compressed trace is referred to as \( T_c \). Figure 6(a) shows the compression process for our bicubic example; no compression can be performed since no identical steps exist in the raw trace.

After cleaning up and compressing the trace, TraceBanking performs a multi-objective exhaustive search using the **findMasksBits** algorithm in Figure 5. This algorithm takes the available number of banks, \( N \), as well as the compressed memory trace, \( T_c \), as inputs. It evaluates any candidate mask using two objectives: mask IDs’ conflicts and conflict graph coloring.

The search starts with masks that includes \( \lfloor \log_2(N_A) \rfloor \) bits, where \( N_A \) is the maximum number of memory accesses in all steps in the compressed memory trace. It tries all possible combinations of \( \lfloor \log_2(N_A) \rfloor \) bits; each combination constructs a unique mask which maps addresses to mask IDs. Going through the compressed memory trace, the algorithm evaluates mask IDs conflicts by counting the number of times when two addresses in the same step have the same mask ID.

After finding a mask that has the lowest number of mask IDs conflicts, the algorithm constructs a conflict graph — every node in the graph represents a mask ID; edges between nodes indicate mask IDs that appeared together in at least one step, and the edges’ weights represent the frequency. Thus, the problem is transformed to a graph coloring problem. The algorithm calculates a lower bound for

---

\(^2\)We omit the formal proof due to space limit.
Figure 6: Bicubic example being processed throughout our algorithm — (a) Initial part of the flow where the raw memory trace is compressed and preprocessed; in this example, no compression nor preprocessing is possible. (b) The first heuristic of the flow conducting a search for mask bits; underlining shows the bits considered for masking and conflicts are highlighted with red colored mask bits. For masks with no conflicts, the heuristic checks colorability by max-clique. Then, the heuristic takes the mask with minimum conflicts and maximum max-clique (\#C symbolizes number of conflicts for each search branch). For bicubic, the mask with no-conflicts and a max-clique of at most \(N\), 4, is 010 010. (c) Finally, the second heuristic compresses the trace and colors the generated graph. It is obvious that for bicubic the graph is rather simple to color as it is only a clique of size 4.

Algorithm 2 mapMaskIDsToBanks

Input: \(N\) – number of available banks;
\(T_c\) – Compressed memory trace;
\(mask\) – Initial Mask;

\[
\begin{align*}
&\text{do} \\
&\quad \text{/* Perform mask-compression */} \\
&\quad T_{mc} \leftarrow \text{maskCompression}(T_c, mask) \\
&\quad \text{/* Construct a graph */} \\
&\quad G \leftarrow \text{constructGraph}(T_{mc}) \\
&\quad \text{/* Create a seed using greedy coloring */} \\
&\quad S \leftarrow \text{greedyGraphColoring}(G, N) \\
&\quad \text{/* Color G using evolutionary algorithm */} \\
&\quad \text{numConflicts, mapping} \leftarrow \text{eaGraphColoring}(S, N) \\
&\quad \text{/* Ending conditions */} \\
&\quad \text{if bitsRemaining(mask) = 0 then} \\
&\quad \quad \text{break} \\
&\quad \text{else if numConflicts \neq 0 then} \\
&\quad \quad \text{mask \leftarrow performBestFirstSearch(T_c, mask)} \\
&\quad \text{end while numConflicts \neq 0;} \\
\end{align*}
\]

Figure 7: The second heuristic in our flow to map mask IDs to banks.

The colorability of the conflict graph by finding the maximum clique; where graphs with maximum clique size greater than the number of banks, \(N\), cannot be colored with \(N\) colors. Using bicubic as an example, Figure 6(b) shows the resulting graph; it is obvious that the graph has a maximum clique of 4; therefore, the first step concludes with the mask 010 010.

5.2 Mapping Mask IDs to Banks

The second step, with its algorithm shown in Figure 7, takes the mask found by \textit{findMasksBits} and finds bank assignments for mask IDs such that the number of potential conflicts is minimized. To reduce complexity and redundant work, the algorithm further compresses the trace by applying mask-compression, which is similar to initial-compression explained earlier except that the addresses are replaced with their corresponding mask IDs. After that, the algorithm will construct a conflict graph from the mask-compressed trace.

To find a coloring for the generated graph, the algorithm reduces the problem to maximum coloring.\(^3\) The number of banks represents the number of colors available for coloring. The algorithm then generates a colored seed, \(S\), using multiple order-based greedy heuristics \([2, 10]\). If the seed is not conflict-free, TraceBanking attempts to minimize the number of conflicts using an evolutionary algorithm \([10]\). In each evolutionary step, TraceBanking performs a set of heuristics that showed efficiency in coloring memory-accesses graphs \([10]\). Once a coloring for a conflict graph is found, the evolutionary algorithm concludes with banking function \(\text{bank}(A)\) constructed from the coloring.

If the algorithm cannot find a conflict-free coloring in a bounded number of evolutionary steps, it is assumed that the graph is uncolorable. Then, TraceBanking proceeds to perform a best-first search. The search will modify the mask by adding one extra bit to it. It is reasonable to assume that address bits that are part of the final mask have an additive effect in reducing conflicts when considered; as a result, the best-first search tests the colorability of remaining bits by adding them to the mask in isolation. Then, the search includes the bit that yields a graph with the minimum number of conflicts permanently to the mask. Since this is a rough assumption, TraceBanking might use more bits than theoretically needed to find a feasible banking.

Taking the bicubic example from before, the algorithm will take the mask found by \textit{findMasksBits} and its corresponding conflict graph. Then, it will attempt to color the four-node conflict graph shown in Figure 6(c). Because the

\(^3\)In this paper we strictly target conflict-free solutions. However, TraceBanking is easily extended to adapt conflict-less solutions.
5.3 Offset Generation

After finding the mask bits and generating the banking function bank(A), we need to find an offset function offset(A) to transform an address A to a corresponding intra-bank offset. An intuitive method to generate the offset function is to simply scan every data element in the data domain and assign consecutive integers to data elements in each bank. Without any constraints on the offset function, this integer counting method is effective for both regular and irregular banking solutions. In addition, this method is optimal in terms of storage overhead since the data elements in each bank are guaranteed to have consecutive intra-bank offsets.

5.4 Uncovering Closed-Form Representations

The banking and offset functions obtained from Sections 5.2 and 5.3 are represented in the form of look-up tables by default. For applications with regular memory access patterns, it is possible to convert the look-up tables generated by TraceBanking into equivalent closed-form equations, which essentially uncovers and exploits the regularity in the original application.

Our key idea is to decompose the look-up table into multiple stages of smaller look-up tables, and use a simple search to map the sub-tables into equations. The composition of memory addresses is retrieved from source-level instrumentation. An example is shown in Figure 8. The original 5-bit mask shown in Figure 8(a) is divided into two disjoint sub-masks: i-Mask and j-Mask — according to the corresponding array indices. By grouping the entries with the same i-Mask ID, the original banking solution shown in Figure 8(a) is decomposed into two levels, where the first level is used to determine the look-up table for the second level. Figure 8(b) shows how to index the decomposed look-up tables, where the i-Mask ID is used to index the first-level table and j-Mask ID is needed to index the second-level table and retrieve the actual bank ID. As illustrated in Figure 8(b), each second-level look-up table can be represented by a modulus operation. By searching for coefficients to represent the relationship between i-Mask ID and the constants in the equations (highlighted in bold in Figure 8(b)), we can represent the original banking solution with one closed-form equation shown in Figure 8(b). Clearly, this approach can easily be generalized to arrays with higher dimensions. We also use a similar method to uncover the closed-form equation for an offset function, if such representation exists.

According to our experiments on a set of benchmarks with affine memory accesses, all of the results generated by TraceBanking can be represented by our equation template which is generalized from block-cyclic partitioning. Some of our solutions fall into the category of the cyclic partitioning scheme mentioned in the LTB approach [17]. Other solutions are not in the solution space of block-cyclic partitioning. Nonetheless, they can be efficiently represented with a few number of mask bits (e.g., bicubic solution in Figure 3(e)).

6. SMT-BASED VERIFICATION

The previous discussion in Section 5 assume that the input memory trace to TraceBanking is complete. In other words, the input trace captures all memory accesses from the entire software execution. In this case, our solution is supposed to be sound in terms of guaranteeing no banking conflicts. When the given memory trace is incomplete or input-dependent, it is necessary to have a formal mechanism to verify if the resulting solution remains conflict-free under all possible scenarios.

To this end, we propose an SMT-based checker to validate the soundness of the solution with the aid of a simple compiler analysis. The checker takes the memory banking solution from TraceBanking, and the address expressions of the loop kernel from compiler analysis. With this information, we can formulate the SMT problem as shown in Figure 9(a). The integer variables for the SMT problem correspond to loop induction variables in the original application. We represent the banking solution as a function of array indices, and expressions of array indices as functions of loop induction variables. Then, we specify the iteration...
domain as a constraint. Additionally, we add one constraint of having at least one banking conflict in the whole iteration domain. If the SMT problem is proven to be unsatisfiable, there is no memory access conflict in all iterations and the banking solution is valid.

The example shown in Figure 9(b) illustrates how the SMT-based checker validates the banking solution for bicubic interpolation shown in Figure 3(e): the loop induction variables $i$ and $j$ are used as SMT variables; and the banking function is represented symbolically. The constraints specify boundaries for the SMT variables, and compare every pair of addresses from the same iteration to check for conflicts.

As a formal verification technique, the SMT-based checker is also useful for validating the soundness of existing memory banking algorithms to detect any bugs in compiler analysis and code transformation, even though they are designed to be correct by construction.

Figure 9: SMT formulation of the banking solution checker — (a) General formulation. (b) Example of Bicubic interpolation.

Define loop induction variables as SMT variables:
int $i$

Define banking function:
int $B(idx Ca)$

Define expressions of array indices:
int $[i, idxx(i)]$

Construct iteration domain $D$:
assert ($(i[0] > 0)$ and $(i[1] > 0)$ and ...)

Constraint for having at least one conflict:
assert
$\bigvee_{\forall \vec{i} \in D, a \neq b} B(idx_a(\vec{i})) = B(idx_b(\vec{i}))$

(a) General SMT formulation

(b) Example of Bicubic interpolation

7. EXPERIMENTAL RESULTS

In our experiments, memory traces are generated by source-level instrumentation of the loop kernels. The addresses in the memory traces are constructed by concatenating multidimensional array indices. The core algorithm of TraceBanking processes the memory trace and generates banking and offset functions. This algorithm is implemented in C. We use Vivado Design Suite 2016.2 from Xilinx [19] for high-level synthesis (HLS), logic synthesis and simulation. The target FPGA device is Xilinx Virtex-7. The memory banking flow takes in the memory trace and generates solutions in the form of look-up tables or close-form equations. We use Z3, an SMT theorem prover, to verify the generated solutions [9]. Each verified banking solution as well as the corresponding application are implemented as a synthesizable HLS code.

We adopt six different loop kernels from the GMP work [16]. In addition, we add Stencil3D benchmark from MachSuite [13], which accesses a three-dimensional array, to stress the robustness and scalability of our approach. The memory access patterns of these loop kernels are shown in Figure 10 — the solid dots represent the data elements being accessed in each iteration of the loop kernels. We substitute the processing phase of these loop kernels with a simple summation to better compare the overhead of different memory banking solutions. We also implemented the GMP method [16] as the baseline. All the designs are pipelined with II of one for maximum throughput. The input size of the designs is $64 \times 48$ ($5 \times 64 \times 48$ for Stencil3D), and the data size is 8-bit. We employ efficient algorithms from [18] and implement our own modulus functions to minimize area. These customized modulus functions are used in both the baseline and our approach.

7.1 Area Comparison

Table 1 shows comparison with the baseline, where the minimum number of banks is used. Both GMP and our ap-
Table 1: Timing and resource usage comparison with baseline using GMP [16], where the minimum number of memory banks is used — target clock period = 5ns; BRAM = # of BRAMs; Slice = # of slices; LUT = # of lookup-tables; FF = # of flip-flops; DSP = # of DSPs; CP = achieved clock period.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th># Accesses</th>
<th>Method</th>
<th># Banks</th>
<th>Mask Width</th>
<th>BRAM</th>
<th>Slice</th>
<th>LUT</th>
<th>FF</th>
<th>DSP</th>
<th>CP(ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>BICUBIC</td>
<td>4</td>
<td>Baseline</td>
<td>4</td>
<td>-</td>
<td>74</td>
<td>217</td>
<td>163</td>
<td>0</td>
<td>3.89</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>4</td>
<td>2</td>
<td>74 (+0.0%)</td>
<td>212 (-2.3%)</td>
<td>184 (+13%)</td>
<td>0 (+0.0%)</td>
<td>3.66</td>
<td></td>
</tr>
<tr>
<td>DECONV</td>
<td>5</td>
<td>Baseline</td>
<td>5</td>
<td>-</td>
<td>185</td>
<td>531</td>
<td>383</td>
<td>10</td>
<td></td>
<td>3.52</td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>5</td>
<td>12</td>
<td>182 (-1.6%)</td>
<td>541 (+1.9%)</td>
<td>383 (+0.0%)</td>
<td>10 (+0.0%)</td>
<td>3.37</td>
<td></td>
</tr>
<tr>
<td>DENOISE-UR</td>
<td>8</td>
<td>Baseline</td>
<td>8</td>
<td>-</td>
<td>180</td>
<td>610</td>
<td>391</td>
<td>0</td>
<td>4.15</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>8</td>
<td>4</td>
<td>188 (+4.4%)</td>
<td>623 (+1.1%)</td>
<td>427 (+9.2%)</td>
<td>0 (+0.0%)</td>
<td>3.62</td>
<td></td>
</tr>
<tr>
<td>MOTION_C</td>
<td>4</td>
<td>Baseline</td>
<td>4</td>
<td>-</td>
<td>76</td>
<td>186</td>
<td>153</td>
<td>0</td>
<td>3.58</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>4</td>
<td>2</td>
<td>68 (-11%)</td>
<td>193 (+3.8%)</td>
<td>190 (+24%)</td>
<td>0 (+0.0%)</td>
<td>3.65</td>
<td></td>
</tr>
<tr>
<td>MOTION_LV</td>
<td>6</td>
<td>Baseline</td>
<td>6</td>
<td>-</td>
<td>146</td>
<td>425</td>
<td>392</td>
<td>6</td>
<td></td>
<td>3.31</td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>6</td>
<td>6</td>
<td>146 (+0.0%)</td>
<td>425 (+0.0%)</td>
<td>392 (+0.0%)</td>
<td>6 (+0.0%)</td>
<td>3.31</td>
<td></td>
</tr>
<tr>
<td>SOBEL</td>
<td>9</td>
<td>Baseline</td>
<td>9</td>
<td>-</td>
<td>405</td>
<td>1296</td>
<td>692</td>
<td>27</td>
<td></td>
<td>3.93</td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>9</td>
<td>12</td>
<td>350 (-14%)</td>
<td>1059 (-18%)</td>
<td>719 (+3.9%)</td>
<td>27 (+0.0%)</td>
<td>3.96</td>
<td></td>
</tr>
<tr>
<td>STENCIL3D</td>
<td>7</td>
<td>Baseline</td>
<td>7</td>
<td>-</td>
<td>32</td>
<td>966</td>
<td>700</td>
<td>7</td>
<td></td>
<td>3.82</td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>7</td>
<td>15</td>
<td>308 (-4.3%)</td>
<td>932 (-3.5%)</td>
<td>624 (-11%)</td>
<td>7 (+0.0%)</td>
<td>3.74</td>
<td></td>
</tr>
<tr>
<td>Average</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>-3.8%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>-2.4%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>+5.6%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>+0.0%</td>
</tr>
</tbody>
</table>

Table 2: Timing and resource usage comparison with baseline using GMP [16], where the number of memory banks is restricted to be a power-of-two — target clock period = 5ns; BRAM = # of BRAMs; Slice = # of slices; LUT = # of lookup-tables; FF = # of flip-flops; DSP = # of DSPs; CP = achieved clock period.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th># Accesses</th>
<th>Method</th>
<th># Banks</th>
<th>Mask Width</th>
<th>BRAM</th>
<th>Slice</th>
<th>LUT</th>
<th>FF</th>
<th>DSP</th>
<th>CP(ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>DECONV</td>
<td>5</td>
<td>Baseline</td>
<td>8</td>
<td>-</td>
<td>129</td>
<td>418</td>
<td>278</td>
<td>0</td>
<td>3.63</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>8</td>
<td>4</td>
<td>125 (-3.1%)</td>
<td>411 (-1.7%)</td>
<td>302 (+8.6%)</td>
<td>0 (+0.0%)</td>
<td>3.11</td>
<td></td>
</tr>
<tr>
<td>MOTION_LV</td>
<td>6</td>
<td>Baseline</td>
<td>8</td>
<td>-</td>
<td>117</td>
<td>369</td>
<td>237</td>
<td>0</td>
<td>3.56</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>8</td>
<td>3</td>
<td>119 (+1.7%)</td>
<td>391 (+6.0%)</td>
<td>282 (+19%)</td>
<td>0 (+0.0%)</td>
<td>3.77</td>
<td></td>
</tr>
<tr>
<td>SOBEL</td>
<td>9</td>
<td>Baseline</td>
<td>16</td>
<td>-</td>
<td>328</td>
<td>1114</td>
<td>525</td>
<td>0</td>
<td>4.34</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>16</td>
<td>4</td>
<td>340 (+3.7%)</td>
<td>1129 (+1.3%)</td>
<td>472 (-10%)</td>
<td>0 (+0.0%)</td>
<td>3.89</td>
<td></td>
</tr>
<tr>
<td>STENCIL3D</td>
<td>7</td>
<td>Baseline</td>
<td>8</td>
<td>-</td>
<td>195</td>
<td>649</td>
<td>443</td>
<td>0</td>
<td>3.87</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>TraceBanking</td>
<td>8</td>
<td>6</td>
<td>201 (+3.1%)</td>
<td>655 (+0.9%)</td>
<td>450 (+1.6%)</td>
<td>0 (+0.0%)</td>
<td>3.70</td>
<td></td>
</tr>
<tr>
<td>Average</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>+1.4%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>+1.6%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>+4.8%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>+0.0%</td>
</tr>
</tbody>
</table>

Table 3: Execution time of TraceBanking on Motion_LV with different array sizes.

<table>
<thead>
<tr>
<th>Array Size</th>
<th>Runtime (s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>12x12</td>
<td>0.0096</td>
</tr>
<tr>
<td>32x24</td>
<td>2.19</td>
</tr>
<tr>
<td>64x48</td>
<td>4.88</td>
</tr>
<tr>
<td>128x96</td>
<td>6.94</td>
</tr>
<tr>
<td>320x240</td>
<td>12.87</td>
</tr>
<tr>
<td>640x480</td>
<td>33.38</td>
</tr>
</tbody>
</table>

TraceBanking is able to generate competitive memory banking solutions from memory traces. However, using a complete memory trace may be expensive when the memory trace is large. Table 3 shows how the execution time of TraceBanking scales with an increasing array size. For applications with affine memory accesses, we can apply trace reduction to reduce the runtime. The general idea is to use a power-of-two instead of the minimum. Therefore, we conduct this experiment for the four benchmarks whose number of banks is not a power-of-two and compare our results with the baseline. Detailed experiment results are shown in Table 2. Comparing with the corresponding entries in Table 1, the designs in Table 2 generally have less area even though they use more memory banks and a more complex crossbar, because banking functions are significantly simplified when the number of banks is a power-of-two. For GMP designs, multiplication and division become simple shifting operations, while modulus operations are just selecting LSBs. For our designs, the resource saving comes from the reduction in mask width. Compared with baseline, our designs use a negligible 1.4% more slices. In general, the hardware generated by our trace-based memory banking approach is comparable with GMP in terms of area and timing.

7.2 Scalability

TraceBanking is able to generate competitive memory banking solutions from memory traces. However, using a complete memory trace may be expensive when the memory trace is large. Table 3 shows how the execution time of TraceBanking scales with an increasing array size. For applications with affine memory accesses, we can apply trace reduction to reduce the runtime. The general idea is to use a power-of-two instead of the minimum. Therefore, we conduct this experiment for the four benchmarks whose number of banks is not a power-of-two and compare our results with the baseline. Detailed experiment results are shown in Table 2. Comparing with the corresponding entries in Table 1, the designs in Table 2 generally have less area even though they use more memory banks and a more complex crossbar, because banking functions are significantly simplified when the number of banks is a power-of-two. For GMP designs, multiplication and division become simple shifting operations, while modulus operations are just selecting LSBs. For our designs, the resource saving comes from the reduction in mask width. Compared with baseline, our designs use a negligible 1.4% more slices. In general, the hardware generated by our trace-based memory banking approach is comparable with GMP in terms of area and timing.
Table 4: Execution time of TraceBanking with reduced memory trace — Initial mask refers to the mask found by Section 5.1, while the Final mask refers to the mask found by Section 5.2.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Reduced Array Size</th>
<th>Mask Width</th>
<th>Runtime (s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>BICUBIC</td>
<td>8 × 8</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>DENOISE</td>
<td>10 × 10</td>
<td>4</td>
<td>8</td>
</tr>
<tr>
<td>DENOISE2</td>
<td>16 × 16</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>MOTION_C</td>
<td>8 × 8</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>MOTION_LV</td>
<td>12 × 12</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>SOBEL</td>
<td>18 × 18</td>
<td>6</td>
<td>10</td>
</tr>
<tr>
<td>STENCIL3D</td>
<td>5 × 14 × 14</td>
<td>6</td>
<td>11</td>
</tr>
</tbody>
</table>

A partial memory trace which covers an adequate number of steps. Because of memory access pattern redundancy in the trace, the generated banking scheme is likely to comply with banking schemes generated from a full trace. Since the banking scheme generated from a partial trace is not guaranteed to be valid, we use the SMT-based checker proposed in Section 6 to validate it. If the validation fails, we revert to using the complete memory trace.

We perform experiments with reduced memory traces for all the benchmarks listed in Table 1. For the size of the reduced trace, we use an empirical value of 2 × #Banks in each dimension of the array. For example, if the loop kernel conducts Sobel edge detection on an VGA image (640 × 480), rather than iterating through the whole image, we execute the loop kernel on an 18 × 18 sub-image and use this reduced trace as the input to TraceBanking. For all the benchmarks listed in Table 1, TraceBanking is able to generate solutions which are proven to be valid using the reduced traces as inputs. Moreover, these solutions are identical to the ones generated from complete traces. The execution time of the SMT-based checker is less than a second. As shown in Table 4, the execution time of TraceBanking is reduced significantly by using partial traces without sacrificing the quality of the solutions.

A critical observation from Table 4 is that, in most benchmarks, the final solution is either in the beginning or at the very end of the search space. TraceBanking exploits the aforementioned observation in pruning the search space by performing two simultaneous searches: forward search and backward search. Forward search starts from the mask with minimum number of bits upward to the mask with maximum number of bits, stopping with the first mask that satisfies no conflicts. On the other hand, backward search starts from the mask with maximum number of bits downward to the mask with minimum number of bits, stopping when no bit can be removed without causing conflicts.

8. CASE STUDY: HAAR FACE DETECTION

In this section, we use Haar face detection [5] as a case study to show the efficacy of TraceBanking on applications with non-affine memory accesses. The Haar algorithm uses cascaded classifiers to detect human faces rapidly and robustly. Thousands of weak classifiers are integrated into a Haar system, and each of them has a distinct memory access pattern. A code snippet of the loop kernel in Haar algorithm is shown in Figure 11. The array window is a 25 × 25 image buffer and is steadily shifted in from the input image. Therefore, it is implemented with discrete registers. In each iteration, the loop kernel reads pixels into the array coord and process them in the function foo(). There are 2913 classifiers in total. The constant arrays rectangles_array[] store the constants needed to compute the array indices in each iteration. There is an if statement inside the loop kernel. When the condition is met, the loop kernel accesses 12 pixels from the window array in that iteration; otherwise, 8 pixels are accessed.

In order to maximize throughput, we need to fully pipeline the CLASSIFIER loop in Figure 11, where each classifier requires eight or 12 parallel accesses to the image buffer. Existing techniques cannot generate an efficient banking solution for this problem due to two reasons: (1) The 2013 classifiers have more than 2000 different memory access patterns in total, and (2) The array indices are non-affine without any linear relationship with the iteration variable filter_no. With TraceBanking, we are able to generate a conflict-free banking solution to partition the image buffer window[25][25] into 28 memory banks using the whole address as mask bits. The execution time is less than a second. Because the window array is a shifting window implemented using discrete registers, in this scenario, the memory banks are actually register banks.

Our baseline is a straightforward design that uses 12 instances of 625-to-1 multiplexer. We compare our design with the baseline and the result is shown in Table 5. The TraceBanking design in Table 5 refers to the memory banking design generated by our approach, and the Full Mux design refers to the baseline. For these two designs, we only extract the loop kernel part shown in Figure 11 to better compare the banking hardware overhead.

Figure 11: Loop kernel of Haar face recognition.
Table 5: Timing and resource usage comparison of the two designs — target clock period = 5ns; BRAM = # of BRAMs; Slice = # of slices; LUT = # of lookup-tables; FF = # of flip-flops; DSP = # of DSPs; CP = achieved clock period; Latency = latency of the loop kernel.

<table>
<thead>
<tr>
<th>Implementation</th>
<th>BRAM</th>
<th>Slice</th>
<th>LUT</th>
<th>FF</th>
<th>DSP</th>
<th>CP(ns)</th>
<th>Latency</th>
</tr>
</thead>
<tbody>
<tr>
<td>TraceBanking</td>
<td>34</td>
<td>4915</td>
<td>8266</td>
<td>12559</td>
<td>6</td>
<td>4.52</td>
<td>2923</td>
</tr>
<tr>
<td>Full Mux</td>
<td>22</td>
<td>21275</td>
<td>53553</td>
<td>23785</td>
<td>3</td>
<td>9.22</td>
<td>2919</td>
</tr>
</tbody>
</table>

Table 5 compares the Full Mux design with the TraceBanking design. Our TraceBanking design reduces Slice, LUT and Flip-Flop usage by 76.9%, 84.6% and 47.2%, respectively. Meanwhile, clock period is improved by 51.0%. BRAM usage increases because of the overhead in storing look-up tables for banking and offset functions. The reduction in logic resource usage results from the simplified muxing network in the TraceBanking design. In the TraceBanking design, two levels of multiplexers are used to connect the registers with the compute units, and each multiplexer has less than 30 inputs. In contrast, the Full Mux design uses 12 instances of 625-to-1 multiplexers, which consumes a lot more area. Even worse, the Full Mux design is extremely hard to route and unable to meet the 5ns timing target. Therefore, even though the Full Mux design has similar latency with the TraceBanking design, the total execution time of the loop kernel is much worse. Clearly, the banking scheme generated by TraceBanking helps improve both area and performance of the design, which contains very irregular memory accesses.

9. CONCLUSION AND FUTURE WORK

In this work, we propose TraceBanking, a memory banking approach using trace-based address mining. By analyzing the input memory trace of an application, we select the important address bits which can guide our banking decision, and apply a graph coloring algorithm to generate efficient banking solutions. We also propose an SMT-based checker to validate our solution. Experiments show that TraceBanking can generate comparable hardware with the state-of-the-art partitioning algorithm for applications with affine memory access patterns. In addition, a case study on Haar face detection demonstrates that TraceBanking can generate valid and efficient banking solutions for applications with non-affine memory accesses.

Our work opens a new path in automatically generating parallel hardware architecture using memory trace analysis, and we prove that this compute-intensive task can be accomplished within a reasonable amount of time. The execution time of our approach can be further reduced by launching parallel threads to exploit the heavy parallelism in the best-first search.

Furthermore, we believe our approach can be extended to generate other specialized parallel architectures, such as data reuse buffers commonly used for stencil-like applications. Additionally, trace analysis provides an opportunity for automatically generating a specialized on-chip caching system for applications with non-affine memory accesses, or even data-driven applications whose memory access pattern cannot be statically determined.

Acknowledgements

This work was supported in part by Intel Corporation under the ISRA Program, NSF Awards #1337240 and #1453378, a DARPA Young Faculty Award, and a research gift from Xilinx, Inc. Khalid Al-Hawaj is supported by King Abdullah Scholarship Program (KASP) and King Fahd University of Petroleum and Minerals (KFUPM).

10. REFERENCES