Heterogeneous computer architectures with extensive use of hardware accelerators, such as FPGAs, GPUs, and neural processing units, have shown significant potential to bring in orders of magnitude improvement in compute efficiency for a broad range of applications. However, system designers exploring these non-traditional architectures generally lack effective design methodologies and tools to swiftly navigate through the intricate design trade-offs and achieve rapid design closure. While several heterogeneous computing platforms are becoming commercially available to a wide user base, they are very difficult to program, especially those with reconfigurable logics. To address these pressing challenges, my research group investigate new applications, programming models, and computer-aided design (CAD) algorithms and tools to enable productive design and implementation of highly efficient application- and domain-specific computer systems. Our cross-cutting research intersects CAD, compilers, and computer architecture at multiple scales, from circuit-level building blocks, to chip-level processor and co-processor cores, as well as system-level heterogeneous compute nodes. In particular, we are currently tackling the following important and challenging problems:
- Algorithm-Hardware Co-Design for Machine Learning Acceleration
- Multi-Paradigm Programming for Heterogeneous Platforms
- Intelligent High-Level Synthesis
- Scale-Out Design Automation
- Programmable and Polymorphic Hardware Specialization
We are investigating various accelerator architectures for compute-intensive machine learning applications. We employ an algorithm-hardware co-design approach to achieving both high performance and low energy. We are among the first to build FPGA and ASIC accelerators for aggressively quantized convolutional neural networks (CNNs), especially those with binarized weights and activations [C35][W3][C39][W4]. Since the dominant computations of these networks are logic operations and their memory requirements are greatly reduced, they are well suited for a wide range of applications in the emerging Internet of intelligent things. In addition, we are developing a new suite of realistic benchmarks for software-defined FPGA-based computing [C41]. Unlike previous efforts, we aim to provide parametrizable benchmarks beyond the simple kernel level by incorporating real applications from emerging domains.
The latest advances in industry have produced highly integrated heterogeneous hardware platforms, such as the CPU+FPGA multi-chip packages by Intel and the GPU and FPGA enabled AWS cloud by Amazon. Although these heterogeneous computing platforms are becoming commercially available to a wide user base, they are very difficult to program, especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized knowledge on the low-level hardware details.
To democratize accelerator programming, we are starting a new project in collaboration with Prof. Adrian Sampson of CS and Prof. Jason Cong and Prof. Miryung Kim at UCLA. Our goal is to develop a highly productive multi-paradigm programming infrastructure that explicitly embraces heterogeneity to integrate a variety of programming models into a single, unified programming interface. We will also develop automated compilation from high-level domain-specific languages, novel runtime systems, and debugging support.
Specialized hardware manually created by traditional register-transfer-level (RTL) design can yield high performance but is also usually the least productive. As specialized accelerators become more integral to achieving the performance and energy goals of future hardware, there is a crucial need for above-RTL design automation to enable productive modeling, rapid exploration, and automatic generation of customized hardware based on high-level languages. Along this line, there has been an increasing use of high-level synthesis (HLS) tools to compile algorithmic descriptions (e.g., C/C++, Python) to RTL designs for quick ASIC or FPGA implementation of hardware accelerators [J5][J4].
While the latest HLS tools have made encouraging progress with much improved quality-of-results (QoR), they still heavily rely on designers to manually restructure source code and insert vendor-specific directives to guide the synthesis tool to explore a vast and complex solution space. The lack of guarantee on meeting QoR goals out-of-the-box presents a major barrier to non-expert users. To this end, we are developing a new generation of HLS techniques that feature scalable cross-layer synthesis [C25][C26], complexity-effective runtime optimization [C28][J7][C33], and trace-based analysis [C36] to enable a radically accelerated and greatly simplified hardware design experience, while retaining QoR on par with that of "ninja" designers.
Modern CAD tools are under immense pressure to cope with heterogeneous system-on-chip devices that continue to scale in capacity and complexity in accordance with Moore's law — a long-standing challenge widely known as "bridging the design productivity gap". Concurrently, compute resources are becoming abundant and inexpensive, especially with the emergence of cloud datacenters. My group is exploring a fresh approach to tackling several long-standing challenges in design automation by mobilizing cloud computing resources to enable vastly distributed stochastic optimization. With such a scale-out scheme, a CAD optimization problem is solved in a massively parallel manner, which promised significantly improved QoRs with a runtime similar to the traditional method. We have demonstrated the efficacy of our approach by applying it to (1) distributed autotuning the FPGA compilation flow from RTL to bitstream [C34], (2) improving the quality of FPGA-targeted logic synthesis [C37], and (3) approximate logic synthesis under various error constraints [C40].
A major research challenge for integrating heterogeneous accelerators into mainstream computing platforms involves creating clean hardware-software abstractions that are highly programmable, yet still enable efficient execution via specialized microarchitectures. We have been tackling this challenge through close collaboration with Prof. Christopher Batten and his students. One example is the XLOOPS co-processor architecture, which employs a novel hardware specialization approach that elegantly encodes common inter-iteration loop dependence patterns in the instruction set [C24]. We are currently extending XLOOPS to create a new class of easy-to-program reconfigurable overlay architectures.
In addition, we have been working closely with Prof. Batten's team on polymorphic hardware specialization, where we are investigating a new problem of synthesizing template-based software libraries formed by polymorphic algorithms and data structures. Along this line, we have developed a novel HLS methodology in which complex data structures are decoupled from the algorithm using a latency-insensitive interface, enabling overlapped execution of data structure methods and the algorithm code [C31].
Research conducted by my group is currently sponsored by Defense Advanced Research Projects Agency (DARPA), National Science Foundation (NSF), Semiconductor Research Corporation (SRC), Intel Corporation, and Xilinx, Inc. Their support is greatly appreciated.