Lossless data compression

To reduce the storage requirement and speed up the gathering, transfer and processing of the large amounts of data microprocessor researchers and computational scientists have to deal with, 1) we invented a new type of lossless data compression algorithm for program traces that surpasses other algorithms in both compression ratio and compression speed, 2) we designed hardware to unobtrusively collect and compress execution traces in real time, 3) we created a flexible web-based tool to automatically compile user-provided trace format specifications into some of the best performing trace compressors, and 4) we developed a real-time compression algorithm for scientific floating-point data that, at comparable compression ratios, is one to two orders of magnitude faster than other approaches. Researchers at CMU, Intel, MIT, Princeton, and the Universities of Alabama, Arizona, and Colorado extensively use these trace-compression tools and algorithms. The European Centre for Medium-Range Weather Forecasts compresses some of its satellite data with our floating-point compression algorithm. The tracing hardware is collaborative work with Dr. Milena Milenkovic from IBM and Prof. Aleksandar Milenkovic from the University of Alabama in Huntsville.


Prefetching and caching

To decrease the data access latency and make computer hardware simpler and faster, we have devised novel multi-core prefetching and helper-threading schemes and evaluated caching strategies for 3D architectures. 1) We designed a simple yet very effective hardware approach to create a prefetch thread to accelerate programs, 2) we proposed a lightweight architectural framework for chip multiprocessors to facilitate the writing of helper threads that can emulate complex hardware in software at competitive performance, and 3) we showed that 3D processors with moderately sized caches and the rest of the real estate used for on-chip main memory perform best, especially when combined with a good prefetcher. The 3D aspect of this project is collaborative work with Prof. Sandip Tiwari from Cornell University and his Ph.D. student Christianto C. Liu.


Parallel computing

Because the data are dispersed in parallel systems, writing efficient parallel programs is a challenge, even for experts. To simplify this task, we have developed four innovative approaches to speed up the data delivery at the library level, namely 1) an effective way to prefetch pages in software distributed shared memory systems, 2) a transparent, software based, real-time message compression algorithm for MPI libraries, 3) an approach to reliably prefetch messages in a message-passing system, and 4) a page protection based mechanism to safely release blocking receives early. These techniques boost the performance without the need to recompile or modify applications. Moreover, our approach simplifies the writing and debugging of parallel programs because it allows the use of simple parallel constructs while meeting or exceeding the performance of complex code, which increases programmer productivity. This project is a collaboration with Dr. Evan Speight from IBM.


Load-value prediction

Due to the slow data accesses, latency-hiding speculation hardware is ubiquitous in high-end processors. To improve the speculation accuracy and reduce the complexity and power consumption of such hardware, 1) we created new hybrid load-value predictors that are superior to other designs, 2) we invented a technique to reduce the size of the best hybrid predictors by more than a factor of two without sacrificing performance, 3) we devised several approaches to improve the energy and complexity efficiency of load-value predictors, 4) we demonstrated that more accurate but slower predictors are often inferior to simpler predictors when accounting for the latency, and 5) we originated a new class of self-optimizing hardware that autonomously carries out the operations of a genetic algorithm and developed confidence estimators that use this technique to continuously adapt to the workload. Dr. Benjamin G. Zorn from Microsoft Research collaborated with us on the hybrid predictors.


Performance evaluation and optimization

To facilitate and support the other projects, we have designed our own compiler optimizations, simulators, and metrics, including 1) a method for compilers to statically determine which load instructions are likely to miss in the cache and which of those cache-missing loads are likely to be value predictable and with what kind of predictor, 2) source code transformations that substantially speed up important bioinformatics programs, 3) a simple approach to synthesize processor simulators that are an order of magnitude faster than generic simulators, and 4) a straightforward metric to evaluate and compare the energy efficiency of processor components. The compiler project is a collaboration with Prof. Amer Diwan from the University of Colorado at Boulder and his Ph.D. student Matthias Hauswirth.


Computational brain-injury modeling

Traumatic brain injuries are the leading cause of death among children and affect two million people annually in the US alone. To help bring this number down, we designed a computational model that 1) advances the state of the art in closed-head injury simulation, 2) demonstrates that nonlinear fluid viscoelastic models are necessary to recreate the key features of diffuse axonal injuries, 3) illustrates that the skull’s geometry is able to amplify the material velocity inside the brain, and 4) shows that head translations and not just rotations can induce dangerous rotational flows that may lead to closed-head injuries. The goal of this bioengineering work is to enable industry to design better helmets, car interiors, and other safety measures to alleviate human suffering and to lower healthcare costs. This project is a collaboration with Prof. Igor Szczyrba from the University of Northern Colorado.