exp-bbv -- a tool to generate Simpoint BBVs (basic block vectors) by Vince Weaver vince _at_ csl.cornell.edu Background ~~~~~~~~~~ The Simpoint algorithm (see http://www.cse.ucsd.edu/~calder/simpoint/ for more information) is a method of speeding up architectural simulations by exploiting the fact that programs have phase behavior, enabling you to extrapolate total performance while only simulating small portions of the program. The general idea depends on the property that most program show phase-based behavior. This means that at various times during program execution you will encounter intervals of time where the code behaves very similarly to an interval that happened already in the past. If you can detect these intervals and group them together, you can approximate the total behavior of the program by only simulating the bare minimum number of intervals, and then scaling these results up to predict what the actual results would have been if you simulated the entire program. This is useful in computer architecture research, as running a benchmark on a cycle-accurate simulator can cause a slowdown of 1000x, making it take days, weeks, or even longer to run until completion. If you utilize Simpoints, you can simulate a small fraction of the program but still get a good estimate of what the behavior would be for the entire program. See Sherwood et al [1] for more info. In order to calculate the phases, you need to collect so-called Basic Block Vectors (BBVs) from the benchmark of interest. These are a list of all Basic Blocks entered [a basic block is a section of code with only one entry point and one exit point], with a count of how many times each block was run. This count is also multiplied by the number of instructions that are in the basic block. The multiplication is meant to weigh the count so that instructions in small Basic Blocks aren't counted as more important than instructions in large Basic Blocks. The Basic Block Vector count is dumped at certain fixed intervals. A common way to do things is to do this every 100 million committed instructions. You can use smaller or larger intervals. Having fewer (bigger) intervals makes analysis go faster, and it also avoids problems with warm-up when simulating (since when using simpoints you tend to fast-forward or otherwise skip to the interval of interest, you are starting from scratch metric wise. If you are looking at things like caches or branch predictors, you are going to get wrong results at first because your structures are empty, rather than full of previous state they would have if you ran the whole way up until now. Eventually the structures "warm up" and start producing the proper results. Smaller intervals mean more of your simulation results will be skewed by the wrong values that are returned before the structures are ready). The actual on-disk format of a BBV files looks something like this: T:45:1024 :189:99343 T:11:78573 :15:1353 :56:1 T:18:45 :12:135353 :56:78 314:4324263 and so on. Each new interval starts with a T. Then there is a colon followed by a value. This is a unique number identifying a basic block. This is followed by another colon, and then followed by a value that is the number of times the block was entered multiplied by the number of instructions in the block. The Simpoint utility takes a BBV file as described above as an input. You run simpoint like this: ./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \ -loadFVFile results.valgrind.bb.gz \ -k 5 -saveSimpoints results.simpts \ -saveSimpointWeights results.weights The Simpoint program does random linear projection using 15-dimensions, then does k-mean clustering to calculate which intervals are the one of interest. In this case we specified we want 5 with the -k 5 option. So as output we will have the "results.simpts" file which will simply list the 5 intervals of interest (which when we multiply by 100 million [or whatever our interval size we chose] will give us the offset in number of instructions from the beginning of the program to start simulating). If we then simulate 100 million instructions from the start of each of those intervals, gather our statistics, then weight them according to the weights printed in the "results.weights" file, we then should have an approximation of how the entire program would have behaved, in a small fraction of the time. Valgrind Implementation ~~~~~~~~~~~~~~~~~~~~~~~ This valgrind plugin generates the BBV output as described above. Luckily valgrind makes it easy to instrument code. Ideally we would instrument at the basic block level, but this is complicated by the fact that valgrind uses super-blocks, which only have one entry point but can have multiple exit points. For this implementation we actually instrument at the super-block level. Having super-blocks can be seen as a benefit; it might give a better indication of phase to the clustering algorithms than plain basic blocks would. Super-block generated BBVs give similar results when compared to real basic-block generated BBVs made by the PIN tool. It is possible to use actual basic blocks by using the "--vex-guest-chase-thresh=0" option to valgrind. Once started, valgrind begins translating the program being run in an as-needed fashion. The binary code is translated into an intermediate language, this is then instrumented, and then the resulting code is compiled into a super-block and cached. This final code is the code that is executed. At initial translation time we instrument every instruction, and have it call our BBV routine. This is a lot of overhead, but it is simpler to implement than instrumenting all of the basic blocks inside of a super-block. On x86 we also note if the instruction is a "rep" or "fldcw" instruction for special-case handling. (See "The REP Prefix" in the next section for why). While executing, our routine is called before where each individual instruction was in the original code. We look up our current superblock in an Ordered Set to find a structure that holds block-specific statistics (we use the entry point address as the index into the hash table). We then incrememnt by one the instruction count for this superblock (assuming it's not a rep instruction). We also update the master instruction counter by one at the same time. If this overflows the interval size (by default 100 million, but this is configurable with a command line option) then we run the BBV generation code. This routine prints the current BBV line to the output file, and resets all the superblock counters to zero. Pitfalls ~~~~~~~~ The REP prefix ~~~~~~~~~~~~~~ This plugin was validated against actual x86/x86_64 machines using performance counters. With valgrind we were getting a much larger retired instruction count than the perf counters or the PIN implementation tool were reporting. It turns out that the "rep" prefix (in conjuction with a string instruction) may repeat an instruction many times, yet the processor only counts this as one committed instruction. This is documented in various Intel manuals. Without special handling, valgrind counts each repetition. Our plugin takes this into account; if we are in a rep instruction we only update the various instruction counts every 4096 repeats, not every repeat. The FLDCW instruction ~~~~~~~~~~~~~~~~~~~~~ The "fldcw" instruction on x86/x86_64 is counted as one committed instruction on most architectures, but on the Pentium 4 "NetBurst" architecture it counts as two. Valgrind only counts it as one, which is fine. We record the number of fldcw instructions so that we can make comparisons with Pentium 4 machines. In the future we might enable counting per-interval these instructions. The art benchmark ~~~~~~~~~~~~~~~~~ The spec2k benchmark suite includes "art" which is a C floating point benchmark. When run under valgrind, the benchmark finished with half of the number of instructions committed than when run on actual hardware. It turns out that valgrind uses 64-bit floating point math for portability reasons, but by default on x86-Linux programs use 80-bit floating point math. The art benchmark unwisely uses the "==" C operator to compare two floating point numbers, and due to rounding errors between the 80-bit and 64-bit versions the 64-bit version can finish early, while still generating the proper expected output from art. This makes it hard to compare simpoint results, because you will get different simpoints if you have only half the number of instructions. For comparison purposes, the following code was added to the beginning of the art benchmark to force it to use 64-bit math. (There are other methods for working around this, including using the Intel compiler which has options to force 64-bit math, or using the -msse2 option of gcc which uses 64-bit sse2 math. Neither of these would work for the Pentium III we were gathering the performance counter data on). #include fpu_control_t cw; _FPU_GETCW(cw); cw &= ~_FPU_EXTENDED; cw |= _FPU_DOUBLE; _FPU_SETCW(cw); The dealII benchmark ~~~~~~~~~~~~~~~~~ The SPEC 2006 benchmark "dealII" has a similar problem to the one described previously for art. In this case, due to the 64-bit floating point being used, a comparison against an error epsilon never becomes true, and the benchmark gets stuck in an infinite loop. You can fix things to work by making: const long double tolerance= 5e-16; In the function QGauss<1>::QGauss (const unsigned int n) in quadrature_lib.cc Validation ~~~~~~~~~~ The code has been tested on x86 and compared against real hardware. This is described in a paper [2]. The code has been compiled and run on PowerPC but has not been in any way validated. Installing with Valgrind 3.4.1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Download valgrind from www.valgrind.org This tools assumes version 3.4.1 * Unpack valgrind * Enter into the valgrind-3.4.1 directory and unpack this plugin (tar -xzvf exp-bbv.tar.gz) * Go back to the main valgrind-3.4.1 directory. * Edit the file "configure". Search for the string none/docs/Makefile and immediately after it put: exp-bbv/Makefile * Now run "./configure" as you would if you were normally installing valgrind. * Run make, then make install of valgrind. * Now enter into the ./exp-bbv directory, and run make and make install. * The simpoint plugin should now be ready... Installing, using Valgrind SVN ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + get SVN version of valgrind svn co svn://svn.valgrind.org/valgrind/trunk valgrind + unpack the exp-bbv tree in the ./valgrind tree + edit the global version of Makefile.am and add exp-bbv to the EXP_TOOLS list + Add exp-bbv/Makefile exp-bbv/docs/Makefile exp-bbv/tests/Makefile exp-bbv/x86/Makefile to the end of the AC_OUTPUT list in configure.in + run ./autogen.sh + run configure as described in the "Installing with Valgrind 3.4.1" section Using ~~~~~ You run the tool something like this: valgrind --tool=exp-bbv --vex-guest-chase-thresh=0 \ --interval_size=100000000 --bbfile=out.bbv /bin/ls The --vex-guest-chase-thresh=0 option forces valgrind to use basic blocks (as opposed to super blocks). The "interval_size" parameter is optional, it defaults to 100 million. The "bbfile" option specifies the output file name. It defaults to out.bb /bin/ls is just a placeholder for any executable you'd like to run. Performance ~~~~~~~~~~~ Using valsim slows down execution by roughly a factor of 50. It depends on the machine being run on, and the benchmark. On the spec2000 benchmarks running on a 3.4GHz Pentium D processor, the slowdown ranges from 24x (mcf) to 340x (vortex.2). References ~~~~~~~~~~ [1] T Sherwood, E Perelman, G. Hamerly, B. Calder. Automatically Characterizing Large Scale Program Behavior. ASPLOS X, 2002. [2] V.M. Weaver, S.A. McKee, "Using Dynamic Binary Instrumentation to Generate Multi-Platform Simpoints: Methodology and Accuracy", 3rd EC International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'08), Göteborg, Sweden, January 2008.