As described in top down performance counter analysis part 1, top down analysis is an approach that uses key performance counters to characterize an application and then successively drills down with further refinement. On Intel x86 processors, this first level refinement characterizes the applications by looking at the overall usage of the pipeline. Topdown is implemented in the perf(1) tool and in the previous post, I also was able to extend likwid-perfctr to add a TOPDOWN performance group for 1st level characterization.
In this posting, I’ve documented steps I took to add such top down support to wspy
wspy supports performance counters in two methods:
- “–process-counter-model core” measures by core and does a periodic sampling (default once per second) of counters on each core.
- “–process-counter-model process” measures by process and aggregates counters on a tree of processes
Support in the “core” method was pretty much there because wspy already had an option to specify the counter list. To make this a little easier, I added a “–config” option to set the name of a configuration file and then defined the configuration file to collect these counters as follows
# configuration file for top-down analysis command --perfcounter-model core command --set-counters topdown-slots-retired,topdown-slots-issued,topdown-recovery-bubbles,topdown-fetch-bubbles,cpu-cycles,instructions
I then invoke my command like as follows to run the phoronix test suite test of c-ray and keep track of the performance counters in the zip archive.
./wspy --perfcounters --zip c-ray2.zip --config topdown.config phoronix-test-suite batch-run c-ray
The remaining work is to turn these CSV files into plots using gnuplot. As I do this, one thing I do notice is the absolute counts of instructions and cycles are lower in this comparison even though the ratios are still the same. Need to still narrow this down to make sure I’m not missing events.
Here is a previous plot of IPC
Following is a brief excerpt of the counts showing ~80 million cycles per second.
20.00 ,1004913228,835863826,24386,23 21.00 ,992754668,814408353,18401,5 22.00 ,912770140,773296336,21873,3050 23.00 ,1003534847,835868034,21581,669 24.00 ,912624972,758693013,16680,24
Here is a new plot of IPC
Following is a brief excerpt of the counts showing ~40 million cycles per second.
20.00 ,600956604,427443300,15159161,1358150,696118208,713229417 21.00 ,609612979,417897496,15979474,1421009,624238082,620748846 22.00 ,604022803,417932347,14027413,1274063,621499226,619059478 23.00 ,601231976,417933484,14130253,1396528,624020412,618550534 24.00 ,724758859,501513701,17118096,1665515,717089577,692470819
This may be the scale factor of 2, though the instruction counts are similarly low and I used the same “cpu-cycles” name. Hence, in the equations below, I also used the “scale” factors to multiply appropriately.
As a sanity check, I tried using perf stat –topdown to run c-ray test under the Phoronix test suite. Here are the metrics it gave (also note that it is per core rather than per virtual thread):
Performance counter stats for 'system wide': retiring bad speculation frontend bound backend bound S0-C0 2 73.9% 0.8% 10.4% 14.9% S0-C1 2 73.9% 0.8% 10.4% 14.8% S0-C2 2 73.7% 0.8% 10.4% 15.1% S0-C3 2 73.5% 0.8% 10.6% 15.0% 84.129243091 seconds time elapsed
I then added a definition below as part of a shell script to plot my CSV file. This makes five different files including one for all the metrics together. Here is the “topdown.png” file I got by plotting these metrics over time.
Taking this apart a little bit more, it looks like my “fetch-bubbles-counter” calculation isn’t quite correct. If I sum up the total and then do the calculations, it looks like my retiring is close to perf (74.2%) and my speculation is as well (0.8%). However, my front end recovery bubbles is 1.8% where it should be much higher. This also then makes the subtraction for backend-bound too high.
Here are the parts of the shell script for the topdown file as well as other parts.
gnuplot <<PLOTCMD set terminal png set output 'frontend0.png' set title 'Front End - CPU 0' set datafile separator "," plot 'perf0.csv' using 1:(\$4/(\$3*2)) title 'CPU 0' with linespoints PLOTCMD gnuplot <<PLOTCMD set terminal png set output 'retiring0.png' set title 'Retiring - CPU 0' set datafile separator "," plot 'perf0.csv' using 1:(\$7/(\$3*2)) title 'CPU 0' with linespoints PLOTCMD gnuplot <<PLOTCMD set terminal png set output 'spec0.png' set title 'Speculation - CPU 0' set datafile separator "," plot 'perf0.csv' using 1:((\$6 - \$7 + (\$5*2))/(\$3*2)) title 'CPU0' with line\ spoints PLOTCMD gnuplot <<PLOTCMD set terminal png set output 'backend0.png' set title 'Back End - CPU 0' set datafile separator "," plot 'perf0.csv' using 1:(1 -((\$4 +\$6 +(\$5*2))/(\$3*2))) title 'CPU0' with l\ inespoints PLOTCMD gnuplot <<PLOTCMD set terminal png set output 'topdown0.png' set title 'CPU 0' set datafile separator "," plot 'perf0.csv' using 1:(\$4/(\$3*2)) title 'front end' with linespoints,'perf\ 0.csv' using 1:(\$7/(\$3*2)) title 'retiring' with linespoints,'perf0.csv' usin\ g 1:((\$6 - \$7 + (\$5*2))/(\$3*2)) title 'speculation' with linespoints,'perf0\ .csv' using 1:(1 -((\$4 +\$6 +(\$5*2))/(\$3*2))) title 'back end' with linespoi\ nts PLOTCMD
My next steps here are to investigate further the counter I am using for topdown-fetch bubbles since I also note that the likwid report done in part 1 was also inconsistent with perf.
NOTE: After investigating further, I found and fixed three issues:
- I was accidentally measuring all my global counters on core 0 (oops). This meant they were being multiplexed by the kernel. It also means that in a hyper-threaded situation, the frontend cycles are not evenly split across the two cores. It also means I need to clean up some of my existing workload charts.
- My multiplexing was also not quite calculated correctly. Will fix this.
- While I was at it, I fixed up the scale as well.
After fixing these things and improving my plot functions further, I am now able to make plots such as the following of the topdown counters for a workload. Notice how the frontend cycles get attributed to core 0 instead of core 4 in the same physical core (and hence why perf also groups them together).