Top down performance counter analysis (part 2) – wspy – Performance analysis, tools and experiments

As described in top down performance counter analysis part 1, top down analysis is an approach that uses key performance counters to characterize an application and then successively drills down with further refinement. On Intel x86 processors, this first level refinement characterizes the applications by looking at the overall usage of the pipeline. Topdown is implemented in the perf(1) tool and in the previous post, I also was able to extend likwid-perfctr to add a TOPDOWN performance group for 1st level characterization.

In this posting, I’ve documented steps I took to add such top down support to wspy

wspy supports performance counters in two methods:

“–process-counter-model core” measures by core and does a periodic sampling (default once per second) of counters on each core.
“–process-counter-model process” measures by process and aggregates counters on a tree of processes

Support in the “core” method was pretty much there because wspy already had an option to specify the counter list. To make this a little easier, I added a “–config” option to set the name of a configuration file and then defined the configuration file to collect these counters as follows

# configuration file for top-down analysis
command --perfcounter-model core
command --set-counters topdown-slots-retired,topdown-slots-issued,topdown-recovery-bubbles,topdown-fetch-bubbles,cpu-cycles,instructions

I then invoke my command like as follows to run the phoronix test suite test of c-ray and keep track of the performance counters in the zip archive.

./wspy --perfcounters --zip c-ray2.zip --config topdown.config phoronix-test-suite batch-run c-ray

The remaining work is to turn these CSV files into plots using gnuplot. As I do this, one thing I do notice is the absolute counts of instructions and cycles are lower in this comparison even though the ratios are still the same. Need to still narrow this down to make sure I’m not missing events.

Here is a previous plot of IPC

Following is a brief excerpt of the counts showing ~80 million cycles per second.

20.00     ,1004913228,835863826,24386,23
21.00     ,992754668,814408353,18401,5
22.00     ,912770140,773296336,21873,3050
23.00     ,1003534847,835868034,21581,669
24.00     ,912624972,758693013,16680,24

Here is a new plot of IPC

Following is a brief excerpt of the counts showing ~40 million cycles per second.

20.00     ,600956604,427443300,15159161,1358150,696118208,713229417
21.00     ,609612979,417897496,15979474,1421009,624238082,620748846
22.00     ,604022803,417932347,14027413,1274063,621499226,619059478
23.00     ,601231976,417933484,14130253,1396528,624020412,618550534
24.00     ,724758859,501513701,17118096,1665515,717089577,692470819

This may be the scale factor of 2, though the instruction counts are similarly low and I used the same “cpu-cycles” name. Hence, in the equations below, I also used the “scale” factors to multiply appropriately.

As a sanity check, I tried using perf stat –topdown to run c-ray test under the Phoronix test suite. Here are the metrics it gave (also note that it is per core rather than per virtual thread):

 Performance counter stats for 'system wide':

                  retiring             bad speculation      frontend bound       backend bound        
S0-C0           2     73.9%                0.8%               10.4%               14.9%           
S0-C1           2     73.9%                0.8%               10.4%               14.8%           
S0-C2           2     73.7%                0.8%               10.4%               15.1%           
S0-C3           2     73.5%                0.8%               10.6%               15.0%           

      84.129243091 seconds time elapsed

I then added a definition below as part of a shell script to plot my CSV file. This makes five different files including one for all the metrics together. Here is the “topdown.png” file I got by plotting these metrics over time.

Taking this apart a little bit more, it looks like my “fetch-bubbles-counter” calculation isn’t quite correct. If I sum up the total and then do the calculations, it looks like my retiring is close to perf (74.2%) and my speculation is as well (0.8%). However, my front end recovery bubbles is 1.8% where it should be much higher. This also then makes the subtraction for backend-bound too high.

Here are the parts of the shell script for the topdown file as well as other parts.

gnuplot <<PLOTCMD                                                               
set terminal png                                                                
set output 'frontend0.png'                                                      
set title 'Front End - CPU 0'                                                   
set datafile separator ","                                                      
plot 'perf0.csv' using 1:(\$4/(\$3*2)) title 'CPU 0' with linespoints           
PLOTCMD                                                                         
gnuplot <<PLOTCMD                                                            
set terminal png                                                                
set output 'retiring0.png'                                                      
set title 'Retiring - CPU 0'                                                    
set datafile separator ","                                                      
plot 'perf0.csv' using 1:(\$7/(\$3*2)) title 'CPU 0' with linespoints           
PLOTCMD                                                                         
gnuplot <<PLOTCMD                                                             
set terminal png                                                                
set output 'spec0.png'                                                          
set title 'Speculation - CPU 0'                                                 
set datafile separator ","                                                      
plot 'perf0.csv' using 1:((\$6 - \$7 + (\$5*2))/(\$3*2)) title 'CPU0' with line\
spoints                                                                         
PLOTCMD                                                                         
gnuplot <<PLOTCMD                                                              
set terminal png                                                                
set output 'backend0.png'                                                       
set title 'Back End - CPU 0'                                                    
set datafile separator ","                                                      
plot 'perf0.csv' using 1:(1 -((\$4 +\$6 +(\$5*2))/(\$3*2))) title 'CPU0' with l\
inespoints                                                                      
PLOTCMD                                                                         
gnuplot <<PLOTCMD                                                              
set terminal png                                                                
set output 'topdown0.png'                                                       
set title 'CPU 0'                                                               
set datafile separator ","                                                      
plot 'perf0.csv' using 1:(\$4/(\$3*2)) title 'front end' with linespoints,'perf\
0.csv' using 1:(\$7/(\$3*2)) title 'retiring' with linespoints,'perf0.csv' usin\
g 1:((\$6 - \$7 + (\$5*2))/(\$3*2)) title 'speculation' with linespoints,'perf0\
.csv' using 1:(1 -((\$4 +\$6 +(\$5*2))/(\$3*2))) title 'back end' with linespoi\
nts                                                                             
PLOTCMD

My next steps here are to investigate further the counter I am using for topdown-fetch bubbles since I also note that the likwid report done in part 1 was also inconsistent with perf.

NOTE: After investigating further, I found and fixed three issues:

I was accidentally measuring all my global counters on core 0 (oops). This meant they were being multiplexed by the kernel. It also means that in a hyper-threaded situation, the frontend cycles are not evenly split across the two cores. It also means I need to clean up some of my existing workload charts.
My multiplexing was also not quite calculated correctly. Will fix this.
While I was at it, I fixed up the scale as well.

After fixing these things and improving my plot functions further, I am now able to make plots such as the following of the topdown counters for a workload. Notice how the frontend cycles get attributed to core 0 instead of core 4 in the same physical core (and hence why perf also groups them together).

Performance analysis, tools and experiments

An eclectic collection

Top down performance counter analysis (part 2) – wspy

Comments

Top down performance counter analysis (part 2) – wspy — No Comments

Leave a Reply Cancel reply