Investigating performance counters related to memory

As part of my investigation to create a page for STREAM, I have tried to reconcile things with underlying performance counters. This page documents some of that work.

To start with, the overall numbers coming from the “data_reads” and “data_writes” counters are consistent with output of the stream benchmark itself. Here is what I see:

mev@popayan$ perf stat -e data_reads -e data_writes ./stream-bin 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 100000000 (elements), Offset = 0 (elements)
Memory per array = 762.9 MiB (= 0.7 GiB).
Total memory required = 2288.8 MiB (= 2.2 GiB).
Each kernel will be executed 100 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 84283 microseconds.
   (= 84283 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           19468.0     0.082407     0.082186     0.085826
Scale:          14227.4     0.112806     0.112459     0.117188
Add:            16084.9     0.149577     0.149208     0.151376
Triad:          16076.8     0.149715     0.149284     0.152794
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

 Performance counter stats for 'system wide':

        726,753.93 MiB  data_reads                                                  
        309,818.72 MiB  data_writes                                                 

      50.062565822 seconds time elapsed

Approximately 16,000 MB/s for 50 seconds is comparable to 1,036 MiB of total reads and writes when the effects of a write-back cache are taken into account. Perf uses the “scale” file to create that output. The file says to multiply the counter by 6.103515625e-5 to get the MiB or in other words if we multiply by 16384 we’ll get the original counters:

data_reads : 11,907,136,389
data_writes:  5,076,069,908

That is also consistent with values I get reading the same counter from wspy, so think my implementation is correct. So now I want to see how consistent it might be with other counters. First to try “cache-references” and “cache-reads” as generic events:

     9,104,076,314      cache-references                                            
     5,755,534,411      cache-misses              #   63.219 % of all cache refs

The number of cache-misses is off by a factor of ~3 compared to data reads+writes which is the same ratio I see compared with the more generic counters.

Breaking this down to the LLC-* events, these are still consistent with what I saw in wspy:

     6,144,605,838      LLC-loads                                                     (50.00%)
     3,814,249,161      LLC-load-misses           #   62.07% of all LL-cache hits     (50.02%)
     2,951,620,357      LLC-stores                                                    (50.00%)
     1,915,136,145      LLC-store-misses                                              (49.98%)

and once again the misses are off by a factor of ~3 compared to reads and writes.

The “node” events seem to be same as LLC misses:

     3,845,763,854      node-loads                                                    (50.00%)
                 0      node-load-misses                                              (50.01%)
     1,917,682,944      node-stores                                                   (50.00%)
                 0      node-store-misses                                             (49.99%)

Not quite sure what the “mem-loads” and “mem-stores” are telling me:

                 0      mem-loads                                                   
    10,535,113,642      mem-stores

Also not quite sure on the “offcore” events:

     7,678,025,687      offcore_requests.all_data_rd

Looking further at likwid-perfctr, I can’t find any metrics directly getting the memory bandwidth. Hence, I’m concluding no direct linkage between L3 misses and data reads/writes. Instead the metrics seem to point more at the number of cycles in various wait states waiting for memory operations.

Performance analysis, tools and experiments

An eclectic collection

Comments

Investigating performance counters related to memory — No Comments

Leave a Reply Cancel reply