Investigating performance counters related to memory
As part of my investigation to create a page for STREAM, I have tried to reconcile things with underlying performance counters. This page documents some of that work.
To start with, the overall numbers coming from the “data_reads” and “data_writes” counters are consistent with output of the stream benchmark itself. Here is what I see:
mev@popayan$ perf stat -e data_reads -e data_writes ./stream-bin
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 100000000 (elements), Offset = 0 (elements)
Memory per array = 762.9 MiB (= 0.7 GiB).
Total memory required = 2288.8 MiB (= 2.2 GiB).
Each kernel will be executed 100 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 84283 microseconds.
(= 84283 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 19468.0 0.082407 0.082186 0.085826
Scale: 14227.4 0.112806 0.112459 0.117188
Add: 16084.9 0.149577 0.149208 0.151376
Triad: 16076.8 0.149715 0.149284 0.152794
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Performance counter stats for 'system wide':
726,753.93 MiB data_reads
309,818.72 MiB data_writes
50.062565822 seconds time elapsed
Approximately 16,000 MB/s for 50 seconds is comparable to 1,036 MiB of total reads and writes when the effects of a write-back cache are taken into account. Perf uses the “scale” file to create that output. The file says to multiply the counter by 6.103515625e-5 to get the MiB or in other words if we multiply by 16384 we’ll get the original counters:
data_reads : 11,907,136,389 data_writes: 5,076,069,908
That is also consistent with values I get reading the same counter from wspy, so think my implementation is correct. So now I want to see how consistent it might be with other counters. First to try “cache-references” and “cache-reads” as generic events:
9,104,076,314 cache-references
5,755,534,411 cache-misses # 63.219 % of all cache refs
The number of cache-misses is off by a factor of ~3 compared to data reads+writes which is the same ratio I see compared with the more generic counters.
Breaking this down to the LLC-* events, these are still consistent with what I saw in wspy:
6,144,605,838 LLC-loads (50.00%)
3,814,249,161 LLC-load-misses # 62.07% of all LL-cache hits (50.02%)
2,951,620,357 LLC-stores (50.00%)
1,915,136,145 LLC-store-misses (49.98%)
and once again the misses are off by a factor of ~3 compared to reads and writes.
The “node” events seem to be same as LLC misses:
3,845,763,854 node-loads (50.00%)
0 node-load-misses (50.01%)
1,917,682,944 node-stores (50.00%)
0 node-store-misses (49.99%)
Not quite sure what the “mem-loads” and “mem-stores” are telling me:
0 mem-loads
10,535,113,642 mem-stores
Also not quite sure on the “offcore” events:
7,678,025,687 offcore_requests.all_data_rd
Looking further at likwid-perfctr, I can’t find any metrics directly getting the memory bandwidth. Hence, I’m concluding no direct linkage between L3 misses and data reads/writes. Instead the metrics seem to point more at the number of cycles in various wait states waiting for memory operations.

Comments
Investigating performance counters related to memory — No Comments
HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>