Investigating performance counters related to memory
As part of my investigation to create a page for STREAM, I have tried to reconcile things with underlying performance counters. This page documents some of that work.
To start with, the overall numbers coming from the “data_reads” and “data_writes” counters are consistent with output of the stream benchmark itself. Here is what I see:
mev@popayan$ perf stat -e data_reads -e data_writes ./stream-bin ------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 100000000 (elements), Offset = 0 (elements) Memory per array = 762.9 MiB (= 0.7 GiB). Total memory required = 2288.8 MiB (= 2.2 GiB). Each kernel will be executed 100 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 8 Number of Threads counted = 8 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 84283 microseconds. (= 84283 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 19468.0 0.082407 0.082186 0.085826 Scale: 14227.4 0.112806 0.112459 0.117188 Add: 16084.9 0.149577 0.149208 0.151376 Triad: 16076.8 0.149715 0.149284 0.152794 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays ------------------------------------------------------------- Performance counter stats for 'system wide': 726,753.93 MiB data_reads 309,818.72 MiB data_writes 50.062565822 seconds time elapsed
Approximately 16,000 MB/s for 50 seconds is comparable to 1,036 MiB of total reads and writes when the effects of a write-back cache are taken into account. Perf uses the “scale” file to create that output. The file says to multiply the counter by 6.103515625e-5 to get the MiB or in other words if we multiply by 16384 we’ll get the original counters:
data_reads : 11,907,136,389 data_writes: 5,076,069,908
That is also consistent with values I get reading the same counter from wspy, so think my implementation is correct. So now I want to see how consistent it might be with other counters. First to try “cache-references” and “cache-reads” as generic events:
9,104,076,314 cache-references 5,755,534,411 cache-misses # 63.219 % of all cache refs
The number of cache-misses is off by a factor of ~3 compared to data reads+writes which is the same ratio I see compared with the more generic counters.
Breaking this down to the LLC-* events, these are still consistent with what I saw in wspy:
6,144,605,838 LLC-loads (50.00%) 3,814,249,161 LLC-load-misses # 62.07% of all LL-cache hits (50.02%) 2,951,620,357 LLC-stores (50.00%) 1,915,136,145 LLC-store-misses (49.98%)
and once again the misses are off by a factor of ~3 compared to reads and writes.
The “node” events seem to be same as LLC misses:
3,845,763,854 node-loads (50.00%) 0 node-load-misses (50.01%) 1,917,682,944 node-stores (50.00%) 0 node-store-misses (49.99%)
Not quite sure what the “mem-loads” and “mem-stores” are telling me:
0 mem-loads 10,535,113,642 mem-stores
Also not quite sure on the “offcore” events:
7,678,025,687 offcore_requests.all_data_rd
Looking further at likwid-perfctr, I can’t find any metrics directly getting the memory bandwidth. Hence, I’m concluding no direct linkage between L3 misses and data reads/writes. Instead the metrics seem to point more at the number of cycles in various wait states waiting for memory operations.
Comments
Investigating performance counters related to memory — No Comments
HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>