Investigating Intel performance counters for backend memory costing…
I’ve implemented the first level for topdown performance counter analysis and also done an initial analysis of ~15 workloads from recent Phoronix article. A logical next step is to expand the “backend bound” category to first separate CPU-bound vs. memory-bound and then separate out the L1 vs. L2 vs. L3 vs. main memory vs. memory stores.
This post looks at some of the counters under consideration.
If I sort workloads I’ve looked at so far, the following are listed as particularly high “backend bound”:
stream - 0.955 rodinia CFD solver - 0.580 redis:get - 0.448 redis:set - 0.429 blender:barbershop - 0.262
Some at the other end that are particularly low include:
x264 - 0.001 openssl - 0.006 go-benchmark:json - 0.033
If I throw out redis because it runs for fractions of a second, then this gives me three somewhat high and three somewhat low benchmarks to work with. The goal being to (1) find the counters that might separate the backend-bound benchmarks into effects of core vs. memory and then break out the memory hierarchy and then (2) sanity check what if anything shows for those benchmarks that are not backend-bound.
A somewhat obvious set of counters to start with are the “cycle activity” counters, here is how they come up in perf(1)
cycle_activity.cycles_l1d_pending [Cycles with pending L1 cache miss loads] cycle_activity.cycles_l2_pending [Cycles with pending L2 cache miss loads Spec update: HSD78] cycle_activity.cycles_ldm_pending [Cycles with pending memory loads] cycle_activity.cycles_no_execute [Total execution stalls] cycle_activity.stalls_l1d_pending [Execution stalls due to L1 data cache misses] cycle_activity.stalls_l2_pending [Execution stalls due to L2 cache misses] cycle_activity.stalls_ldm_pending [Execution stalls due to memory subsystem]
I assume the “.cycles” is a total amount of cycles and a “.stalls” is a number of stalls (not quite, see below). Also note that in a previous post I had a likwid-perfctr report “CYCLE_ACTIVITY” including these counters for three benchmarks and can thus make runs for the other three as linked below.
I also notice that the likwid-perfctr refers to the “*stalls*” counters not the “*cycles*” counters, so not sure if that is correct. As a sanity test, I ran both the likwid-perfctr report and perf on my six targeted workloads.
Following is the perf output for stream
Performance counter stats for 'system wide': 6,211,965,807,896 cpu-cycles (62.50%) 5,960,318,134,549 cycle_activity.cycles_no_execute (62.50%) 5,507,402,631,196 cycle_activity.cycles_l1d_pending (62.50%) 5,831,519,386,880 cycle_activity.cycles_l2_pending (62.50%) 6,093,233,122,660 cycle_activity.cycles_ldm_pending (62.50%) 5,305,422,833,023 cycle_activity.stalls_l1d_pending (37.50%) 5,601,073,092,552 cycle_activity.stalls_l2_pending (50.00%) 5,843,574,030,653 cycle_activity.stalls_ldm_pending (50.00%) 258.717437314 seconds time elapsed
In contrast here is the output for x264
Performance counter stats for 'system wide': 1,004,373,726,072 cpu-cycles (62.49%) 492,241,067,036 cycle_activity.cycles_no_execute (62.49%) 184,764,611,744 cycle_activity.cycles_l1d_pending (62.49%) 717,040,841,165 cycle_activity.cycles_l2_pending (62.50%) 1,075,694,568,119 cycle_activity.cycles_ldm_pending (62.51%) 91,223,511,372 cycle_activity.stalls_l1d_pending (37.51%) 349,530,008,257 cycle_activity.stalls_l2_pending (50.00%) 417,440,813,108 cycle_activity.stalls_ldm_pending (50.00%) 54.806516905 seconds time elapsed
After reading further on the internet, I believe the “cycles” counters are not of interest. As I understand it, it counts the number of cycles where no instructions are dispatched *and* there is an outstanding memory request, but doesn’t really indicate the stall was due to the memory request. I further found this paper that empirically tested counters and now believe I need to set up the following hierarchy where counter names are in { }.
active_cycles = {cpu-cycles} productive_cycles = active-cycles - {cycle_activity.cycles_no_execute} stall_cycles = {cycle_activity.cycles_no_execute} memory_bound = {cycle_activity.stalls_l1d_pending} bandwidth_bound = {l1d_pend_miss.fb_full} + {offcore_requests_buffer.sq_full} latency_bound = 1 - bandwidth_bound other_stall_reason = 1 - memory_bound
So I think there are five counters total, six if I also add {resource_stalls.sb} to cover store related stalls as well as load stalls. Below I’ve collect these with perf. For stream this is:
Performance counter stats for 'system wide': 6,219,060,933,176 cpu-cycles (83.33%) 5,968,672,825,601 cycle_activity.cycles_no_execute (83.33%) 5,312,257,710,841 cycle_activity.stalls_l1d_pending (83.33%) 245,370,423,703 resource_stalls.sb (83.33%) 1,491,679,451,897 l1d_pend_miss.fb_full (83.33%) 2,528,175,472,523 offcore_requests_buffer.sq_full (66.67%)
I believe by the equations above, the percentages for STREAM become:
active_cycles = 100.0% productive_cycles = 4.0% stalled_cycles = 96.0% memory_bound = 85.4% bandwidth = 64.6% latency = 20.8% other stalls = 10.6%
The measurements for x264 are as follows:
1,006,171,527,484 cpu-cycles (83.33%) 491,458,850,680 cycle_activity.cycles_no_execute (83.33%) 91,391,287,054 cycle_activity.stalls_l1d_pending (83.34%) 19,777,655,122 resource_stalls.sb (83.34%) 13,236,384,103 l1d_pend_miss.fb_full (83.34%) 4,762,008,291 offcore_requests_buffer.sq_full (66.66%)
active_cycles = 100.0%
productive_cycles = 90.9%
stalled_cycles = 9.1%
memory_bound = 1.9% (based on stores)
bandwidth = 1.9%
other stalls = 7.2%
So the statistics are at least in the right direction.
As next steps, I’ll look at implementing these counters with wspy and a standalone wrapper tool and then run them across some workloads, particularly those heavily backend-bound. A next level drill down after that might be to look at factors influencing the memory-bound apps, e.g. cache miss rates and similar metrics.
Comments
Investigating Intel performance counters for backend memory costing… — No Comments
HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>