Investigating Intel performance counters for backend memory costing…

I’ve implemented the first level for topdown performance counter analysis and also done an initial analysis of ~15 workloads from recent Phoronix article. A logical next step is to expand the “backend bound” category to first separate CPU-bound vs. memory-bound and then separate out the L1 vs. L2 vs. L3 vs. main memory vs. memory stores.

This post looks at some of the counters under consideration.

If I sort workloads I’ve looked at so far, the following are listed as particularly high “backend bound”:

stream             -   0.955
rodinia CFD solver -   0.580
redis:get          -   0.448
redis:set          -   0.429
blender:barbershop -   0.262

Some at the other end that are particularly low include:

x264               -   0.001
openssl            -   0.006
go-benchmark:json  -   0.033

If I throw out redis because it runs for fractions of a second, then this gives me three somewhat high and three somewhat low benchmarks to work with. The goal being to (1) find the counters that might separate the backend-bound benchmarks into effects of core vs. memory and then break out the memory hierarchy and then (2) sanity check what if anything shows for those benchmarks that are not backend-bound.

A somewhat obvious set of counters to start with are the “cycle activity” counters, here is how they come up in perf(1)

 cycle_activity.cycles_l1d_pending
       [Cycles with pending L1 cache miss loads]
  cycle_activity.cycles_l2_pending
       [Cycles with pending L2 cache miss loads Spec update: HSD78]
  cycle_activity.cycles_ldm_pending
       [Cycles with pending memory loads]
  cycle_activity.cycles_no_execute
       [Total execution stalls]
  cycle_activity.stalls_l1d_pending
       [Execution stalls due to L1 data cache misses]
  cycle_activity.stalls_l2_pending
       [Execution stalls due to L2 cache misses]
  cycle_activity.stalls_ldm_pending
       [Execution stalls due to memory subsystem]

I assume the “.cycles” is a total amount of cycles and a “.stalls” is a number of stalls (not quite, see below). Also note that in a previous post I had a likwid-perfctr report “CYCLE_ACTIVITY” including these counters for three benchmarks and can thus make runs for the other three as linked below.

I also notice that the likwid-perfctr refers to the “*stalls*” counters not the “*cycles*” counters, so not sure if that is correct. As a sanity test, I ran both the likwid-perfctr report and perf on my six targeted workloads.

Following is the perf output for stream

 Performance counter stats for 'system wide':

 6,211,965,807,896      cpu-cycles                                                    (62.50%)
 5,960,318,134,549      cycle_activity.cycles_no_execute                                     (62.50%)
 5,507,402,631,196      cycle_activity.cycles_l1d_pending                                     (62.50%)
 5,831,519,386,880      cycle_activity.cycles_l2_pending                                     (62.50%)
 6,093,233,122,660      cycle_activity.cycles_ldm_pending                                     (62.50%)
 5,305,422,833,023      cycle_activity.stalls_l1d_pending                                     (37.50%)
 5,601,073,092,552      cycle_activity.stalls_l2_pending                                     (50.00%)
 5,843,574,030,653      cycle_activity.stalls_ldm_pending                                     (50.00%)

     258.717437314 seconds time elapsed

In contrast here is the output for x264

 Performance counter stats for 'system wide':

 1,004,373,726,072      cpu-cycles                                                    (62.49%)
   492,241,067,036      cycle_activity.cycles_no_execute                                     (62.49%)
   184,764,611,744      cycle_activity.cycles_l1d_pending                                     (62.49%)
   717,040,841,165      cycle_activity.cycles_l2_pending                                     (62.50%)
 1,075,694,568,119      cycle_activity.cycles_ldm_pending                                     (62.51%)
    91,223,511,372      cycle_activity.stalls_l1d_pending                                     (37.51%)
   349,530,008,257      cycle_activity.stalls_l2_pending                                     (50.00%)
   417,440,813,108      cycle_activity.stalls_ldm_pending                                     (50.00%)

      54.806516905 seconds time elapsed

After reading further on the internet, I believe the “cycles” counters are not of interest. As I understand it, it counts the number of cycles where no instructions are dispatched *and* there is an outstanding memory request, but doesn’t really indicate the stall was due to the memory request. I further found this paper that empirically tested counters and now believe I need to set up the following hierarchy where counter names are in { }.

active_cycles             = {cpu-cycles}
   productive_cycles      = active-cycles - {cycle_activity.cycles_no_execute}
   stall_cycles           = {cycle_activity.cycles_no_execute}
      memory_bound        = {cycle_activity.stalls_l1d_pending}
         bandwidth_bound  = {l1d_pend_miss.fb_full} + {offcore_requests_buffer.sq_full}
         latency_bound    = 1 - bandwidth_bound
      other_stall_reason  = 1 - memory_bound

So I think there are five counters total, six if I also add {resource_stalls.sb} to cover store related stalls as well as load stalls. Below I’ve collect these with perf. For stream this is:

 Performance counter stats for 'system wide':

 6,219,060,933,176      cpu-cycles                                                    (83.33%)
 5,968,672,825,601      cycle_activity.cycles_no_execute                                     (83.33%)
 5,312,257,710,841      cycle_activity.stalls_l1d_pending                                     (83.33%)
   245,370,423,703      resource_stalls.sb                                            (83.33%)
 1,491,679,451,897      l1d_pend_miss.fb_full                                         (83.33%)
 2,528,175,472,523      offcore_requests_buffer.sq_full                                     (66.67%)

I believe by the equations above, the percentages for STREAM become:

active_cycles          = 100.0%
   productive_cycles   =   4.0%
   stalled_cycles      =  96.0%
      memory_bound     =  85.4%
         bandwidth     =  64.6%
         latency       =  20.8%
      other stalls     =  10.6%

The measurements for x264 are as follows:

 1,006,171,527,484      cpu-cycles                                                    (83.33%)
   491,458,850,680      cycle_activity.cycles_no_execute                                     (83.33%)
    91,391,287,054      cycle_activity.stalls_l1d_pending                                     (83.34%)
    19,777,655,122      resource_stalls.sb                                            (83.34%)
    13,236,384,103      l1d_pend_miss.fb_full                                         (83.34%)
     4,762,008,291      offcore_requests_buffer.sq_full                                     (66.66%)

active_cycles = 100.0%
productive_cycles = 90.9%
stalled_cycles = 9.1%
memory_bound = 1.9% (based on stores)
bandwidth = 1.9%
other stalls = 7.2%

So the statistics are at least in the right direction.

As next steps, I’ll look at implementing these counters with wspy and a standalone wrapper tool and then run them across some workloads, particularly those heavily backend-bound. A next level drill down after that might be to look at factors influencing the memory-bound apps, e.g. cache miss rates and similar metrics.

Performance analysis, tools and experiments

An eclectic collection

Comments

Investigating Intel performance counters for backend memory costing… — No Comments

Leave a Reply Cancel reply