Performance counters and memory analysis; checking with pmu_tools
After my previous post on backend analysis, I got a chance to look further breaking out backend stall statistics.
In particular I found three pieces of information helpful:
- A recent slideset provides overview of metrics and methods
- The Topdown Microarchitectural analysis spreadsheet from Intel provides performance counters and formulas for analysis
- Intel’s PMU tools implement these analysis
Based on this, I can run stream under the PMU tools with a “-l 2” (level 2) analysis and get some useful information. Excluding the output from stream itself, I get the perf commandline output showing the counters measured:
perf stat -x\; --no-merge -e '{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0
xd,umask=0x3,any=1,cmask=1/,cpu/event=0xc2,umask=0x2/},{cpu/event=0x3c,umask=0x0,any=1/,instructions,cycles,cpu/event=0x9c,um
ask=0x1/,cpu/event=0x9c,umask=0x1,cmask=4/},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x
30/,cpu/event=0xc2,umask=0x2/},{cpu/event=0xc5,umask=0x0/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},{cpu/event=0xa3,umask=0x6
,cmask=6/,cpu/event=0xa3,umask=0x4,cmask=4/,cycles,cpu/event=0xa2,umask=0x8/,cpu/event=0xb1,umask=0x2,cmask=1/},{cpu/event=0x
b1,umask=0x2,cmask=2/,cpu/event=0xb1,umask=0x2,cmask=3/,cpu/event=0x9c,umask=0x1,cmask=4/,cpu/event=0x5e,umask=0x1/,instructi
ons}' -A -a ./stream-bin
I also then get the output after processing from the tools to show the top areas:
C0 BE Backend_Bound: 95.27 % Slots [ 16.66%] C0 BE/Mem Backend_Bound.Memory_Bound: 91.36 % Slots [ 16.66%] <== This metric represents slots fraction the Memory subsystem within the Backend was a bottleneck... C0-T0 MUX: 16.66 % PerfMon Event Multiplexing accuracy indicator C1 BE Backend_Bound: 95.20 % Slots [ 16.66%] C1 BE/Mem Backend_Bound.Memory_Bound: 91.30 % Slots [ 16.66%] <== C1-T0 MUX: 16.66 % C2 BE Backend_Bound: 95.28 % Slots [ 16.66%] C2 BE/Mem Backend_Bound.Memory_Bound: 91.37 % Slots [ 16.66%] <== C2-T0 MUX: 16.66 % C3 BE Backend_Bound: 95.08 % Slots [ 16.66%] C3 BE/Mem Backend_Bound.Memory_Bound: 91.21 % Slots [ 16.66%] <== C3-T0 MUX: 16.66 % C0-T1 MUX: 16.66 % C1-T1 MUX: 16.66 % C2-T1 MUX: 16.66 % C3-T1 MUX: 16.66 %
This shows the backend bound percentage of 95.27% including 91.36% of this as memory bound and the balance (3.91%) as core bound.
Using the TMA spreadsheet, I believe the following equations were in play for computing memory-bound vs. core-bound:
FrontEnd = {topdown-fetch-bubbles} / {topdown-total-slots}
Retire = {topdown-slots-retired} / {topdown-total-slots}
BadSpec = ({topdown-slots-issued} - {topdown-slots-retired} + {topdown-recovery-bubbles})/ {topdown-total-slots}
Backend = 1 - (FrontEnd + Retire + BadSpec)
Event: {topdown-fetch-bubbles}; event=0x9c,umask=0x1
Event: {topdown-total-slots}; event=0x3c,umask=0x0,any=1 * scale=2
Event: {topdown-slots-retired}; event=0xc2,umask=0x2
Event: {topdown-slots-issued}; event=0xe,umask=0x1
Event: {topdown-slots-bubbles}; event=0xd,umask=0x3,cmask=1,any=1 * scale=2
Backend.MemoryBound = #Memory_Bound_Fraction * Backend
Backend.CoreBound = Backend = Backend.MemoryBound
#Memory_Bound_Fraction = (#Stalls_Mem_Any + {RESOURCE_STALLS.SB})/#Backend_Bound_Cycles
#Stalls_Mem_Any = min({CPU_CLK_UNHALTED.THREAD},{CYCLE_ACTIVITY.STALLS_LDM_PENDING})
#Backend_Bound_Cycles = #Stalls_Total + ( - #Few_Uops_Executed_Threshold)/2 - #Frontend_RS_Empty_Cycles + ,CYCLE_ACTIVITY.CYCLES_NO_EXECUTE)
#Few_Uops_Executed_Threshold = UOPS_EXECUTED.CORE:c3 if (IPC > 1.8) else UOPS_EXECUTED.CORE:c2
#Frontend_RS_Empty_Cycles = RS_EVENTS.EMPTY_CYCLES if (Frontend_Latency > 0.1) else 0
Event: ; event=0xa2,umask=0x8
Event: ; event=0x0,umask=0x2 // not used
Event: ; event=0xa3,umask=0x6,cmask=0x6
Event: :c1 ; event=0xb1,umask=0x2,cmask=0x1
Event: :c2 ; event=0xb1,umask=0x2,cmask=0x2
Event: :c3 ; event=0xb1,umask=0x2,cmask=0x3
Event: ; event=0xa3,umask=0x4,cmask=0x4
Event: ; event=0x5e,umask=0x1
this seems to require the 5 counters to compute overall "topdown" metrics and an additional 7 counters to separate out memory-bound vs. core-bound. However, I can also decode the overall counters being used in the "perf stat" line above:
Topdown metrics, level #1
{cpu/event=0x3c,umask=0x0,any=1/, # topdown-total-slots
cpu/event=0xe,umask=0x1/, # topdown-slots-issued
cpu/event=0x9c,umask=0x1/, # topdown-fetch-bubbles
cpu/event=0xd,umask=0x3,any=1,cmask=1/, # topdown-recovery-bubbles
cpu/event=0xc2,umask=0x2/}, # topdown-slots-retired
Topdown metrics, level #2, frontend latency vs. frontend bandwidth
{cpu/event=0x3c,umask=0x0,any=1/, # topdown-total-slots
instructions,
cycles,
cpu/event=0x9c,umask=0x1/, # topdown-fetch-bubbles
cpu/event=0x9c,umask=0x1,cmask=4/}, # IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE
Retiring metrics: level #2: microcode sequencer vs. base
{cpu/event=0x3c,umask=0x0,any=1/, # topdown-total-slots
cpu/event=0xe,umask=0x1/, # topdown-slots-issued
cpu/event=0x79,umask=0x30/, # IDQ.MS_UOPS
cpu/event=0xc2,umask=0x2/}, # topdown-slots-retired
Speculation metrics: level #2: branch mispredicts and machine clears
{cpu/event=0xc5,umask=0x0/, # BR_MISP_RETIRED.ALL_BRANCHES
cpu/event=0xc3,umask=0x1,edge=1,cmask=1/}, # MACHINE_CLEARS.COUNT
Backend metrics: level #2: memory-bound vs. core-bound
{cpu/event=0xa3,umask=0x6,cmask=6/, # CYCLE_ACTIVITY.STALLS_LDM_PENDING
cpu/event=0xa3,umask=0x4,cmask=4/, # CYCLE_ACTIVITY.CYCLES_NO_EXECUTE
cycles,
cpu/event=0xa2,umask=0x8/, # RESOURCE_STALLS.SB
cpu/event=0xb1,umask=0x2,cmask=1/}, # UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC
{cpu/event=0xb1,umask=0x2,cmask=2/, # UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC
cpu/event=0xb1,umask=0x2,cmask=3/, # UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC
cpu/event=0x9c,umask=0x1,cmask=4/, # UOPS_EXECUTED.CYCLES_GE_4_UOPS_EXEC
cpu/event=0x5e,umask=0x1/, # RS_EVENTS.EMPTY_CYCLES
instructions}
As a whole, 26 counters get grouped into six groups measured together. As annotated above, the first group is level #1 of the top-level metrics. The next groups go down to level #2 for frontend stalls, retiring, bad speculation and backend stalls with 10 total counters measured for backend stalls and split between two groups.
This suggests to me that measuring the five top-level counters plus the two additional groups of five can then be plugged into the formula above to get the next level breakdown of backend stalls into either memory-bound or core-bound stalls.
The larger question is what "perfcounter-model" makes sense for this. In particular whether to add to a process or sampled-core model or only once overall. The problem is as counters get multi-plexed they also become somewhat less accurate. This is a bigger issue for "volatile" benchmarks that vary widely through their workload and thus precisely those that I also could graph behavior.
I'll probably lean to implementing this at the application level once and then see.

Comments
Performance counters and memory analysis; checking with pmu_tools — No Comments
HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>