Performance counters and memory analysis; checking with pmu_tools

After my previous post on backend analysis, I got a chance to look further breaking out backend stall statistics.

In particular I found three pieces of information helpful:

A recent slideset provides overview of metrics and methods
The Topdown Microarchitectural analysis spreadsheet from Intel provides performance counters and formulas for analysis
Intel’s PMU tools implement these analysis

Based on this, I can run stream under the PMU tools with a “-l 2” (level 2) analysis and get some useful information. Excluding the output from stream itself, I get the perf commandline output showing the counters measured:

perf stat -x\; --no-merge -e '{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x9c,umask=0x1/,cpu/event=0
xd,umask=0x3,any=1,cmask=1/,cpu/event=0xc2,umask=0x2/},{cpu/event=0x3c,umask=0x0,any=1/,instructions,cycles,cpu/event=0x9c,um
ask=0x1/,cpu/event=0x9c,umask=0x1,cmask=4/},{cpu/event=0x3c,umask=0x0,any=1/,cpu/event=0xe,umask=0x1/,cpu/event=0x79,umask=0x
30/,cpu/event=0xc2,umask=0x2/},{cpu/event=0xc5,umask=0x0/,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},{cpu/event=0xa3,umask=0x6
,cmask=6/,cpu/event=0xa3,umask=0x4,cmask=4/,cycles,cpu/event=0xa2,umask=0x8/,cpu/event=0xb1,umask=0x2,cmask=1/},{cpu/event=0x
b1,umask=0x2,cmask=2/,cpu/event=0xb1,umask=0x2,cmask=3/,cpu/event=0x9c,umask=0x1,cmask=4/,cpu/event=0x5e,umask=0x1/,instructi
ons}' -A -a ./stream-bin

I also then get the output after processing from the tools to show the top areas:

C0    BE             Backend_Bound:                      95.27 % Slots [ 16.66%]
C0    BE/Mem         Backend_Bound.Memory_Bound:         91.36 % Slots [ 16.66%] <==
	This metric represents slots fraction the Memory subsystem
	within the Backend was a bottleneck...
C0-T0                MUX:                                16.66 %      
	PerfMon Event Multiplexing accuracy indicator
C1    BE             Backend_Bound:                      95.20 % Slots [ 16.66%]
C1    BE/Mem         Backend_Bound.Memory_Bound:         91.30 % Slots [ 16.66%] <==
C1-T0                MUX:                                16.66 %      
C2    BE             Backend_Bound:                      95.28 % Slots [ 16.66%]
C2    BE/Mem         Backend_Bound.Memory_Bound:         91.37 % Slots [ 16.66%] <==
C2-T0                MUX:                                16.66 %      
C3    BE             Backend_Bound:                      95.08 % Slots [ 16.66%]
C3    BE/Mem         Backend_Bound.Memory_Bound:         91.21 % Slots [ 16.66%] <==
C3-T0                MUX:                                16.66 %      
C0-T1                MUX:                                16.66 %      
C1-T1                MUX:                                16.66 %      
C2-T1                MUX:                                16.66 %      
C3-T1                MUX:                                16.66 %

This shows the backend bound percentage of 95.27% including 91.36% of this as memory bound and the balance (3.91%) as core bound.

Using the TMA spreadsheet, I believe the following equations were in play for computing memory-bound vs. core-bound:

FrontEnd = {topdown-fetch-bubbles} / {topdown-total-slots}
Retire   = {topdown-slots-retired} / {topdown-total-slots}
BadSpec  = ({topdown-slots-issued} - {topdown-slots-retired} + {topdown-recovery-bubbles})/ {topdown-total-slots}
Backend  = 1 - (FrontEnd + Retire + BadSpec)

Event: {topdown-fetch-bubbles};             event=0x9c,umask=0x1
Event: {topdown-total-slots};               event=0x3c,umask=0x0,any=1 * scale=2
Event: {topdown-slots-retired};             event=0xc2,umask=0x2
Event: {topdown-slots-issued};              event=0xe,umask=0x1
Event: {topdown-slots-bubbles};             event=0xd,umask=0x3,cmask=1,any=1 * scale=2

Backend.MemoryBound = #Memory_Bound_Fraction * Backend
Backend.CoreBound   = Backend = Backend.MemoryBound

#Memory_Bound_Fraction       = (#Stalls_Mem_Any + {RESOURCE_STALLS.SB})/#Backend_Bound_Cycles
#Stalls_Mem_Any              = min({CPU_CLK_UNHALTED.THREAD},{CYCLE_ACTIVITY.STALLS_LDM_PENDING})
#Backend_Bound_Cycles        = #Stalls_Total + ( - #Few_Uops_Executed_Threshold)/2 - #Frontend_RS_Empty_Cycles + ,CYCLE_ACTIVITY.CYCLES_NO_EXECUTE)
#Few_Uops_Executed_Threshold = UOPS_EXECUTED.CORE:c3 if (IPC > 1.8) else UOPS_EXECUTED.CORE:c2
#Frontend_RS_Empty_Cycles    = RS_EVENTS.EMPTY_CYCLES if (Frontend_Latency > 0.1) else 0

Event: ;                event=0xa2,umask=0x8
Event: ;           event=0x0,umask=0x2            // not used
Event: ; event=0xa3,umask=0x6,cmask=0x6
Event: :c1            ; event=0xb1,umask=0x2,cmask=0x1
Event: :c2            ; event=0xb1,umask=0x2,cmask=0x2
Event: :c3            ; event=0xb1,umask=0x2,cmask=0x3
Event:  ; event=0xa3,umask=0x4,cmask=0x4
Event:            ; event=0x5e,umask=0x1

this seems to require the 5 counters to compute overall "topdown" metrics and an additional 7 counters to separate out memory-bound vs. core-bound. However, I can also decode the overall counters being used in the "perf stat" line above:

Topdown	metrics, level #1
{cpu/event=0x3c,umask=0x0,any=1/,	      #	topdown-total-slots
 cpu/event=0xe,umask=0x1/,      	      #	topdown-slots-issued
 cpu/event=0x9c,umask=0x1/,   	      	      #	topdown-fetch-bubbles
 cpu/event=0xd,umask=0x3,any=1,cmask=1/,      #	topdown-recovery-bubbles
 cpu/event=0xc2,umask=0x2/},                  #	topdown-slots-retired

Topdown	metrics, level #2, frontend latency vs.	frontend bandwidth
{cpu/event=0x3c,umask=0x0,any=1/,             #	topdown-total-slots
 instructions,
 cycles,
 cpu/event=0x9c,umask=0x1/,     	      #	topdown-fetch-bubbles
 cpu/event=0x9c,umask=0x1,cmask=4/},          #	IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE

Retiring metrics: level #2: microcode sequencer vs. base
{cpu/event=0x3c,umask=0x0,any=1/,             #	topdown-total-slots
 cpu/event=0xe,umask=0x1/,                    #	topdown-slots-issued
 cpu/event=0x79,umask=0x30/,                  #	IDQ.MS_UOPS
 cpu/event=0xc2,umask=0x2/},    	      #	topdown-slots-retired

Speculation metrics: level	#2: branch mispredicts and machine clears
{cpu/event=0xc5,umask=0x0/,                   #	BR_MISP_RETIRED.ALL_BRANCHES
 cpu/event=0xc3,umask=0x1,edge=1,cmask=1/},   #	MACHINE_CLEARS.COUNT

Backend	metrics: level #2: memory-bound	vs. core-bound
{cpu/event=0xa3,umask=0x6,cmask=6/,           #	CYCLE_ACTIVITY.STALLS_LDM_PENDING
 cpu/event=0xa3,umask=0x4,cmask=4/,           #	CYCLE_ACTIVITY.CYCLES_NO_EXECUTE
 cycles,
 cpu/event=0xa2,umask=0x8/,     	      #	RESOURCE_STALLS.SB
 cpu/event=0xb1,umask=0x2,cmask=1/},          #	UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC

{cpu/event=0xb1,umask=0x2,cmask=2/,           #	UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC
 cpu/event=0xb1,umask=0x2,cmask=3/,           #	UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC
 cpu/event=0x9c,umask=0x1,cmask=4/,	      #	UOPS_EXECUTED.CYCLES_GE_4_UOPS_EXEC
 cpu/event=0x5e,umask=0x1/,                   #	RS_EVENTS.EMPTY_CYCLES
 instructions}

As a whole, 26 counters get grouped into six groups measured together. As annotated above, the first group is level #1 of the top-level metrics. The next groups go down to level #2 for frontend stalls, retiring, bad speculation and backend stalls with 10 total counters measured for backend stalls and split between two groups.

This suggests to me that measuring the five top-level counters plus the two additional groups of five can then be plugged into the formula above to get the next level breakdown of backend stalls into either memory-bound or core-bound stalls.

The larger question is what "perfcounter-model" makes sense for this. In particular whether to add to a process or sampled-core model or only once overall. The problem is as counters get multi-plexed they also become somewhat less accurate. This is a bigger issue for "volatile" benchmarks that vary widely through their workload and thus precisely those that I also could graph behavior.

I'll probably lean to implementing this at the application level once and then see.

Performance analysis, tools and experiments

An eclectic collection

Comments

Performance counters and memory analysis; checking with pmu_tools — No Comments

Leave a Reply Cancel reply