Top down performance counter analysis (part 1) – likwid & perf – Performance analysis, tools and experiments

In this posting I summarize top-down performance counter analysis to evaluate workloads and show how this can be measured on Haswell using likwid-perfctr and perf. In part 2 to follow, I’ll describe how top down metrics have been added to wspy.

Top Down Analysis
The top down analysis approach is based on a paper and slides by Yasin Ahmad. This technique is also described on Intel’s web site.

It also turns out this is implemented in the perf(1) tool for Intel platforms with the –topdown option including some generic counters that have been added.

Before reading about this technique my initial approach was going to be to first measure/calculate/research the costs of various events e.g. cache misses, tlb misses, branch misses, memory access and then use performance counters to measure how frequently these events occur. Using the frequency and cost and comparing across workloads I might better understand what factors are limiting particular workloads. While I might still take some of this approach later, the top-down method makes a lot of sense as a start.

There are several reasons why first costing events and then adding frequencies can be difficult on a multi-core, super-scalar, parallelized and speculated micro-architecture.

Finding costs is not always straightforward, since almost by definition the micro-architecture is trying to minimize their effects. Hence, it can be tough to make experiments that actually measure what you expect
Many operations occur in parallel. This means that forward progress could be stalled for more than one reason. It might be possible that other progress gets made during a stall. For example, in a hyper-threaded program the opposite thread might take advantage of a stall.
While there is no shortage of events and counters to example, it can also be difficult to know which of these events are most relevant.

The main idea behind top-down performance counter analysis is to first characterize the workload based on key metrics. These metrics are based on a small number of counters in the Intel architecture that see how well the processor pipelines are being used. With this overall pipeline analysis one can first characterize bottlenecks into several categories (front-end, back-end, retiring, speculation). These categories can then be used to further guide subsequent analysis based on the specific category.

Measurement Techniques
Slide #13 of Ahmed Yasin’s slides defines the following event names based on five counters:

TotalSlots = 4 * CPU_CLK_UNHALTED.THREAD
SlotsIssued = UOPS_ISSUED.ANY
SlotsRetired = UOPS_RETIRED.RETIRE_SLOTS
FetchBubbles = IDQ_UOPS_NOT_DELIVERED.CORE
RecoveryBubbles = 4 * INT_MISC.RECOVERY_CYCLES

these events are then used to compute the following metrics:

Frontend Bound = FetchBubbles / TotalSlots
Bad Speculation = (SlotsIssues - SlotsRetired  + RecoveryBubbles)/ TotalSlots
Retiring = SlotsRetired / TotalSlots
Backend Bound = 1 - (FrontendBound + BadSpeculation + Retiring)

There are four measurement configurations of interest: likwid, perf, wspy periodic timer and wspy process tree.

In a previous blog post about likwid-perfctr I printed the results from likwid-perfctr runs benchmarks of a Phoronix CPU suite. There isn’t a metric exactly like this one, though a few are close. For example, the CYCLE_ACTIVITY group reports the percentage of cycles spent waiting on stalls due to data traffic. So I created a new performance group and placed the file in

/usr/share/likwid/perfgroups/haswell/TOPDOWN.txt

Following is the contents of the new file I created:

SHORT Top down cycle allocation

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
PMC0 UOPS_ISSUED_ANY
PMC1 UOPS_RETIRED_RETIRE_SLOTS
PMC2 IDQ_UOPS_NOT_DELIVERED_CORE
PMC3 INT_MISC_RECOVERY_CYCLES

METRICS
IPC FIXC0/FIXC1
Total Slots 4*FIXC1
Slots Retired PMC1
Fetch Bubbles PMC2
Recovery Bubbles 4*PMC3
Front End [%] PMC2/(4*FIXC1)*100
Speculation [%] (PMC0-PMC1+(4*PMC3))/(4*FIXC1)*100
Retiring [%] PMC1/(4*FIXC1)*100
Back End [%] (1-((PMC2+PMC0+(4*PMC3))/(4*FIXC1)))*100

LONG
Front End [%] = IDQ_UOPS_NOT_DELIVERED_CORE/(4*CPU_CLK_UNHALTED_CORE)*100
Speculation [%] = (UOPS_ISSUED_ANY-UOPS_RETIRED_RETIRE_SLOTS+(4*INT_MISC_RECOV\
ERY_CYCLES))/(4*CPU_CLK_UNHALTED_CORE)*100
Retiring [%] = UOPS_RETIRED_RETIRE_SLOTS/(4*CPU_CLK_UNHALTED_CORE)*100
Back End [%] = (1-((IDQ_UOPS_NOT_DELIVERED_CORE+UOPS_ISSUED_ANY+(4*INT_MISC_RE\
COVERY_CYCLES))/(4*CPU_CLK_UNHALTED_CORE)))*100
--
This performance group measures cycles to determine percentage of time spent i\
n front end, back end, retiring and speculation.

The results files of this run across the Phoronix CPU benchmarks are linked below:
padman etqw-demo graphics-magick john-the-ripper ttsiod-renderer compress-pbzip2 compress-7zip encode-mp3 encode-flac x264 ffmpeg openssl himeno pgbench apache c-ray povray smallpt tachyon crafty tscp mafft stream

These are also summarized in the following table. It is also useful to note the sorts of events to investigate further when particular areas are high

Frontend Bound: fetch latency (iTLB, iCache, Branch Resteers) and fetch bandwidth
Bad Speculation: Branch mispredicts and machine clears
Retiring: Floating point arithmetic, micro-sequencer
Backend Bound: memory bound (L1, L2, L3, external memory), core bound (divider, execution port utilization)

Benchmark	IPC	Front End	Speculation	Retiring	Back End	Threading
apache	0.45	34.5	3.7	12.3	49.5	multiple
c-ray	1.44	5.2	0.4	37.0	57.4	multiple
compress-7zip	0.84	11.8	13.8	19.7	54.7	multiple
compress-pbzip2	0.95	5.0	16.3	23.3	55.4	multiple
crafty	1.45	26.5	16.1	35.6	21.9	single
encode-flac	2.47	4.7	4.6	64.6	26.1	single
encode-mp3	1.90	5.0	11.4	49.3	34.2	single
etqw-demo	1.09	20.1	14.6	27.4	37.9	multiple
ffmpeg	1.19	20.7	16.1	29.9	33.3	multiple
graphics-magick	1.54	6.4	7.7	42.4	43.5	multiple
himeno	0.86	2.5	0.4	28.7	68.4	single
john-the-ripper	1.09	24.6	14.6	27.5	33.2	multiple
mafft	1.31	8.6	10.0	27.6	53.7	multiple
openssl	1.66	3.3	0.3	46.2	50.3	multiple
padman	1.32	22.9	6.5	32.8	37.7	multiple
pgbench	1.20	21.1	16.2	29.9	32.8	multiple
povray	1.07	20.4	17.1	27.3	35.1	multiple
smallpt	1.24	23.0	15.3	31.0	30.7	multiple
stream	0.05	0.8	0.1	1.4	97.7	multiple
tachyon	1.03	9.6	1.9	31.8	56.8	multiple
tscp	1.75	32.1	23.3	37.4	7.3	single
ttsiod-renderer	1.24	17.1	16.3	30.8	35.8	multiple
x264	1.31	12.1	3.6	34.7	49.6	multiple

After implementing this for likwid-perfctr, I noticed that my Intel platforms actually had some general architecture breakpoints named topdown-fetch-bubble, topdown-slots-issues, topdown-slots-retired and topdown-total-slots. Looking further, I saw this method had been implemented for perf(1

       --topdown
           Print top down level 1 metrics if supported by the CPU. This allows
           to determine bottle necks in the CPU pipeline for CPU bound
           workloads, by breaking the cycles consumed down into frontend
           bound, backend bound, bad speculation and retiring.

root@pasto:~# perf stat -a --topdown gcc -o hello hello.c

On an in order Atom machine (no speculation), I did a quick test:

root@pasto:~# perf stat -a --topdown gcc -o hello hello.c

 Performance counter stats for 'system wide':

                  retiring             frontend bound       backend bound/bad spec 
S0-C0           1     16.3%               50.3%                               
S0-C1           1     10.6%               58.6%                               
S0-C2           1     23.7%               29.9%                               
S0-C3           1     20.8%               21.4%                               
S0-C4           1     19.3%               45.2%                               
S0-C5           1     37.5%               32.5%                               
S0-C6           1     15.3%               54.6%                               
S0-C7           1     21.8%               44.1%                               

       0.116470972 seconds time elapsed

Excellent, nice to see it here. Also here is the lwn article on when this was added to perf along with some notes about how hyper-threading is treated and a reference to the pmu_tools that provide additional topdown metrics used to drill down further.

As a whole, I the technique seems intriguing. However, I also need to calibrate this further with some of the underlying metrics. A further next step is also to implement this with wspy. That will let me watch how these metrics might vary over time as an application runs. They might also let me see how these values build hierarchically in a tree of processes.

A different avenue is to see if there are some equivalent metrics that might provide a similar high-level overview on AMD or ARM processors.

Comments

Top down performance counter analysis (part 1) – likwid & perf — 1 Comment

mev on 2018-04-16 at 8:09 am said:

After adding this, I also notice a recent tutorial on use of topdown metrics was posted: http://www.cs.technion.ac.il/~erangi/TMA_using_Linux_perf__Ahmad_Yasin.pdf

Reply ↓

Performance analysis, tools and experiments

An eclectic collection

Top down performance counter analysis (part 1) – likwid & perf

Comments

Top down performance counter analysis (part 1) – likwid & perf — 1 Comment

Leave a Reply to mev Cancel reply