Top down performance counter analysis (part 1) – likwid & perf
In this posting I summarize top-down performance counter analysis to evaluate workloads and show how this can be measured on Haswell using likwid-perfctr and perf. In part 2 to follow, I’ll describe how top down metrics have been added to wspy.
Top Down Analysis
The top down analysis approach is based on a paper and slides by Yasin Ahmad. This technique is also described on Intel’s web site.
It also turns out this is implemented in the perf(1) tool for Intel platforms with the –topdown option including some generic counters that have been added.
Before reading about this technique my initial approach was going to be to first measure/calculate/research the costs of various events e.g. cache misses, tlb misses, branch misses, memory access and then use performance counters to measure how frequently these events occur. Using the frequency and cost and comparing across workloads I might better understand what factors are limiting particular workloads. While I might still take some of this approach later, the top-down method makes a lot of sense as a start.
There are several reasons why first costing events and then adding frequencies can be difficult on a multi-core, super-scalar, parallelized and speculated micro-architecture.
- Finding costs is not always straightforward, since almost by definition the micro-architecture is trying to minimize their effects. Hence, it can be tough to make experiments that actually measure what you expect
- Many operations occur in parallel. This means that forward progress could be stalled for more than one reason. It might be possible that other progress gets made during a stall. For example, in a hyper-threaded program the opposite thread might take advantage of a stall.
- While there is no shortage of events and counters to example, it can also be difficult to know which of these events are most relevant.
The main idea behind top-down performance counter analysis is to first characterize the workload based on key metrics. These metrics are based on a small number of counters in the Intel architecture that see how well the processor pipelines are being used. With this overall pipeline analysis one can first characterize bottlenecks into several categories (front-end, back-end, retiring, speculation). These categories can then be used to further guide subsequent analysis based on the specific category.
Measurement Techniques
Slide #13 of Ahmed Yasin’s slides defines the following event names based on five counters:
TotalSlots = 4 * CPU_CLK_UNHALTED.THREAD SlotsIssued = UOPS_ISSUED.ANY SlotsRetired = UOPS_RETIRED.RETIRE_SLOTS FetchBubbles = IDQ_UOPS_NOT_DELIVERED.CORE RecoveryBubbles = 4 * INT_MISC.RECOVERY_CYCLES
these events are then used to compute the following metrics:
Frontend Bound = FetchBubbles / TotalSlots Bad Speculation = (SlotsIssues - SlotsRetired + RecoveryBubbles)/ TotalSlots Retiring = SlotsRetired / TotalSlots Backend Bound = 1 - (FrontendBound + BadSpeculation + Retiring)
There are four measurement configurations of interest: likwid, perf, wspy periodic timer and wspy process tree.
In a previous blog post about likwid-perfctr I printed the results from likwid-perfctr runs benchmarks of a Phoronix CPU suite. There isn’t a metric exactly like this one, though a few are close. For example, the CYCLE_ACTIVITY group reports the percentage of cycles spent waiting on stalls due to data traffic. So I created a new performance group and placed the file in
/usr/share/likwid/perfgroups/haswell/TOPDOWN.txt
Following is the contents of the new file I created:
SHORT Top down cycle allocation EVENTSET FIXC0 INSTR_RETIRED_ANY FIXC1 CPU_CLK_UNHALTED_CORE PMC0 UOPS_ISSUED_ANY PMC1 UOPS_RETIRED_RETIRE_SLOTS PMC2 IDQ_UOPS_NOT_DELIVERED_CORE PMC3 INT_MISC_RECOVERY_CYCLES METRICS IPC FIXC0/FIXC1 Total Slots 4*FIXC1 Slots Retired PMC1 Fetch Bubbles PMC2 Recovery Bubbles 4*PMC3 Front End [%] PMC2/(4*FIXC1)*100 Speculation [%] (PMC0-PMC1+(4*PMC3))/(4*FIXC1)*100 Retiring [%] PMC1/(4*FIXC1)*100 Back End [%] (1-((PMC2+PMC0+(4*PMC3))/(4*FIXC1)))*100 LONG Front End [%] = IDQ_UOPS_NOT_DELIVERED_CORE/(4*CPU_CLK_UNHALTED_CORE)*100 Speculation [%] = (UOPS_ISSUED_ANY-UOPS_RETIRED_RETIRE_SLOTS+(4*INT_MISC_RECOV\ ERY_CYCLES))/(4*CPU_CLK_UNHALTED_CORE)*100 Retiring [%] = UOPS_RETIRED_RETIRE_SLOTS/(4*CPU_CLK_UNHALTED_CORE)*100 Back End [%] = (1-((IDQ_UOPS_NOT_DELIVERED_CORE+UOPS_ISSUED_ANY+(4*INT_MISC_RE\ COVERY_CYCLES))/(4*CPU_CLK_UNHALTED_CORE)))*100 -- This performance group measures cycles to determine percentage of time spent i\ n front end, back end, retiring and speculation.
The results files of this run across the Phoronix CPU benchmarks are linked below:
padman etqw-demo graphics-magick john-the-ripper ttsiod-renderer compress-pbzip2 compress-7zip encode-mp3 encode-flac x264 ffmpeg openssl himeno pgbench apache c-ray povray smallpt tachyon crafty tscp mafft stream
These are also summarized in the following table. It is also useful to note the sorts of events to investigate further when particular areas are high
- Frontend Bound: fetch latency (iTLB, iCache, Branch Resteers) and fetch bandwidth
- Bad Speculation: Branch mispredicts and machine clears
- Retiring: Floating point arithmetic, micro-sequencer
- Backend Bound: memory bound (L1, L2, L3, external memory), core bound (divider, execution port utilization)
Benchmark | IPC | Front End | Speculation | Retiring | Back End | Threading |
---|---|---|---|---|---|---|
apache | 0.45 | 34.5 | 3.7 | 12.3 | 49.5 | multiple |
c-ray | 1.44 | 5.2 | 0.4 | 37.0 | 57.4 | multiple |
compress-7zip | 0.84 | 11.8 | 13.8 | 19.7 | 54.7 | multiple |
compress-pbzip2 | 0.95 | 5.0 | 16.3 | 23.3 | 55.4 | multiple |
crafty | 1.45 | 26.5 | 16.1 | 35.6 | 21.9 | single |
encode-flac | 2.47 | 4.7 | 4.6 | 64.6 | 26.1 | single |
encode-mp3 | 1.90 | 5.0 | 11.4 | 49.3 | 34.2 | single |
etqw-demo | 1.09 | 20.1 | 14.6 | 27.4 | 37.9 | multiple |
ffmpeg | 1.19 | 20.7 | 16.1 | 29.9 | 33.3 | multiple |
graphics-magick | 1.54 | 6.4 | 7.7 | 42.4 | 43.5 | multiple |
himeno | 0.86 | 2.5 | 0.4 | 28.7 | 68.4 | single |
john-the-ripper | 1.09 | 24.6 | 14.6 | 27.5 | 33.2 | multiple |
mafft | 1.31 | 8.6 | 10.0 | 27.6 | 53.7 | multiple |
openssl | 1.66 | 3.3 | 0.3 | 46.2 | 50.3 | multiple |
padman | 1.32 | 22.9 | 6.5 | 32.8 | 37.7 | multiple |
pgbench | 1.20 | 21.1 | 16.2 | 29.9 | 32.8 | multiple |
povray | 1.07 | 20.4 | 17.1 | 27.3 | 35.1 | multiple |
smallpt | 1.24 | 23.0 | 15.3 | 31.0 | 30.7 | multiple |
stream | 0.05 | 0.8 | 0.1 | 1.4 | 97.7 | multiple |
tachyon | 1.03 | 9.6 | 1.9 | 31.8 | 56.8 | multiple |
tscp | 1.75 | 32.1 | 23.3 | 37.4 | 7.3 | single |
ttsiod-renderer | 1.24 | 17.1 | 16.3 | 30.8 | 35.8 | multiple |
x264 | 1.31 | 12.1 | 3.6 | 34.7 | 49.6 | multiple |
After implementing this for likwid-perfctr, I noticed that my Intel platforms actually had some general architecture breakpoints named topdown-fetch-bubble, topdown-slots-issues, topdown-slots-retired and topdown-total-slots. Looking further, I saw this method had been implemented for perf(1
--topdown Print top down level 1 metrics if supported by the CPU. This allows to determine bottle necks in the CPU pipeline for CPU bound workloads, by breaking the cycles consumed down into frontend bound, backend bound, bad speculation and retiring. root@pasto:~# perf stat -a --topdown gcc -o hello hello.c
On an in order Atom machine (no speculation), I did a quick test:
root@pasto:~# perf stat -a --topdown gcc -o hello hello.c Performance counter stats for 'system wide': retiring frontend bound backend bound/bad spec S0-C0 1 16.3% 50.3% S0-C1 1 10.6% 58.6% S0-C2 1 23.7% 29.9% S0-C3 1 20.8% 21.4% S0-C4 1 19.3% 45.2% S0-C5 1 37.5% 32.5% S0-C6 1 15.3% 54.6% S0-C7 1 21.8% 44.1% 0.116470972 seconds time elapsed
Excellent, nice to see it here. Also here is the lwn article on when this was added to perf along with some notes about how hyper-threading is treated and a reference to the pmu_tools that provide additional topdown metrics used to drill down further.
As a whole, I the technique seems intriguing. However, I also need to calibrate this further with some of the underlying metrics. A further next step is also to implement this with wspy. That will let me watch how these metrics might vary over time as an application runs. They might also let me see how these values build hierarchically in a tree of processes.
A different avenue is to see if there are some equivalent metrics that might provide a similar high-level overview on AMD or ARM processors.
After adding this, I also notice a recent tutorial on use of topdown metrics was posted: http://www.cs.technion.ac.il/~erangi/TMA_using_Linux_perf__Ahmad_Yasin.pdf