likwid-perfctr
The likwid performance monitoring and benchmarking suite has multiple tools including:
- likwid-topology – print thread, cache and NUMA topology
- likwid-pin – pin threaded applications to processors
- likwid-bench – micro-benchmarking application
- likwid-agent – monitoring agent for hardware performance counters
As of this writing, these tools seem to be available for x86 systems for both Intel and AMD, but not for ARM architecture
UPDATE: I’m told ARM support was added in January 2018.
This page documents some of my experiments playing with the likwid-perfctr application to examine performance counters.
Getting likwid was pretty easy as there is a package for Ubuntu 17.10. However, out of the box the program didn’t recognize any performance counters. Results from “-e” option provide a list of events supported:
mev@popayan:~$ likwid-perfctr -e This architecture has 0 counters. Counter tags(name, type<, options>): This architecture has 0 events. Event tags (tag, id, umask, counters<, options>): mev@popayan:~$
I was able to fix this by including the msr module in the kernel:
mev@popayan:~$ sudo modprobe msr
On my Intel i7-4770s box, this now shows 27 counters and 496 events that can be monitored:
mev@popayan:~$ sudo likwid-perfctr -e This architecture has 27 counters. Counter tags(name, type<, options>): FIXC0, Fixed counters, KERNEL|ANYTHREAD FIXC1, Fixed counters, KERNEL|ANYTHREAD FIXC2, Fixed counters, KERNEL|ANYTHREAD PMC0, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION PMC1, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION PMC2, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION|IN_TRANSACTION_ABORTED PMC3, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION TMP0, Thermal PWR0, Energy/Power counters (RAPL) PWR1, Energy/Power counters (RAPL) PWR2, Energy/Power counters (RAPL) PWR3, Energy/Power counters (RAPL) CBOX0C0, Caching Agent box 0, EDGEDETECT|THRESHOLD|INVERT CBOX0C1, Caching Agent box 0, EDGEDETECT|THRESHOLD|INVERT CBOX1C0, Caching Agent box 1, EDGEDETECT|THRESHOLD|INVERT CBOX1C1, Caching Agent box 1, EDGEDETECT|THRESHOLD|INVERT CBOX2C0, Caching Agent box 2, EDGEDETECT|THRESHOLD|INVERT CBOX2C1, Caching Agent box 2, EDGEDETECT|THRESHOLD|INVERT CBOX3C0, Caching Agent box 3, EDGEDETECT|THRESHOLD|INVERT CBOX3C1, Caching Agent box 3, EDGEDETECT|THRESHOLD|INVERT UBOX0, System Configuration box, EDGEDETECT|THRESHOLD|INVERT UBOX1, System Configuration box, EDGEDETECT|THRESHOLD|INVERT UBOXFIX, System Configuration box fixed counter This architecture has 496 events. Event tags (tag, id, umask, counters<, options>): TEMP_CORE, 0x0, 0x0, TMP0 PWR_PKG_ENERGY, 0x0, 0x0, PWR0
On AMD A10-7850 it shows 4 counters and 605 events (somehow the count shows as 0):
mev@cuenca:~$ sudo likwid-perfctr -e This architecture has 0 counters. Counter tags(name, type<, options>): UPMC0, Socket-local general/fixed purpose counters UPMC2, Socket-local general/fixed purpose counters UPMC1, Socket-local general/fixed purpose counters UPMC3, Socket-local general/fixed purpose counters This architecture has 605 events. Event tags (tag, id, umask, counters<, options>): UNC_DRAM_ACCESSES_DCT0_PAGE_HIT, 0xE0, 0x1, UPMC
Notice these are all run as root. I wasn’t able to get them to show as a non-root application, despite trying two things suggest by web search:
- Changing permissions on the /dev/cpu/*/msr device files to be 0666
- Adding “sudo setcap cap_sys_rawio+ep /usr/bin/likwid-perfctr”
UPDATE: likwid page documents setcap method which should work. However, AMD A10 client is not supported. AMD “Interlagos” processors are supported. Both Interlagos and A10 are based on the AMD Family 15h processors, though different models/cores.
So for now run the experiments as root. Things that don’t require access to performance counters such as showing predefined performance groups can be run as user. Here is a list of performance groups on the Intel box (one can also run individual counters). These are different for each CPU. One can run these as predefined performance studies on a particular aspect of a CPU + workload.
Group name Description -------------------------------------------------------------------------------- UOPS_RETIRE UOPs retirement FLOPS_AVX Packed AVX MFLOP/s TLB_DATA L2 data TLB miss rate/ratio CACHES Cache bandwidth in MBytes/s CYCLE_ACTIVITY Cycle Activities CLOCK Power and Energy consumption L3 L3 cache bandwidth in MBytes/s BRANCH Branch prediction miss rate/ratio UOPS UOPs execution info TLB_INSTR L1 Instruction TLB miss rate/ratio RECOVERY Recovery duration L2CACHE L2 cache miss rate/ratio UOPS_ISSUE UOPs issueing L2 L2 cache bandwidth in MBytes/s ENERGY Power and Energy consumption FALSE_SHARE False sharing DATA Load to store ratio L3CACHE L3 cache miss rate/ratio ICACHE Instruction cache miss rate/ratio UOPS_EXEC UOPs execution
likwid-perfctr can be run in “stethoscope” mode with the -S option or as a wrapper without this option. When running as a wrapper, one provides a list of CPUs to monitor with either the -c (don’t pin) or -C (pin) options.
Here is an example output that comes from:
likwid-perfctr -f -c 0-7 -g L2CACHE phoronix-test-suite batch-run c-ray
to look at L2 data cache rates on the Phoronix Test Suite c-ray benchmark. Somehow giving the –output filename option seemed to have difficulties, so included both Phoronix Test Suite and then likwid output (see below). A few notes that I found:
- Multiple -g options can be given. It appears to run the groups in round-robin fashion even if perhaps counters were available
- This run happens under a Xen hypervisor, so have access to counters there
- The –output option has special rules for file naming, e.g. require a .txt suffix to create text files or requires other naming.
- The -O CSV option provides data in format for potential later batch processing.
- Setting the -t timeline option for periodic measurements seemed to generate a lot of divide by zero and nan exceptions
- The data below is itself fishy. For example no reason from a benchmark perspective that there should be a lot more instructions running on some cores rather than others
Overall, a useful tool and basic wrapper with at least for my current use: (a) run as root (b) try multiple studies and counters (perhaps defining my own groups).
root@popayan:~# likwid-perfctr -f -c 0-7 -g L2CACHE phoronix-test-suite batch-run c-ray -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz CPU type: Intel Core Haswell processor CPU clock: 3.09 GHz -------------------------------------------------------------------------------- Phoronix Test Suite v5.2.1 System Information Hardware: Processor: Intel Core i7-4770S @ 3.09GHz (8 Cores), Motherboard: ASUS M11AD, Chipset: Intel 4th Gen Core DRAM, Memory: 2 x 8192 MB DDR3-1600MT/s Kingston, Disk: 2000GB TOSHIBA DT01ACA2, Graphics: Intel Gen7, Audio: Intel Xeon E3-1200 v3/4th, Network: Realtek RTL8111/8168/8411 Software: OS: Ubuntu 17.10, Kernel: 4.13.0-36-generic (x86_64), Desktop: GNOME Shell 3.26.2, Display Server: X Server 1.19.5, OpenGL: 4.5 Mesa 17.2.8, File-System: ext4, Screen Resolution: 1920x1080, System Layer: Xen 4.9.0 Hypervisor C-Ray 1.1: pts/c-ray-1.1.1 Test 1 of 1 Estimated Trial Run Count: 3 Estimated Time To Completion: 2 Minutes Started Run 1 @ 15:08:58 Started Run 2 @ 15:09:27 Started Run 3 @ 15:09:55 [Std. Dev: 0.68%] Test Results: 26.432 26.713 26.771 Average: 26.64 Seconds [NOTICE] Parameter 1 to graphics_event_checker::__post_test_run() expected to be a reference, value given in pts_module_manager:74 -------------------------------------------------------------------------------- Group 1: L2CACHE +-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+ | Event | Counter | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 | +-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+ | INSTR_RETIRED_ANY | FIXC0 | 4582154 | 3732198 | 2390500 | 212888339 | 1895719 | 1159032 | 6164391 | 2134122 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 7283403 | 5277463 | 6594966 | 84537477 | 2124226 | 3762613 | 13278656 | 4842015 | | CPU_CLK_UNHALTED_REF | FIXC2 | 14821100 | 18926213 | 23058947 | 83465113 | 7209050 | 10615454 | 18522934 | 17412266 | | L2_TRANS_ALL_REQUESTS | PMC0 | 663946 | 446758 | 551331 | 493447 | 210080 | 319799 | 1263952 | 252110 | | L2_RQSTS_MISS | PMC1 | 196746 | 156247 | 209585 | 173079 | 68890 | 125965 | 449283 | 85779 | +-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+ +----------------------------+---------+-----------+---------+-----------+--------------+ | Event | Counter | Sum | Min | Max | Avg | +----------------------------+---------+-----------+---------+-----------+--------------+ | INSTR_RETIRED_ANY STAT | FIXC0 | 234946455 | 1159032 | 212888339 | 2.936831e+07 | | CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 127700819 | 2124226 | 84537477 | 1.596260e+07 | | CPU_CLK_UNHALTED_REF STAT | FIXC2 | 194031077 | 7209050 | 83465113 | 2.425388e+07 | | L2_TRANS_ALL_REQUESTS STAT | PMC0 | 4201423 | 210080 | 1263952 | 525177.8750 | | L2_RQSTS_MISS STAT | PMC1 | 1465574 | 68890 | 449283 | 183196.7500 | +----------------------------+---------+-----------+---------+-----------+--------------+ +----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+ | Metric | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 | +----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+ | Runtime (RDTSC) [s] | 84.8845 | 84.8845 | 84.8845 | 84.8845 | 84.8845 | 84.8845 | 84.8845 | 84.8845 | | Runtime unhalted [s] | 0.0024 | 0.0017 | 0.0021 | 0.0273 | 0.0007 | 0.0012 | 0.0043 | 0.0016 | | Clock [MHz] | 1519.9395 | 862.4499 | 884.5971 | 3132.6848 | 911.3708 | 1096.2848 | 2217.2606 | 860.0887 | | CPI | 1.5895 | 1.4140 | 2.7588 | 0.3971 | 1.1205 | 3.2463 | 2.1541 | 2.2689 | | L2 request rate | 0.1449 | 0.1197 | 0.2306 | 0.0023 | 0.1108 | 0.2759 | 0.2050 | 0.1181 | | L2 miss rate | 0.0429 | 0.0419 | 0.0877 | 0.0008 | 0.0363 | 0.1087 | 0.0729 | 0.0402 | | L2 miss ratio | 0.2963 | 0.3497 | 0.3801 | 0.3508 | 0.3279 | 0.3939 | 0.3555 | 0.3402 | +----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+ +---------------------------+------------+----------+-----------+-----------+ | Metric | Sum | Min | Max | Avg | +---------------------------+------------+----------+-----------+-----------+ | Runtime (RDTSC) [s] STAT | 679.0760 | 84.8845 | 84.8845 | 84.8845 | | Runtime unhalted [s] STAT | 0.0413 | 0.0007 | 0.0273 | 0.0052 | | Clock [MHz] STAT | 11484.6762 | 860.0887 | 3132.6848 | 1435.5845 | | CPI STAT | 14.9492 | 0.3971 | 3.2463 | 1.8686 | | L2 request rate STAT | 1.2073 | 0.0023 | 0.2759 | 0.1509 | | L2 miss rate STAT | 0.4314 | 0.0008 | 0.1087 | 0.0539 | | L2 miss ratio STAT | 2.7944 | 0.2963 | 0.3939 | 0.3493 | +---------------------------+------------+----------+-----------+-----------+
I created a simple script to try all of the predefined groups. I tried to add the –output
#!/bin/bash likwid-perfctr -a | tail +3 | awk '{ print $1 }' | while read group do likwid-perfctr --output cray_${group}.txt -f -c 0-7 -g ${group} phoronix-test-suite batch-run c-ray done
Note: The files below seem to show a rough format for metrics gathered. However, as above, the actual data seems fishy since shouldn’t expect as big a difference between cores for what is essentially a symmetric benchmark.
- UOPS_RETIRE UOPs retirement
- FLOPS_AVX Packed AVX MFLOP/s
- TLB_DATA L2 data TLB miss rate/ratio
- CACHES Cache bandwidth in MBytes/s
- CYCLE_ACTIVITYCycle Activities
- CLOCK Power and Energy consumption
- L3 L3 cache bandwidth in MBytes/s
- BRANCH Branch prediction miss rate/ratio
- UOPS UOPs execution info
- TLB_INSTR L1 Instruction TLB miss rate/ratio
- RECOVERY Recovery duration
- L2CACHE L2 cache miss rate/ratio
- UOPS_ISSUE UOPs issueing
- L2 L2 cache bandwidth in MBytes/s
- ENERGY Power and Energy consumption
- FALSE_SHARE False sharing
- DATA Load to store ratio
- L3CACHE L3 cache miss rate/ratio
- ICACHE Instruction cache miss rate/ratio
- UOPS_EXEC UOPs execution
These tests were performed on a virtualized system running the Xen hypervisor.
Followup experiment http://perf.mvermeulen.com/2018/03/14/experiment-virtualization-and-performance-counters/ shows much better results on bare metal.