The likwid performance monitoring and benchmarking suite has multiple tools including:
- likwid-topology – print thread, cache and NUMA topology
- likwid-pin – pin threaded applications to processors
- likwid-bench – micro-benchmarking application
- likwid-agent – monitoring agent for hardware performance counters
As of this writing, these tools seem to be available for x86 systems for both Intel and AMD, but not for ARM architecture
UPDATE: I’m told ARM support was added in January 2018.
This page documents some of my experiments playing with the likwid-perfctr application to examine performance counters.
Getting likwid was pretty easy as there is a package for Ubuntu 17.10. However, out of the box the program didn’t recognize any performance counters. Results from “-e” option provide a list of events supported:
mev@popayan:~$ likwid-perfctr -e
This architecture has 0 counters.
Counter tags(name, type<, options>):
This architecture has 0 events.
Event tags (tag, id, umask, counters<, options>):
mev@popayan:~$
I was able to fix this by including the msr module in the kernel:
mev@popayan:~$ sudo modprobe msr
On my Intel i7-4770s box, this now shows 27 counters and 496 events that can be monitored:
mev@popayan:~$ sudo likwid-perfctr -e
This architecture has 27 counters.
Counter tags(name, type<, options>):
FIXC0, Fixed counters, KERNEL|ANYTHREAD
FIXC1, Fixed counters, KERNEL|ANYTHREAD
FIXC2, Fixed counters, KERNEL|ANYTHREAD
PMC0, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
PMC1, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
PMC2, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION|IN_TRANSACTION_ABORTED
PMC3, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
TMP0, Thermal
PWR0, Energy/Power counters (RAPL)
PWR1, Energy/Power counters (RAPL)
PWR2, Energy/Power counters (RAPL)
PWR3, Energy/Power counters (RAPL)
CBOX0C0, Caching Agent box 0, EDGEDETECT|THRESHOLD|INVERT
CBOX0C1, Caching Agent box 0, EDGEDETECT|THRESHOLD|INVERT
CBOX1C0, Caching Agent box 1, EDGEDETECT|THRESHOLD|INVERT
CBOX1C1, Caching Agent box 1, EDGEDETECT|THRESHOLD|INVERT
CBOX2C0, Caching Agent box 2, EDGEDETECT|THRESHOLD|INVERT
CBOX2C1, Caching Agent box 2, EDGEDETECT|THRESHOLD|INVERT
CBOX3C0, Caching Agent box 3, EDGEDETECT|THRESHOLD|INVERT
CBOX3C1, Caching Agent box 3, EDGEDETECT|THRESHOLD|INVERT
UBOX0, System Configuration box, EDGEDETECT|THRESHOLD|INVERT
UBOX1, System Configuration box, EDGEDETECT|THRESHOLD|INVERT
UBOXFIX, System Configuration box fixed counter
This architecture has 496 events.
Event tags (tag, id, umask, counters<, options>):
TEMP_CORE, 0x0, 0x0, TMP0
PWR_PKG_ENERGY, 0x0, 0x0, PWR0
On AMD A10-7850 it shows 4 counters and 605 events (somehow the count shows as 0):
mev@cuenca:~$ sudo likwid-perfctr -e
This architecture has 0 counters.
Counter tags(name, type<, options>):
UPMC0, Socket-local general/fixed purpose counters
UPMC2, Socket-local general/fixed purpose counters
UPMC1, Socket-local general/fixed purpose counters
UPMC3, Socket-local general/fixed purpose counters
This architecture has 605 events.
Event tags (tag, id, umask, counters<, options>):
UNC_DRAM_ACCESSES_DCT0_PAGE_HIT, 0xE0, 0x1, UPMC
Notice these are all run as root. I wasn’t able to get them to show as a non-root application, despite trying two things suggest by web search:
- Changing permissions on the /dev/cpu/*/msr device files to be 0666
- Adding “sudo setcap cap_sys_rawio+ep /usr/bin/likwid-perfctr”
UPDATE: likwid page documents setcap method which should work. However, AMD A10 client is not supported. AMD “Interlagos” processors are supported. Both Interlagos and A10 are based on the AMD Family 15h processors, though different models/cores.
So for now run the experiments as root. Things that don’t require access to performance counters such as showing predefined performance groups can be run as user. Here is a list of performance groups on the Intel box (one can also run individual counters). These are different for each CPU. One can run these as predefined performance studies on a particular aspect of a CPU + workload.
Group name Description
--------------------------------------------------------------------------------
UOPS_RETIRE UOPs retirement
FLOPS_AVX Packed AVX MFLOP/s
TLB_DATA L2 data TLB miss rate/ratio
CACHES Cache bandwidth in MBytes/s
CYCLE_ACTIVITY Cycle Activities
CLOCK Power and Energy consumption
L3 L3 cache bandwidth in MBytes/s
BRANCH Branch prediction miss rate/ratio
UOPS UOPs execution info
TLB_INSTR L1 Instruction TLB miss rate/ratio
RECOVERY Recovery duration
L2CACHE L2 cache miss rate/ratio
UOPS_ISSUE UOPs issueing
L2 L2 cache bandwidth in MBytes/s
ENERGY Power and Energy consumption
FALSE_SHARE False sharing
DATA Load to store ratio
L3CACHE L3 cache miss rate/ratio
ICACHE Instruction cache miss rate/ratio
UOPS_EXEC UOPs execution
likwid-perfctr can be run in “stethoscope” mode with the -S option or as a wrapper without this option. When running as a wrapper, one provides a list of CPUs to monitor with either the -c (don’t pin) or -C (pin) options.
Here is an example output that comes from:
likwid-perfctr -f -c 0-7 -g L2CACHE phoronix-test-suite batch-run c-ray
to look at L2 data cache rates on the Phoronix Test Suite c-ray benchmark. Somehow giving the –output filename option seemed to have difficulties, so included both Phoronix Test Suite and then likwid output (see below). A few notes that I found:
- Multiple -g options can be given. It appears to run the groups in round-robin fashion even if perhaps counters were available
- This run happens under a Xen hypervisor, so have access to counters there
- The –output option has special rules for file naming, e.g. require a .txt suffix to create text files or requires other naming.
- The -O CSV option provides data in format for potential later batch processing.
- Setting the -t timeline option for periodic measurements seemed to generate a lot of divide by zero and nan exceptions
- The data below is itself fishy. For example no reason from a benchmark perspective that there should be a lot more instructions running on some cores rather than others
Overall, a useful tool and basic wrapper with at least for my current use: (a) run as root (b) try multiple studies and counters (perhaps defining my own groups).
root@popayan:~# likwid-perfctr -f -c 0-7 -g L2CACHE phoronix-test-suite batch-run c-ray
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz
CPU type: Intel Core Haswell processor
CPU clock: 3.09 GHz
--------------------------------------------------------------------------------
Phoronix Test Suite v5.2.1
System Information
Hardware:
Processor: Intel Core i7-4770S @ 3.09GHz (8 Cores), Motherboard: ASUS M11AD, Chipset: Intel 4th Gen Core DRAM, Memory: 2 x 8192 MB DDR3-1600MT/s Kingston, Disk: 2000GB TOSHIBA DT01ACA2, Graphics: Intel Gen7, Audio: Intel Xeon E3-1200 v3/4th, Network: Realtek RTL8111/8168/8411
Software:
OS: Ubuntu 17.10, Kernel: 4.13.0-36-generic (x86_64), Desktop: GNOME Shell 3.26.2, Display Server: X Server 1.19.5, OpenGL: 4.5 Mesa 17.2.8, File-System: ext4, Screen Resolution: 1920x1080, System Layer: Xen 4.9.0 Hypervisor
C-Ray 1.1:
pts/c-ray-1.1.1
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 2 Minutes
Started Run 1 @ 15:08:58
Started Run 2 @ 15:09:27
Started Run 3 @ 15:09:55 [Std. Dev: 0.68%]
Test Results:
26.432
26.713
26.771
Average: 26.64 Seconds
[NOTICE] Parameter 1 to graphics_event_checker::__post_test_run() expected to be a reference, value given in pts_module_manager:74
--------------------------------------------------------------------------------
Group 1: L2CACHE
+-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+
| Event | Counter | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 |
+-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+
| INSTR_RETIRED_ANY | FIXC0 | 4582154 | 3732198 | 2390500 | 212888339 | 1895719 | 1159032 | 6164391 | 2134122 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 7283403 | 5277463 | 6594966 | 84537477 | 2124226 | 3762613 | 13278656 | 4842015 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 14821100 | 18926213 | 23058947 | 83465113 | 7209050 | 10615454 | 18522934 | 17412266 |
| L2_TRANS_ALL_REQUESTS | PMC0 | 663946 | 446758 | 551331 | 493447 | 210080 | 319799 | 1263952 | 252110 |
| L2_RQSTS_MISS | PMC1 | 196746 | 156247 | 209585 | 173079 | 68890 | 125965 | 449283 | 85779 |
+-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+
+----------------------------+---------+-----------+---------+-----------+--------------+
| Event | Counter | Sum | Min | Max | Avg |
+----------------------------+---------+-----------+---------+-----------+--------------+
| INSTR_RETIRED_ANY STAT | FIXC0 | 234946455 | 1159032 | 212888339 | 2.936831e+07 |
| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 127700819 | 2124226 | 84537477 | 1.596260e+07 |
| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 194031077 | 7209050 | 83465113 | 2.425388e+07 |
| L2_TRANS_ALL_REQUESTS STAT | PMC0 | 4201423 | 210080 | 1263952 | 525177.8750 |
| L2_RQSTS_MISS STAT | PMC1 | 1465574 | 68890 | 449283 | 183196.7500 |
+----------------------------+---------+-----------+---------+-----------+--------------+
+----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+
| Metric | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 |
+----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+
| Runtime (RDTSC) [s] | 84.8845 | 84.8845 | 84.8845 | 84.8845 | 84.8845 | 84.8845 | 84.8845 | 84.8845 |
| Runtime unhalted [s] | 0.0024 | 0.0017 | 0.0021 | 0.0273 | 0.0007 | 0.0012 | 0.0043 | 0.0016 |
| Clock [MHz] | 1519.9395 | 862.4499 | 884.5971 | 3132.6848 | 911.3708 | 1096.2848 | 2217.2606 | 860.0887 |
| CPI | 1.5895 | 1.4140 | 2.7588 | 0.3971 | 1.1205 | 3.2463 | 2.1541 | 2.2689 |
| L2 request rate | 0.1449 | 0.1197 | 0.2306 | 0.0023 | 0.1108 | 0.2759 | 0.2050 | 0.1181 |
| L2 miss rate | 0.0429 | 0.0419 | 0.0877 | 0.0008 | 0.0363 | 0.1087 | 0.0729 | 0.0402 |
| L2 miss ratio | 0.2963 | 0.3497 | 0.3801 | 0.3508 | 0.3279 | 0.3939 | 0.3555 | 0.3402 |
+----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+
+---------------------------+------------+----------+-----------+-----------+
| Metric | Sum | Min | Max | Avg |
+---------------------------+------------+----------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 679.0760 | 84.8845 | 84.8845 | 84.8845 |
| Runtime unhalted [s] STAT | 0.0413 | 0.0007 | 0.0273 | 0.0052 |
| Clock [MHz] STAT | 11484.6762 | 860.0887 | 3132.6848 | 1435.5845 |
| CPI STAT | 14.9492 | 0.3971 | 3.2463 | 1.8686 |
| L2 request rate STAT | 1.2073 | 0.0023 | 0.2759 | 0.1509 |
| L2 miss rate STAT | 0.4314 | 0.0008 | 0.1087 | 0.0539 |
| L2 miss ratio STAT | 2.7944 | 0.2963 | 0.3939 | 0.3493 |
+---------------------------+------------+----------+-----------+-----------+
I created a simple script to try all of the predefined groups. I tried to add the –output
#!/bin/bash
likwid-perfctr -a | tail +3 | awk '{ print $1 }' | while read group
do
likwid-perfctr --output cray_${group}.txt -f -c 0-7 -g ${group} phoronix-test-suite batch-run c-ray
done
Note: The files below seem to show a rough format for metrics gathered. However, as above, the actual data seems fishy since shouldn’t expect as big a difference between cores for what is essentially a symmetric benchmark.