likwid-perfctr

The likwid performance monitoring and benchmarking suite has multiple tools including:

likwid-topology – print thread, cache and NUMA topology
likwid-pin – pin threaded applications to processors
likwid-bench – micro-benchmarking application
likwid-agent – monitoring agent for hardware performance counters

As of this writing, these tools seem to be available for x86 systems for both Intel and AMD, but not for ARM architecture

UPDATE: I’m told ARM support was added in January 2018.

This page documents some of my experiments playing with the likwid-perfctr application to examine performance counters.

Getting likwid was pretty easy as there is a package for Ubuntu 17.10. However, out of the box the program didn’t recognize any performance counters. Results from “-e” option provide a list of events supported:

mev@popayan:~$ likwid-perfctr -e
This architecture has 0 counters.
Counter tags(name, type<, options>):

This architecture has 0 events.
Event tags (tag, id, umask, counters<, options>):
mev@popayan:~$

I was able to fix this by including the msr module in the kernel:

mev@popayan:~$ sudo modprobe msr

On my Intel i7-4770s box, this now shows 27 counters and 496 events that can be monitored:

mev@popayan:~$ sudo likwid-perfctr -e
This architecture has 27 counters.
Counter tags(name, type<, options>):
FIXC0, Fixed counters, KERNEL|ANYTHREAD
FIXC1, Fixed counters, KERNEL|ANYTHREAD
FIXC2, Fixed counters, KERNEL|ANYTHREAD
PMC0, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
PMC1, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
PMC2, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION|IN_TRANSACTION_ABORTED
PMC3, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
TMP0, Thermal
PWR0, Energy/Power counters (RAPL)
PWR1, Energy/Power counters (RAPL)
PWR2, Energy/Power counters (RAPL)
PWR3, Energy/Power counters (RAPL)
CBOX0C0, Caching Agent box 0, EDGEDETECT|THRESHOLD|INVERT
CBOX0C1, Caching Agent box 0, EDGEDETECT|THRESHOLD|INVERT
CBOX1C0, Caching Agent box 1, EDGEDETECT|THRESHOLD|INVERT
CBOX1C1, Caching Agent box 1, EDGEDETECT|THRESHOLD|INVERT
CBOX2C0, Caching Agent box 2, EDGEDETECT|THRESHOLD|INVERT
CBOX2C1, Caching Agent box 2, EDGEDETECT|THRESHOLD|INVERT
CBOX3C0, Caching Agent box 3, EDGEDETECT|THRESHOLD|INVERT
CBOX3C1, Caching Agent box 3, EDGEDETECT|THRESHOLD|INVERT
UBOX0, System Configuration box, EDGEDETECT|THRESHOLD|INVERT
UBOX1, System Configuration box, EDGEDETECT|THRESHOLD|INVERT
UBOXFIX, System Configuration box fixed counter

This architecture has 496 events.
Event tags (tag, id, umask, counters<, options>):
TEMP_CORE, 0x0, 0x0, TMP0
PWR_PKG_ENERGY, 0x0, 0x0, PWR0

On AMD A10-7850 it shows 4 counters and 605 events (somehow the count shows as 0):

mev@cuenca:~$ sudo likwid-perfctr -e 
This architecture has 0 counters.
Counter tags(name, type<, options>):
UPMC0, Socket-local general/fixed purpose counters
UPMC2, Socket-local general/fixed purpose counters
UPMC1, Socket-local general/fixed purpose counters
UPMC3, Socket-local general/fixed purpose counters

This architecture has 605 events.
Event tags (tag, id, umask, counters<, options>):
UNC_DRAM_ACCESSES_DCT0_PAGE_HIT, 0xE0, 0x1, UPMC

Notice these are all run as root. I wasn’t able to get them to show as a non-root application, despite trying two things suggest by web search:

Changing permissions on the /dev/cpu/*/msr device files to be 0666
Adding “sudo setcap cap_sys_rawio+ep /usr/bin/likwid-perfctr”

UPDATE: likwid page documents setcap method which should work. However, AMD A10 client is not supported. AMD “Interlagos” processors are supported. Both Interlagos and A10 are based on the AMD Family 15h processors, though different models/cores.

So for now run the experiments as root. Things that don’t require access to performance counters such as showing predefined performance groups can be run as user. Here is a list of performance groups on the Intel box (one can also run individual counters). These are different for each CPU. One can run these as predefined performance studies on a particular aspect of a CPU + workload.

 Group name	Description
--------------------------------------------------------------------------------
UOPS_RETIRE	UOPs retirement
  FLOPS_AVX	Packed AVX MFLOP/s
   TLB_DATA	L2 data TLB miss rate/ratio
     CACHES	Cache bandwidth in MBytes/s
CYCLE_ACTIVITY	Cycle Activities
      CLOCK	Power and Energy consumption
         L3	L3 cache bandwidth in MBytes/s
     BRANCH	Branch prediction miss rate/ratio
       UOPS	UOPs execution info
  TLB_INSTR	L1 Instruction TLB miss rate/ratio
   RECOVERY	Recovery duration
    L2CACHE	L2 cache miss rate/ratio
 UOPS_ISSUE	UOPs issueing
         L2	L2 cache bandwidth in MBytes/s
     ENERGY	Power and Energy consumption
FALSE_SHARE	False sharing
       DATA	Load to store ratio
    L3CACHE	L3 cache miss rate/ratio
     ICACHE	Instruction cache miss rate/ratio
  UOPS_EXEC	UOPs execution

likwid-perfctr can be run in “stethoscope” mode with the -S option or as a wrapper without this option. When running as a wrapper, one provides a list of CPUs to monitor with either the -c (don’t pin) or -C (pin) options.

Here is an example output that comes from:

likwid-perfctr -f  -c 0-7 -g L2CACHE phoronix-test-suite batch-run c-ray

to look at L2 data cache rates on the Phoronix Test Suite c-ray benchmark. Somehow giving the –output filename option seemed to have difficulties, so included both Phoronix Test Suite and then likwid output (see below). A few notes that I found:

Multiple -g options can be given. It appears to run the groups in round-robin fashion even if perhaps counters were available
This run happens under a Xen hypervisor, so have access to counters there
The –output option has special rules for file naming, e.g. require a .txt suffix to create text files or requires other naming.
The -O CSV option provides data in format for potential later batch processing.
Setting the -t timeline option for periodic measurements seemed to generate a lot of divide by zero and nan exceptions
The data below is itself fishy. For example no reason from a benchmark perspective that there should be a lot more instructions running on some cores rather than others

Overall, a useful tool and basic wrapper with at least for my current use: (a) run as root (b) try multiple studies and counters (perhaps defining my own groups).

root@popayan:~# likwid-perfctr -f -c 0-7 -g L2CACHE phoronix-test-suite batch-run c-ray
--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz
CPU type:	Intel Core Haswell processor
CPU clock:	3.09 GHz
--------------------------------------------------------------------------------

Phoronix Test Suite v5.2.1
System Information

Hardware:
Processor: Intel Core i7-4770S @ 3.09GHz (8 Cores), Motherboard: ASUS M11AD, Chipset: Intel 4th Gen Core DRAM, Memory: 2 x 8192 MB DDR3-1600MT/s Kingston, Disk: 2000GB TOSHIBA DT01ACA2, Graphics: Intel Gen7, Audio: Intel Xeon E3-1200 v3/4th, Network: Realtek RTL8111/8168/8411

Software:
OS: Ubuntu 17.10, Kernel: 4.13.0-36-generic (x86_64), Desktop: GNOME Shell 3.26.2, Display Server: X Server 1.19.5, OpenGL: 4.5 Mesa 17.2.8, File-System: ext4, Screen Resolution: 1920x1080, System Layer: Xen 4.9.0 Hypervisor



C-Ray 1.1:
    pts/c-ray-1.1.1
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 2 Minutes
        Started Run 1 @ 15:08:58
        Started Run 2 @ 15:09:27
        Started Run 3 @ 15:09:55  [Std. Dev: 0.68%]

    Test Results:
        26.432
        26.713
        26.771

    Average: 26.64 Seconds

[NOTICE] Parameter 1 to graphics_event_checker::__post_test_run() expected to be a reference, value given in pts_module_manager:74


--------------------------------------------------------------------------------
Group 1: L2CACHE
+-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+
|         Event         | Counter |  Core 0  |  Core 1  |  Core 2  |   Core 3  |  Core 4 |  Core 5  |  Core 6  |  Core 7  |
+-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+
|   INSTR_RETIRED_ANY   |  FIXC0  |  4582154 |  3732198 |  2390500 | 212888339 | 1895719 |  1159032 |  6164391 |  2134122 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  |  7283403 |  5277463 |  6594966 |  84537477 | 2124226 |  3762613 | 13278656 |  4842015 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  | 14821100 | 18926213 | 23058947 |  83465113 | 7209050 | 10615454 | 18522934 | 17412266 |
| L2_TRANS_ALL_REQUESTS |   PMC0  |   663946 |   446758 |   551331 |    493447 |  210080 |   319799 |  1263952 |   252110 |
|     L2_RQSTS_MISS     |   PMC1  |   196746 |   156247 |   209585 |    173079 |   68890 |   125965 |   449283 |    85779 |
+-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+

+----------------------------+---------+-----------+---------+-----------+--------------+
|            Event           | Counter |    Sum    |   Min   |    Max    |      Avg     |
+----------------------------+---------+-----------+---------+-----------+--------------+
|   INSTR_RETIRED_ANY STAT   |  FIXC0  | 234946455 | 1159032 | 212888339 | 2.936831e+07 |
| CPU_CLK_UNHALTED_CORE STAT |  FIXC1  | 127700819 | 2124226 |  84537477 | 1.596260e+07 |
|  CPU_CLK_UNHALTED_REF STAT |  FIXC2  | 194031077 | 7209050 |  83465113 | 2.425388e+07 |
| L2_TRANS_ALL_REQUESTS STAT |   PMC0  |   4201423 |  210080 |   1263952 |  525177.8750 |
|     L2_RQSTS_MISS STAT     |   PMC1  |   1465574 |   68890 |    449283 |  183196.7500 |
+----------------------------+---------+-----------+---------+-----------+--------------+

+----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+
|        Metric        |   Core 0  |  Core 1  |  Core 2  |   Core 3  |  Core 4  |   Core 5  |   Core 6  |  Core 7  |
+----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+
|  Runtime (RDTSC) [s] |   84.8845 |  84.8845 |  84.8845 |   84.8845 |  84.8845 |   84.8845 |   84.8845 |  84.8845 |
| Runtime unhalted [s] |    0.0024 |   0.0017 |   0.0021 |    0.0273 |   0.0007 |    0.0012 |    0.0043 |   0.0016 |
|      Clock [MHz]     | 1519.9395 | 862.4499 | 884.5971 | 3132.6848 | 911.3708 | 1096.2848 | 2217.2606 | 860.0887 |
|          CPI         |    1.5895 |   1.4140 |   2.7588 |    0.3971 |   1.1205 |    3.2463 |    2.1541 |   2.2689 |
|    L2 request rate   |    0.1449 |   0.1197 |   0.2306 |    0.0023 |   0.1108 |    0.2759 |    0.2050 |   0.1181 |
|     L2 miss rate     |    0.0429 |   0.0419 |   0.0877 |    0.0008 |   0.0363 |    0.1087 |    0.0729 |   0.0402 |
|     L2 miss ratio    |    0.2963 |   0.3497 |   0.3801 |    0.3508 |   0.3279 |    0.3939 |    0.3555 |   0.3402 |
+----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+

+---------------------------+------------+----------+-----------+-----------+
|           Metric          |     Sum    |    Min   |    Max    |    Avg    |
+---------------------------+------------+----------+-----------+-----------+
|  Runtime (RDTSC) [s] STAT |   679.0760 |  84.8845 |   84.8845 |   84.8845 |
| Runtime unhalted [s] STAT |     0.0413 |   0.0007 |    0.0273 |    0.0052 |
|      Clock [MHz] STAT     | 11484.6762 | 860.0887 | 3132.6848 | 1435.5845 |
|          CPI STAT         |    14.9492 |   0.3971 |    3.2463 |    1.8686 |
|    L2 request rate STAT   |     1.2073 |   0.0023 |    0.2759 |    0.1509 |
|     L2 miss rate STAT     |     0.4314 |   0.0008 |    0.1087 |    0.0539 |
|     L2 miss ratio STAT    |     2.7944 |   0.2963 |    0.3939 |    0.3493 |
+---------------------------+------------+----------+-----------+-----------+

I created a simple script to try all of the predefined groups. I tried to add the –output

#!/bin/bash                                                                                                                                                            
likwid-perfctr -a | tail +3 | awk '{ print $1 }' | while read group
do
    likwid-perfctr --output cray_${group}.txt -f -c 0-7 -g ${group} phoronix-test-suite batch-run c-ray
done

Note: The files below seem to show a rough format for metrics gathered. However, as above, the actual data seems fishy since shouldn’t expect as big a difference between cores for what is essentially a symmetric benchmark.

UOPS_RETIRE UOPs retirement
FLOPS_AVX Packed AVX MFLOP/s
TLB_DATA L2 data TLB miss rate/ratio
CACHES Cache bandwidth in MBytes/s
CYCLE_ACTIVITYCycle Activities
CLOCK Power and Energy consumption
L3 L3 cache bandwidth in MBytes/s
BRANCH Branch prediction miss rate/ratio
UOPS UOPs execution info
TLB_INSTR L1 Instruction TLB miss rate/ratio
RECOVERY Recovery duration
L2CACHE L2 cache miss rate/ratio
UOPS_ISSUE UOPs issueing
L2 L2 cache bandwidth in MBytes/s
ENERGY Power and Energy consumption
FALSE_SHARE False sharing
DATA Load to store ratio
L3CACHE L3 cache miss rate/ratio
ICACHE Instruction cache miss rate/ratio
UOPS_EXEC UOPs execution

Comments

likwid-perfctr — 1 Comment

mev on 2018-03-14 at 6:25 am said:

These tests were performed on a virtualized system running the Xen hypervisor.

Followup experiment http://perf.mvermeulen.com/2018/03/14/experiment-virtualization-and-performance-counters/ shows much better results on bare metal.

Reply ↓

Performance analysis, tools and experiments

An eclectic collection

Comments

likwid-perfctr — 1 Comment

Leave a Reply Cancel reply