Blog – Page 8 – Performance analysis, tools and experiments

Experiment – virtualization and performance counters

Posted on 2018-03-14 by mev2018-03-14

During a recent look at likwid-perfctr the performance counters didn’t look right in several aspects:

CPU core to CPU core differences in what should be a symmetric benchmark and where wspy results showed processes balanced
run to run differences with vastly different amounts of cycles retired
absolute differences such as showing much too small amounts of CPU cycles

So while the runs were good at showing the format of the tool, the data just looked wrong.

I had a hypothesis that the virtual performance counters were not correctly tabulated via the MSR interface. In that experiment, the system was booted under Xen hypervisor with vpmu=1 parameter. This seemed to let the “perf” tool report results, but perhaps not likwid-perfctr which used the msr kernel module.

For this experiment, I reran the last experiment that ran each predefined group on same system booted in bare metal.

#!/bin/bash
likwid-perfctr -a | tail +3 | awk '{ print $1 }' | while read group
do
    likwid-perfctr --output cray_2${group}.txt -f -c 0-7 -g ${group} phoronix-test-suite batch-run c-ray
done

This time the results make a lot more sense without the anomalies listed above.

UOPS_RETIRE UOPs retirement
FLOPS_AVX Packed AVX MFLOP/s
TLB_DATA L2 data TLB miss rate/ratio
CACHES Cache bandwidth in MBytes/s
CYCLE_ACTIVITYCycle Activities
CLOCK Power and Energy consumption
L3 L3 cache bandwidth in MBytes/s
BRANCH Branch prediction miss rate/ratio
UOPS UOPs execution info
TLB_INSTR L1 Instruction TLB miss rate/ratio
RECOVERY Recovery duration
L2CACHE L2 cache miss rate/ratio
UOPS_ISSUE UOPs issueing
L2 L2 cache bandwidth in MBytes/s
ENERGY Power and Energy consumption
FALSE_SHARE False sharing
DATA Load to store ratio
L3CACHE L3 cache miss rate/ratio
ICACHE Instruction cache miss rate/ratio
UOPS_EXEC UOPs execution

While further experiments might compare the vpmu interface for perf_event_open(2) based tools like perf, this experiment suggests avoiding msr device based tools like likwid-perfctr on virtualized systems.

wspy created

Posted on 2018-03-13 by mev2018-04-11

Created a simple monitoring tool named wspy for workload spy. Source is hosted at http://www.github.com/mvermeulen/wspy

This is a wrapper program that collects data as the program runs. The initial program has two data collectors:

trace – which uses the ftrace linux kernel tracer to turn on tracing for scheduler events for fork/exec/exit. It parses these to create and display a process tree.
timer – which fires off a timer once per second to run a collector for /proc/stat contents

This program is somewhat different from the “trup” utility I created while at AMD. That program instrumentation used the “ptrace” capability to trace all processes in a tree and then instrument them. This approach might have the following advantages and disadvantages:

Using capabilities like ftrace is hopefully less intrusive and hence less performance overhead than ptracing every process.
ftrace seems to require a root setup; so the program runs as root, though I’ve added a “-u” option to set the user for the child.
Instrumentation is based on the entire system; rather than individual processes. The pro/con of this is less precision for an individual application (particularly a single process) but a bigger picture of the overall system state.

wspy code is an initial code base that I can mutate and change for other tracing.

Following is an example of initial usage:

mev@popayan:~/wspy$ ./wspy -?
./wspy: fatal error: usage: ./wspy [CcFf][-r name][-u uid] ...
	-C	turn on CPU usage tracing (default = on)
	-c	turn off CPU usage tracing (default = on)
	-F	turn on kernel scheduler tracing (default = on)
	-f	turn off kernel scheduler tracing (default = on)
	-r	filter for name of process tree root
	-u	run  as user 
mev@popayan:~/wspy$

The following command line was run on the Phoronix Test Suite to run the c-ray application:

mev@popayan:~/wspy$ sudo ./wspy -o wspy.output.txt -r ./c-ray -u mev phoronix-test-suite batch-run c-ray

Following is a link to the output file created. The “-r” helps trim auxiliary process tree information.

Clearly more to do both in data output/presentation and maturity of this program. However, also useful to have a code base that runs on x86/ARM to now tweak for other instrumentation.

likwid-perfctr

Posted on 2018-03-13 by mev2018-06-26

The likwid performance monitoring and benchmarking suite has multiple tools including:

likwid-topology – print thread, cache and NUMA topology
likwid-pin – pin threaded applications to processors
likwid-bench – micro-benchmarking application
likwid-agent – monitoring agent for hardware performance counters

As of this writing, these tools seem to be available for x86 systems for both Intel and AMD, but not for ARM architecture

UPDATE: I’m told ARM support was added in January 2018.

This page documents some of my experiments playing with the likwid-perfctr application to examine performance counters.

Getting likwid was pretty easy as there is a package for Ubuntu 17.10. However, out of the box the program didn’t recognize any performance counters. Results from “-e” option provide a list of events supported:

mev@popayan:~$ likwid-perfctr -e
This architecture has 0 counters.
Counter tags(name, type<, options>):

This architecture has 0 events.
Event tags (tag, id, umask, counters<, options>):
mev@popayan:~$

I was able to fix this by including the msr module in the kernel:

mev@popayan:~$ sudo modprobe msr

On my Intel i7-4770s box, this now shows 27 counters and 496 events that can be monitored:

mev@popayan:~$ sudo likwid-perfctr -e
This architecture has 27 counters.
Counter tags(name, type<, options>):
FIXC0, Fixed counters, KERNEL|ANYTHREAD
FIXC1, Fixed counters, KERNEL|ANYTHREAD
FIXC2, Fixed counters, KERNEL|ANYTHREAD
PMC0, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
PMC1, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
PMC2, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION|IN_TRANSACTION_ABORTED
PMC3, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD|IN_TRANSACTION
TMP0, Thermal
PWR0, Energy/Power counters (RAPL)
PWR1, Energy/Power counters (RAPL)
PWR2, Energy/Power counters (RAPL)
PWR3, Energy/Power counters (RAPL)
CBOX0C0, Caching Agent box 0, EDGEDETECT|THRESHOLD|INVERT
CBOX0C1, Caching Agent box 0, EDGEDETECT|THRESHOLD|INVERT
CBOX1C0, Caching Agent box 1, EDGEDETECT|THRESHOLD|INVERT
CBOX1C1, Caching Agent box 1, EDGEDETECT|THRESHOLD|INVERT
CBOX2C0, Caching Agent box 2, EDGEDETECT|THRESHOLD|INVERT
CBOX2C1, Caching Agent box 2, EDGEDETECT|THRESHOLD|INVERT
CBOX3C0, Caching Agent box 3, EDGEDETECT|THRESHOLD|INVERT
CBOX3C1, Caching Agent box 3, EDGEDETECT|THRESHOLD|INVERT
UBOX0, System Configuration box, EDGEDETECT|THRESHOLD|INVERT
UBOX1, System Configuration box, EDGEDETECT|THRESHOLD|INVERT
UBOXFIX, System Configuration box fixed counter

This architecture has 496 events.
Event tags (tag, id, umask, counters<, options>):
TEMP_CORE, 0x0, 0x0, TMP0
PWR_PKG_ENERGY, 0x0, 0x0, PWR0

On AMD A10-7850 it shows 4 counters and 605 events (somehow the count shows as 0):

mev@cuenca:~$ sudo likwid-perfctr -e 
This architecture has 0 counters.
Counter tags(name, type<, options>):
UPMC0, Socket-local general/fixed purpose counters
UPMC2, Socket-local general/fixed purpose counters
UPMC1, Socket-local general/fixed purpose counters
UPMC3, Socket-local general/fixed purpose counters

This architecture has 605 events.
Event tags (tag, id, umask, counters<, options>):
UNC_DRAM_ACCESSES_DCT0_PAGE_HIT, 0xE0, 0x1, UPMC

Notice these are all run as root. I wasn’t able to get them to show as a non-root application, despite trying two things suggest by web search:

Changing permissions on the /dev/cpu/*/msr device files to be 0666
Adding “sudo setcap cap_sys_rawio+ep /usr/bin/likwid-perfctr”

UPDATE: likwid page documents setcap method which should work. However, AMD A10 client is not supported. AMD “Interlagos” processors are supported. Both Interlagos and A10 are based on the AMD Family 15h processors, though different models/cores.

So for now run the experiments as root. Things that don’t require access to performance counters such as showing predefined performance groups can be run as user. Here is a list of performance groups on the Intel box (one can also run individual counters). These are different for each CPU. One can run these as predefined performance studies on a particular aspect of a CPU + workload.

 Group name	Description
--------------------------------------------------------------------------------
UOPS_RETIRE	UOPs retirement
  FLOPS_AVX	Packed AVX MFLOP/s
   TLB_DATA	L2 data TLB miss rate/ratio
     CACHES	Cache bandwidth in MBytes/s
CYCLE_ACTIVITY	Cycle Activities
      CLOCK	Power and Energy consumption
         L3	L3 cache bandwidth in MBytes/s
     BRANCH	Branch prediction miss rate/ratio
       UOPS	UOPs execution info
  TLB_INSTR	L1 Instruction TLB miss rate/ratio
   RECOVERY	Recovery duration
    L2CACHE	L2 cache miss rate/ratio
 UOPS_ISSUE	UOPs issueing
         L2	L2 cache bandwidth in MBytes/s
     ENERGY	Power and Energy consumption
FALSE_SHARE	False sharing
       DATA	Load to store ratio
    L3CACHE	L3 cache miss rate/ratio
     ICACHE	Instruction cache miss rate/ratio
  UOPS_EXEC	UOPs execution

likwid-perfctr can be run in “stethoscope” mode with the -S option or as a wrapper without this option. When running as a wrapper, one provides a list of CPUs to monitor with either the -c (don’t pin) or -C (pin) options.

Here is an example output that comes from:

likwid-perfctr -f  -c 0-7 -g L2CACHE phoronix-test-suite batch-run c-ray

to look at L2 data cache rates on the Phoronix Test Suite c-ray benchmark. Somehow giving the –output filename option seemed to have difficulties, so included both Phoronix Test Suite and then likwid output (see below). A few notes that I found:

Multiple -g options can be given. It appears to run the groups in round-robin fashion even if perhaps counters were available
This run happens under a Xen hypervisor, so have access to counters there
The –output option has special rules for file naming, e.g. require a .txt suffix to create text files or requires other naming.
The -O CSV option provides data in format for potential later batch processing.
Setting the -t timeline option for periodic measurements seemed to generate a lot of divide by zero and nan exceptions
The data below is itself fishy. For example no reason from a benchmark perspective that there should be a lot more instructions running on some cores rather than others

Overall, a useful tool and basic wrapper with at least for my current use: (a) run as root (b) try multiple studies and counters (perhaps defining my own groups).

root@popayan:~# likwid-perfctr -f -c 0-7 -g L2CACHE phoronix-test-suite batch-run c-ray
--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz
CPU type:	Intel Core Haswell processor
CPU clock:	3.09 GHz
--------------------------------------------------------------------------------

Phoronix Test Suite v5.2.1
System Information

Hardware:
Processor: Intel Core i7-4770S @ 3.09GHz (8 Cores), Motherboard: ASUS M11AD, Chipset: Intel 4th Gen Core DRAM, Memory: 2 x 8192 MB DDR3-1600MT/s Kingston, Disk: 2000GB TOSHIBA DT01ACA2, Graphics: Intel Gen7, Audio: Intel Xeon E3-1200 v3/4th, Network: Realtek RTL8111/8168/8411

Software:
OS: Ubuntu 17.10, Kernel: 4.13.0-36-generic (x86_64), Desktop: GNOME Shell 3.26.2, Display Server: X Server 1.19.5, OpenGL: 4.5 Mesa 17.2.8, File-System: ext4, Screen Resolution: 1920x1080, System Layer: Xen 4.9.0 Hypervisor



C-Ray 1.1:
    pts/c-ray-1.1.1
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 2 Minutes
        Started Run 1 @ 15:08:58
        Started Run 2 @ 15:09:27
        Started Run 3 @ 15:09:55  [Std. Dev: 0.68%]

    Test Results:
        26.432
        26.713
        26.771

    Average: 26.64 Seconds

[NOTICE] Parameter 1 to graphics_event_checker::__post_test_run() expected to be a reference, value given in pts_module_manager:74


--------------------------------------------------------------------------------
Group 1: L2CACHE
+-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+
|         Event         | Counter |  Core 0  |  Core 1  |  Core 2  |   Core 3  |  Core 4 |  Core 5  |  Core 6  |  Core 7  |
+-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+
|   INSTR_RETIRED_ANY   |  FIXC0  |  4582154 |  3732198 |  2390500 | 212888339 | 1895719 |  1159032 |  6164391 |  2134122 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  |  7283403 |  5277463 |  6594966 |  84537477 | 2124226 |  3762613 | 13278656 |  4842015 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  | 14821100 | 18926213 | 23058947 |  83465113 | 7209050 | 10615454 | 18522934 | 17412266 |
| L2_TRANS_ALL_REQUESTS |   PMC0  |   663946 |   446758 |   551331 |    493447 |  210080 |   319799 |  1263952 |   252110 |
|     L2_RQSTS_MISS     |   PMC1  |   196746 |   156247 |   209585 |    173079 |   68890 |   125965 |   449283 |    85779 |
+-----------------------+---------+----------+----------+----------+-----------+---------+----------+----------+----------+

+----------------------------+---------+-----------+---------+-----------+--------------+
|            Event           | Counter |    Sum    |   Min   |    Max    |      Avg     |
+----------------------------+---------+-----------+---------+-----------+--------------+
|   INSTR_RETIRED_ANY STAT   |  FIXC0  | 234946455 | 1159032 | 212888339 | 2.936831e+07 |
| CPU_CLK_UNHALTED_CORE STAT |  FIXC1  | 127700819 | 2124226 |  84537477 | 1.596260e+07 |
|  CPU_CLK_UNHALTED_REF STAT |  FIXC2  | 194031077 | 7209050 |  83465113 | 2.425388e+07 |
| L2_TRANS_ALL_REQUESTS STAT |   PMC0  |   4201423 |  210080 |   1263952 |  525177.8750 |
|     L2_RQSTS_MISS STAT     |   PMC1  |   1465574 |   68890 |    449283 |  183196.7500 |
+----------------------------+---------+-----------+---------+-----------+--------------+

+----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+
|        Metric        |   Core 0  |  Core 1  |  Core 2  |   Core 3  |  Core 4  |   Core 5  |   Core 6  |  Core 7  |
+----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+
|  Runtime (RDTSC) [s] |   84.8845 |  84.8845 |  84.8845 |   84.8845 |  84.8845 |   84.8845 |   84.8845 |  84.8845 |
| Runtime unhalted [s] |    0.0024 |   0.0017 |   0.0021 |    0.0273 |   0.0007 |    0.0012 |    0.0043 |   0.0016 |
|      Clock [MHz]     | 1519.9395 | 862.4499 | 884.5971 | 3132.6848 | 911.3708 | 1096.2848 | 2217.2606 | 860.0887 |
|          CPI         |    1.5895 |   1.4140 |   2.7588 |    0.3971 |   1.1205 |    3.2463 |    2.1541 |   2.2689 |
|    L2 request rate   |    0.1449 |   0.1197 |   0.2306 |    0.0023 |   0.1108 |    0.2759 |    0.2050 |   0.1181 |
|     L2 miss rate     |    0.0429 |   0.0419 |   0.0877 |    0.0008 |   0.0363 |    0.1087 |    0.0729 |   0.0402 |
|     L2 miss ratio    |    0.2963 |   0.3497 |   0.3801 |    0.3508 |   0.3279 |    0.3939 |    0.3555 |   0.3402 |
+----------------------+-----------+----------+----------+-----------+----------+-----------+-----------+----------+

+---------------------------+------------+----------+-----------+-----------+
|           Metric          |     Sum    |    Min   |    Max    |    Avg    |
+---------------------------+------------+----------+-----------+-----------+
|  Runtime (RDTSC) [s] STAT |   679.0760 |  84.8845 |   84.8845 |   84.8845 |
| Runtime unhalted [s] STAT |     0.0413 |   0.0007 |    0.0273 |    0.0052 |
|      Clock [MHz] STAT     | 11484.6762 | 860.0887 | 3132.6848 | 1435.5845 |
|          CPI STAT         |    14.9492 |   0.3971 |    3.2463 |    1.8686 |
|    L2 request rate STAT   |     1.2073 |   0.0023 |    0.2759 |    0.1509 |
|     L2 miss rate STAT     |     0.4314 |   0.0008 |    0.1087 |    0.0539 |
|     L2 miss ratio STAT    |     2.7944 |   0.2963 |    0.3939 |    0.3493 |
+---------------------------+------------+----------+-----------+-----------+

I created a simple script to try all of the predefined groups. I tried to add the –output

#!/bin/bash                                                                                                                                                            
likwid-perfctr -a | tail +3 | awk '{ print $1 }' | while read group
do
    likwid-perfctr --output cray_${group}.txt -f -c 0-7 -g ${group} phoronix-test-suite batch-run c-ray
done

Note: The files below seem to show a rough format for metrics gathered. However, as above, the actual data seems fishy since shouldn’t expect as big a difference between cores for what is essentially a symmetric benchmark.

UOPS_RETIRE UOPs retirement
FLOPS_AVX Packed AVX MFLOP/s
TLB_DATA L2 data TLB miss rate/ratio
CACHES Cache bandwidth in MBytes/s
CYCLE_ACTIVITYCycle Activities
CLOCK Power and Energy consumption
L3 L3 cache bandwidth in MBytes/s
BRANCH Branch prediction miss rate/ratio
UOPS UOPs execution info
TLB_INSTR L1 Instruction TLB miss rate/ratio
RECOVERY Recovery duration
L2CACHE L2 cache miss rate/ratio
UOPS_ISSUE UOPs issueing
L2 L2 cache bandwidth in MBytes/s
ENERGY Power and Energy consumption
FALSE_SHARE False sharing
DATA Load to store ratio
L3CACHE L3 cache miss rate/ratio
ICACHE Instruction cache miss rate/ratio
UOPS_EXEC UOPs execution