Blog – Performance analysis, tools and experiments

A path to topdown counters for AMD Zen4

Posted on 2023-02-21 by mev2023-02-21

A recent Phoronix article described new performance counters being added to the Linux kernel. This seemed intriguing to me as a way to get analysis that previously had only done on Intel however it required a few steps.

The first step was getting a Linux 6.2 kernel installed. The procedure was straightforward where I followed these instructions to install the “mainline” tool and then the graphical UI to install and boot into a new kernel.

However, after installing the kernel when I invoked perf, I got
WARNING: perf not found for kernel 6.2.0-060200

You may need to install the following packages for this specific kernel: linux-tools-6.2.0-060200-generic linux-cloud-tools-6.2.0-060200-generic

You may also want to install one of the following packages to keep up to date: linux-tools-generic linux-cloud-tools-generic
and this was not available from the standard repository.

So the next step was finding and following instructions for downloading kernel sources and building perf. This was straightforward and including a “make” in the tools/perf directory.

At this point, trying “perf -a –topdown” does not work yet, but if I do a “perf list”, I can see a set of counters such as these:
backend_bound_group: backend_bound_cpu [Fraction of dispatch slots that remained unused because of stalls not related to the memory subsystem] backend_bound_memory [Fraction of dispatch slots that remained unused because of stalls due to the memory subsystem]

Looking at these a little further
mev@sacramento:~/source/linux-6.2/tools/perf$ ./perf stat --event=backend_bound_cpu event syntax error: 'backend_bound_cpu' \___ parser error Run 'perf list' for a list of valid events

Usage: perf stat [] []
I can’t immediately pick these because they seem to be an event group that I can see further when I list the -v –detail option on perf-list:
ist of pre-defined events (to be used in -e or -M):

Metric Groups:

PipelineL2: backend_bound_cpu [backend_bound * (1 - d_ratio(ex_no_retire.load_not_complete, ex_no_retire.not_complete))]

backend_bound_group: backend_bound_cpu [backend_bound * (1 - d_ratio(ex_no_retire.load_not_complete, ex_no_retire.not_complete))]
However, using the “-M” option to perf stat will give me the underlying counters for this metric:
mev@sacramento:~/source/linux-6.2/tools/perf$ ./perf stat -M backend_bound_cpu ~/hello hello world

Performance counter stats for '/home/mev/hello':

1,234,099 ex_no_retire.load_not_complete # 2.7 % backend_bound_cpu 5,366,366 de_no_dispatch_per_slot.backend_stalls 1,336,714 ex_no_retire.not_complete 2,551,447 ls_not_halted_cyc


0.001079579 seconds time elapsed

0.001128000 seconds user 0.000000000 seconds sys
So success overall and believe I can now get these metrics from an AMD Zen4 core.

uProf CLI

Posted on 2023-01-31 by mev2023-01-31

A few examples trying the AMDuProfCLI command. This seems to go through a “collect” stage followed by a “report” stage.
There are some pre-defined profiles that can be seen with the “AMDuProfCLI info –list collect-configs command

mev@sacramento:~$ AMDuProfCLI info --list collect-configs
/opt/AMDuProf_4.0-341/bin/AMDuProfCLI

List of predefined profiles that can be used with 'collect --config' option:

  tbp          : Time-based Sampling
                 Use this configuration to identify where programs are spending time.

  inst_access  : Investigate Instruction Access
                 Use this configuration to find instruction fetches with poor L1 instruction 
                 cache locality and poor ITLB behavior.
                 [PMU Events: PMCx076, PMCx0C0, PMCx28F, PMCx18E, PMCx060, PMCx064, PMCx084, PMCx085, 
                              PMCx094]

  data_access  : Investigate Data Access
                 Use this configuration to find data access operations with poor L1 data 
                 cache locality and poor DTLB behavior.
                 [PMU Events: PMCx076, PMCx0C0, PMCx029, PMCx060, PMCx043, PMCx047, PMCx045]

  assess_ext   : Assess Performance (Extended)
                 Use this configuration for an overall assessment of performance and to 
                 find the potential issues for further investigation. This has additional 
                 events to monitor than the Assess Performance configuration.
                 [PMU Events: PMCx076, PMCx0C0, PMCx0C2, PMCx0C3, PMCx029, PMCx060, PMCx047, PMCx043, 
                              PMCx024, PMCx052, PMCx00E]

  memory       : Cache Analysis
                 Use this configuration to identify the false cache-line sharing issues. 
                 The profile data will be collected using IBS OP.

  branch       : Investigate Branching
                 Use this configuration to find poorly predicted branches and near returns.
                 [PMU Events: PMCx076, PMCx0C0, PMCx0C2, PMCx0C3, PMCx0C4, PMCx0C5, PMCx0C8, PMCx0C9, 
                              PMCx0CA]

  assess       : Assess Performance
                 Use this configuration to get an overall assessment of performance and 
                 to find potential issues for further investigation.
                 [PMU Events: PMCx076, PMCx0C0, PMCx0C2, PMCx0C3, PMCx029, PMCx060, PMCx043, PMCx047]

  ibs          : Instruction-based Sampling
                 Use this configuration to collect profile data using Instruction Based 
                 Sampling. Samples are attributed to instructions precisely with IBS.

  cpi          : Investigate CPI
                 Basic profile type to analyse the CPI and IPC metrics of the running application 
                 or the entire system.
                 [PMU Events: PMCx076, PMCx0C0]

Picking the “assess” configuration we can next run this on stockfish. This needs to run as root to collect information. The “-o stockfish” option gives an output directory for the profile.

mev@sacramento:~$ /opt/AMDuProf_4.0-341/bin/AMDuProfCLI collect -o stockfish --config assess phoronix-test-suite batch-run stockfish

Next step is to create a report from the saved profile information. We point the report option at a saved file with the “-i option”

mev@sacramento:~$ sudo /opt/AMDuProf_4.0-341/bin/AMDuProfCLI report -i stockfish/AMDuProf-phoronix-test-suite-EBP_Jan-31-2023_17-31-15/ --report-output /home/mev/stockfish_out
/opt/AMDuProf_4.0-341/bin/AMDuProfCLI
Report generation started...
Generating report file...

Report generation completed...

Generated report file: /home/mev/stockfish_out/report.csv

Unfortunately, the output report.csv file seems to tell me what is to be measured but didn’t have actual measurements

mev@sacramento:~$ more stockfish_out/report.csv 

"AMD uProf (Version:4.0.341.0)"
PERFORMANCE ANALYSIS REPORT

EXECUTION
Target Path:,"phoronix-test-suite"
Command Line Arguments:,"batch-run stockfish "
Working Directory:,"/home/mev"
Environment Variables:
CPU Details:,"Family(0x19), Model(0x61), Number of Cores(32)"
Operating System:,"LinuxUbuntu 22.04.1 LTS-64 Kernel:5.18.13-051813-generic"

PROFILE DETAILS
Profile Session Type:,"Assess Performance"
Profile Scope:,"Single Application"
CPU Mask:,"0-31"
CPU Affinity Mask:,"0-31"
Profile Start Time:,"Tue Jan 31 17:31:15 2023"
Profile End Time:,"Tue Jan 31 17:35:53 2023"
Profile Duration:,"277.888 seconds"
Data Folder:,"/home/mev/stockfish/AMDuProf-phoronix-test-suite-EBP_Jan-31-2023_17-31-15"
Virtual Machine:,"No"
Call Stack Sampling:,"False"

MONITORED EVENTS
PMC Events:,Name,Interval,Unitmask,Countmask,Invert Countmask,User,OS,Description
,"CYCLES_NOT_IN_HALT (PMCx076)",250000,0x00,0x00,False,True,True,"The number of cpu cycles when the thread is not in halt state."
,"RETIRED_INST (PMCx0C0)",250000,0x00,0x00,False,True,True,"The number of instructions retired from execution. This count includes exceptions and interrupts. Each exception or interrupt is counted as one
 instruction."
,"RETIRED_BR_INST (PMCx0C2)",25000,0x00,0x00,False,True,True,"The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts
.
            "
,"RETIRED_BR_INST_MISP (PMCx0C3)",25000,0x00,0x00,False,True,True,"The number of retired branch instructions, that were mispredicted.Note that only EX direct mispredicts and indirect target mispredicts a
re counted.
            "
,"MISALIGNED_LOADS (PMCx047)",25000,0x03,0x00,False,True,True,"The number of misaligned loads. This event counts the 64B (cacheline crossing) and 4K (page crossing) misaligned loads."
,"L1_DC_ACCESSES_ALL (PMCx029)",250000,0x07,0x00,False,True,True,"The number of load and store ops dispatched to LS unit. This counts the dispatch of single op that performs a memory load, dispatch of si
ngle op that performs a memory store, dispatch of a single op that performs a load from and store to the same memory address."
,"L1_DEMAND_DC_REFILLS_LOCAL (PMCx043)",25000,0x0F,0x00,False,True,True,"The demand Data Cache fills from L2, L3, CCX and DRAM."
,"L2_CACHE_ACCESS_FROM_L1_DC_MISS (PMCx060)",25000,0xE8,0x00,False,True,True,"The L2 cache access requests due to L1 data cache misses. This also counts hardware and software prefetches"

Overall, might still be missing something but not finding this tool useful yet.

uprof – profiler

Posted on 2023-01-31 by mev2023-01-31

There is a now a profiling tool named uprof for AMD CPUs/GPUs.

Looking at the features I see a number of interest:

Support for newest Zen4 CPUs which have added topdown performance counters. So with a Linux 6.2 kernel, I should be able to do similar analysis as earlier Intel CPUs
A set of additional combinations of counters that come together as settings.
Power profiling
Instruction-based sampling
Stack charts
GPU counters and measurements

There is a debian package with a click-through license. The package told me to also pick up BCC library to support OS tracing, so added that too by installing “bpftrace” package. After that everything is installed in /opt/AMDuProf_4.0-341. Invoking AMDuProf command has a screen to accept the license agreement followed by the GUI interface. The uProf User Guide has ~200 pages of further information.
Here is what the invoking the CLI looks like:
/opt/AMDuProf_4.0-341/bin/AMDuProfCLI

AMDuProfCLI is a command-line tool for AMD uProf Profiler.

Usage: AMDuProfCLI [--version] [--help] COMMAND [] []

Following are the supported COMMANDs: collect Run the given program and collects the profile samples. timechart Collects the system characteristics like power, thermal and frequency. report Process the profile-data file and generates the profile report. translate Process the raw profile-data files and save those into database files. info Displays generic information about system, CPU etc.

PROGRAM The launch application to be profiled.

ARGS The list of arguments for the launch application.

Run 'AMDuProfCLI COMMAND -h' for more information on a specific command.
It will be interesting to explore some of these features and describe how well they work.
Update: Unfortunately, trying to run AMDuProfOCM on a Ryzen 7950x part results in an error message:
AMDuProfPcm -m memory -A system -o td.csv -- phoronix-test-suite batch-run stockfish Missing configuration file - unsupported processor model.
A further web search suggests the tool isn’t supported for client parts only EPYC server parts – problem report.

Ryzen 1950x vs Ryzen 3950x

Posted on 2020-03-05 by mev2020-03-29

This blog post provides a comparison of my Ryzen 1950x (Threadripper) and Ryzen 3950 (Desktop) CPU

Table elements below come from mixture of wikichip and direct measurements I’ve made with lmbench, STREAM and Phoronix Test suite. Given the specs, I’m surprised the benchmarks show as large of a change. Wondering if my Ryzen 1950x is properly configured or if there is another reason.

Item	Ryzen Threadripper 1950x	Ryzen 3950x	Ryzen 1700x	Notes
Cores	16	16	8
Threads	32	32	16
Base/Boost Clock	3.4 GHz / 4.0 GHz	3.5 GHz / 4.7 GHz	3.4 GHz / 3.8 GHz	Faster boost, expect higher single-threaded performance.
TDP	180W	105W	95W
Memory	2667 MHz DDR4 4 memory channels 79.47 GiB/s	2400 MHz DDR4 2 memory channels 47.68 GiB/s	2400 MHz DDR4 2 memory channels 39.74 GiB/s	Faster memory, expect memory-bound latency to be slightly faster. Fewer memory controllers and memory bandwidth. Check STREAM performance.
Core	Zen	Zen2	Zen
Cache	16 x 64 KiB L1I, 4-way 16 x 32 KiB L1D, 8-way 16 x 512 KiB L2, 8-way 4 x 8 MiB L3	16 x 32KiB L1I , 8-way 16 x 32 KiB L1D , 8-way 16 x 512 KiB L2, 8-way 4 x 16 MiB L3	8 x 64 KiB L1I, 4-way 8 x 32 KiB L1D, 8-way 2 x 8 MiB L3	Less L1i and more L3. Compare across benchmarks.
lmbench	L1 - 4 cycles L2 - 10 cycles L3 - 16 cycles memory - 150 cycles	L1 - 4 cycles L2 - 10 cycles L3 - 17 cycles memory - 113 cycles	L1 - 4 cycles L2 - 11 cycles L3 - 17 cycles memory - 100 cycles
pts/rodinia OpenMP LavaMD	47.27 seconds	38.67 seconds	102.676 seconds
pts/rodinia OpenMP CFD solver	15.082 seconds	12.004 seconds	32.502 seconds
pts/namd	1.41998 days/ns	1.13749 days/ns	2.87945 days/ns
pts/x264	124.37 frames/second	149.44 frames/second	60.71 frames/second
pts/x265	34.70 frames/second	54.76 frames/second	7.06 frames/second
pts/compress-7zip	64379 MIPS	98677 MIPS	31465 MIPS
pts/stockfish	37267071 nodes/second	51242651 nodes/second	18967192 nodes/second
pts/asmfish	34856084 nodes/second	51295444 nodes/second	19168181 nodes/second
pts/gcc compile	978.643 seconds	692.402 seconds	1294.157 seconds
pts/linux kernel compile	50.663 seconds	34.973 seconds	90.695 seconds
pts/povray	32.672 xeconds	24.128 seconds	64.075 seconds
pts/radiance Serial	813.706 seconds	588.438 seconds	878.193 seconds
pts/radiance SMP parallel	260.705 seconds	186.469 seconds	319.56 seconds
pts/openssl	3065.2 signs/second	4740.1 signs/second	1368.8 signs/second
pts/ctx-clock	170 clocks	175 clocks	150 clocks
pts/sysbench	30970.0407 events/second	34976.4773 events/second	13378.7779 events/second
pts/blender barbershop	767.63 seconds	532.7 seconds	1472.21 seconds

haswell system freezes

Posted on 2018-08-04 by mev2018-08-04

Not sure what is causing it, but my Haswell system has started to freeze up when running “wspy –config topdown.config”. This started happening after I updated the system and started running benchmarks after a month on the road.

Some additional diagnosis and items I’ve tried:

Observed that the hangs also happened in a debugger running single-step so investigated how many single-steps before it hung. That may have been a false lead, as the single step in middle of fopen(3C) suddenly jumps ahead. However, along the way, tightened up my strtok() calls to be strtok_r() to make sure nothing strange was happening with recursive open_config_file()/parse_command_line calls. These were set up to be tail-recursive, so shouldn’t matter but cleaned up anyways.
Next observed that failure seemed to happen in setting up performance counters. Created small test program that made the same performance counters, and it didn’t hang.
Looked through logs in /var/log and didn’t see any smoking guns. The kernel completely locks up, even for other logged in processes – so even if the program is faulty, there is vulnerability to locking the kernel.
Further analysis started looking at grub to boot to an older kernel. In the process, uncovered that I was running under Xen hypervisor. Booting into bare metal fixed the problem. I’ve added a check to my 123.sh script that calls wspy. Not sure why this hung the system, but I don’t have virtual counters enabled, so this shouldn’t work. TODO item to look at more robust error detection in wspy to avoid stumbling into this again.

Conclusion: running under bare metal fixed the issue.

Phoronix article – benchmarks of high-end Intel/AMD desktops

Posted on 2018-08-03 by mev2018-08-03

Phoronix posted an article comparing Intel and AMD desktops on the Linux 4.18 kernel. The article says 100+ benchmarks were measured, though only half a dozen are displayed as part of the article.

I haven’t done these benchmarks on 4.18, but can look at analysis to see what is measured. This posting summarizes the phoronix conclusions as well as my observations of the benchmarks. Looks like an opportunity to look at a few new benchmarks. These are described in the table below.

Benchmark	Phoronix observations	My observations	Analysis
indigobench	Ryzen & Threadripper faster than i7 and slower than i9 platforms	On_CPU of 97% with an IPC of 0.65. Many backend stalls and L2/L3 cache misses.	Analysis
hpcc	Threadripper fastest, i9 next followed by Ryzen 7 2700 and Core i7.	Requires specific variables during install, still need to figure these out.
compress-p7zip	i9 fastest followed by threadripper. Ryzen 7 2700 similar to i7.	On_CPU 88% with some I/O to limit scaling. IPC 0.83 with 27% speculation misses (branch prediction).	Analysis
build-linux-kernel	i9 fastest, threadripper close, i7 slowest.	On_CPU 88%, mostly parallel compiles with a sequential period at end. High frontend stalls. # processes less in subsequent runs so might not do thorough "clean".	Analysis
c-ray	Threadripper fastest, i9 next and i7 slowest.	On_CPU almost 100% with moderately high IPC of 1.44. Frontend stalls of 10% and backend of 15%.	Analysis
octave-benchmark	i7 fastest, Ryzen 7 next and i9 after that.	Single-threaded with On_Core of 100%. Six workloads varying slightly but including backend memory stalls.	Analysis
v-ray	i9 fastest, threadripper/ryzen next and i7 slowest.	Installation instructions point to site to register and download the benchmark to place in download cache. Even following these steps had difficulty getting it installed.

TODO list at end of June

Posted on 2018-06-30 by mev2018-06-25

As June is coming to a close, useful to take stock of what is completed and what still remains.

During June, the following were done:

Phoronix benchmark list: I finished going through ~120 Phoronix benchmarks to at least do a “topdown” run. Approximately 60 have further “analysis” pages. Most of this was done by end of May, but finished the last at start of the month. As a result, when benchmark articles are posted, most of these I’ve already looked at and it is quicker to update the analysis.
Phoronix articles: looked at articles on OS comparisons, CPU comparison and hyper-threading. Updated article based on previous analysis. Hyper-threading was most interesting, showing these smaller benchmarks all benefited unless there was obvious cause, e.g. limited thread scaling. Skipped over some OS-specific articles as I’ve looked at the benchmarks and not sure much more to add.
Installed and analyzed both gromacs and OpenFoam applications. Nice to see tools created based on smaller benchmarks can work here.
Added support to wspy for –memstats. This periodically samples /proc/meminfo and creates metrics. Useful for OpenFOAM
Looked further at OpenSSL differences between AMD and Intel and suggested perhaps MULX instructions were related.
Looked at topdown metrics for AMD, but not much traction here.

This leaves several areas for further emphasis in the future(*):

Add additional “real world” codes. Top candidates are wrf and namd.
Keep up with incremental phoronix articles as they are published.
Look at Ryzen to create better “topdown” quick tool, e.g. add cache miss rates. It might become more of an overall tool than top down.
Add ARMv8 architecture examples.
Cleanups: take care of nmi timer, add “about this graph”, review test next steps
Implement –netstats, the one remaining “stats” feature. However, don’t have a motivating case yet
Look at tools/techniques beyond current measurements, e.g. microbenchmark measurements similar to Agner’s scripts?

I have some extended cycle touring scheduled in July, so may be slower month overall. However, also reached a general level of maturity on tools and analysis that more about rounding out edges.

Performance analysis, tools and experiments

An eclectic collection