↓
 
  • Phoronix
  • gromacs
  • OpenFOAM

Performance analysis, tools and experiments

An eclectic collection

  • Home
  • Blog
  • Tools
    • wspy – workload spy
  • Workloads
    • Geekbench
    • gromacs
      • lysozyme tutorial
      • PRACE benchmark
    • OpenFOAM
    • Phoronix
      • aobench
      • apache
      • asmfish
      • blake2
      • blender
      • botan
      • build-gcc
      • build-linux-kernel
      • build-llvm
      • build-php
      • bullet
      • c-ray
      • cachebench
      • compilebench
      • compress-p7zip
      • compress-pbzip2
      • compress-zstd
      • ebizzy
      • encode-flac
      • encode-mp3
      • ffmpeg
      • ffte
      • fftw
      • fhourstones
      • fio
      • fs-mark
      • gimp
      • git
      • go-benchmark
      • graphics-magick
      • hackbench
      • himeno
      • hmmer
      • indigobench
      • java-gradle-perf
      • java-scimark2
      • luajit
      • m-queens
      • mafft
      • n-queens
      • nginx
      • numpy
      • octave-benchmark
      • openssl
      • osbench
        • osbench – create processes
        • osbench – create threads
        • osbench – memory
      • parboil
      • pgbench
      • phpbench
      • polybench-c
      • povray
      • primesieve
      • pybench
      • radiance
      • rbenchmark
      • redis
      • rodinia
      • scikit-learn
      • scimark2
      • sqlite
      • stockfish
      • stream
      • stress-ng
      • tensorflow
      • tinymembench
      • tjbench
      • tscp
      • ttsiod-renderer
      • vpxenc
      • x264
      • y-cruncher
  • Experiments
Home 1 2 3 … 7 8 >>

Post navigation

← Older posts

A path to topdown counters for AMD Zen4

Performance analysis, tools and experiments Posted on 2023-02-21 by mev2023-02-21

A recent Phoronix article described new performance counters being added to the Linux kernel. This seemed intriguing to me as a way to get analysis that previously had only done on Intel however it required a few steps.

The first step was getting a Linux 6.2 kernel installed. The procedure was straightforward where I followed these instructions to install the “mainline” tool and then the graphical UI to install and boot into a new kernel.

However, after installing the kernel when I invoked perf, I got

WARNING: perf not found for kernel 6.2.0-060200

You may need to install the following packages for this specific kernel:
linux-tools-6.2.0-060200-generic
linux-cloud-tools-6.2.0-060200-generic


You may also want to install one of the following packages to keep up to date:
linux-tools-generic
linux-cloud-tools-generic

and this was not available from the standard repository.

So the next step was finding and following instructions for downloading kernel sources and building perf. This was straightforward and including a “make” in the tools/perf directory.

At this point, trying “perf -a –topdown” does not work yet, but if I do a “perf list”, I can see a set of counters such as these:

backend_bound_group:
backend_bound_cpu
[Fraction of dispatch slots that remained unused because of stalls not related to the memory subsystem]
backend_bound_memory
[Fraction of dispatch slots that remained unused because of stalls due to the memory subsystem]

Looking at these a little further

mev@sacramento:~/source/linux-6.2/tools/perf$ ./perf stat --event=backend_bound_cpu
event syntax error: 'backend_bound_cpu'
\___ parser error
Run 'perf list' for a list of valid events


Usage: perf stat [] []

I can’t immediately pick these because they seem to be an event group that I can see further when I list the -v –detail option on perf-list:

ist of pre-defined events (to be used in -e or -M):

Metric Groups:

PipelineL2:
backend_bound_cpu
[backend_bound * (1 - d_ratio(ex_no_retire.load_not_complete, ex_no_retire.not_complete))]


backend_bound_group:
backend_bound_cpu
[backend_bound * (1 - d_ratio(ex_no_retire.load_not_complete, ex_no_retire.not_complete))]

However, using the “-M” option to perf stat will give me the underlying counters for this metric:

mev@sacramento:~/source/linux-6.2/tools/perf$ ./perf stat -M backend_bound_cpu ~/hello
hello world

Performance counter stats for '/home/mev/hello':

1,234,099 ex_no_retire.load_not_complete # 2.7 % backend_bound_cpu
5,366,366 de_no_dispatch_per_slot.backend_stalls
1,336,714 ex_no_retire.not_complete
2,551,447 ls_not_halted_cyc

0.001079579 seconds time elapsed

0.001128000 seconds user
0.000000000 seconds sys

So success overall and believe I can now get these metrics from an AMD Zen4 core.

Posted in tools | Tagged top-down | Leave a reply

uProf CLI

Performance analysis, tools and experiments Posted on 2023-01-31 by mev2023-01-31

A few examples trying the AMDuProfCLI command. This seems to go through a “collect” stage followed by a “report” stage.
There are some pre-defined profiles that can be seen with the “AMDuProfCLI info –list collect-configs command

mev@sacramento:~$ AMDuProfCLI info --list collect-configs
/opt/AMDuProf_4.0-341/bin/AMDuProfCLI

List of predefined profiles that can be used with 'collect --config' option:

  tbp          : Time-based Sampling
                 Use this configuration to identify where programs are spending time.

  inst_access  : Investigate Instruction Access
                 Use this configuration to find instruction fetches with poor L1 instruction 
                 cache locality and poor ITLB behavior.
                 [PMU Events: PMCx076, PMCx0C0, PMCx28F, PMCx18E, PMCx060, PMCx064, PMCx084, PMCx085, 
                              PMCx094]

  data_access  : Investigate Data Access
                 Use this configuration to find data access operations with poor L1 data 
                 cache locality and poor DTLB behavior.
                 [PMU Events: PMCx076, PMCx0C0, PMCx029, PMCx060, PMCx043, PMCx047, PMCx045]

  assess_ext   : Assess Performance (Extended)
                 Use this configuration for an overall assessment of performance and to 
                 find the potential issues for further investigation. This has additional 
                 events to monitor than the Assess Performance configuration.
                 [PMU Events: PMCx076, PMCx0C0, PMCx0C2, PMCx0C3, PMCx029, PMCx060, PMCx047, PMCx043, 
                              PMCx024, PMCx052, PMCx00E]

  memory       : Cache Analysis
                 Use this configuration to identify the false cache-line sharing issues. 
                 The profile data will be collected using IBS OP.

  branch       : Investigate Branching
                 Use this configuration to find poorly predicted branches and near returns.
                 [PMU Events: PMCx076, PMCx0C0, PMCx0C2, PMCx0C3, PMCx0C4, PMCx0C5, PMCx0C8, PMCx0C9, 
                              PMCx0CA]

  assess       : Assess Performance
                 Use this configuration to get an overall assessment of performance and 
                 to find potential issues for further investigation.
                 [PMU Events: PMCx076, PMCx0C0, PMCx0C2, PMCx0C3, PMCx029, PMCx060, PMCx043, PMCx047]

  ibs          : Instruction-based Sampling
                 Use this configuration to collect profile data using Instruction Based 
                 Sampling. Samples are attributed to instructions precisely with IBS.

  cpi          : Investigate CPI
                 Basic profile type to analyse the CPI and IPC metrics of the running application 
                 or the entire system.
                 [PMU Events: PMCx076, PMCx0C0]

Picking the “assess” configuration we can next run this on stockfish. This needs to run as root to collect information. The “-o stockfish” option gives an output directory for the profile.

mev@sacramento:~$ /opt/AMDuProf_4.0-341/bin/AMDuProfCLI collect -o stockfish --config assess phoronix-test-suite batch-run stockfish

Next step is to create a report from the saved profile information. We point the report option at a saved file with the “-i option”

mev@sacramento:~$ sudo /opt/AMDuProf_4.0-341/bin/AMDuProfCLI report -i stockfish/AMDuProf-phoronix-test-suite-EBP_Jan-31-2023_17-31-15/ --report-output /home/mev/stockfish_out
/opt/AMDuProf_4.0-341/bin/AMDuProfCLI
Report generation started...
Generating report file...

Report generation completed...

Generated report file: /home/mev/stockfish_out/report.csv

Unfortunately, the output report.csv file seems to tell me what is to be measured but didn’t have actual measurements

mev@sacramento:~$ more stockfish_out/report.csv 

"AMD uProf (Version:4.0.341.0)"
PERFORMANCE ANALYSIS REPORT

EXECUTION
Target Path:,"phoronix-test-suite"
Command Line Arguments:,"batch-run stockfish "
Working Directory:,"/home/mev"
Environment Variables:
CPU Details:,"Family(0x19), Model(0x61), Number of Cores(32)"
Operating System:,"LinuxUbuntu 22.04.1 LTS-64 Kernel:5.18.13-051813-generic"

PROFILE DETAILS
Profile Session Type:,"Assess Performance"
Profile Scope:,"Single Application"
CPU Mask:,"0-31"
CPU Affinity Mask:,"0-31"
Profile Start Time:,"Tue Jan 31 17:31:15 2023"
Profile End Time:,"Tue Jan 31 17:35:53 2023"
Profile Duration:,"277.888 seconds"
Data Folder:,"/home/mev/stockfish/AMDuProf-phoronix-test-suite-EBP_Jan-31-2023_17-31-15"
Virtual Machine:,"No"
Call Stack Sampling:,"False"

MONITORED EVENTS
PMC Events:,Name,Interval,Unitmask,Countmask,Invert Countmask,User,OS,Description
,"CYCLES_NOT_IN_HALT (PMCx076)",250000,0x00,0x00,False,True,True,"The number of cpu cycles when the thread is not in halt state."
,"RETIRED_INST (PMCx0C0)",250000,0x00,0x00,False,True,True,"The number of instructions retired from execution. This count includes exceptions and interrupts. Each exception or interrupt is counted as one
 instruction."
,"RETIRED_BR_INST (PMCx0C2)",25000,0x00,0x00,False,True,True,"The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts
.
            "
,"RETIRED_BR_INST_MISP (PMCx0C3)",25000,0x00,0x00,False,True,True,"The number of retired branch instructions, that were mispredicted.Note that only EX direct mispredicts and indirect target mispredicts a
re counted.
            "
,"MISALIGNED_LOADS (PMCx047)",25000,0x03,0x00,False,True,True,"The number of misaligned loads. This event counts the 64B (cacheline crossing) and 4K (page crossing) misaligned loads."
,"L1_DC_ACCESSES_ALL (PMCx029)",250000,0x07,0x00,False,True,True,"The number of load and store ops dispatched to LS unit. This counts the dispatch of single op that performs a memory load, dispatch of si
ngle op that performs a memory store, dispatch of a single op that performs a load from and store to the same memory address."
,"L1_DEMAND_DC_REFILLS_LOCAL (PMCx043)",25000,0x0F,0x00,False,True,True,"The demand Data Cache fills from L2, L3, CCX and DRAM."
,"L2_CACHE_ACCESS_FROM_L1_DC_MISS (PMCx060)",25000,0xE8,0x00,False,True,True,"The L2 cache access requests due to L1 data cache misses. This also counts hardware and software prefetches"

Overall, might still be missing something but not finding this tool useful yet.

Posted in tools | Tagged uprof | Leave a reply

uprof – profiler

Performance analysis, tools and experiments Posted on 2023-01-31 by mev2023-01-31

There is a now a profiling tool named uprof for AMD CPUs/GPUs.

Looking at the features I see a number of interest:

  • Support for newest Zen4 CPUs which have added topdown performance counters. So with a Linux 6.2 kernel, I should be able to do similar analysis as earlier Intel CPUs
  • A set of additional combinations of counters that come together as settings.
  • Power profiling
  • Instruction-based sampling
  • Stack charts
  • GPU counters and measurements

There is a debian package with a click-through license. The package told me to also pick up BCC library to support OS tracing, so added that too by installing “bpftrace” package. After that everything is installed in /opt/AMDuProf_4.0-341. Invoking AMDuProf command has a screen to accept the license agreement followed by the GUI interface. The uProf User Guide has ~200 pages of further information.
Here is what the invoking the CLI looks like:

/opt/AMDuProf_4.0-341/bin/AMDuProfCLI

AMDuProfCLI is a command-line tool for AMD uProf Profiler.

Usage: AMDuProfCLI [--version] [--help] COMMAND [] []

Following are the supported COMMANDs:
collect Run the given program and collects the profile samples.
timechart Collects the system characteristics like power, thermal and frequency.
report Process the profile-data file and generates the profile report.
translate Process the raw profile-data files and save those into database files.
info Displays generic information about system, CPU etc.

PROGRAM
The launch application to be profiled.

ARGS
The list of arguments for the launch application.


Run 'AMDuProfCLI COMMAND -h' for more information on a specific command.

It will be interesting to explore some of these features and describe how well they work.
Update: Unfortunately, trying to run AMDuProfOCM on a Ryzen 7950x part results in an error message:

AMDuProfPcm -m memory -A system -o td.csv -- phoronix-test-suite batch-run stockfish
Missing configuration file - unsupported processor model.

A further web search suggests the tool isn’t supported for client parts only EPYC server parts – problem report.

Posted in tools | Tagged uprof | Leave a reply

Ryzen 1950x vs Ryzen 3950x

Performance analysis, tools and experiments Posted on 2020-03-05 by mev2020-03-29

This blog post provides a comparison of my Ryzen 1950x (Threadripper) and Ryzen 3950 (Desktop) CPU

Table elements below come from mixture of wikichip and direct measurements I’ve made with lmbench, STREAM and Phoronix Test suite.  Given the specs, I’m surprised the benchmarks show as large of a change.  Wondering if my Ryzen 1950x is properly configured or if there is another reason.

ItemRyzen Threadripper 1950xRyzen 3950xRyzen 1700xNotes
Cores16168
Threads323216
Base/Boost Clock3.4 GHz / 4.0 GHz3.5 GHz / 4.7 GHz3.4 GHz / 3.8 GHzFaster boost, expect higher single-threaded performance.
TDP180W105W95W
Memory2667 MHz DDR4

4 memory channels

79.47 GiB/s
2400 MHz DDR4

2 memory channels

47.68 GiB/s
2400 MHz DDR4

2 memory channels

39.74 GiB/s
Faster memory, expect memory-bound latency to be slightly faster.
Fewer memory controllers and memory bandwidth.
Check STREAM performance.
CoreZenZen2Zen
Cache16 x 64 KiB L1I, 4-way

16 x 32 KiB L1D, 8-way

16 x 512 KiB L2, 8-way

4 x 8 MiB L3
16 x 32KiB L1I , 8-way

16 x 32 KiB L1D , 8-way

16 x 512 KiB L2, 8-way

4 x 16 MiB L3
8 x 64 KiB L1I, 4-way

8 x 32 KiB L1D, 8-way

2 x 8 MiB L3
Less L1i and more L3. Compare across benchmarks.
lmbenchL1 - 4 cycles

L2 - 10 cycles

L3 - 16 cycles

memory - 150 cycles
L1 - 4 cycles

L2 - 10 cycles

L3 - 17 cycles

memory - 113 cycles
L1 - 4 cycles

L2 - 11 cycles

L3 - 17 cycles

memory - 100 cycles
pts/rodinia OpenMP LavaMD47.27 seconds38.67 seconds102.676 seconds
pts/rodinia OpenMP CFD solver15.082 seconds12.004 seconds32.502 seconds
pts/namd1.41998 days/ns1.13749 days/ns2.87945 days/ns
pts/x264124.37 frames/second149.44 frames/second60.71 frames/second
pts/x26534.70 frames/second54.76 frames/second7.06 frames/second
pts/compress-7zip64379 MIPS98677 MIPS31465 MIPS
pts/stockfish37267071 nodes/second51242651 nodes/second18967192 nodes/second
pts/asmfish34856084 nodes/second51295444 nodes/second19168181 nodes/second
pts/gcc compile978.643 seconds692.402 seconds1294.157 seconds
pts/linux kernel compile50.663 seconds34.973 seconds90.695 seconds
pts/povray32.672 xeconds24.128 seconds64.075 seconds
pts/radiance Serial813.706 seconds588.438 seconds878.193 seconds
pts/radiance SMP parallel260.705 seconds186.469 seconds319.56 seconds
pts/openssl3065.2 signs/second4740.1 signs/second1368.8 signs/second
pts/ctx-clock170 clocks175 clocks150 clocks
pts/sysbench30970.0407 events/second34976.4773 events/second13378.7779 events/second
pts/blender barbershop767.63 seconds532.7 seconds1472.21 seconds
Posted in hardware | Leave a reply

haswell system freezes

Performance analysis, tools and experiments Posted on 2018-08-04 by mev2018-08-04

Not sure what is causing it, but my Haswell system has started to freeze up when running “wspy –config topdown.config”. This started happening after I updated the system and started running benchmarks after a month on the road.

Some additional diagnosis and items I’ve tried:

  1. Observed that the hangs also happened in a debugger running single-step so investigated how many single-steps before it hung. That may have been a false lead, as the single step in middle of fopen(3C) suddenly jumps ahead. However, along the way, tightened up my strtok() calls to be strtok_r() to make sure nothing strange was happening with recursive open_config_file()/parse_command_line calls. These were set up to be tail-recursive, so shouldn’t matter but cleaned up anyways.
  2. Next observed that failure seemed to happen in setting up performance counters. Created small test program that made the same performance counters, and it didn’t hang.
  3. Looked through logs in /var/log and didn’t see any smoking guns. The kernel completely locks up, even for other logged in processes – so even if the program is faulty, there is vulnerability to locking the kernel.
  4. Further analysis started looking at grub to boot to an older kernel. In the process, uncovered that I was running under Xen hypervisor. Booting into bare metal fixed the problem. I’ve added a check to my 123.sh script that calls wspy. Not sure why this hung the system, but I don’t have virtual counters enabled, so this shouldn’t work. TODO item to look at more robust error detection in wspy to avoid stumbling into this again.

Conclusion: running under bare metal fixed the issue.

Posted in analysis | Tagged system hang, virtualization | Leave a reply

Phoronix article – benchmarks of high-end Intel/AMD desktops

Performance analysis, tools and experiments Posted on 2018-08-03 by mev2018-08-03

Phoronix posted an article comparing Intel and AMD desktops on the Linux 4.18 kernel. The article says 100+ benchmarks were measured, though only half a dozen are displayed as part of the article.

I haven’t done these benchmarks on 4.18, but can look at analysis to see what is measured. This posting summarizes the phoronix conclusions as well as my observations of the benchmarks. Looks like an opportunity to look at a few new benchmarks. These are described in the table below.

BenchmarkPhoronix observationsMy observationsAnalysis
indigobenchRyzen & Threadripper faster than i7 and slower than i9 platformsOn_CPU of 97% with an IPC of 0.65. Many backend stalls and L2/L3 cache misses.Analysis
hpccThreadripper fastest, i9 next followed by Ryzen 7 2700 and Core i7.Requires specific variables during install, still need to figure these out.
compress-p7zipi9 fastest followed by threadripper. Ryzen 7 2700 similar to i7.On_CPU 88% with some I/O to limit scaling. IPC 0.83 with 27% speculation misses (branch prediction).Analysis
build-linux-kerneli9 fastest, threadripper close, i7 slowest.On_CPU 88%, mostly parallel compiles with a sequential period at end. High frontend stalls. # processes less in subsequent runs so might not do thorough "clean".Analysis
c-rayThreadripper fastest, i9 next and i7 slowest.On_CPU almost 100% with moderately high IPC of 1.44. Frontend stalls of 10% and backend of 15%.Analysis
octave-benchmarki7 fastest, Ryzen 7 next and i9 after that.Single-threaded with On_Core of 100%. Six workloads varying slightly but including backend memory stalls.Analysis
v-rayi9 fastest, threadripper/ryzen next and i7 slowest.Installation instructions point to site to register and download the benchmark to place in download cache. Even following these steps had difficulty getting it installed.

Posted in analysis | Tagged phoronix benchmark article | Leave a reply

TODO list at end of June

Performance analysis, tools and experiments Posted on 2018-06-30 by mev2018-06-25

As June is coming to a close, useful to take stock of what is completed and what still remains.

During June, the following were done:

  1. Phoronix benchmark list: I finished going through ~120 Phoronix benchmarks to at least do a “topdown” run. Approximately 60 have further “analysis” pages. Most of this was done by end of May, but finished the last at start of the month. As a result, when benchmark articles are posted, most of these I’ve already looked at and it is quicker to update the analysis.
  2. Phoronix articles: looked at articles on OS comparisons, CPU comparison and hyper-threading. Updated article based on previous analysis. Hyper-threading was most interesting, showing these smaller benchmarks all benefited unless there was obvious cause, e.g. limited thread scaling. Skipped over some OS-specific articles as I’ve looked at the benchmarks and not sure much more to add.
  3. Installed and analyzed both gromacs and OpenFoam applications. Nice to see tools created based on smaller benchmarks can work here.
  4. Added support to wspy for –memstats. This periodically samples /proc/meminfo and creates metrics. Useful for OpenFOAM
  5. Looked further at OpenSSL differences between AMD and Intel and suggested perhaps MULX instructions were related.
  6. Looked at topdown metrics for AMD, but not much traction here.

This leaves several areas for further emphasis in the future(*):

  1. Add additional “real world” codes. Top candidates are wrf and namd.
  2. Keep up with incremental phoronix articles as they are published.
  3. Look at Ryzen to create better “topdown” quick tool, e.g. add cache miss rates. It might become more of an overall tool than top down.
  4. Add ARMv8 architecture examples.
  5. Cleanups: take care of nmi timer, add “about this graph”, review test next steps
  6. Implement –netstats, the one remaining “stats” feature. However, don’t have a motivating case yet
  7. Look at tools/techniques beyond current measurements, e.g. microbenchmark measurements similar to Agner’s scripts?

I have some extended cycle touring scheduled in July, so may be slower month overall. However, also reached a general level of maturity on tools and analysis that more about rounding out edges.

Posted in tools | Tagged progress report | Leave a reply

Phoronix article – POWER9, Xeon and AMD comparison (2018-06-25)

Performance analysis, tools and experiments Posted on 2018-06-25 by mev2018-06-25

Phoronix posted an article comparing POWER vs x86 on CPU benchmarks. This post looks at some of the workloads and adds comments.
Continue reading →

Posted in analysis | Tagged phoronix benchmark article | Leave a reply

openssl – AMD vs Intel

Performance analysis, tools and experiments Posted on 2018-06-24 by mev2018-06-24

The openssl Phoronix benchmark is interesting because the IPC on Intel Haswell system (1.66) is considerably higher than the IPC on AMD Ryzen (1.12). In this post, I’ll explore to look for causes.
Continue reading →

Posted in analysis, featured | Tagged analysis technique | Leave a reply

Phoronix article – hyperthreading (2018-06-20)

Performance analysis, tools and experiments Posted on 2018-06-21 by mev2018-06-21

Phoronix posted an article comparing hyperthreading on/off on an Intel i7. This post reviews some of the workloads and add comments.
Continue reading →

Posted in analysis | Tagged hyperthreading, phoronix benchmark article | Leave a reply

Post navigation

← Older posts
©2025 - Performance analysis, tools and experiments - Weaver Xtreme Theme
↑