I have added support to wspy for the –memstats option.
Continue reading →
OpenFOAM summary
I’ve analyzed the OpenFOAM CFD application using the motorbike tutorial page and created an analysis page for the results as well as added it to the overall workload summary.
This post describes a few high-level takeaways I have from the analysis:
- OpenFOAM is sensitive to the number of threads used to run, with consistent improvement until the number of threads equals the number of physical cores. After that having the number of thread equal the hyperthread number is slightly slower on my Intel system and slightly faster on my AMD system. In this range the percentage of system time jumps dramatically with top routines appearing to be memory management and scheduling related.
- OpenFOAM appears limited backend stalls and memory with the overall composite application run showing 40% L2 miss ratio and 50% L3 miss ratio. The ratio of iTLB misses is also surprisingly high 0.011. Not sure if this is related to recent kernel spectre/meltdown or something else.
- The overall IPC and backend stalls vary some as the application runs, but is roughly slightly less than 1 on Intel and slightly higher on AMD. There doesn’t seem to be a particular bias between my AMD and Intel systems.
- Overall, I get the sense when running the application, keeping track of memory management and threads including affinity are particularly important.
openfoam notes
OpenFOAM is free CFD software. While one can build it from source, there are also prebuilt Ubuntu repositories that I used in my testing.
Continue reading →
gromacs – summary
I’ve analyzed the gromacs computational chemistry application using three sample workloads and created an analysis page for the results as well as added them to the overall workload summary.
This post describes a few higher-level takeaways I have from the analysis:
- Gromacs seems to have sophisticated topology configuration using both MPI and OpenMP and configures the system to take advantage of the NUMA architecture. Some additional tuning is undoubtedly helpful, but also useful to work with defaults out of the box
- Gromacs has an intermediate IPC of roughly ~1 on my AMD and Intel reference systems. The largest bottlenecks seemed to be memory related with different molecule simulations having 30-40% of cycles in backend stalls. Frontend stalls are ~10% and mostly come from bandwidth (inefficient packing to use all 4 uops) than latency (itlb misses or icache misses). Branch misses and bad speculation also don’t seem to be a big factor. Overall, the uop cache gets used 70-90% of the time depending on the workload.
- AMD does proportionally worse on the runs I did using gromacs 5.1.5 than using gromacs 2018. The earlier gromacs 5.1.5 did not come with a good default configuration since it assumed XOP and FMA4 ISAs for AMD platforms and these were dropped in Ryzen. Hence, some of this may have come from my inefficient build. In general for peak performance on AMD, I’d encourage a recent build. Left unmeasured is whether the default version on my Ubuntu systems (built for a lower common denominator) show gaps as much or not
gromacs 5.1.5, Ryzen build issues (use -DGMX_SIMD)
I downloaded and built gromacs from www.gromacs.org and used the 5.1.5 version because some of the tutorials/benchmarks still refer to this older version. This post documents steps I used to run gromacs 5.1.5 on a Ryzen processor.
Continue reading →
gromacs – setup notes
This post documents steps I used to build, install and run gromacs.
gromacs is a molecular dynamics package for simulating proteins, lipids and nucleic acids. It seems to be one of the more common HPC applications. My interest is less in the computational chemistry and more in characterizing how the program uses microprocessor resources.
Continue reading →
topdown equivalent metrics for AMD Zen?
Intel processors provide a useful set of performance counters for doing topdown analysis, particularly at the first topdown level. This first level asks questions in the following hierarchy:
Is a Uop dispatched? If yes, is it retired? If yes --> retire If no --> bad speculation If no, is it stalled in the front end? If yes --> front end stall If no --> backend stall
Looking over the AMD Open Source register reference I don’t quite see the same equivalent registers to make these same choices – particularly when also looking at the microarchitectural diagram for a Zen processor.
Continue reading →
Phoronix article – CPU comparison (2018-06-08)
Phoronix posted an article comparing 28 desktop CPUs. This post reviews some of the workloads that were used.
Continue reading →
100+ phoronix benchmarks listed
On the upper left of the web site is now an item labeled “phoronix”. Clicking there returns a table with a long list of Phoronix benchmarks. For each I’ve recorded the description, scores when running on my Intel reference system, topdown metrics and if available analysis done.
Continue reading →
Phoronix article – OS comparison (2018-05-31)
Phoronix posted an article comparing 15 different OS versions. This posting doesn’t reproduce the measurements, but instead looks at overall summaries made as well as looks at the benchmarks being used for comparisons.
Continue reading →