gromacs – summary – Performance analysis, tools and experiments

I’ve analyzed the gromacs computational chemistry application using three sample workloads and created an analysis page for the results as well as added them to the overall workload summary.

This post describes a few higher-level takeaways I have from the analysis:

Gromacs seems to have sophisticated topology configuration using both MPI and OpenMP and configures the system to take advantage of the NUMA architecture. Some additional tuning is undoubtedly helpful, but also useful to work with defaults out of the box
Gromacs has an intermediate IPC of roughly ~1 on my AMD and Intel reference systems. The largest bottlenecks seemed to be memory related with different molecule simulations having 30-40% of cycles in backend stalls. Frontend stalls are ~10% and mostly come from bandwidth (inefficient packing to use all 4 uops) than latency (itlb misses or icache misses). Branch misses and bad speculation also don’t seem to be a big factor. Overall, the uop cache gets used 70-90% of the time depending on the workload.
AMD does proportionally worse on the runs I did using gromacs 5.1.5 than using gromacs 2018. The earlier gromacs 5.1.5 did not come with a good default configuration since it assumed XOP and FMA4 ISAs for AMD platforms and these were dropped in Ryzen. Hence, some of this may have come from my inefficient build. In general for peak performance on AMD, I’d encourage a recent build. Left unmeasured is whether the default version on my Ubuntu systems (built for a lower common denominator) show gaps as much or not

Performance analysis, tools and experiments

An eclectic collection

gromacs – summary

Comments

gromacs – summary — No Comments

Leave a Reply Cancel reply