This post describes a few higher-level takeaways I have from the analysis:
- Gromacs seems to have sophisticated topology configuration using both MPI and OpenMP and configures the system to take advantage of the NUMA architecture. Some additional tuning is undoubtedly helpful, but also useful to work with defaults out of the box
- Gromacs has an intermediate IPC of roughly ~1 on my AMD and Intel reference systems. The largest bottlenecks seemed to be memory related with different molecule simulations having 30-40% of cycles in backend stalls. Frontend stalls are ~10% and mostly come from bandwidth (inefficient packing to use all 4 uops) than latency (itlb misses or icache misses). Branch misses and bad speculation also don’t seem to be a big factor. Overall, the uop cache gets used 70-90% of the time depending on the workload.
- AMD does proportionally worse on the runs I did using gromacs 5.1.5 than using gromacs 2018. The earlier gromacs 5.1.5 did not come with a good default configuration since it assumed XOP and FMA4 ISAs for AMD platforms and these were dropped in Ryzen. Hence, some of this may have come from my inefficient build. In general for peak performance on AMD, I’d encourage a recent build. Left unmeasured is whether the default version on my Ubuntu systems (built for a lower common denominator) show gaps as much or not