As a followup to this post, I’ve implemented per-process capture of backend counters. I can now create a memory report from a process tree.
For example, here is the one for STREAM
sh - pid 24788 cycles (100.0%) 826576248325 productive ( 4.2%) 34668836965 stalls ( 95.8%) 791907411360 other stall ( 10.3%) 84868643116 memory ( 85.5%) 707038768244 read_bw ( 64.8%) 535687104571 read_lat ( 20.7%) 171351663673 write ( 3.9%) 32528655492
I implemented these metrics as described in this paper. The numbers line up with previous calculations. However, as I’ve gone further with this, I have some doubts whether the paper is correct. For example, why should STREAM even have as much as 10% stalls not due to memory? I’ve found an Excel sheet from Intel that suggests slightly different metrics. Not sure if all the counters are there before Skylake, but want to investigate just a bit further…
Based on reviewing the Excel sheet, I believe this should be CYCLE_ACTIVITY.STALLS_LDM_PENDING for any memory operation, rather than CYCLE_ACTIVITY.STALLS_L1D_PENDING, which reduces the other stall to something that makes more sense:
sh - pid 28480 cycles (100.0%) 827599363217 productive ( 4.2%) 35066714270 stalls ( 95.8%) 792532648947 other stall ( 1.9%) 15573606224 memory ( 93.9%) 776959042723 read_bw ( 64.6%) 535029290366 read_lat ( 29.2%) 241929752357 write ( 3.9%) 32574641248
Also checked Andy Kleen’s PMU tools and see these implement STALLS_LDM_PENDING to separate memory from other backend stalls. Next step when I get back will be to implement next level of reporting hierarchy using these forumulas.
Two few other observations as one dives in these top-down:
- Going next levels will require multiplexing more counters; always a potential for less accuracy, though I should also get some clues on the variations look at time series. Other that doing multiplexing, one might instead do multiple runs (same issue – but correlating across more than one run rather than parts of the same program).
- Not too much later, the “top down” leads back to looking at the events and how often they occur, just that it has steered one first into identifying those that matter most.