wspy – memory analysis for processes; need to sanity check the metrics… – Performance analysis, tools and experiments

As a followup to this post, I’ve implemented per-process capture of backend counters. I can now create a memory report from a process tree.

For example, here is the one for STREAM

sh - pid 24788
	cycles              (100.0%)	826576248325
	  productive        (  4.2%)	34668836965
	  stalls            ( 95.8%)	791907411360
	    other stall     ( 10.3%)	84868643116
	    memory          ( 85.5%)	707038768244
	      read_bw       ( 64.8%)	535687104571
	      read_lat      ( 20.7%)	171351663673
	      write         (  3.9%)	32528655492

I implemented these metrics as described in this paper. The numbers line up with previous calculations. However, as I’ve gone further with this, I have some doubts whether the paper is correct. For example, why should STREAM even have as much as 10% stalls not due to memory? I’ve found an Excel sheet from Intel that suggests slightly different metrics. Not sure if all the counters are there before Skylake, but want to investigate just a bit further…

Based on reviewing the Excel sheet, I believe this should be CYCLE_ACTIVITY.STALLS_LDM_PENDING for any memory operation, rather than CYCLE_ACTIVITY.STALLS_L1D_PENDING, which reduces the other stall to something that makes more sense:

sh - pid 28480
	cycles              (100.0%)	827599363217
	  productive        (  4.2%)	35066714270
	  stalls            ( 95.8%)	792532648947
	    other stall     (  1.9%)	15573606224
	    memory          ( 93.9%)	776959042723
	      read_bw       ( 64.6%)	535029290366
	      read_lat      ( 29.2%)	241929752357
	      write         (  3.9%)	32574641248

Also checked Andy Kleen’s PMU tools and see these implement STALLS_LDM_PENDING to separate memory from other backend stalls. Next step when I get back will be to implement next level of reporting hierarchy using these forumulas.

Two few other observations as one dives in these top-down:

Going next levels will require multiplexing more counters; always a potential for less accuracy, though I should also get some clues on the variations look at time series. Other that doing multiplexing, one might instead do multiple runs (same issue – but correlating across more than one run rather than parts of the same program).
Not too much later, the “top down” leads back to looking at the events and how often they occur, just that it has steered one first into identifying those that matter most.

Performance analysis, tools and experiments

An eclectic collection

wspy – memory analysis for processes; need to sanity check the metrics…

Comments

wspy – memory analysis for processes; need to sanity check the metrics… — No Comments

Leave a Reply Cancel reply