himeno – Performance analysis, tools and experiments

Description - phoronix/himeno

The Himeno benchmark is a linear solver of pressure Poisson using a point-Jacobi method.

A description of the overall benchmark is here and described in these slides.

The code has four different predefined grid sizes:

 Grid-size= XS (32x32x64)
	    S  (64x64x128)
	    M  (128x128x256)
	    L  (256x256x512)
	    XL (512x512x1024)

and the Phoronix program runs with the “s” grid size with following type output

mimax = 64 mjmax = 64 mkmax = 128
imax = 63 jmax = 63 kmax =127
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 1830.141187 time(s): 0.025923 3.288628e-03

 Now, start the actual measurement process.
 The loop will be excuted in 6943 times
 This will take about one minute.
 Wait for a while

 Loop executed for 6943 times
 Gosa : 6.113893e-08 
 MFLOPS measured : 1871.809420	cpu : 58.658930
 Score based on Pentium III 600MHz using Fortran 77: 22.826944

While parallel versions are available, the Phoronix variant uses a single-threaded C implementation and hence doesn’t really exercise either cores or caches.

For these runs all testing was done pinning to core 1.

Metrics (Intel) - phoronix/himeno

An interesting aspect of the benchmark is that it shows some noise in the amount of backend stalls. For example below are first successive runs:

sh - pid 8559
	On_CPU   0.125
	On_Core  1.000
	IPC      0.979
	Retire   0.325	(32.5%)
	FrontEnd 0.028	(2.8%)
	Spec     0.003	(0.3%)
	Backend  0.645	(64.5%)
	Elapsed  47.92
	Procs    3
	Maxrss   29K
	Minflt   7405
	Majflt   0
	Inblock  0
	Oublock  8
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    18	(40.0%)
	Nivcsw   27
	Utime    47.911952
	Stime    0.008337
	Start    120898.16
	Finish   120946.08

and

sh - pid 8564
	On_CPU   0.125
	On_Core  1.000
	IPC      0.765
	Retire   0.254	(25.4%)
	FrontEnd 0.036	(3.6%)
	Spec     0.020	(2.0%)
	Backend  0.689	(68.9%)
	Elapsed  74.30
	Procs    3
	Maxrss   29K
	Minflt   7405
	Majflt   0
	Inblock  0
	Oublock  8
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    18	(36.7%)
	Nivcsw   31
	Utime    74.278622
	Stime    0.013664
	Start    120947.08
	Finish   121021.38

Overall the particular code is 100% On_CPU and limited by backend stalls. Some aspect (cache placement?) causes some noise between runs affecting backend stalls and hence IPC.

Metrics (AMD) - phoronix/himeno

sh - pid 19774
	On_CPU   0.062
	On_Core  0.999
	IPC      0.548
	FrontCyc 0.035	(3.5%)
	BackCyc  0.011	(1.1%)
	Elapsed  59.67
	Procs    3
	Maxrss   29K
	Minflt   7407
	Majflt   0
	Inblock  0
	Oublock  8
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    18	(0.3%)
	Nivcsw   5742
	Utime    59.604901
	Stime    0.011988
	Start    138173.41
	Finish   138233.08

AMD runs have somewhat less variation than Intel. The IPC is also a fair amount lower. Phoronix benchmark run saw degradation on AMD between 16.04 and 18.04 but not Intel (gcc changes?).

Process Tree - phoronix/himeno
Process Tree
The process tree is simple

    8559) sh elapsed=47.92 start=2.66 finish=50.58
      8560) himeno elapsed=47.92 start=2.66 finish=50.58
        8561) himenobmtxpa elapsed=47.92 start=2.66 finish=50.58

On_CPU is 100% overall.

The IPC varies some by run, but overall is slightly less than 1.

Backend stalls are the primary limiter. Runs between graphs are different, but also shows some run to run variations.

Next steps: Drill into backend stall issues, understand factors affecting run to run variations. Also understand AMD and Intel gaps in IPC as well as more recent AMD drop from Ubuntu 16.04 to 18.04.