Description - phoronix/polybench-c

PolyBench-C is a C-language polyhedral benchmark suite made at the Ohio State University.

A link to the polybench-c page is here.

Phoronix runs these benchmarks with the LARGE model and hence they do not fit in L3 and overall are memory-bound. The code is single-threaded and tests below were pinned to one core. Three workloads are run in order

  1. covariance
  2. correlation
  3. matrix multiplication

Metrics (Intel) - phoronix/polybench-c

Metrics for the workloads:

sh - pid 14672 //covariance
	On_CPU   0.125
	On_Core  0.999
	IPC      0.149
	Retire   0.030	(3.0%)
	FrontEnd 0.006	(0.6%)
	Spec     0.008	(0.8%)
	Backend  0.956	(95.6%)
	Elapsed  10.66
	Procs    3
	Maxrss   46K
	Minflt   14527
	Majflt   0
	Inblock  0
	Oublock  16
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    18	(45.0%)
	Nivcsw   22
	Utime    10.637998
	Stime    0.012624
	Start    85156.99
	Finish   85167.65
sh - pid 14683 // correlation
	On_CPU   0.125
	On_Core  1.000
	IPC      0.149
	Retire   0.030	(3.0%)
	FrontEnd 0.006	(0.6%)
	Spec     0.008	(0.8%)
	Backend  0.956	(95.6%)
	Elapsed  10.66
	Procs    3
	Maxrss   46K
	Minflt   14538
	Majflt   0
	Inblock  0
	Oublock  16
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    18	(50.0%)
	Nivcsw   18
	Utime    10.647670
	Stime    0.015996
	Start    85199.00
	Finish   85209.66
sh - pid 14696 // matrix multiply
	On_CPU   0.125
	On_Core  1.000
	IPC      0.473
	Retire   0.081	(8.1%)
	FrontEnd 0.002	(0.2%)
	Spec     0.021	(2.1%)
	Backend  0.895	(89.5%)
	Elapsed  10.38
	Procs    3
	Maxrss   64K
	Minflt   21502
	Majflt   0
	Inblock  0
	Oublock  16
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    18	(52.9%)
	Nivcsw   16
	Utime    10.350641
	Stime    0.031998
	Start    85241.09
	Finish   85251.47

Overall, On_Core of 100% with low IPC and very much backend (memory) bound. The matrix multiply is least memory bound.

Metrics (AMD) - phoronix/polybench
sh - pid 4917 // covariance
	On_CPU   0.062
	On_Core  1.000
	IPC      0.426
	FrontCyc 0.001	(0.1%)
	BackCyc  0.071	(7.1%)
	Elapsed   3.86
	Procs    3
	Maxrss   45K
	Minflt   14530
	Majflt   0
	Inblock  0
	Oublock  16
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    18	(4.5%)
	Nivcsw   385
	Utime    3.847715
	Stime    0.012120
	Start    154067.80
	Finish   154071.66
sh - pid 4928 // correlation
	On_CPU   0.062
	On_Core  0.999
	IPC      0.424
	FrontCyc 0.001	(0.1%)
	BackCyc  0.070	(7.0%)
	Elapsed   3.89
	Procs    3
	Maxrss   46K
	Minflt   14541
	Majflt   0
	Inblock  0
	Oublock  16
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    18	(4.4%)
	Nivcsw   387
	Utime    3.864962
	Stime    0.020848
	Start    154089.34
	Finish   154093.23
sh - pid 4939 // matrix multiply
	On_CPU   0.062
	On_Core  0.998
	IPC      1.202
	FrontCyc 0.001	(0.1%)
	BackCyc  0.171	(17.1%)
	Elapsed   4.25
	Procs    3
	Maxrss   64K
	Minflt   21506
	Majflt   0
	Inblock  0
	Oublock  16
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    18	(4.0%)
	Nivcsw   431
	Utime    4.219678
	Stime    0.021351
	Start    154110.93
	Finish   154115.18

Perhaps different instructions used as IPC on AMD is slightly higher.

Process Tree - phoronix/polybench-c
Process Tree
The process trees are simple

    14672) sh
      14673) polybench-c
        14674) covariance_benc

    14683) sh
      14684) polybench-c
        14685) correlation_ben

    14696) sh
      14697) polybench-c
        14698) 3mm_bench


Overall 100% On_Core.


IPC is very low.


Backend stalls are the key issue.

Topdown (Intel)
retire         0.061
ms_uops                0.003
speculation    0.002
branch_misses          16.32%
machine_clears         83.68%
frontend       0.011
idq_uops_delivered_0   0.003
icache_stall               0.001
itlb_misses                0.000
idq_uops_delivered_1   0.005
idq_uops_delivered_2   0.006
idq_uops_delivered_3   0.008
dsb_ops                    5.26%
backend        0.926
resource_stalls.sb     0.002
stalls_ldm_pending     0.921

Overall the stalls are memory read related.

Next steps: None