parboil – Performance analysis, tools and experiments

This page is a home for analysis of the parboil benchmark as part of the Phoronix test suite. Description of this suite is:

The Parboil Benchmarks from the IMPACT Research Group at University of Illinois are a set of throughput computing applications for looking at computing architecture and compilers. Parboil test-cases support OpenMP, OpenCL, and CUDA multi-processing environments. However, at this time the test profile is just making use of the OpenMP and OpenCL test workloads.

The link to the IMPACT research group for these benchmarks is here.

This test has 10 workloads:

Parboil 2.5:
pts/parboil-1.1.2
Processor Test Configuration
1: OpenMP CUTCP
2: OpenMP MRI-Q
3: OpenMP MRI Gridding
4: OpenMP Stencil
5: OpenMP LBM
6: OpenCL BFS
7: OpenCL TPACF
8: OpenCL LBM
9: OpenCL MRI Gridding
10: OpenCL Histo
11: Test All Options
Test:

I don’t have OpenCL on my test system and hence these benchmarks do not run. In addition, the MRI-Q benchmark executable is missing references to ComputeQCPU either because the file didn’t compile or link properly (not yet further diagnosed). The remaining four tests build and run correctly.

The remaining parboil benchmarks have the following descriptions

LBM – A fluid dynamics simulation of an enclosed, lid-driven cavity, using the Lattice-Boltzmann Method.

CUTCP – Computes the short-range component of Coulombic potential at each grid point over a 3D grid containing point charges representing an explicit-water biomolecular model.

Stencil – An iterative Jacobi stencil operation on a regular 3-D grid.

These benchmarks are run in the following order: LBM, CUTCP, Stencil. All are OpenMP programs running one thread per core.

Metrics (Intel) - phoronix/parboil

sh - pid 31043
	On_CPU   0.958
	On_Core  7.665
	IPC      1.152
	Retire   0.493	(49.3%)
	FrontEnd 0.229	(22.9%)
	Spec     0.128	(12.8%)
	Backend  0.150	(15.0%)
	Elapsed  209.30
	Procs    14
	Maxrss   1211K
	Minflt   418088
	Majflt   0
	Inblock  0
	Oublock  50648
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    20190	(76.8%)
	Nivcsw   6102
	Utime    1603.134343
	Stime    1.079020
	Start    131779.24
	Finish   131988.54

LBM runs the longest at 209 seconds (and tends to dominate the graphs visually). It is On_CPU at 96% and has more frontend stalls than backend stalls. It also has a moderate level of bad speculation. At ~75% of the total, voluntary context switches are jumping between threads.

sh - pid 31096
	On_CPU   0.919
	On_Core  7.353
	IPC      0.842
	Retire   0.432	(43.2%)
	FrontEnd 0.142	(14.2%)
	Spec     0.144	(14.4%)
	Backend  0.282	(28.2%)
	Elapsed  14.73
	Procs    14
	Maxrss   35K
	Minflt   24364
	Majflt   0
	Inblock  0
	Oublock  656
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    97	(15.4%)
	Nivcsw   531
	Utime    108.265967
	Stime    0.047483
	Start    132411.73
	Finish   132426.46

LBM runs less than 10% of the time of CUTCP. The IPC is somewhat lower at 0.84 and backend stalls are higher at over 25%.

sh - pid 31140
	On_CPU   0.908
	On_Core  7.267
	IPC      0.280
	Retire   0.117	(11.7%)
	FrontEnd 0.054	(5.4%)
	Spec     0.046	(4.6%)
	Backend  0.783	(78.3%)
	Elapsed  33.43
	Procs    14
	Maxrss   265K
	Minflt   103024
	Majflt   0
	Inblock  0
	Oublock  131096
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    721	(43.0%)
	Nivcsw   955
	Utime    242.750514
	Stime    0.185575
	Start    132465.53
	Finish   132498.96

Stencil has slightly longer runtime. It has the lowest IPC and a large number of backend stalls.

Metrics (AMD) - phoronix/parboil

sh - pid 31937 // LBM
	On_CPU   0.970
	On_Core  15.514
	IPC      0.905
	FrontCyc 0.000	(0.0%)
	BackCyc  0.000	(0.0%)
	Elapsed  145.64
	Procs    22
	Maxrss   1211K
	Minflt   418097
	Majflt   0
	Inblock  8
	Oublock  50648
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    30062	(12.1%)
	Nivcsw   219315
	Utime    2258.300102
	Stime    1.089136
	Start    696877.29
	Finish   697022.93
sh - pid 32042 // CUTCP
	On_CPU   0.890
	On_Core  14.240
	IPC      1.068
	FrontCyc 0.000	(0.0%)
	BackCyc  0.000	(0.0%)
	Elapsed   6.55
	Procs    22
	Maxrss   35K
	Minflt   24428
	Majflt   0
	Inblock  0
	Oublock  656
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    174	(1.9%)
	Nivcsw   9049
	Utime    93.116885
	Stime    0.156270
	Start    697324.42
	Finish   697330.97
sh - pid 32111 // Stencil
	On_CPU   0.926
	On_Core  14.812
	IPC      0.168
	FrontCyc 0.000	(0.0%)
	BackCyc  0.000	(0.0%)
	Elapsed  30.50
	Procs    22
	Maxrss   264K
	Minflt   103038
	Majflt   0
	Inblock  0
	Oublock  131096
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    1394	(3.1%)
	Nivcsw   43984
	Utime    451.568290
	Stime    0.195263
	Start    697353.89
	Finish   697384.39

On_CPU for AMD is roughly similar. The IPC for Stencil is considerably lower and the elapsed time almost same as Intel but using twice as many threads.

Process Tree - phoronix/parboil
Process Tree
The process trees are simple. All essentially a variation of the below.

    31140) sh
      31141) parboil
        31142) python
          31143) make
          31144) make
            31145) stencil
            31146) stencil
            31147) stencil
            31148) stencil
            31149) stencil
            31150) stencil
            31151) stencil
            31152) stencil
          31153) python

^{About this graph}
On_CPU is highest for LBM and this dominates the averages. It is closer to 90% for CUTCP and Stencil.

IPC is consistent across the workload. LBM has highest IPC, CUTCP is next highest and Stencil is very low.

^{About this graph}
Topdown metrics show effects of backend stalls for Stencil and to lesser extent CUTCP. For LBM, frontend stalls are higher issue.

Topdown (Intel)

on_cpu         0.915
elapsed        892.974
utime          6530.158
stime          3.373
nvcsw          52873 (63.21%)
nivcsw         30776 (36.79%)
inblock        40
onblock        941288
retire         0.500
ms_uops                0.007
speculation    0.009
branch_misses          2.52%
machine_clears         97.48%
frontend       0.184
idq_uops_delivered_0   0.024
icache_stall               0.004
itlb_misses                0.000
idq_uops_delivered_1   0.079
idq_uops_delivered_2   0.116
idq_uops_delivered_3   0.149
dsb_ops                    37.69%
backend        0.307
resource_stalls.sb     0.032
stalls_ldm_pending     0.365
l2_refs                    0.022
l2_misses                  0.008
l2_miss_ratio              38.79%
l3_refs                    0.004
l3_misses                  0.002
l3_miss_ratio              50.26%

Overall topdown metrics are an aggregate of all three. However, a reasonable percentage of backend stalls and cache misses (particularly stencil). The frontend stalls seem to be more poor allocation than icache/itlb misses and 38% of the uops come from the cache. Bad speculation is not high, but interestingly coming from machine clears.