rodinia – Performance analysis, tools and experiments

Description - phoronix/rodinia

Rodinia is a suite focused upon accelerating compute-intensive applications with accelerators. CUDA, OpenMP, and OpenCL parallel models are supported by the included applications. This profile utilizes the OpenCL and OpenMP test binaries at the moment.

As described in the iiswc paper Rodinia was designed to compare implementations of classic algorithms between GPU and CPU. Below I don’t have that comparison, but instead characterize the OpenMP implementations of Rodinia in the Phoronix test suite as they might be used to compare processors.

The Phoronix test suite has four OpenMP workloads but one of them does (leukocyte) does not compile correctly out of the box. I expect Phoronix to eventually fix this, so haven’t further diagnosed to make a local fix. The remaining three benchmarks and their runtimes on my systems are:

               Intel            AMD
LavaMD         215.00           95.57
CFD Solver      65.53           32.67
Streamcluster   33.16           25.00

In the plots below they are run in this order and each has different characteristics.

Metrics (Intel) - phoronix/rodinia

sh - pid 10911 // LavaMD
	On_CPU   0.980
	On_Core  7.841
	IPC      1.459
	Retire   0.723	(72.3%)
	FrontEnd 0.150	(15.0%)
	Spec     0.006	(0.6%)
	Backend  0.121	(12.1%)
	Elapsed  215.65
	Procs    10
	Minflt   212459
	Majflt   0
	Utime    1690.77 	(100.0%)
	Stime    0.21    	(0.0%)
	Start    415189.98
	Finish   415405.63
sh - pid 11000 // CFD Solver
	On_CPU   0.974
	On_Core  7.790
	IPC      0.641
	Retire   0.323	(32.3%)
	FrontEnd 0.093	(9.3%)
	Spec     0.003	(0.3%)
	Backend  0.580	(58.0%)
	Elapsed  72.33
	Procs    10
	Minflt   14414
	Majflt   0
	Utime    562.78  	(99.9%)
	Stime    0.69    	(0.1%)
	Start    415855.61
	Finish   415927.94
sh - pid 11064 // Streamcluster
	On_CPU   0.973
	On_Core  7.782
	IPC      0.909
	Retire   0.459	(45.9%)
	FrontEnd 0.097	(9.7%)
	Spec     0.007	(0.7%)
	Backend  0.437	(43.7%)
	Elapsed  33.28
	Procs    10
	Minflt   33772
	Majflt   0
	Utime    258.93  	(100.0%)
	Stime    0.04    	(0.0%)
	Start    416291.28
	Finish   416324.56

Metrics show similarities where all benchmarks run almost 100% on the CPU and all have a basic structure of one parent process delegating work to children on all the cores. The differences are that CFD solver and to lesser extent Streamcluster are more backend bound than LavaMD and thus also have higher IPC.

Metrics (AMD) - phoronix/rodinia

sh - pid 12067 // LavaMD
	On_CPU   0.973
	On_Core  15.565
	IPC      1.835
	FrontCyc 0.016	(1.6%)
	BackCyc  0.175	(17.5%)
	Elapsed  95.55
	Procs    18
	Minflt   212510
	Majflt   0
	Utime    1487.08 	(100.0%)
	Stime    0.17    	(0.0%)
	Start    66011.18
	Finish   66106.73
sh - pid 12166 // CFD Solver
	On_CPU   0.929
	On_Core  14.869
	IPC      0.824
	FrontCyc 0.001	(0.1%)
	BackCyc  0.034	(3.4%)
	Elapsed  32.78
	Procs    18
	Minflt   14435
	Majflt   0
	Utime    486.68  	(99.9%)
	Stime    0.72    	(0.1%)
	Start    66317.66
	Finish   66350.44
sh - pid 12224 // Streamcluster
	On_CPU   0.955
	On_Core  15.283
	IPC      0.708
	FrontCyc 0.001	(0.1%)
	BackCyc  0.107	(10.7%)
	Elapsed  25.05
	Procs    18
	Minflt   33801
	Majflt   0
	Utime    382.81  	(100.0%)
	Stime    0.02    	(0.0%)
	Start    66425.54
	Finish   66450.59

IPCs for LavaMD and CFD Solver are higher for AMD than Intel while Streamcluster is lower.

Process Tree - phoronix/rodinia
Process Tree
The process trees for three benchmarks are similar and simple

    10911) sh elapsed=215.65 start=0.00 finish=215.65
      10912) rodinia elapsed=215.65 start=0.00 finish=215.65
        10913) lavaMD elapsed=215.65 start=0.00 finish=215.65
        10914) lavaMD elapsed=215.01 start=0.64 finish=215.65
        10915) lavaMD elapsed=215.01 start=0.64 finish=215.65
        10916) lavaMD elapsed=215.01 start=0.64 finish=215.65
        10917) lavaMD elapsed=215.01 start=0.64 finish=215.65
        10918) lavaMD elapsed=215.01 start=0.64 finish=215.65
        10919) lavaMD elapsed=215.01 start=0.64 finish=215.65
        10920) lavaMD elapsed=215.01 start=0.64 finish=215.65

    11000) sh elapsed=72.33 start=0.00 finish=72.33
      11001) rodinia elapsed=72.33 start=0.00 finish=72.33
        11002) euler3d_cpu_dou elapsed=72.32 start=0.01 finish=72.33
        11003) euler3d_cpu_dou elapsed=71.16 start=1.17 finish=72.33
        11004) euler3d_cpu_dou elapsed=71.16 start=1.17 finish=72.33
        11005) euler3d_cpu_dou elapsed=71.16 start=1.17 finish=72.33
        11006) euler3d_cpu_dou elapsed=71.16 start=1.17 finish=72.33
        11007) euler3d_cpu_dou elapsed=71.16 start=1.17 finish=72.33
        11008) euler3d_cpu_dou elapsed=71.16 start=1.17 finish=72.33
        11009) euler3d_cpu_dou elapsed=71.16 start=1.17 finish=72.33

    11064) sh elapsed=33.28 start=0.00 finish=33.28
      11065) rodinia elapsed=33.27 start=0.01 finish=33.28
        11066) sc_omp elapsed=33.26 start=0.01 finish=33.27
        11067) sc_omp elapsed=32.30 start=0.97 finish=33.27
        11068) sc_omp elapsed=32.30 start=0.97 finish=33.27
        11069) sc_omp elapsed=32.30 start=0.97 finish=33.27
        11070) sc_omp elapsed=32.30 start=0.97 finish=33.27
        11071) sc_omp elapsed=32.30 start=0.97 finish=33.27
        11072) sc_omp elapsed=32.30 start=0.97 finish=33.27
        11073) sc_omp elapsed=32.30 start=0.97 finish=33.27

^{About this graph}
Almost all CPUs are busy all the time. As an aside, the Phoronix test suite runs each benchmark three times. If the variation is too large, it may run one more more times. This is why in this instance one sees the first workload four times and in other graphs only three times.

IPC of LavaMD is highest. CFD Solver ran six times with a low IPC and Streamcluster slightly higher.

^{About this graph}
LavaMD has relatively high retire rate and lower front-end and back-end stalls. CFD solver has a high amount of backend stalls that is useful to investigate further (memory/cache or core)?. Streamcluster also has a higher amount of backend stalls.

^{About this graph}
The next level of topdown analysis shows that backend stalls in CFD solver are more split between memory and other, while the stall issue for Streamcluster is predominantly memory.

Also decided to print the amount of memory traffic. As expected LavaMD has relatively little traffic. CFD Solver has lowest IPC but medium amount of memory traffic and Streamcluster has the highest external memory traffic.

As an interesting aside, the phoronix article compares Ryzen 2700x and Ryzen 1700 on both with following ratios

              LavaMD   CFD Solver
Ryzen 2700x    88.16     26.65 
Ryzen 1700    102.19     30.31
Ratio          +15%       +13%

Two larger differences between these processors are increased frequencies (3.0 base/3.7 boost to 3.7 base/4.3 boost) and decreased cache latencies. I would expect the cache to primarily help the backend-bound CFD solver and the increased frequency to perhaps help the LavaMD that doesn’t have to wait on memory as much. More difficult to tease these two apart without having systems.

Next Steps: Why the Intel AMD IPC differences for LavaMD? Further dig into backend stall issues for CFD Solver.