Blender is an open-source 3D creation software project. This test is of Blender’s Cycles benchmark with various sample files. GPU computing via OpenCL or CUDA is supported
The blender workload has five input files and three modes: CPU-Only, OpenCL and CUDA. Following are the elapsed time in seconds I see on my Intel and AMD systems for CPU-only mode. The Phoronix article and workloads page include only Barbershop. However, my analysis below uses all five input files.
Runtimes are considerably longer than for other Phoronix workloads. My AMD box continues to get scores roughly twice of my Intel box.
Intel AMD BMW27 544.61 289.46 Classroom 1302.18 628.91 Fishy Cat 786.25 408.10 Barbershop 2635.12 1468.05 Pabellon Barcelona 1587.06 804.58
Metrics (Intel) - phoronix/blender
Metrics for all five workloads separately
sh - pid 520 // BMW27 On_CPU 0.996 On_Core 7.965 IPC 0.929 Retire 0.480 (48.0%) FrontEnd 0.205 (20.5%) Spec 0.086 (8.6%) Backend 0.229 (22.9%) Elapsed 549.33 Procs 56 Minflt 178840 Majflt 0 Utime 4374.90 (100.0%) Stime 0.26 (0.0%) Start 332948.28 Finish 333497.61 sh - pid 585 // Classroom On_CPU 0.999 On_Core 7.989 IPC 1.029 Retire 0.536 (53.6%) FrontEnd 0.234 (23.4%) Spec 0.072 (7.2%) Backend 0.158 (15.8%) Elapsed 1308.55 Procs 48 Minflt 188838 Majflt 0 Utime 10452.72 (100.0%) Stime 0.65 (0.0%) Start 333505.73 Finish 334814.28 sh - pid 692 // Fishy Cat On_CPU 0.995 On_Core 7.957 IPC 0.995 Retire 0.514 (51.4%) FrontEnd 0.182 (18.2%) Spec 0.112 (11.2%) Backend 0.193 (19.3%) Elapsed 790.29 Procs 48 Minflt 647075 Majflt 0 Utime 6287.66 (100.0%) Stime 0.88 (0.0%) Start 334822.41 Finish 335612.70 sh - pid 795 // Barbershop On_CPU 0.990 On_Core 7.923 IPC 0.800 Retire 0.413 (41.3%) FrontEnd 0.206 (20.6%) Spec 0.119 (11.9%) Backend 0.262 (26.2%) Elapsed 2631.79 Procs 69 Minflt 2845658 Majflt 0 Utime 20848.75 (100.0%) Stime 4.09 (0.0%) Start 335620.95 Finish 338252.74 sh - pid 1033 // Pabellon Barcelona On_CPU 0.999 On_Core 7.989 IPC 0.868 Retire 0.449 (44.9%) FrontEnd 0.193 (19.3%) Spec 0.110 (11.0%) Backend 0.248 (24.8%) Elapsed 1597.22 Procs 63 Minflt 144019 Majflt 0 Utime 12759.58 (100.0%) Stime 1.07 (0.0%) Start 338260.86 Finish 339858.08
A few things to note: first the On_CPU numbers are extremely high so the workload remains scheduled on the CPU at almost 100%. Second, the retiring percentage is slightly lower with front end and back end stalls perhaps contributing similar and wasted speculation also slightly higher than average. The IPC varies slightly by input files.
The resource chart below is also consistent with 100% On_CPU as number of involuntary context switches is low.
utime: 54469.605827 stime: 6.352188 maxrss: 7015K minflt: 4068215 majflt: 5 nswap: 0 inblock: 144704 oublock: 10904 msgsnd: 0 msgrcv: 0 nsignals: 0 nvcsw: 20559 nivcsw: 439295
Metrics (AMD) - phoronix/blender
Metrics for all five workloads separately
sh - pid 21985 // BMW27 On_CPU 0.983 On_Core 15.723 IPC 0.973 FrontCyc 0.041 (4.1%) BackCyc 0.082 (8.2%) Elapsed 291.18 Procs 101 Minflt 189299 Majflt 0 Utime 4577.98 (100.0%) Stime 0.16 (0.0%) Start 306221.12 Finish 306512.30 sh - pid 22099 // Classroom On_CPU 0.996 On_Core 15.934 IPC 1.164 FrontCyc 0.048 (4.8%) BackCyc 0.075 (7.5%) Elapsed 630.14 Procs 85 Minflt 195835 Majflt 0 Utime 10040.15 (100.0%) Stime 0.30 (0.0%) Start 306520.42 Finish 307150.56 sh - pid 22242 // Fishy Cat On_CPU 0.988 On_Core 15.809 IPC 1.060 FrontCyc 0.051 (5.1%) BackCyc 0.075 (7.5%) Elapsed 408.31 Procs 85 Minflt 656453 Majflt 0 Utime 6454.05 (100.0%) Stime 0.79 (0.0%) Start 307159.04 Finish 307567.35 sh - pid 22349 // Barbershop On_CPU 0.980 On_Core 15.684 IPC 0.794 FrontCyc 0.058 (5.8%) BackCyc 0.062 (6.2%) Elapsed 1472.31 Procs 130 Minflt 2862890 Majflt 0 Utime 23087.91 (100.0%) Stime 4.00 (0.0%) Start 307575.47 Finish 309047.78 sh - pid 22575 // Pabellon Barcelona On_CPU 0.996 On_Core 15.943 IPC 0.946 FrontCyc 0.049 (4.9%) BackCyc 0.074 (7.4%) Elapsed 804.45 Procs 131 Minflt 151736 Majflt 0 Utime 12825.16 (100.0%) Stime 0.32 (0.0%) Start 309056.05 Finish 309860.50
Process Tree - phoronix/blender
Process Tree
The process tree shows a similar pattern. Operations are spawned with two threads on each core. There are a few very small <1 second operations and in the middle one long-running operation for over 40 minutes. Below is the processtree for barbershop, but others are similar.
795) sh elapsed=2631.79 start=0.00 finish=2631.79 797) blender elapsed=2631.79 start=0.00 finish=2631.79 803) blender elapsed=0.18 start=0.00 finish=0.18 804) blender elapsed=0.15 start=0.02 finish=0.17 805) blender elapsed=0.15 start=0.02 finish=0.17 806) blender elapsed=0.15 start=0.02 finish=0.17 807) blender elapsed=0.15 start=0.02 finish=0.17 808) blender elapsed=0.15 start=0.02 finish=0.17 809) blender elapsed=0.15 start=0.02 finish=0.17 810) blender elapsed=0.15 start=0.02 finish=0.17 811) blender elapsed=0.15 start=0.02 finish=0.17 812) threaded-ml elapsed=0.00 start=0.05 finish=0.05 813) threaded-ml elapsed=0.12 start=0.05 finish=0.17 814) blender elapsed=0.12 start=0.05 finish=0.17 815) blender elapsed=0.00 start=0.05 finish=0.05 817) blender elapsed=0.02 start=0.14 finish=0.16 819) blender elapsed=0.02 start=0.14 finish=0.16 820) blender elapsed=0.02 start=0.14 finish=0.16 821) blender elapsed=0.02 start=0.14 finish=0.16 822) blender elapsed=0.02 start=0.14 finish=0.16 824) blender elapsed=0.02 start=0.14 finish=0.16 826) blender elapsed=0.02 start=0.14 finish=0.16 827) blender elapsed=2631.42 start=0.18 finish=2631.60 829) blender elapsed=2631.41 start=0.19 finish=2631.60 830) blender elapsed=2631.41 start=0.19 finish=2631.60 832) blender elapsed=2631.41 start=0.19 finish=2631.60 833) blender elapsed=2631.41 start=0.19 finish=2631.60 834) blender elapsed=2631.41 start=0.19 finish=2631.60 835) blender elapsed=2631.41 start=0.19 finish=2631.60 836) blender elapsed=2631.41 start=0.19 finish=2631.60 837) blender elapsed=2631.41 start=0.19 finish=2631.60 838) blender elapsed=2630.89 start=0.71 finish=2631.60 842) blender elapsed=2630.89 start=0.71 finish=2631.60 843) blender elapsed=2630.89 start=0.71 finish=2631.60 844) blender elapsed=2630.89 start=0.71 finish=2631.60 846) blender elapsed=2630.89 start=0.71 finish=2631.60 847) blender elapsed=2630.89 start=0.71 finish=2631.60 848) blender elapsed=2630.89 start=0.71 finish=2631.60 849) blender elapsed=2630.61 start=0.99 finish=2631.60 850) blender elapsed=2630.61 start=0.99 finish=2631.60 851) blender elapsed=2630.61 start=0.99 finish=2631.60 852) blender elapsed=2630.61 start=0.99 finish=2631.60 853) blender elapsed=2630.61 start=0.99 finish=2631.60 854) blender elapsed=2630.60 start=1.00 finish=2631.60 855) blender elapsed=2630.60 start=1.00 finish=2631.60 859) blender elapsed=2630.25 start=1.35 finish=2631.60 860) blender elapsed=2630.25 start=1.35 finish=2631.60 864) blender elapsed=2630.25 start=1.35 finish=2631.60 865) blender elapsed=2630.25 start=1.35 finish=2631.60 866) blender elapsed=2630.25 start=1.35 finish=2631.60 869) blender elapsed=2630.25 start=1.35 finish=2631.60 870) blender elapsed=2630.25 start=1.35 finish=2631.60 871) blender elapsed=2629.35 start=2.12 finish=2631.47
About this graph
Overall CPU usage almost at 100% with gaps between workloads visible as well as their relative durations.
IPC numbers are slightly more heavy, partially because there are so many data points. However, it does show the third workload slightly more chaotic and the second and fourth ones more similar.
Somewhat more memory read traffic in the third workload correlates with a higher IPC. However, it will also be useful to correlate this “top down” with memory stall measurements.
About this graph
Topdown metrics show varying amounts of backend, as well as correlation with memory traffic.
Next steps: Drill down on next level metrics, e.g. how much of backend is memory-bound vs. core-bound. At lower priority, anything we can tell about the speculation misses?