y-cruncher – Performance analysis, tools and experiments

Description - phoronix/y-cruncher

Y-Cruncher is a multi-threaded Pi benchmark.

This benchmark is also described here. It claims are record for computing the most digits of pi.

Metrics (Intel) - phoronix/y-cruncher

sh - pid 18152
	On_CPU   0.897
	On_Core  7.173
	IPC      1.178
	Retire   0.480	(48.0%)
	FrontEnd 0.138	(13.8%)
	Spec     0.122	(12.2%)
	Backend  0.260	(26.0%)
	Elapsed  65.10
	Procs    21
	Maxrss   2565K
	Minflt   661484
	Majflt   0
	Inblock  0
	Oublock  976608
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    31375	(94.5%)
	Nivcsw   1820
	Utime    465.602228
	Stime    1.364009
	Start    90263.07
	Finish   90328.17

The program has ~94.5% voluntary context switches and blocks written out, so there is some I/O. Otherwise a limiter are backend stalls, resulting in an IPC slightly over 1 and a low retirement rate.

Metrics (AMD) - phoronix/y-cruncher

sh - pid 6382
	On_CPU   0.761
	On_Core  12.168
	IPC      0.842
	FrontCyc 0.008	(0.8%)
	BackCyc  0.014	(1.4%)
	Elapsed  64.46
	Procs    37
	Maxrss   2563K
	Minflt   661224
	Majflt   0
	Inblock  32
	Oublock  976608
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    68925	(96.9%)
	Nivcsw   2198
	Utime    781.762266
	Stime    2.612581
	Start    159119.22
	Finish   159183.68

AMD IPC is just slightly lower.

Process Tree - phoronix/y-cruncher
Process Tree
The program runs two processes per virtual core.

    18152) sh
      18153) y-cruncher
        18154) y-cruncher
          18156) sh
            18157) 13-HSW ~ Airi
            18158) 13-HSW ~ Airi
            18159) 13-HSW ~ Airi
            18160) 13-HSW ~ Airi
            18161) 13-HSW ~ Airi
            18162) 13-HSW ~ Airi
            18163) 13-HSW ~ Airi
            18164) 13-HSW ~ Airi
            18165) 13-HSW ~ Airi
            18166) 13-HSW ~ Airi
            18167) 13-HSW ~ Airi
            18168) 13-HSW ~ Airi
            18169) 13-HSW ~ Airi
            18170) 13-HSW ~ Airi
            18171) 13-HSW ~ Airi
            18172) 13-HSW ~ Airi
        18155) sed

^{About this graph}
Some noise as these processes are scheduled on all cores.

The overall IPC is consistent and slightly over 1.

^{About this graph}
Backend stalls are the largest limiter.

Topdown (Intel)

retire         0.588
ms_uops                0.001
speculation    0.004
branch_misses          5.28%
machine_clears         94.72%
frontend       0.137
idq_uops_delivered_0   0.054
icache_stall               0.009
itlb_misses                0.000
idq_uops_delivered_1   0.060
idq_uops_delivered_2   0.071
idq_uops_delivered_3   0.088
dsb_ops                    55.17%
backend        0.271
resource_stalls.sb     0.033
stalls_ldm_pending     0.214

Overall retirement rate is higher than reported above (and this also seems more consistent with the IPC). Shows a few frontend stalls (branch resteers?) and some memory stalls.

Next steps: None