tensorflow – Performance analysis, tools and experiments

Description - phoronix/tensorflow

This is a benchmark of the Tensorflow deep learning framework using the CIFAR10 data set.

Metrics (Intel) - phoronix/tensorflow

sh - pid 20011
	On_CPU   0.873
	On_Core  6.987
	IPC      1.062
	Retire   0.559	(55.9%)
	FrontEnd 0.101	(10.1%)
	Spec     0.025	(2.5%)
	Backend  0.315	(31.5%)
	Elapsed  91.25
	Procs    50
	Maxrss   623K
	Minflt   20401674
	Majflt   0
	Inblock  16
	Oublock  397768
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    1043029	(83.0%)
	Nivcsw   213430
	Utime    591.999773
	Stime    45.550762
	Start    753175.79
	Finish   753267.04

The program has a steady amount of kernel time as well as user time ans some output I/O leaving an On_CPU of 87%. The largest limiters are backend stalls.

Metrics (AMD) - phoronix/tensorflow

sh - pid 16175
	On_CPU   0.722
	On_Core  11.546
	IPC      0.891
	FrontCyc 0.017	(1.7%)
	BackCyc  0.050	(5.0%)
	Elapsed  80.41
	Procs    214
	Maxrss   842K
	Minflt   20562855
	Majflt   0
	Inblock  8
	Oublock  397680
	Msgsnd   0
	Msgrcv   0
	Nsignals 0
	Nvcsw    1631884	(85.8%)
	Nivcsw   269298
	Utime    851.494691
	Stime    76.885744
	Start    2520.93
	Finish   2601.34

IPC for AMD is ~10% lower than Intel.

Process Tree - phoronix/tensorflow
Process Tree
Multiple python processes started on each core


   20011) sh
      20012) tensorflow
        20013) python3
          20021) ldconfig.real
          20022) sh
            20023) uname
        20014) python3
        20015) python3
        20016) python3
        20017) python3
        20018) python3
        20019) python3
        20020) python3
        20024) python3
        20025) python3
        20026) python3
        20027) python3
        20028) python3
        20029) python3
        20030) python3
        20031) python3
        20032) python3
        20033) python3
        20034) python3
        20035) python3
        20036) python3
        20037) python3
        20038) python3
        20039) python3
        20040) python3
        20041) python3
        20042) python3
        20043) python3
        20044) python3
        20045) python3
        20046) python3
        20047) python3
        20048) python3
        20049) python3
        20050) python3
        20051) python3
        20052) python3
        20053) python3
        20054) python3
        20055) python3
        20056) python3
        20057) python3
        20058) python3
        20059) python3
        20060) python3

Adding times for all the CPUs shows a small amount of system time.

^{About this graph}
Overall scheduled on all cores.

IPC is consistently around 1.

^{About this graph}
Shows how backend stalls are the largest issue.

retire         0.526
ms_uops                0.016
speculation    0.024
branch_misses          16.70%
machine_clears         83.30%
frontend       0.099
idq_uops_delivered_0   0.035
idq_uops_delivered_1   0.042
idq_uops_delivered_2   0.052
idq_uops_delivered_3   0.067
backend        0.351
resource_stalls.sb     0.109
stalls_ldm_pending     0.319

Backend stalls tend to dominate.

Next steps: Dig deeper in backend stalls.