luajit – Performance analysis, tools and experiments

Description - phoronix/luajit

This test profile is a collection of Lua scripts/benchmarks run against a locally-built copy of LuaJIT upstream.

This is a single-threaded workload and all tests below have been pinned to core 1. The test itself runs “scimark.lua” which goes through the similar algorithms as scimark2 and java-scimark2. In order they are run these are:

Composite
Monte Carlo
Fast Fourier Transform
Sparse Matrix Multiply
Dense LU Matrix Factorization
Jacobi Successive Over-Relaxation

They all run within a single process so the overall metrics reported are a composite. However, in the time sequences below, one can see slightly different characteristics as the program goes through each workload.

Metrics (Intel) - phoronix/luajit

sh - pid 19626
	On_CPU   0.125
	On_Core  1.000
	IPC      2.026
	Retire   0.385	(38.5%)
	FrontEnd 0.025	(2.5%)
	Spec     0.136	(13.6%)
	Backend  0.454	(45.4%)
	Elapsed  30.37
	Procs    3
	Minflt   13347
	Majflt   0
	Utime    30.35   	(100.0%)
	Stime    0.01    	(0.0%)
	Start    99616.02
	Finish   99646.39

Overall the workload is scheduled to run 100% of the time. On average, the primary limiter seems to be backend stalls.

Metrics (AMD) - phoronix/luajit

sh - pid 3564
	On_CPU   0.062
	On_Core  0.999
	IPC      1.931
	FrontCyc 0.012	(1.2%)
	BackCyc  0.245	(24.5%)
	Elapsed  33.12
	Procs    3
	Minflt   13343
	Majflt   0
	Utime    33.08   	(100.0%)
	Stime    0.01    	(0.0%)
	Start    405832.45
	Finish   405865.57

AMD metrics show just a slightly lower composite IPC.

Process Tree - phoronix/luajit
Process Tree
The process tree is simple.

   19626) sh elapsed=30.37 start=0.00 finish=30.37
      19627) luajit elapsed=30.37 start=0.00 finish=30.37
        19628) luajit elapsed=30.37 start=0.00 finish=30.37

Core 1 is kept scheduled while the test runs.

Patterns for IPC vary across from less than 1 to over 3.5 on the workloads. Running these separately would sort out the specifics.

There is an almost inverse proportional relationship between backend stalls and declining IPC. One workload shows up particularly high in speculation misses.

Next steps: Separate out workloads and drill deeper into speculation misses, IPC differences between AMD/Intel and likely causes for backend stalls between core/memory. Note that there is a fair overall in all three of these between scimark2, java-scimark2 and luajit and hence can show similar patterned results.