scimark2 – Performance analysis, tools and experiments

Description - phoronix/scimark2

This test runs the ANSI C version of SciMark 2.0, which is a benchmark for scientific and numerical computing developed by programmers at the National Institute of Standards and Technology. This test is made up of Fast Foruier Transform, Jacobi Successive Over-relaxation, Monte Carlo, Sparse Matrix Multiply, and dense LU matrix factorization benchmarks.

scimark2 and java-scimark2 were developed around 1999 by NIST: NIST java SciMark 2.0 page. There were GCC bugs related to scimark2: 54073 and 53397 and perhaps others so useful to make sure one is getting code you expect.

scimark is single-threaded and designed in a time when caches were smaller, though is run with the -large option for larger data sets. All tests below were run pinned to core 1. The test runs all five workloads below in a single process and reports both individual scores and a composite. As you’ll see in the graphs, these workloads have somewhat different characteristics and can be spotted separately on the graphs and overall metrics like IPC will also be a composite.

FFT
SOR
MonteCarlo
Sparse matmul
LU

described in more detail here.

Metrics (Intel) - phoronix/scimark2

sh - pid 3669
	On_CPU   0.125
	On_Core  1.000
	IPC      1.950
	Retire   0.392	(39.2%)
	FrontEnd 0.025	(2.5%)
	Spec     0.150	(15.0%)
	Backend  0.433	(43.3%)
	Elapsed  26.66
	Procs    3
	Minflt   8261
	Majflt   0
	Utime    26.65   	(100.0%)
	Stime    0.00    	(0.0%)
	Start    3794.14
	Finish   3820.80

Metrics above were adjusted to account for the process being single-threaded, i.e. twice as many slots available and my tool only assumed these across two processes. The front-end time is very small, so fits in iCache and TLB. Speculation somewhat high and backend also higher, these will break out from some of the workloads below.

Metrics (AMD) - phoronix/scimark2

sh - pid 28072
	On_CPU   0.062
	On_Core  0.999
	IPC      1.704
	FrontCyc 0.977	(97.7%)
	BackCyc  0.062	(6.2%)
	Elapsed  26.37
	Procs    3
	Minflt   8263
	Majflt   0
	Utime    26.35   	(100.0%)
	Stime    0.00    	(0.0%)
	Start    310324.42
	Finish   310350.79

AMD metrics slow slightly lower IPC.

Process Tree - phoronix/scimark2
Process Tree
The process tree is simple

   3669) sh elapsed=26.66 start=0.00 finish=26.66
      3670) scimark2 elapsed=26.66 start=0.00 finish=26.66
        3671) scimark2 elapsed=26.66 start=0.00 finish=26.66

Processor core 1 is kept scheduled almost 100% of the time.

IPCs for the five workloads can be seen with FFT (~1), SOR (~2.5), MonteCarlo (~1), Sparse Matmul (~2.5) and LU (~2.7) creating the composite IPC of 1.98.

The topdown metrics also show variations between the five workloads:

FFT – is backend bound
SOR – has a smaller backend issue and retires more slots
MonteCarlo – has a particularly high amount of speculation misses
SparseMatmul – is similar to SOR with slightly higher back end and lower front end stalls
LU – has highest retire rates

Overall a next level of analysis could tease these apart to characterize them separately.

Next steps: Understand AMD/Intel IPC gap, speculation misses in Monte Carlo, backend misses in SOR by separating out the workloads.