Description - phoronix/scimark2

This test runs the ANSI C version of SciMark 2.0, which is a benchmark for scientific and numerical computing developed by programmers at the National Institute of Standards and Technology. This test is made up of Fast Foruier Transform, Jacobi Successive Over-relaxation, Monte Carlo, Sparse Matrix Multiply, and dense LU matrix factorization benchmarks.

scimark2 and java-scimark2 were developed around 1999 by NIST: NIST java SciMark 2.0 page. There were GCC bugs related to scimark2: 54073 and 53397 and perhaps others so useful to make sure one is getting code you expect.

scimark is single-threaded and designed in a time when caches were smaller, though is run with the -large option for larger data sets. All tests below were run pinned to core 1. The test runs all five workloads below in a single process and reports both individual scores and a composite. As you’ll see in the graphs, these workloads have somewhat different characteristics and can be spotted separately on the graphs and overall metrics like IPC will also be a composite.

  • FFT
  • SOR
  • MonteCarlo
  • Sparse matmul
  • LU
  • described in more detail here.

    Metrics (Intel) - phoronix/scimark2
    sh - pid 3669
    	On_CPU   0.125
    	On_Core  1.000
    	IPC      1.950
    	Retire   0.392	(39.2%)
    	FrontEnd 0.025	(2.5%)
    	Spec     0.150	(15.0%)
    	Backend  0.433	(43.3%)
    	Elapsed  26.66
    	Procs    3
    	Minflt   8261
    	Majflt   0
    	Utime    26.65   	(100.0%)
    	Stime    0.00    	(0.0%)
    	Start    3794.14
    	Finish   3820.80
    

    Metrics above were adjusted to account for the process being single-threaded, i.e. twice as many slots available and my tool only assumed these across two processes. The front-end time is very small, so fits in iCache and TLB. Speculation somewhat high and backend also higher, these will break out from some of the workloads below.

    Metrics (AMD) - phoronix/scimark2
    sh - pid 28072
    	On_CPU   0.062
    	On_Core  0.999
    	IPC      1.704
    	FrontCyc 0.977	(97.7%)
    	BackCyc  0.062	(6.2%)
    	Elapsed  26.37
    	Procs    3
    	Minflt   8263
    	Majflt   0
    	Utime    26.35   	(100.0%)
    	Stime    0.00    	(0.0%)
    	Start    310324.42
    	Finish   310350.79
    

    AMD metrics slow slightly lower IPC.

    Process Tree - phoronix/scimark2
    Process Tree
    The process tree is simple

       3669) sh elapsed=26.66 start=0.00 finish=26.66
          3670) scimark2 elapsed=26.66 start=0.00 finish=26.66
            3671) scimark2 elapsed=26.66 start=0.00 finish=26.66
    


    Processor core 1 is kept scheduled almost 100% of the time.


    IPCs for the five workloads can be seen with FFT (~1), SOR (~2.5), MonteCarlo (~1), Sparse Matmul (~2.5) and LU (~2.7) creating the composite IPC of 1.98.


    The topdown metrics also show variations between the five workloads:

    • FFT – is backend bound
    • SOR – has a smaller backend issue and retires more slots
    • MonteCarlo – has a particularly high amount of speculation misses
    • SparseMatmul – is similar to SOR with slightly higher back end and lower front end stalls
    • LU – has highest retire rates

    Overall a next level of analysis could tease these apart to characterize them separately.

    Next steps: Understand AMD/Intel IPC gap, speculation misses in Monte Carlo, backend misses in SOR by separating out the workloads.