Phoronix article – hyperthreading (2018-06-20) – Performance analysis, tools and experiments

Phoronix posted an article comparing hyperthreading on/off on an Intel i7. This post reviews some of the workloads and add comments.

The Phoronix article is interesting as choice of hyper-threading is important choice. Apparently BSD has chosen to default to hyper-threading on/off. My uninformed hypothesis are that (a) hyper-threading should help where natural parallelism is limiting use of the processor (b) hyper-threading should hurt in cases where it exacerbates a natural bottleneck like backend memory stalls. Otherwise hyper-threading should likely be neutral neither helping nor hurting including single-threaded programs.

So what I’ll first do is look at my existing metrics/comments for the workloads Phoronix tested and then see if I can take some measurements of more interesting workloads that come out. The following table summarizes the benchmarks measured and my past analysis on some of these benchmarks. Overall he saw improvements on most all the benchmarks, disputing the original hyper-threading hypothesis for BSD.

Benchmark	Phoronix observations	My observations	Analysis
blender	Hyperthreading improved ~30% over non-hyperthreading	On_CPU close to 100% with ~20% stalls in backend and ~20% stalls in front end and bad speculation of ~10% due to branch misses. An overall IPC of 0.80. Blender runs more than one thread at a time, so interesting that this doesn't result in gains.	Analysis
parboil: cutcp	Hyperthreading improves ~15%	On_CPU of 92%, IPC of 0.84. Some backend stalls as largest issue.	Analysis
parboil: stencil	Hyperthreading improves ~10%	On_CPU of 91%, IPC of 0.28 and considerably lower on AMD. Backend stalls account for 78% with many memory misses.	Analysis
rodinia: cfd solver	Hyperthreading improves ~8%	CFD solver: parallel, On_CPU 97%. IPC of 0.64 with a large number of memory stalls. L2 miss rate 62% and L3 miss rate 69% so cache/memory plays big part in overall performance.	Analysis
rodinia: streamcluster	Hyperthreading improves ~8%	streamcluster: parallel, On_CPU 97%. IPC of 0.91 with a moderate number of memory stalls. L2 miss rate 62% and L3 miss rate 69% so cache/memory plays big part in overall performance.	Analysis
hmmer	Hyperthreading improves ~10%	On_CPU of 90% with an IPC of 1.30. Some I/O with 90% voluntary context switches. Overall some backend stalls but high retirement rate.	Analysis
ttsiod-renderer	Hyperthreading improves 30%	On_CPU 98%. IPC 0.85. Frontend/backend stalls similar at 21%/22%. L2 miss rate 67% and L3 miss rate of 10% so cache sizes likely play a factor. Frontend seems to be more inefficiencies in allocating (uop cache 44%) than icache or itlb misses.	Analysis
vpxenc	Hyperthreading decreases ~2%	On_CPU 30%, with only 4 threads run. IPC is over 2 and 99.5% voluntary context switches. Some backend memory stalls.	Analysis
x264	Hyperthreading improves 15%	On_CPU 71%, many voluntary context switches and I/O read input. IPC 1.3	Analysis
graphics-magick: blur, sharpen, resize	Hyperthreading makes no change	Benchmark has five operations; overall On_CPU 30%. A pool of backend processing threads but not always busy. Backend stalls are largest issue.	Analysis
compress-p7zip	Hyperthreading improves ~30%	On_CPU 88% with some I/O to limit scaling. IPC 0.83 with 27% speculation misses (branch prediction).	Analysis
stockfish	Hyperthreading improves ~30%	On_CPU 100%, IPC of 1.0 with frontend stalls and bad speculation.	Analysis
asmfish	Hyperthreading improves ~30%	On_CPU 100% with IPC 0.96. Frontend stalls of 27% and speculation of 19% (all branch misses)	Analysis
build-linux-kernel	Hyperthreading improves ~20%	On_CPU 88%, mostly parallel compiles with a sequential period at end. High frontend stalls. # processes less in subsequent runs so might not do thorough "clean".	Analysis
build-php	Hyperthreading improves ~15%	On_CPU 82% so less parallel times than build-linux-kernel. Frontend stalls high. Many small short-lived processes.	Analysis
c-ray	Hyperthreading improves ~7%	On_CPU almost 100% with moderately high IPC of 1.44. Frontend stalls of 10% and backend of 15%.	Analysis
povray	Hyperthreading improves 20%	On_CPU almost 100% with IPC of 1.30. More frontend stalls than backend stalls.	Analysis
n-queens	Hyperthreading improves 20%	OpenMP with On_CPU almost 100% and high number of speculative branch misses.	Analysis

The lack of improvement for graphics-magick and vpxenc makes sense if one considers the On_CPU percentages are 30% and 42% respectively. So these are not fully able to take advantage to be scheduled on my 8 hyper-threaded processor so likely the jump in Phoronix tests from 6 physical cores to 12 virtual cores also doesn’t help.

What is interesting is that every other benchmark shows gains, sometimes up to 30%. This suggests natural limits to parallelism that aren’t always able to take advantage of all the parallel units on a throughput type application.

Performance analysis, tools and experiments

An eclectic collection

Phoronix article – hyperthreading (2018-06-20)

Comments

Phoronix article – hyperthreading (2018-06-20) — No Comments

Leave a Reply Cancel reply