openssl – AMD vs Intel
The openssl Phoronix benchmark is interesting because the IPC on Intel Haswell system (1.66) is considerably higher than the IPC on AMD Ryzen (1.12). In this post, I’ll explore to look for causes.
On Intel, the topdown metrics show a benchmark with a very high retirement rate and few stalls:
Retire 0.923 Frontend 0.065 Spec 0.006 Backend 0.006
This is consistent with a high IPC. Overall, the benchmark is On_CPU 100%. So what might be slowing things down on my AMD processor? The benchmark runs one copy per virtual core (16 vs. 8), but it doesn’t look like scaling is a factor here.
First some overall “perf stat” statistics:
On Intel:
Performance counter stats for 'env NUM_CPU_CORES=8 LOG_FILE=/tmp/log ./openssl': 160192.830894 task-clock (msec) # 7.995 CPUs utilized 1,169 context-switches # 0.007 K/sec 3 cpu-migrations # 0.000 K/sec 1,464 page-faults # 0.009 K/sec 559,227,741,354 cycles # 3.491 GHz 926,242,195,324 instructions # 1.66 insn per cycle 23,580,127,013 branches # 147.198 M/sec 138,052,656 branch-misses # 0.59% of all branches 20.037475592 seconds time elapsed
On AMD:
Performance counter stats for 'env NUM_CPU_CORES=16 LOG_FILE=/tmp/log ./openssl': 320206.627994 task-clock (msec) # 15.981 CPUs utilized 31,645 context-switches # 0.099 K/sec 7 cpu-migrations # 0.000 K/sec 2,579 page-faults # 0.008 K/sec 1,021,027,927,002 cycles # 3.189 GHz (83.33%) 837,657,758 stalled-cycles-frontend # 0.08% frontend cycles idle (83.33%) 38,648,026,069 stalled-cycles-backend # 3.79% backend cycles idle (83.33%) 1,139,590,717,319 instructions # 1.12 insn per cycle # 0.03 stalled cycles per insn (83.33%) 44,400,734,333 branches # 138.663 M/sec (83.34%) 353,891,753 branch-misses # 0.80% of all branches (83.34%) 20.036814531 seconds time elapsed
More total instructions and more branches (even when I run only 8 copies on AMD). While the number of branch misses themselves are not large, it does suggest that somewhat different code is being compiled or linked in. So look first at these components.
Here are the shared libraries:
root@popayan:/var/lib/phoronix-test-suite/installed-tests/pts/openssl-1.10.0# env LD_LIBRARY_PATH=openssl-1.1.0f/:$LD_LIBRARY_PATH ldd ./openssl-1.1.0f/apps/openssl linux-vdso.so.1 (0x00007fff4b571000) libssl.so.1.1 => openssl-1.1.0f/libssl.so.1.1 (0x00007fdce7f9c000) libcrypto.so.1.1 => openssl-1.1.0f/libcrypto.so.1.1 (0x00007fdce7b0f000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fdce78f0000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdce74ff000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fdce72fb000) /lib64/ld-linux-x86-64.so.2 (0x00007fdce84b4000)
The libssl.so.1.1 and libcrypto.so.1.1 are included with the openssl build, the others come from the system. From “perf record”, it looks like libcrypto is the library with most all the time.
48.05% openssl libcrypto.so.1.1 [.] __bn_sqr8x_reduction 26.27% openssl libcrypto.so.1.1 [.] bn_sqr8x_internal 10.52% openssl libcrypto.so.1.1 [.] mul4x_internal 8.95% openssl libcrypto.so.1.1 [.] bn_mul4x_mont 1.91% openssl libcrypto.so.1.1 [.] bn_mul_add_words 0.98% openssl libcrypto.so.1.1 [.] __bn_post4x_internal 0.61% openssl libcrypto.so.1.1 [.] bn_sqr8x_mont 0.47% openssl libcrypto.so.1.1 [.] BN_bn2bin 0.30% openssl libcrypto.so.1.1 [.] bn_sub_words 0.19% openssl libcrypto.so.1.1 [.] BN_bin2bn 0.18% openssl libcrypto.so.1.1 [.] OPENSSL_cleanse 0.17% openssl libcrypto.so.1.1 [.] bn_mul_words 0.14% openssl libcrypto.so.1.1 [.] RSA_padding_check_PKCS1_type_1
However, comparing across to AMD, the overall distribution looks similar
47.42% openssl libcrypto.so.1.1 [.] __bn_sqrx8x_reduction 24.73% openssl libcrypto.so.1.1 [.] bn_sqrx8x_internal 11.29% openssl libcrypto.so.1.1 [.] mulx4x_internal 10.02% openssl libcrypto.so.1.1 [.] bn_mulx4x_mont 2.25% openssl libcrypto.so.1.1 [.] bn_mul_add_words 1.00% openssl libcrypto.so.1.1 [.] __bn_postx4x_internal 0.61% openssl libcrypto.so.1.1 [.] bn_sqr8x_mont 0.49% openssl libcrypto.so.1.1 [.] BN_bn2bin 0.22% openssl libcrypto.so.1.1 [.] OPENSSL_cleanse 0.21% openssl libcrypto.so.1.1 [.] BN_bin2bn 0.18% openssl libcrypto.so.1.1 [.] bn_mul_words 0.17% openssl libcrypto.so.1.1 [.] RSA_padding_check_PKCS1_type_1 0.14% openssl libcrypto.so.1.1 [.] bn_sub_words 0.12% openssl libcrypto.so.1.1 [.] bn_powerx5 0.12% openssl libcrypto.so.1.1 [.] BN_from_montgomery_word
Doing an nm(1) on the libcrypto.so.1.1 file gives me identical output, so believe it is likely compilation of the library results in the same output on AMD and Intel. Looking at the particular file, it appears this is written in assembly code. The code itself is processed by a perl script. Following are the comments:
###################################################################### # Montgomery reduction part, "word-by-word" algorithm. # # This new path is inspired by multiple submissions from Intel, by # Shay Gueron, Vlad Krasnov, Erdinc Ozturk, James Guilford, # Vinodh Gopal... {
On AMD, the hottest part of the code is the section that starts with
.align 32 .Lsqrx8x_reduce: mov %r8, %rbx mulx 8*0($nptr),%rax,%r8 # n[0] adcx %rbx,%rax # discarded // ~8% adox %r9,%r8 mulx 8*1($nptr),%rbx,%r9 # n[1] adcx %rbx,%r8 // ~6% adox %r10,%r9 mulx 8*2($nptr),%rbx,%r10 adcx %rbx,%r9 // ~5% adox %r11,%r10 mulx 8*3($nptr),%rbx,%r11 adcx %rbx,%r10 adox %r12,%r11
With the instructions I noted above accounting for almost 20% of the total time. There can be some skid, so perhaps not completely these instructions, though seems about right. Curiously the disassembly on Intel in perf is just slightly different, though the similar code is hot. It is also “smeared” at bit more across adjacent instructions.
The AMD Software Optimization guide for Ryzen suggests the ADCX instruction with reg/reg operands has a latency of 1 and a throughput of 1 and uses the ALU pipe. Meanwhile the MULX instruction has a latency of 3 and a throughput of 0.5 and uses ALU1. Hence, despite what perf says, expect this is more a bottleneck on MULX instructions.
Meanwhile, the Intel Software optimization guide for Haswell says there is 1 “slow int” execution unit used for MULX instructions. The latency is 4 and the throughput is 1.
Hence, for openssl, I will hypothesize that AMD is slower than Intel because:
- Openssl uses hand-coded assembly routines taking advantage of the MULX instruction
- Overall throughput for MULX is lower on AMD than Intel
Agner Fog’s instruction tables also give Ryzen as half the reciprocal throughput for MULX as Haswell.
Comments
openssl – AMD vs Intel — No Comments
HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>