openssl – AMD vs Intel – Performance analysis, tools and experiments

The openssl Phoronix benchmark is interesting because the IPC on Intel Haswell system (1.66) is considerably higher than the IPC on AMD Ryzen (1.12). In this post, I’ll explore to look for causes.

On Intel, the topdown metrics show a benchmark with a very high retirement rate and few stalls:

Retire   0.923
Frontend 0.065
Spec     0.006
Backend  0.006

This is consistent with a high IPC. Overall, the benchmark is On_CPU 100%. So what might be slowing things down on my AMD processor? The benchmark runs one copy per virtual core (16 vs. 8), but it doesn’t look like scaling is a factor here.

First some overall “perf stat” statistics:

On Intel:

 Performance counter stats for 'env NUM_CPU_CORES=8 LOG_FILE=/tmp/log ./openssl':

     160192.830894      task-clock (msec)         #    7.995 CPUs utilized          
             1,169      context-switches          #    0.007 K/sec                  
                 3      cpu-migrations            #    0.000 K/sec                  
             1,464      page-faults               #    0.009 K/sec                  
   559,227,741,354      cycles                    #    3.491 GHz                    
   926,242,195,324      instructions              #    1.66  insn per cycle         
    23,580,127,013      branches                  #  147.198 M/sec                  
       138,052,656      branch-misses             #    0.59% of all branches        

      20.037475592 seconds time elapsed

On AMD:

 Performance counter stats for 'env NUM_CPU_CORES=16 LOG_FILE=/tmp/log ./openssl':

     320206.627994      task-clock (msec)         #   15.981 CPUs utilized          
            31,645      context-switches          #    0.099 K/sec                  
                 7      cpu-migrations            #    0.000 K/sec                  
             2,579      page-faults               #    0.008 K/sec                  
 1,021,027,927,002      cycles                    #    3.189 GHz                      (83.33%)
       837,657,758      stalled-cycles-frontend   #    0.08% frontend cycles idle     (83.33%)
    38,648,026,069      stalled-cycles-backend    #    3.79% backend cycles idle      (83.33%)
 1,139,590,717,319      instructions              #    1.12  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.33%)
    44,400,734,333      branches                  #  138.663 M/sec                    (83.34%)
       353,891,753      branch-misses             #    0.80% of all branches          (83.34%)

      20.036814531 seconds time elapsed

More total instructions and more branches (even when I run only 8 copies on AMD). While the number of branch misses themselves are not large, it does suggest that somewhat different code is being compiled or linked in. So look first at these components.

Here are the shared libraries:

root@popayan:/var/lib/phoronix-test-suite/installed-tests/pts/openssl-1.10.0# env LD_LIBRARY_PATH=openssl-1.1.0f/:$LD_LIBRARY_PATH ldd ./openssl-1.1.0f/apps/openssl
	linux-vdso.so.1 (0x00007fff4b571000)
	libssl.so.1.1 => openssl-1.1.0f/libssl.so.1.1 (0x00007fdce7f9c000)
	libcrypto.so.1.1 => openssl-1.1.0f/libcrypto.so.1.1 (0x00007fdce7b0f000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fdce78f0000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdce74ff000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fdce72fb000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fdce84b4000)

The libssl.so.1.1 and libcrypto.so.1.1 are included with the openssl build, the others come from the system. From “perf record”, it looks like libcrypto is the library with most all the time.

    48.05%  openssl  libcrypto.so.1.1    [.] __bn_sqr8x_reduction
    26.27%  openssl  libcrypto.so.1.1    [.] bn_sqr8x_internal
    10.52%  openssl  libcrypto.so.1.1    [.] mul4x_internal
     8.95%  openssl  libcrypto.so.1.1    [.] bn_mul4x_mont
     1.91%  openssl  libcrypto.so.1.1    [.] bn_mul_add_words
     0.98%  openssl  libcrypto.so.1.1    [.] __bn_post4x_internal
     0.61%  openssl  libcrypto.so.1.1    [.] bn_sqr8x_mont
     0.47%  openssl  libcrypto.so.1.1    [.] BN_bn2bin
     0.30%  openssl  libcrypto.so.1.1    [.] bn_sub_words
     0.19%  openssl  libcrypto.so.1.1    [.] BN_bin2bn
     0.18%  openssl  libcrypto.so.1.1    [.] OPENSSL_cleanse
     0.17%  openssl  libcrypto.so.1.1    [.] bn_mul_words
     0.14%  openssl  libcrypto.so.1.1    [.] RSA_padding_check_PKCS1_type_1

However, comparing across to AMD, the overall distribution looks similar

    47.42%  openssl  libcrypto.so.1.1    [.] __bn_sqrx8x_reduction
    24.73%  openssl  libcrypto.so.1.1    [.] bn_sqrx8x_internal
    11.29%  openssl  libcrypto.so.1.1    [.] mulx4x_internal
    10.02%  openssl  libcrypto.so.1.1    [.] bn_mulx4x_mont
     2.25%  openssl  libcrypto.so.1.1    [.] bn_mul_add_words
     1.00%  openssl  libcrypto.so.1.1    [.] __bn_postx4x_internal
     0.61%  openssl  libcrypto.so.1.1    [.] bn_sqr8x_mont
     0.49%  openssl  libcrypto.so.1.1    [.] BN_bn2bin
     0.22%  openssl  libcrypto.so.1.1    [.] OPENSSL_cleanse
     0.21%  openssl  libcrypto.so.1.1    [.] BN_bin2bn
     0.18%  openssl  libcrypto.so.1.1    [.] bn_mul_words
     0.17%  openssl  libcrypto.so.1.1    [.] RSA_padding_check_PKCS1_type_1
     0.14%  openssl  libcrypto.so.1.1    [.] bn_sub_words
     0.12%  openssl  libcrypto.so.1.1    [.] bn_powerx5
     0.12%  openssl  libcrypto.so.1.1    [.] BN_from_montgomery_word

Doing an nm(1) on the libcrypto.so.1.1 file gives me identical output, so believe it is likely compilation of the library results in the same output on AMD and Intel. Looking at the particular file, it appears this is written in assembly code. The code itself is processed by a perl script. Following are the comments:

######################################################################                                                                                                           
# Montgomery reduction part, "word-by-word" algorithm.                                                                                                                           
#                                                                                                                                                                                
# This new path is inspired by multiple submissions from Intel, by                                                                                                               
# Shay Gueron, Vlad Krasnov, Erdinc Ozturk, James Guilford,                                                                                                                      
# Vinodh Gopal...                                                                                                                                                                
{

On AMD, the hottest part of the code is the section that starts with

.align  32                                                                      
.Lsqrx8x_reduce:                                                                
        mov     %r8, %rbx                                                       
        mulx    8*0($nptr),%rax,%r8     # n[0]                                  
        adcx    %rbx,%rax               # discarded         // ~8%                        
        adox    %r9,%r8                                                         
                                                                                
        mulx    8*1($nptr),%rbx,%r9     # n[1]                                  
        adcx    %rbx,%r8                                    // ~6%                    
        adox    %r10,%r9                                                  
                                                                                
        mulx    8*2($nptr),%rbx,%r10                                            
        adcx    %rbx,%r9                                    // ~5%              
        adox    %r11,%r10                                                       
                                                                                
        mulx    8*3($nptr),%rbx,%r11                                            
        adcx    %rbx,%r10                                                       
        adox    %r12,%r11

With the instructions I noted above accounting for almost 20% of the total time. There can be some skid, so perhaps not completely these instructions, though seems about right. Curiously the disassembly on Intel in perf is just slightly different, though the similar code is hot. It is also “smeared” at bit more across adjacent instructions.

The AMD Software Optimization guide for Ryzen suggests the ADCX instruction with reg/reg operands has a latency of 1 and a throughput of 1 and uses the ALU pipe. Meanwhile the MULX instruction has a latency of 3 and a throughput of 0.5 and uses ALU1. Hence, despite what perf says, expect this is more a bottleneck on MULX instructions.

Meanwhile, the Intel Software optimization guide for Haswell says there is 1 “slow int” execution unit used for MULX instructions. The latency is 4 and the throughput is 1.

Hence, for openssl, I will hypothesize that AMD is slower than Intel because:

Openssl uses hand-coded assembly routines taking advantage of the MULX instruction
Overall throughput for MULX is lower on AMD than Intel

Agner Fog’s instruction tables also give Ryzen as half the reciprocal throughput for MULX as Haswell.

Performance analysis, tools and experiments

An eclectic collection

openssl – AMD vs Intel

Comments

openssl – AMD vs Intel — No Comments

Leave a Reply Cancel reply