AES-NI, Gotta Go Fast!

Datetime:2016-08-22 21:47:01          Topic: Encryption and Decryption  OpenSSL           Share

T he use of encryption has drastically changed over the years. A vast number of encryption ciphers have been deemed weak (e.g., RC4), while others are considered industry standard (e.g., AES-128). The use of encryption is a predominant method of securing data at rest and in transit, as well as for securing communications between clients and servers. Traditionally, encryption ciphers are selected based on the type of data secured and the performance requirements of the system or application. The security of an algorithm is not always a forefront requirement in the selection process. Can a secure algorithm, such as AES, perform at a rate or faster than an algorithm known for performance, such as RC4?

Over the years, encryption has become more widespread, and Hypertext Transfer Protocol (HTTPS) is a prime example of encryption use; a vast number of web servers use HTTPS to secure communications with clients. All communication between web servers and Internet browsers are secured using HTTPS and not only for authentication network traffic like in the past. Since 2014, major companies such as Facebook, WordPress, and Tumblr have all secured communication with HTTPS by default . Further, Google and Yahoo have deployed HTTPS for search queries and results. Another example of more commonplace encryption is modern operating systems offering full disk encryption for data security. Apple’s Mac OS X(since 10.7) [2] and Microsoft’s Windows 10 both use XTS-AES-128 encryption. Windows 7 and Windows 8 both utilize AES with an optional diffuser that could be configured to use 128 or 256-bit keys.

Performance and Encryption Algorithm Selection

Traditionally, performance was a big concern for developers when selecting anencryption cipher; RC4 performs at a high rate and was the base of many protocols (e.g., RDP, WEP, and WPA). In comparison to AES, RC4 out performed AES in CBC, ECB, CFB mode with 128, 192, and 256 key sizes.

While RC4 is extremely fast, the algorithm is no longer considered secure. In a 2013 security advisory , Microsoft recommended discontinuing use of RC4 due to significant issues within the Key Scheduling Algorithm (KSA), in which output is processed by a Pseudo-Random Generator Algorithm (PRGA). The first 257 bytes of output was shown to be significantly biased. In addition, RC4 is vulnerable to plaintext recovery attacks.

AES New Instructions

To improve encryption/decryption performance, Intel and AMD added special x86 instructions to implement AES within hardware. Table 1 lists the new instructions and their descriptions.

These instructions have been added to processors dating back to 2010, with Westmere as Intel’s first microarchitecture to feature them. Of the Ivy Bridge processors, all i5, i7, Xeon, and i3–2115C support AES-NI instructions. Celeron, Pentium, and i3–4000m processors from the Haswell family do not support the new instructions, while all others do. A full list of processors can be found on Intel’s website . AMD has taken a full coverage approach and implemented these instructions in all processors since Bulldozer-based processors in 2011.

Users of Linux and Mac OS X can quickly use shell commands to determine if their processors support AES-NI. Listed below are the commands and corresponding output.

For Linux: grep aes /proc/cpuinfo
root@kali:~# grep aes /proc/cpuinfo 
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm ida arat epb pln pts dtherm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt

Figure 1.Sample out for a processor that supports AES (Linux).

For Mac OS X: sysctl machdep.cpu.features
MacBook-Pro:~$ sysctl machdep.cpu.features 
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

Figure 2. Sample output for a processor that supports AES (Mac).

Performance

The test results shown below were executed on a MacBook Pro Mid 2015 that contained an Intel Core i7 4870HQ @ 2.5GHz and 16GB of RAM. Performance results are contingent on the computer system and dependent on which OpenSSL library is used while testing performance. To execute performance testing, OpenSSL contains a function, speed, which allows one to test the speed of various encryption algorithms. To ensure AES-NI is utilized doing the test, testers must use the –evp flag. Functions within the EVP library automatically uses AES-NI if available.

Below are results from several tests. These tests were executed with several changing variables, such as AES-NI enabled and AES-NI disabled. As shown in Figure 3, OpenSSL version 1.0.2f performed at 786 MB/s 8192 bytes block size when AES-NI is enabled.

MacBook-Pro:bin $ ./openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 127081126 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 35936898 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 9220403 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2303418 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 288105 aes-128-cbc’s in 3.00s
OpenSSL 1.0.2f 28 Jan 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang -I. -I.. -I../include -fPIC -fno-common -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -arch x86_64 -O3 -DL_ENDIAN -Wall -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 677766.01k 766653.82k 786807.72k 786233.34k 786718.72k

Figure 3. Speed test of AES-128-CBC using OpenSSL 1.0.2f

To keep the speed performance test uniform, the functions with the EVP library will be utilized with AES_NI disabled. Since functions within the library will automatically use AES-NI, the OPENSSL_ia32cap=”~0x200000200000000" environment variable will trigger OpenSSL to disable AES-NI. With AES-NI disabled, AES-128-CBC clocked in a drastically lower rate at 369MB/s on an 8192 block size.

MacBook-Pro:bin $ OPENSSL_ia32cap=”~0x200000200000000" ./openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 58707174 aes-128-cbc’s in 3.01s
Doing aes-128-cbc for 3s on 64 size blocks: 14667713 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 3920894 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 1097018 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 135340 aes-128-cbc’s in 3.00s
OpenSSL 1.0.2f 28 Jan 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang -I. -I.. -I../include -fPIC -fno-common -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -arch x86_64 -O3 -DL_ENDIAN -Wall -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 312064.71k 312911.21k 334582.95k 374448.81k 369568.43k

Figure 4. Speed test of AES-128-CBC using OpenSSL 1.0.2f with AES-NI disabled

RC4 is well known for its speed, but not for its security, as with the AES encryption algorithm. To determine how AES-128 in CBC mode using the new AES instructions holds up, we conducted an RC4 speed test and concluded that RC4 is slower than AES-128-CBC with the use of the new instructions. RC4 clocked in at 621 MB/s on block sizes of 8190 as seen in Figure 5, while AES-128-CBC clocked in at 786 MB/s. However, the use of AES-NI is critical; RC4 is more than twice as fast as AES without the new instructions.

MacBook-Pro:bin $ ./openssl speed -elapsed -evp rc4
You have chosen to measure elapsed time instead of user CPU time.
Doing rc4 for 3s on 16 size blocks: 98100856 rc4’s in 3.00s
Doing rc4 for 3s on 64 size blocks: 31242136 rc4’s in 3.00s
Doing rc4 for 3s on 256 size blocks: 8148845 rc4’s in 3.00s
Doing rc4 for 3s on 1024 size blocks: 1972899 rc4’s in 3.01s
Doing rc4 for 3s on 8192 size blocks: 227644 rc4’s in 3.00s
OpenSSL 1.0.2f 28 Jan 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang -I. -I.. -I../include -fPIC -fno-common -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -arch x86_64 -O3 -DL_ENDIAN -Wall -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
rc4 523204.57k 666498.90k 695368.11k 671178.93k 621619.88k

Figure 5. Speed test of RC4 using OpenSSL 1.0.2f

Drew Branch is an Analyst at Independent Security Evaluators .

Twitters: @ISESecurity





About List