-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mbedTLS AES code, intrinsics vs. assembly, alignment #5593
Comments
Upstream is also considering dropping the assembly in favor of intrinsics: Mbed-TLS/mbedtls#8231 |
Experiment 1: I tried adding This is without usage of VEX yet, even though most of the rest of Experiment 2: On top of the above, added also |
Looks the same for other subdirectories we have with third-party crypto code that we build into |
This is for x86 AES-NI, intrinsics version. Closes openwall#5593
For legacy makes, perhaps we could just add |
I've just checked - Intel Sandy Bridge - the first microarch to introduce AVX - already had both of these as well. So I was tempted to say yes. But then it gets interesting - it was also a time of paranoia about possible backdoors in AES-NI, which is probably why Intel added a way to disable AES-NI from the firmware (including in a way that this can't be re-enabled later on a live system), and many systems shipped with it default-disabled (and default-locked)! I still have a server like this where I cannot easily re-enable AES-NI remotely. I wonder what this means not only for our |
Provided the mbedTLS code checks cpuid sufficiently, it's more a matter of the compiler supporting these flags. A compiler supporting AVX should also support -maes. |
It would already be supported by the -native targets. See openwall#5593
This is for x86 AES-NI, intrinsics version. Closes openwall#5593
It would already be supported by the -native targets. See openwall#5593
This is for x86 AES-NI, intrinsics version. Closes openwall#5593
It would already be supported by the -native targets. See openwall#5593
This is for x86 AES-NI, intrinsics version. Closes openwall#5593
It would already be supported by the -native targets. See openwall#5593
This is for x86 AES-NI, intrinsics version. Closes openwall#5593
It would already be supported by the -native targets. See openwall#5593
It would already be supported by the -native targets. See #5593
After merging these changes and doing a clean build, I confirmed that this same speedup for keepass appeared - yes, it did. (Also the expected slight slowdown for o5logon.) However, here's what I did not expect: On top of all the merged changes, I went into the It could be that these performance differences we're seeing are not so much asm vs. intrinsics, but e.g. whether something just happens to be well-aligned in a given build/run or not. |
Tried this: +++ b/src/keepass_fmt_plug.c
@@ -176,7 +176,7 @@ static void set_salt(void *salt)
static int transform_key(char *masterkey, unsigned char *final_key)
{
SHA256_CTX ctx;
- unsigned char hash[32];
+ unsigned char hash[32] JTR_ALIGN(16);
int ret = 0;
// First, hash the masterkey The speeds look unchanged - still +25% with intrinsics, +25%+14% with re-enabled asm. So I guess the asm is actually faster, and I guess that previously the alignment just happened to be wrong. |
I've just tried deliberately misaligning Anyway, we should probably ensure |
I didn't run valgrind for a while, perhaps time for that. OTOH I get consistenly get 1-7% better speed with intrinsics on my intel macbook. BTW could that be a data point?. What CPU(s) are you testing? Also, did you try different compilers? |
I'd use
The above is with Tiger Lake, gcc 11, and in a VM. I didn't try anything else yet. I wasn't planning on benchmarking this much. Of course, need a different setup for proper benchmarking, but I was surprised by the relative speed differences here. |
Here are my test results on a i9-13900KF. Speed is for KeePass Original speed with no AES-NI:
MbedTLS using asm, 10.3x from the above, and baseline for all below
MbedTLS as a lib (trivial patch never committed), +22%
Current bleeding-jumbo (b1d063a), +5% for a total of +29%
MbedTLS intrinsics + @solardiz unroll + alignment -2% (more like ±0 on repeated runs)
MbedTLS intrinsics + @solardiz unroll ±0%
It's puzzling you get better speeds with asm, while I get worse. But here's another random data point: Using OpenSSL EVP (trivial patch never committed) This is +29% over current bleeding-jumbo 😢 and +67% over baseline
The EVP test was with as little code I could get away with (mostly init/update/final). Back when we dropped all/most of EVP from Jumbo, EVP needed lots of silly calls to stuff for enabling AES-NI at all, and also was not thread safe unless you added callbacks. I'm not sure if callbacks are still needed but AES-NI apparently just works now. I haven't looked at the OpenSSL code yet to see why it's so much faster. |
The thing is I was also getting worse speeds with asm (and still am using old binary I saved) before the switch to intrinsics. I get better speeds with asm when I remove just the |
Ah, my "MbedTLS using asm" baseline is actually f88688e meaning it did get |
That's not what I meant. Sure please do create a better baseline, but also please literally try the revert-to-asm approach where I observed the puzzling speedup - that is, starting from the latest commit, edit the generated mbedtls/Makefile to remove the |
But that must be exactly the same as "the commit before that" (which translates to 40752fe). It is the git history's version of mbedtls with no Verifying... yep, while aes.a does differ in md5sum (should be some timestamp, size is same) |
While I still don't know why for me re-enabling of asm is a lot faster than it was before being disabled, I now managed to achieve even better speeds with intrinsics by disabling
This says that Mbed TLS takes care of the alignment when creating the round keys (I guess placing them accordingly within the larger context?) and I hope we're not copying them to differently-aligned context structs (if we are, we'll need to fix that). BTW, I found |
Mbed TLS takes care of the alignment when creating the round keys (placing them accordingly within the larger context struct) and we hope we're not copying them to differently-aligned context structs (if we are, we'll need to fix that). See openwall#5593
FWIW, completely disabling |
Pushed the above change to this repo now (passed testing in my fork). I also have the below, not pushed since no measurable speedup (but it probably does speed up the key setup a tiny bit by skipping a call to a function provided by +++ b/src/mbedtls/aes.c
@@ -535,7 +535,12 @@ void mbedtls_aes_xts_free(mbedtls_aes_xts_context *ctx)
MBEDTLS_MAYBE_UNUSED static unsigned mbedtls_aes_rk_offset(uint32_t *buf)
{
#if defined(MAY_NEED_TO_ALIGN)
- int align_16_bytes = 0;
+ static int align_16_bytes = 0;
+
+ if (align_16_bytes > 0)
+ goto align_16_bytes;
+ if (align_16_bytes < 0)
+ return 0;
#if defined(MBEDTLS_VIA_PADLOCK_HAVE_CODE)
if (aes_padlock_ace == -1) {
@@ -553,6 +558,7 @@ MBEDTLS_MAYBE_UNUSED static unsigned mbedtls_aes_rk_offset(uint32_t *buf)
#endif
if (align_16_bytes) {
+align_16_bytes:
/* These implementations needs 16-byte alignment
* for the round key array. */
unsigned delta = ((uintptr_t) buf & 0x0000000fU) / 4;
@@ -561,6 +567,8 @@ MBEDTLS_MAYBE_UNUSED static unsigned mbedtls_aes_rk_offset(uint32_t *buf)
} else {
return 4 - delta; // 16 bytes = 4 uint32_t
}
+ } else {
+ align_16_bytes = -1;
}
#else /* MAY_NEED_TO_ALIGN */
(void) buf; |
Good stuff. I remember seeing that realigning comment and made a mental note to examine it later, then immediately forgot about it 🙄
Yes, I'm currently revising the OpenCL AES code (using |
@magnumripper Maybe you can benchmark this? My "no measurable speedup" may be simply because my system isn't idle. |
Will do |
Tried benchmarking the |
And here's maybe why not: for me, |
Indeed, with |
Loop unrolling also helps. Now doing it in the same way as |
Pushed the unrolling. Another way to do it could be: +++ b/src/mbedtls/aesni.c
@@ -96,16 +96,6 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
#if !defined(MBEDTLS_BLOCK_CIPHER_NO_DECRYPT)
if (mode == MBEDTLS_AES_DECRYPT) {
- if (nr == 10)
- goto rounds_10_dec;
- if (nr == 12)
- goto rounds_12_dec;
- state = _mm_aesdec_si128(state, *++rk);
- state = _mm_aesdec_si128(state, *++rk);
-rounds_12_dec:
- state = _mm_aesdec_si128(state, *++rk);
- state = _mm_aesdec_si128(state, *++rk);
-rounds_10_dec:
state = _mm_aesdec_si128(state, *++rk);
state = _mm_aesdec_si128(state, *++rk);
state = _mm_aesdec_si128(state, *++rk);
@@ -115,22 +105,20 @@ rounds_10_dec:
state = _mm_aesdec_si128(state, *++rk);
state = _mm_aesdec_si128(state, *++rk);
state = _mm_aesdec_si128(state, *++rk);
+ if (__builtin_expect(nr > 10, 1)) {
+ state = _mm_aesdec_si128(state, *++rk);
+ state = _mm_aesdec_si128(state, *++rk);
+ if (__builtin_expect(nr > 12, 1)) {
+ state = _mm_aesdec_si128(state, *++rk);
+ state = _mm_aesdec_si128(state, *++rk);
+ }
+ }
state = _mm_aesdeclast_si128(state, *++rk);
} else
#else
(void) mode;
#endif
{
- if (nr == 10)
- goto rounds_10_enc;
- if (nr == 12)
- goto rounds_12_enc;
- state = _mm_aesenc_si128(state, *++rk);
- state = _mm_aesenc_si128(state, *++rk);
-rounds_12_enc:
- state = _mm_aesenc_si128(state, *++rk);
- state = _mm_aesenc_si128(state, *++rk);
-rounds_10_enc:
state = _mm_aesenc_si128(state, *++rk);
state = _mm_aesenc_si128(state, *++rk);
state = _mm_aesenc_si128(state, *++rk);
@@ -140,6 +128,14 @@ rounds_10_enc:
state = _mm_aesenc_si128(state, *++rk);
state = _mm_aesenc_si128(state, *++rk);
state = _mm_aesenc_si128(state, *++rk);
+ if (__builtin_expect(nr > 10, 1)) {
+ state = _mm_aesenc_si128(state, *++rk);
+ state = _mm_aesenc_si128(state, *++rk);
+ if (__builtin_expect(nr > 12, 1)) {
+ state = _mm_aesenc_si128(state, *++rk);
+ state = _mm_aesenc_si128(state, *++rk);
+ }
+ }
state = _mm_aesenclast_si128(state, *++rk);
}
For me, this other way results in pre-loading of round keys into registers, which is probably unneeded, and it hurts a little bit. |
I observed that on older gcc (such as We could want to come up with and switch to using an API that accepts a guaranteed-aligned buffer, but then we'd deviate from upstream Mbed-TLS much more. |
Could we add an intrinsic for |
Yes, I think that's |
I've just implemented this. Testing. |
I pushed the load/store intrinsics usage and it seems to work well so far. |
Some other observations are:
|
For future reference: Many relevant comments also in #5591 |
I believe it builds asm for now. Brief tests with intrinsics show a slight performance drop. From PR comments:
@solardiz said:
@magnumripper said:
The text was updated successfully, but these errors were encountered: