host/include/aarch64/host/cpuinfo.h | 1 + host/include/i386/host/cpuinfo.h | 1 + target/arm/tcg/crypto_helper.c | 37 ++++++++++- target/i386/ops_sse.h | 69 ++++++++++++++++++++ util/cpuinfo-aarch64.c | 1 + util/cpuinfo-i386.c | 1 + 6 files changed, 107 insertions(+), 3 deletions(-)
Use the host native instructions to implement the AES instructions exposed by the emulated target. The mapping is not 1:1, so it requires a bit of fiddling to get the right result. This is still RFC material - the current approach feels too ad-hoc, but given the non-1:1 correspondence, doing a proper abstraction is rather difficult. Changes since v1/RFC: - add second patch to implement x86 AES instructions on ARM hosts - this helps illustrate what an abstraction should cover. - use cpuinfo framework to detect host support for AES instructions. - implement ARM aesimc using x86 aesimc directly Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's tcrypt benchmark (mode=500) Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to the fact that ARM uses two instructions to implement a single AES round, whereas x86 only uses one. Note that using the ARM intrinsics is fiddly with Clang, as it does not declare the prototypes unless some builtin CPP macro (__ARM_FEATURE_AES) is defined, which will be set by the compiler based on the command line arch/cpu options. However, setting this globally for a compilation unit is dubious, given that we test cpuinfo for AES support, and only emit the instructions conditionally. So I used inline asm() instead. As for the design of an abstraction: I imagine we could introduce a host/aes.h API that implements some building blocks that the TCG helper implementation could use. Quoting from my reply to Richard: Using the primitive operations defined in the AES paper, we basically perform the following transformation for n rounds of AES (for n in {10, 12, 14}) for (n-1 rounds) { AddRoundKey ShiftRows SubBytes MixColumns } AddRoundKey ShiftRows SubBytes AddRoundKey AddRoundKey is just XOR, but it is incorporated into the instructions that combine a couple of these steps. So on x86, we have aesenc: ShiftRows SubBytes MixColumns AddRoundKey aesenclast: ShiftRows SubBytes AddRoundKey and on ARM we have aese: AddRoundKey ShiftRows SubBytes aesmc: MixColumns So a generic routine that does only ShiftRows+SubBytes could be backed by x86's aesenclast and ARM's aese, using a NULL round key argument in each case. Then, it would be up to the TCG helper code for either ARM or x86 to incorporate those routines in the right way. I suppose it really depends on whether there is a third host architecture that could make use of this, and how its AES instructions map onto the primitive AES ops above. Cc: Peter Maydell <peter.maydell@linaro.org> Cc: Alex Bennée <alex.bennee@linaro.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org> Ard Biesheuvel (2): target/arm: use x86 intrinsics to implement AES instructions target/i386: Implement AES instructions using AArch64 counterparts host/include/aarch64/host/cpuinfo.h | 1 + host/include/i386/host/cpuinfo.h | 1 + target/arm/tcg/crypto_helper.c | 37 ++++++++++- target/i386/ops_sse.h | 69 ++++++++++++++++++++ util/cpuinfo-aarch64.c | 1 + util/cpuinfo-i386.c | 1 + 6 files changed, 107 insertions(+), 3 deletions(-) -- 2.39.2
On 5/31/23 04:22, Ard Biesheuvel wrote: > Use the host native instructions to implement the AES instructions > exposed by the emulated target. The mapping is not 1:1, so it requires a > bit of fiddling to get the right result. > > This is still RFC material - the current approach feels too ad-hoc, but > given the non-1:1 correspondence, doing a proper abstraction is rather > difficult. > > Changes since v1/RFC: > - add second patch to implement x86 AES instructions on ARM hosts - this > helps illustrate what an abstraction should cover. > - use cpuinfo framework to detect host support for AES instructions. > - implement ARM aesimc using x86 aesimc directly > > Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's > tcrypt benchmark (mode=500) > > Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to > the fact that ARM uses two instructions to implement a single AES round, > whereas x86 only uses one. Thanks. I spent some time yesterday looking at this, with an encrypted disk test case and could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively. > As for the design of an abstraction: I imagine we could introduce a > host/aes.h API that implements some building blocks that the TCG helper > implementation could use. Indeed. I was considering interfaces like /* Perform SubBytes + ShiftRows on state. */ Int128 aesenc_SB_SR(Int128 state); /* Perform MixColumns on state. */ Int128 aesenc_MC(Int128 state); /* Perform SubBytes + ShiftRows + MixColumns on state. */ Int128 aesenc_SB_SR_MC(Int128 state); /* Perform SubBytes + ShiftRows + MixColumns + AddRoundKey. */ Int128 aesenc_SB_SR_MC_AK(Int128 state, Int128 roundkey); and so forth for aesdec as well. All but aesenc_MC should be implementable on x86 and Power7, and all of them on aarch64. > I suppose it really depends on whether there is a third host > architecture that could make use of this, and how its AES instructions > map onto the primitive AES ops above. There is Power6 (v{,n}cipher{,last}) and RISC-V Zkn (aes64{es,esm,ds,dsm,im}) I got hung up yesterday was understanding the different endian requirements of x86 vs Power. ppc64: asm("lxvd2x 32,0,%1;" "lxvd2x 33,0,%2;" "vcipher 0,0,1;" "stxvd2x 32,0,%0" : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2"); ppc64le: unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7}; asm("lxvd2x 32,0,%1;" "lxvd2x 33,0,%2;" "lxvd2x 34,0,%3;" "vperm 0,0,0,2;" "vperm 1,1,1,2;" "vcipher 0,0,1;" "vperm 0,0,0,2;" "stxvd2x 32,0,%0" : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2"); There are also differences in their AES_Te* based C routines as well, which made me wonder if we are handling host endianness differences correctly in emulation right now. I think I should most definitely add some generic-ish tests for this... r~
On Wed, 31 May 2023 at 18:33, Richard Henderson <richard.henderson@linaro.org> wrote: > > On 5/31/23 04:22, Ard Biesheuvel wrote: > > Use the host native instructions to implement the AES instructions > > exposed by the emulated target. The mapping is not 1:1, so it requires a > > bit of fiddling to get the right result. > > > > This is still RFC material - the current approach feels too ad-hoc, but > > given the non-1:1 correspondence, doing a proper abstraction is rather > > difficult. > > > > Changes since v1/RFC: > > - add second patch to implement x86 AES instructions on ARM hosts - this > > helps illustrate what an abstraction should cover. > > - use cpuinfo framework to detect host support for AES instructions. > > - implement ARM aesimc using x86 aesimc directly > > > > Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's > > tcrypt benchmark (mode=500) > > > > Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to > > the fact that ARM uses two instructions to implement a single AES round, > > whereas x86 only uses one. > > Thanks. I spent some time yesterday looking at this, with an encrypted disk test case and > could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively. > I don't understand what 'overhead' means in this context. Are you saying you saw barely any improvement? > > As for the design of an abstraction: I imagine we could introduce a > > host/aes.h API that implements some building blocks that the TCG helper > > implementation could use. > > Indeed. I was considering interfaces like > > /* Perform SubBytes + ShiftRows on state. */ > Int128 aesenc_SB_SR(Int128 state); > > /* Perform MixColumns on state. */ > Int128 aesenc_MC(Int128 state); > > /* Perform SubBytes + ShiftRows + MixColumns on state. */ > Int128 aesenc_SB_SR_MC(Int128 state); > > /* Perform SubBytes + ShiftRows + MixColumns + AddRoundKey. */ > Int128 aesenc_SB_SR_MC_AK(Int128 state, Int128 roundkey); > > and so forth for aesdec as well. All but aesenc_MC should be implementable on x86 and > Power7, and all of them on aarch64. > aesenc_MC() can be implemented on x86 the way I did in patch #!, using aesdeclast+aesenc > > I suppose it really depends on whether there is a third host > > architecture that could make use of this, and how its AES instructions > > map onto the primitive AES ops above. > > There is Power6 (v{,n}cipher{,last}) and RISC-V Zkn (aes64{es,esm,ds,dsm,im}) > > I got hung up yesterday was understanding the different endian requirements of x86 vs Power. > > ppc64: > > asm("lxvd2x 32,0,%1;" > "lxvd2x 33,0,%2;" > "vcipher 0,0,1;" > "stxvd2x 32,0,%0" > : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2"); > > ppc64le: > > unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7}; > asm("lxvd2x 32,0,%1;" > "lxvd2x 33,0,%2;" > "lxvd2x 34,0,%3;" > "vperm 0,0,0,2;" > "vperm 1,1,1,2;" > "vcipher 0,0,1;" > "vperm 0,0,0,2;" > "stxvd2x 32,0,%0" > : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2"); > > There are also differences in their AES_Te* based C routines as well, which made me wonder > if we are handling host endianness differences correctly in emulation right now. I think > I should most definitely add some generic-ish tests for this... > The above kind of sums it up, no? Or isn't this working code?
On 5/31/23 09:47, Ard Biesheuvel wrote: > On Wed, 31 May 2023 at 18:33, Richard Henderson > <richard.henderson@linaro.org> wrote: >> >> On 5/31/23 04:22, Ard Biesheuvel wrote: >>> Use the host native instructions to implement the AES instructions >>> exposed by the emulated target. The mapping is not 1:1, so it requires a >>> bit of fiddling to get the right result. >>> >>> This is still RFC material - the current approach feels too ad-hoc, but >>> given the non-1:1 correspondence, doing a proper abstraction is rather >>> difficult. >>> >>> Changes since v1/RFC: >>> - add second patch to implement x86 AES instructions on ARM hosts - this >>> helps illustrate what an abstraction should cover. >>> - use cpuinfo framework to detect host support for AES instructions. >>> - implement ARM aesimc using x86 aesimc directly >>> >>> Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's >>> tcrypt benchmark (mode=500) >>> >>> Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to >>> the fact that ARM uses two instructions to implement a single AES round, >>> whereas x86 only uses one. >> >> Thanks. I spent some time yesterday looking at this, with an encrypted disk test case and >> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively. >> > > I don't understand what 'overhead' means in this context. Are you > saying you saw barely any improvement? I saw, without changes, just over 1% of total system emulation time was devoted to aes, which gives an upper limit to the runtime improvement possible there. But I'll have a look at tcrypt. > aesenc_MC() can be implemented on x86 the way I did in patch #!, using > aesdeclast+aesenc Oh, nice. I have not read the actual patches yet. >> ppc64: >> >> asm("lxvd2x 32,0,%1;" >> "lxvd2x 33,0,%2;" >> "vcipher 0,0,1;" >> "stxvd2x 32,0,%0" >> : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2"); >> >> ppc64le: >> >> unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7}; >> asm("lxvd2x 32,0,%1;" >> "lxvd2x 33,0,%2;" >> "lxvd2x 34,0,%3;" >> "vperm 0,0,0,2;" >> "vperm 1,1,1,2;" >> "vcipher 0,0,1;" >> "vperm 0,0,0,2;" >> "stxvd2x 32,0,%0" >> : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2"); >> >> There are also differences in their AES_Te* based C routines as well, which made me wonder >> if we are handling host endianness differences correctly in emulation right now. I think >> I should most definitely add some generic-ish tests for this... >> > > The above kind of sums it up, no? Or isn't this working code? It sums up the problem. It works to produce the same output as the x86 instructions, with input bytes in the same order. It shows that we have to extra careful emulating vcipher etc, and should have unit tests. r~
On 5/31/23 10:08, Richard Henderson wrote: > On 5/31/23 09:47, Ard Biesheuvel wrote: >> On Wed, 31 May 2023 at 18:33, Richard Henderson >>> Thanks. I spent some time yesterday looking at this, with an encrypted disk test case and >>> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively. >>> >> >> I don't understand what 'overhead' means in this context. Are you >> saying you saw barely any improvement? > > I saw, without changes, just over 1% of total system emulation time was devoted to aes, > which gives an upper limit to the runtime improvement possible there. But I'll have a > look at tcrypt. Using # insmod /lib/modules/5.10.0-21-arm64/kernel/crypto/tcrypt.ko mode=600 sec=10 I see 25.50% qemu-system-aar qemu-system-aarch64 [.] helper_crypto_aese 25.36% qemu-system-aar qemu-system-aarch64 [.] helper_crypto_aesmc 6.66% qemu-system-aar qemu-system-aarch64 [.] rebuild_hflags_a64 3.25% qemu-system-aar qemu-system-aarch64 [.] tb_lookup 2.52% qemu-system-aar qemu-system-aarch64 [.] fp_exception_el 2.35% qemu-system-aar qemu-system-aarch64 [.] helper_lookup_tb_ptr Obviously a crypto-heavy test, but 51% of runtime is certainly worth more work. r~
© 2016 - 2024 Red Hat, Inc.