[PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv

Ard Biesheuvel posted 2 patches 10 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230531112239.3164777-1-ardb@kernel.org
Maintainers: Richard Henderson <richard.henderson@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>, Peter Maydell <peter.maydell@linaro.org>
host/include/aarch64/host/cpuinfo.h |  1 +
host/include/i386/host/cpuinfo.h    |  1 +
target/arm/tcg/crypto_helper.c      | 37 ++++++++++-
target/i386/ops_sse.h               | 69 ++++++++++++++++++++
util/cpuinfo-aarch64.c              |  1 +
util/cpuinfo-i386.c                 |  1 +
6 files changed, 107 insertions(+), 3 deletions(-)
[PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
Posted by Ard Biesheuvel 10 months ago
Use the host native instructions to implement the AES instructions
exposed by the emulated target. The mapping is not 1:1, so it requires a
bit of fiddling to get the right result.

This is still RFC material - the current approach feels too ad-hoc, but
given the non-1:1 correspondence, doing a proper abstraction is rather
difficult.

Changes since v1/RFC:
- add second patch to implement x86 AES instructions on ARM hosts - this
  helps illustrate what an abstraction should cover.
- use cpuinfo framework to detect host support for AES instructions.
- implement ARM aesimc using x86 aesimc directly

Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
tcrypt benchmark (mode=500)

Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
the fact that ARM uses two instructions to implement a single AES round,
whereas x86 only uses one.

Note that using the ARM intrinsics is fiddly with Clang, as it does not
declare the prototypes unless some builtin CPP macro (__ARM_FEATURE_AES)
is defined, which will be set by the compiler based on the command line
arch/cpu options. However, setting this globally for a compilation unit
is dubious, given that we test cpuinfo for AES support, and only emit
the instructions conditionally. So I used inline asm() instead.

As for the design of an abstraction: I imagine we could introduce a
host/aes.h API that implements some building blocks that the TCG helper
implementation could use.

Quoting from my reply to Richard:

Using the primitive operations defined in the AES paper, we basically
perform the following transformation for n rounds of AES (for n in {10,
12, 14})

for (n-1 rounds) {
  AddRoundKey
  ShiftRows
  SubBytes
  MixColumns
}
AddRoundKey
ShiftRows
SubBytes
AddRoundKey

AddRoundKey is just XOR, but it is incorporated into the instructions
that combine a couple of these steps.

So on x86, we have

aesenc:
  ShiftRows
  SubBytes
  MixColumns
  AddRoundKey

aesenclast:
  ShiftRows
  SubBytes
  AddRoundKey

and on ARM we have

aese:
  AddRoundKey
  ShiftRows
  SubBytes

aesmc:
  MixColumns

So a generic routine that does only ShiftRows+SubBytes could be backed by
x86's aesenclast and ARM's aese, using a NULL round key argument in each
case. Then, it would be up to the TCG helper code for either ARM or x86
to incorporate those routines in the right way.

I suppose it really depends on whether there is a third host
architecture that could make use of this, and how its AES instructions
map onto the primitive AES ops above.

Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>

Ard Biesheuvel (2):
  target/arm: use x86 intrinsics to implement AES instructions
  target/i386: Implement AES instructions using AArch64 counterparts

 host/include/aarch64/host/cpuinfo.h |  1 +
 host/include/i386/host/cpuinfo.h    |  1 +
 target/arm/tcg/crypto_helper.c      | 37 ++++++++++-
 target/i386/ops_sse.h               | 69 ++++++++++++++++++++
 util/cpuinfo-aarch64.c              |  1 +
 util/cpuinfo-i386.c                 |  1 +
 6 files changed, 107 insertions(+), 3 deletions(-)

-- 
2.39.2


Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
Posted by Richard Henderson 10 months ago
On 5/31/23 04:22, Ard Biesheuvel wrote:
> Use the host native instructions to implement the AES instructions
> exposed by the emulated target. The mapping is not 1:1, so it requires a
> bit of fiddling to get the right result.
> 
> This is still RFC material - the current approach feels too ad-hoc, but
> given the non-1:1 correspondence, doing a proper abstraction is rather
> difficult.
> 
> Changes since v1/RFC:
> - add second patch to implement x86 AES instructions on ARM hosts - this
>    helps illustrate what an abstraction should cover.
> - use cpuinfo framework to detect host support for AES instructions.
> - implement ARM aesimc using x86 aesimc directly
> 
> Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
> tcrypt benchmark (mode=500)
> 
> Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
> the fact that ARM uses two instructions to implement a single AES round,
> whereas x86 only uses one.

Thanks.  I spent some time yesterday looking at this, with an encrypted disk test case and 
could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.

> As for the design of an abstraction: I imagine we could introduce a
> host/aes.h API that implements some building blocks that the TCG helper
> implementation could use.

Indeed.  I was considering interfaces like

/* Perform SubBytes + ShiftRows on state. */
Int128 aesenc_SB_SR(Int128 state);

/* Perform MixColumns on state. */
Int128 aesenc_MC(Int128 state);

/* Perform SubBytes + ShiftRows + MixColumns on state. */
Int128 aesenc_SB_SR_MC(Int128 state);

/* Perform SubBytes + ShiftRows + MixColumns + AddRoundKey. */
Int128 aesenc_SB_SR_MC_AK(Int128 state, Int128 roundkey);

and so forth for aesdec as well.  All but aesenc_MC should be implementable on x86 and 
Power7, and all of them on aarch64.

> I suppose it really depends on whether there is a third host
> architecture that could make use of this, and how its AES instructions
> map onto the primitive AES ops above.

There is Power6 (v{,n}cipher{,last}) and RISC-V Zkn (aes64{es,esm,ds,dsm,im})

I got hung up yesterday was understanding the different endian requirements of x86 vs Power.

ppc64:

     asm("lxvd2x 32,0,%1;"
         "lxvd2x 33,0,%2;"
         "vcipher 0,0,1;"
         "stxvd2x 32,0,%0"
         : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");

ppc64le:

     unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
     asm("lxvd2x 32,0,%1;"
         "lxvd2x 33,0,%2;"
         "lxvd2x 34,0,%3;"
         "vperm 0,0,0,2;"
         "vperm 1,1,1,2;"
         "vcipher 0,0,1;"
         "vperm 0,0,0,2;"
         "stxvd2x 32,0,%0"
         : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");

There are also differences in their AES_Te* based C routines as well, which made me wonder 
if we are handling host endianness differences correctly in emulation right now.  I think 
I should most definitely add some generic-ish tests for this...


r~
Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
Posted by Ard Biesheuvel 10 months ago
On Wed, 31 May 2023 at 18:33, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 5/31/23 04:22, Ard Biesheuvel wrote:
> > Use the host native instructions to implement the AES instructions
> > exposed by the emulated target. The mapping is not 1:1, so it requires a
> > bit of fiddling to get the right result.
> >
> > This is still RFC material - the current approach feels too ad-hoc, but
> > given the non-1:1 correspondence, doing a proper abstraction is rather
> > difficult.
> >
> > Changes since v1/RFC:
> > - add second patch to implement x86 AES instructions on ARM hosts - this
> >    helps illustrate what an abstraction should cover.
> > - use cpuinfo framework to detect host support for AES instructions.
> > - implement ARM aesimc using x86 aesimc directly
> >
> > Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
> > tcrypt benchmark (mode=500)
> >
> > Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
> > the fact that ARM uses two instructions to implement a single AES round,
> > whereas x86 only uses one.
>
> Thanks.  I spent some time yesterday looking at this, with an encrypted disk test case and
> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.
>

I don't understand what 'overhead' means in this context. Are you
saying you saw barely any improvement?

> > As for the design of an abstraction: I imagine we could introduce a
> > host/aes.h API that implements some building blocks that the TCG helper
> > implementation could use.
>
> Indeed.  I was considering interfaces like
>
> /* Perform SubBytes + ShiftRows on state. */
> Int128 aesenc_SB_SR(Int128 state);
>
> /* Perform MixColumns on state. */
> Int128 aesenc_MC(Int128 state);
>
> /* Perform SubBytes + ShiftRows + MixColumns on state. */
> Int128 aesenc_SB_SR_MC(Int128 state);
>
> /* Perform SubBytes + ShiftRows + MixColumns + AddRoundKey. */
> Int128 aesenc_SB_SR_MC_AK(Int128 state, Int128 roundkey);
>
> and so forth for aesdec as well.  All but aesenc_MC should be implementable on x86 and
> Power7, and all of them on aarch64.
>

aesenc_MC() can be implemented on x86 the way I did in patch #!, using
aesdeclast+aesenc


> > I suppose it really depends on whether there is a third host
> > architecture that could make use of this, and how its AES instructions
> > map onto the primitive AES ops above.
>
> There is Power6 (v{,n}cipher{,last}) and RISC-V Zkn (aes64{es,esm,ds,dsm,im})
>
> I got hung up yesterday was understanding the different endian requirements of x86 vs Power.
>
> ppc64:
>
>      asm("lxvd2x 32,0,%1;"
>          "lxvd2x 33,0,%2;"
>          "vcipher 0,0,1;"
>          "stxvd2x 32,0,%0"
>          : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");
>
> ppc64le:
>
>      unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
>      asm("lxvd2x 32,0,%1;"
>          "lxvd2x 33,0,%2;"
>          "lxvd2x 34,0,%3;"
>          "vperm 0,0,0,2;"
>          "vperm 1,1,1,2;"
>          "vcipher 0,0,1;"
>          "vperm 0,0,0,2;"
>          "stxvd2x 32,0,%0"
>          : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");
>
> There are also differences in their AES_Te* based C routines as well, which made me wonder
> if we are handling host endianness differences correctly in emulation right now.  I think
> I should most definitely add some generic-ish tests for this...
>

The above kind of sums it up, no? Or isn't this working code?
Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
Posted by Richard Henderson 10 months ago
On 5/31/23 09:47, Ard Biesheuvel wrote:
> On Wed, 31 May 2023 at 18:33, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 5/31/23 04:22, Ard Biesheuvel wrote:
>>> Use the host native instructions to implement the AES instructions
>>> exposed by the emulated target. The mapping is not 1:1, so it requires a
>>> bit of fiddling to get the right result.
>>>
>>> This is still RFC material - the current approach feels too ad-hoc, but
>>> given the non-1:1 correspondence, doing a proper abstraction is rather
>>> difficult.
>>>
>>> Changes since v1/RFC:
>>> - add second patch to implement x86 AES instructions on ARM hosts - this
>>>     helps illustrate what an abstraction should cover.
>>> - use cpuinfo framework to detect host support for AES instructions.
>>> - implement ARM aesimc using x86 aesimc directly
>>>
>>> Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
>>> tcrypt benchmark (mode=500)
>>>
>>> Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
>>> the fact that ARM uses two instructions to implement a single AES round,
>>> whereas x86 only uses one.
>>
>> Thanks.  I spent some time yesterday looking at this, with an encrypted disk test case and
>> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.
>>
> 
> I don't understand what 'overhead' means in this context. Are you
> saying you saw barely any improvement?

I saw, without changes, just over 1% of total system emulation time was devoted to aes, 
which gives an upper limit to the runtime improvement possible there.  But I'll have a 
look at tcrypt.

> aesenc_MC() can be implemented on x86 the way I did in patch #!, using
> aesdeclast+aesenc

Oh, nice.  I have not read the actual patches yet.

>> ppc64:
>>
>>       asm("lxvd2x 32,0,%1;"
>>           "lxvd2x 33,0,%2;"
>>           "vcipher 0,0,1;"
>>           "stxvd2x 32,0,%0"
>>           : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");
>>
>> ppc64le:
>>
>>       unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
>>       asm("lxvd2x 32,0,%1;"
>>           "lxvd2x 33,0,%2;"
>>           "lxvd2x 34,0,%3;"
>>           "vperm 0,0,0,2;"
>>           "vperm 1,1,1,2;"
>>           "vcipher 0,0,1;"
>>           "vperm 0,0,0,2;"
>>           "stxvd2x 32,0,%0"
>>           : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");
>>
>> There are also differences in their AES_Te* based C routines as well, which made me wonder
>> if we are handling host endianness differences correctly in emulation right now.  I think
>> I should most definitely add some generic-ish tests for this...
>>
> 
> The above kind of sums it up, no? Or isn't this working code?

It sums up the problem.  It works to produce the same output as the x86 instructions, with 
input bytes in the same order.  It shows that we have to extra careful emulating vcipher 
etc, and should have unit tests.


r~
Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
Posted by Richard Henderson 10 months ago
On 5/31/23 10:08, Richard Henderson wrote:
> On 5/31/23 09:47, Ard Biesheuvel wrote:
>> On Wed, 31 May 2023 at 18:33, Richard Henderson
>>> Thanks.  I spent some time yesterday looking at this, with an encrypted disk test case and
>>> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.
>>>
>>
>> I don't understand what 'overhead' means in this context. Are you
>> saying you saw barely any improvement?
> 
> I saw, without changes, just over 1% of total system emulation time was devoted to aes, 
> which gives an upper limit to the runtime improvement possible there.  But I'll have a 
> look at tcrypt.

Using

# insmod /lib/modules/5.10.0-21-arm64/kernel/crypto/tcrypt.ko mode=600 sec=10

I see

   25.50%  qemu-system-aar  qemu-system-aarch64      [.] helper_crypto_aese
   25.36%  qemu-system-aar  qemu-system-aarch64      [.] helper_crypto_aesmc
    6.66%  qemu-system-aar  qemu-system-aarch64      [.] rebuild_hflags_a64
    3.25%  qemu-system-aar  qemu-system-aarch64      [.] tb_lookup
    2.52%  qemu-system-aar  qemu-system-aarch64      [.] fp_exception_el
    2.35%  qemu-system-aar  qemu-system-aarch64      [.] helper_lookup_tb_ptr

Obviously a crypto-heavy test, but 51% of runtime is certainly worth more work.


r~