[v1] RE: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs

RE: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs

Posted by David Laight 1 year, 10 months ago

From: Eric Biggers
> Sent: 05 April 2024 20:19
...
> I did some tests on Sapphire Rapids using a system call that I customized to do
> nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.
> 
> On average the bare syscall took 70 ns.  The syscall with the kernel_fpu_begin /
> kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
> if it used ymm, or 360 ns if it used zmm...
> 
> Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
> instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
> On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
> takes about 2235ns.  With xts-aes-vaes-avx10_512 it takes 75 ns...

So most of the cost of a single 512-byte sector is the kernel_fpu_begin().
But it is so much slower any other way it is still faster.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs

Posted by Eric Biggers 1 year, 10 months ago

On Mon, Apr 08, 2024 at 07:41:44AM +0000, David Laight wrote:
> From: Eric Biggers
> > Sent: 05 April 2024 20:19
> ...
> > I did some tests on Sapphire Rapids using a system call that I customized to do
> > nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.
> > 
> > On average the bare syscall took 70 ns.  The syscall with the kernel_fpu_begin /
> > kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
> > if it used ymm, or 360 ns if it used zmm...
> > 
> > Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
> > instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
> > On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
> > takes about 2235ns.  With xts-aes-vaes-avx10_512 it takes 75 ns...
> 
> So most of the cost of a single 512-byte sector is the kernel_fpu_begin().
> But it is so much slower any other way it is still faster.
> 

Yes.  To clarify, the 75 ns time I mentioned for a 512-byte sector is the
average for repeated calls, amortizing the XSAVE and XRSTOR.  For a real single
512-byte sector that eats the entire cost of the XSAVE and XRSTOR by itself, if
all state is in-use it should be about 75 + (360 - 70) = 365 ns (based on the
syscall benchmarks I did), with the XSAVE and XRSTOR accounting for 80% of that
time.  But yes, that's still over 6 times faster than the scalar alternative.

- Eric