From: Jason A. Donenfeld > Sent: 18 January 2022 11:43 > > On 1/18/22, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > As the patches that triggered this weren't part of the crypto > > tree, this will have to go through the random tree if you want > > them for 5.17. > > Sure, will do. I've rammed the code through godbolt... https://godbolt.org/z/Wv64z9zG8 Some things I've noticed; 1) There is no point having all the inline functions. Far better to have real functions to do the work. Given the cost of hashing 64 bytes of data the extra function call won't matter. Indeed for repeated calls it will help because the required code will be in the I-cache. 2) The compiles I tried do manage to remove the blake2_sigma[][] when unrolling everything - which is a slight gain for the full unroll. But I doubt it is that significant if the access can get sensibly optimised. For non-x86 that might require all the values by multiplied by 4. 3) Although G() is a massive register dependency chain the compiler knows that G(,[0-3],) are independent and can execute in parallel. This does help execution time on multi-issue cpu (like x86). With care it ought to be possible to use the same code for G(,[4-7],) without stopping the compiler interleaving all the instructions. 4) I strongly suspect that using a loop for the rounds will have minimal impact on performance - especially if the first call is 'cold cache'. But I've not got time to test the code. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, Jan 18, 2022 at 1:45 PM David Laight <David.Laight@aculab.com> wrote: > I've rammed the code through godbolt... https://godbolt.org/z/Wv64z9zG8 > > Some things I've noticed; It seems like you've done a lot of work here but... > But I've not got time to test the code. But you're not going to take it all the way. So it unfortunately amounts to mailing list armchair optimization. That's too bad because it really seems like you might be onto something worth seeing through. As I've mentioned a few times now, I've dropped the blake2s optimization patch, and I won't be developing that further. But it appears as though you've really been captured by it, so I urge you: please send a real patch with benchmarks on various platforms! (And CC me on the patch.) Faster reference code would really be terrific. Jason
© 2016 - 2026 Red Hat, Inc.