lib: Implement mul_u64_u64_div_u64_roundup()

[PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()

Posted by David Laight 10 months, 1 week ago

The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
This can be done simply by adding 'divisor - 1' to the 128bit product.
Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
existing code.
Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Only x86-64 has an optimsed (asm) version of the function.
That is optimised to avoid the 'add c' when c is known to be zero.
In all other cases the extra code will be noise compared to the software
divide code.

I've updated the test module to test mul_u64_u64_div_u64_roundup() and
also enhanced it to verify the C division code on x86-64.

Note that the code generated by gcc (eg for 32bit x86) just for the multiply
is rather more horrid than one would expect (clang does better).
I dread to think how long the divide loop takes.
And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
relatively common path (rather than just hardware initialisation).

David Laight (3):
  lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  lib: Add tests for mul_u64_u64_div_u64_roundup()
  lib: Update the muldiv64 tests to verify the C on x86-64

 arch/x86/include/asm/div64.h        |  19 ++--
 include/linux/math64.h              |  44 ++++++++-
 lib/math/div64.c                    |  57 ++++++++----
 lib/math/test_mul_u64_u64_div_u64.c | 136 +++++++++++++++++-----------
 4 files changed, 179 insertions(+), 77 deletions(-)

-- 
2.39.5

Re: [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()

Posted by Uwe Kleine-König 8 months, 3 weeks ago

Hello David,

On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote:
> The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> This can be done simply by adding 'divisor - 1' to the 128bit product.
> Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> existing code.
> Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> 
> Only x86-64 has an optimsed (asm) version of the function.
> That is optimised to avoid the 'add c' when c is known to be zero.
> In all other cases the extra code will be noise compared to the software
> divide code.
> 
> I've updated the test module to test mul_u64_u64_div_u64_roundup() and
> also enhanced it to verify the C division code on x86-64.
> 
> Note that the code generated by gcc (eg for 32bit x86) just for the multiply
> is rather more horrid than one would expect (clang does better).
> I dread to think how long the divide loop takes.
> And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
> relatively common path (rather than just hardware initialisation).
> 
> David Laight (3):
>   lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
>   lib: Add tests for mul_u64_u64_div_u64_roundup()
>   lib: Update the muldiv64 tests to verify the C on x86-64

I wonder what happend to this series. I'd like to make use of
mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the
mainline.

Best regards
Uwe

Re: [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()

Posted by David Laight 8 months, 3 weeks ago

On Fri, 16 May 2025 11:47:58 +0200
Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:

> Hello David,
> 
> On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote:
> > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > existing code.
> > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> > 
> > Only x86-64 has an optimsed (asm) version of the function.
> > That is optimised to avoid the 'add c' when c is known to be zero.
> > In all other cases the extra code will be noise compared to the software
> > divide code.
> > 
> > I've updated the test module to test mul_u64_u64_div_u64_roundup() and
> > also enhanced it to verify the C division code on x86-64.
> > 
> > Note that the code generated by gcc (eg for 32bit x86) just for the multiply
> > is rather more horrid than one would expect (clang does better).
> > I dread to think how long the divide loop takes.
> > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
> > relatively common path (rather than just hardware initialisation).
> > 
> > David Laight (3):
> >   lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
> >   lib: Add tests for mul_u64_u64_div_u64_roundup()
> >   lib: Update the muldiv64 tests to verify the C on x86-64  
> 
> I wonder what happend to this series. I'd like to make use of
> mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the
> mainline.

I've a WIP rewrite of the divide code, speeds it up considerably for
'not amd-64'.

IIRC (the test machine is powered off) the test cases with random data
do down from over 900 clocks to below 150 for x86-32.
The 64bit code (on x64-x64 but skipping the divide instruction) are ~70
clocks rather than nearer 180.
Both those clock counts are almost data independent.

The code relies on the cpu having an 'unsigned long/unsigned long'
instruction.

I got reasonable code for 32bit x64 running gcc 12.2 from debian.
Then copied it to godbolt and found gcc before 15.0 making a pigs
breakfast of it (all clang versions are fine).
So some rework.

I still need to actually plumb it into the kernel sources.

I do want to test the divide loop on x86-64, at least for 64bit.
Probably by including the code directly into the test module.

	David

Re: [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()

Posted by Nicolas Pitre 8 months, 3 weeks ago

On Fri, 16 May 2025, David Laight wrote:

> On Fri, 16 May 2025 11:47:58 +0200
> Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> 
> > Hello David,
> > 
> > On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote:
> > > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > > existing code.
> > > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> > > 
> > > Only x86-64 has an optimsed (asm) version of the function.
> > > That is optimised to avoid the 'add c' when c is known to be zero.
> > > In all other cases the extra code will be noise compared to the software
> > > divide code.
> > > 
> > > I've updated the test module to test mul_u64_u64_div_u64_roundup() and
> > > also enhanced it to verify the C division code on x86-64.
> > > 
> > > Note that the code generated by gcc (eg for 32bit x86) just for the multiply
> > > is rather more horrid than one would expect (clang does better).
> > > I dread to think how long the divide loop takes.
> > > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
> > > relatively common path (rather than just hardware initialisation).
> > > 
> > > David Laight (3):
> > >   lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
> > >   lib: Add tests for mul_u64_u64_div_u64_roundup()
> > >   lib: Update the muldiv64 tests to verify the C on x86-64  
> > 
> > I wonder what happend to this series. I'd like to make use of
> > mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the
> > mainline.
> 
> I've a WIP rewrite of the divide code, speeds it up considerably for
> 'not amd-64'.

May I suggest you simply submit the new API now (addressing my latest 
comments if possible) and submit the divide optimization later?


Nicolas

Re: [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()

Posted by David Laight 8 months, 3 weeks ago

On Fri, 16 May 2025 11:49:44 -0400 (EDT)
Nicolas Pitre <npitre@baylibre.com> wrote:

> On Fri, 16 May 2025, David Laight wrote:
> 
> > On Fri, 16 May 2025 11:47:58 +0200
> > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> >   
> > > Hello David,
> > > 
> > > On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote:  
> > > > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > > > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > > > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > > > existing code.
> > > > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > > > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> > > > 
> > > > Only x86-64 has an optimsed (asm) version of the function.
> > > > That is optimised to avoid the 'add c' when c is known to be zero.
> > > > In all other cases the extra code will be noise compared to the software
> > > > divide code.
> > > > 
> > > > I've updated the test module to test mul_u64_u64_div_u64_roundup() and
> > > > also enhanced it to verify the C division code on x86-64.
> > > > 
> > > > Note that the code generated by gcc (eg for 32bit x86) just for the multiply
> > > > is rather more horrid than one would expect (clang does better).
> > > > I dread to think how long the divide loop takes.
> > > > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
> > > > relatively common path (rather than just hardware initialisation).
...
> > > I wonder what happend to this series. I'd like to make use of
> > > mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the
> > > mainline.  
> > 
> > I've a WIP rewrite of the divide code, speeds it up considerably for
> > 'not amd-64'.  
> 
> May I suggest you simply submit the new API now (addressing my latest 
> comments if possible) and submit the divide optimization later?

I've just 'lobbed in' a 'v2' that excludes the last patch (extra changes
to the test module) and replaces 'if (!divisor) 1/0;' with BUG_ON(!divisor).

I'll to the speedup changes on top.

	David