[v3] Implement mul_u64_u64_div_u64_roundup()

[PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup()

Posted by David Laight 3 months, 3 weeks ago

The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
This can be done simply by adding 'divisor - 1' to the 128bit product.
Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
existing code.
Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Only x86-64 has an optimsed (asm) version of the function.
That is optimised to avoid the 'add c' when c is known to be zero.
In all other cases the extra code will be noise compared to the software
divide code.

The test module has been upadted to test mul_u64_u64_div_u64_roundup() and
also enhanced it to verify the C division code on x86-64 and the 32bit
division code on 64bit.

Changes for v2:
- Rename the 'divisor' parameter from 'c' to 'd'.
- Add an extra patch to use BUG_ON() to trap zero divisors.
- Remove the last patch that ran the C code on x86-64
  (I've a plan to do that differently).

Changes for v3:
- Replace the BUG_ON() (or panic in the original version) for zero
  divisors with a WARN_ONCE() and return zero.
- Remove the 'pre-multiply' check for small products.
  Completely non-trivial on 32bit systems.
- Use mul_u32_u32() and the new add_u64_u32() to stop gcc generating
  pretty much pessimal code for x86 with lots of register spills.
- Replace the 'bit at a time' divide with one that generates 16 bits
  per iteration on 32bit systems and 32 bits per iteration on 64bit.
  Massively faster, the tests run in under 1/3 the time.

David Laight (10):
  lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd'
  lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors.
  lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  lib: Add tests for mul_u64_u64_div_u64_roundup()
  lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86
  lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity
  lib: mul_u64_u64_div_u64() Optimise the divide code
  lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit

 arch/x86/include/asm/div64.h        |  33 ++++-
 include/linux/math64.h              |  56 ++++++++-
 lib/math/div64.c                    | 189 +++++++++++++++++++---------
 lib/math/test_mul_u64_u64_div_u64.c | 187 +++++++++++++++++++--------
 4 files changed, 349 insertions(+), 116 deletions(-)

-- 
2.39.5

Re: [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup()

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Sat, Jun 14, 2025 at 10:53:36AM +0100, David Laight wrote:
> The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> This can be done simply by adding 'divisor - 1' to the 128bit product.
> Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> existing code.
> Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Another way to achieve the same would to to expose the remainder of
mul_u64_64_div_u64(), no?

Re: [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup()

Posted by David Laight 3 months, 3 weeks ago

On Sat, 14 Jun 2025 12:27:17 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Jun 14, 2025 at 10:53:36AM +0100, David Laight wrote:
> > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > existing code.
> > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).  
> 
> Another way to achieve the same would to to expose the remainder of
> mul_u64_64_div_u64(), no?

The optimised paths for small products don't give the remainder.
And you don't want to do the multiply again to generate it.

As well as the conditional increment you'd also need another test
for the quotient overflowing.

Adding 'c' to the product is cheap.

	David