[v4] Implement mul_u64_u64_div_u64_roundup()

[PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()

Posted by David Laight 3 months, 1 week ago

The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
This can be done simply by adding 'divisor - 1' to the 128bit product.
Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
existing code.
Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Only x86-64 has an optimsed (asm) version of the function.
That is optimised to avoid the 'add c' when c is known to be zero.
In all other cases the extra code will be noise compared to the software
divide code.

The test module has been updated to test mul_u64_u64_div_u64_roundup() and
also enhanced it to verify the C division code on x86-64 and the 32bit
division code on 64bit.

Changes for v2:
- Rename the 'divisor' parameter from 'c' to 'd'.
- Add an extra patch to use BUG_ON() to trap zero divisors.
- Remove the last patch that ran the C code on x86-64
  (I've a plan to do that differently).

Changes for v3:
- Replace the BUG_ON() (or panic in the original version) for zero
  divisors with a WARN_ONCE() and return zero.
- Remove the 'pre-multiply' check for small products.
  Completely non-trivial on 32bit systems.
- Use mul_u32_u32() and the new add_u64_u32() to stop gcc generating
  pretty much pessimal code for x86 with lots of register spills.
- Replace the 'bit at a time' divide with one that generates 16 bits
  per iteration on 32bit systems and 32 bits per iteration on 64bit.
  Massively faster, the tests run in under 1/3 the time.

Changes for v4:
No significant code changes.
- Rebase on 6.18-rc2
- Don't change the behaviour for overflow (return ~0ull) or divide
  by zero (execute ~0ul/0).
- Merge patches 8 and 9 to avoid bisection issues.
- Fix build of 32bit test cases on non-x86.
- Fix shell script that verifies test cases.

I've left the x86-64 faulting on both overflow and divide by zero.
The patch to add an execption table entry to return ~0 for both
doesn't seem to have been merged.
If merged it would make sense for the C version to return ~0 for both.
Callers can check for a result of ~0 and then check the divisor if
they care about overflow (etc).
(A valid quotent of ~0 is pretty unlikely and marginal changes to the
input values are likely to generate a real overflow.)
The code that faulted on overflow was about to get invalid results
because one of the 64bit inputs would itself wrap very soon.

David Laight (9):
  lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd'
  lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks
  lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  lib: Add tests for mul_u64_u64_div_u64_roundup()
  lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86
  lib: mul_u64_u64_div_u64() Optimise the divide code
  lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit

 arch/x86/include/asm/div64.h        |  39 ++++--
 include/linux/math64.h              |  59 ++++++++-
 lib/math/div64.c                    | 183 ++++++++++++++++++---------
 lib/math/test_mul_u64_u64_div_u64.c | 190 ++++++++++++++++++++--------
 4 files changed, 352 insertions(+), 119 deletions(-)

-- 
2.39.5

Re: [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()

Posted by Oleg Nesterov 3 months, 1 week ago

On 10/29, David Laight wrote:
>
> The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> This can be done simply by adding 'divisor - 1' to the 128bit product.
> Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> existing code.
> Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Sorry for the noise. Can't review due to the lack of knowledge.
But this reminds me... What about

	[PATCH v3 1/2] x86/math64: handle #DE in mul_u64_u64_div_u64()
	https://lore.kernel.org/all/20250815164055.GA13444@redhat.com

?

should I resend it once again or we don't care?

Oleg.

Re: [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()

Posted by David Laight 3 months, 1 week ago

On Fri, 31 Oct 2025 14:52:56 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 10/29, David Laight wrote:
> >
> > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > existing code.
> > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).  
> 
> Sorry for the noise. Can't review due to the lack of knowledge.
> But this reminds me... What about
> 
> 	[PATCH v3 1/2] x86/math64: handle #DE in mul_u64_u64_div_u64()
> 	https://lore.kernel.org/all/20250815164055.GA13444@redhat.com
> 
> ?
> 
> should I resend it once again or we don't care?

I'd remembered that as well.
I'll do a new version of that on top as a separate patch.
Probably when this set has been merged further.

	David

Re: [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()

Posted by Andrew Morton 3 months, 1 week ago

On Wed, 29 Oct 2025 17:38:19 +0000 David Laight <david.laight.linux@gmail.com> wrote:

> The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> This can be done simply by adding 'divisor - 1' to the 128bit product.
> Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> existing code.
> Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> 
> Only x86-64 has an optimsed (asm) version of the function.
> That is optimised to avoid the 'add c' when c is known to be zero.
> In all other cases the extra code will be noise compared to the software
> divide code.
> 
> The test module has been updated to test mul_u64_u64_div_u64_roundup() and
> also enhanced it to verify the C division code on x86-64 and the 32bit
> division code on 64bit.

Thanks, I added this to mm.git's mm-nonmm-unstable branch for some
linux-next exposure.  I have a note that [3/9] may be updated in
response to Nicolas's comment.

Re: [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()

Posted by Nicolas Pitre 3 months ago

On Thu, 30 Oct 2025, Andrew Morton wrote:

> Thanks, I added this to mm.git's mm-nonmm-unstable branch for some
> linux-next exposure.  I have a note that [3/9] may be updated in
> response to Nicolas's comment.

This is the change I'd like to see:

----- >8
FRom: Nicolas Pitre <npitre@baylibre.com>
Subject: lib: mul_u64_u64_div_u64(): optimize quick path for small numbers

If the 128-bit product is small enough (n_hi == 0) we should branch to
div64_u64() right away. This saves one test for this quick path which is
more prevalent than divide-by-0 cases and div64_u64() can deal with the
(theoretically undefined behavior) zero divisor just fine too. The cost 
remains the same for regular cases.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
diff --git a/lib/math/div64.c b/lib/math/div64.c
index 4e4e962261c3..d1e92ea24fce 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -247,6 +247,9 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 
 	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
 
+	if (!n_hi)
+		return div64_u64(n_lo, d);
+
 	if (unlikely(n_hi >= d)) {
 		/* trigger runtime exception if divisor is zero */
 		if (d == 0) {
@@ -259,9 +262,6 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 		return ~0ULL;
 	}
 
-	if (!n_hi)
-		return div64_u64(n_lo, d);
-
 	/* Left align the divisor, shifting the dividend to match */
 	d_z_hi = __builtin_clzll(d);
 	if (d_z_hi) {