math64: Provide an uprounding variant of mul_u64_u64_div_u64()

[PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Uwe Kleine-König 10 months, 3 weeks ago

This is needed (at least) in the pwm-stm32 driver. Currently the
pwm-stm32 driver implements this function itself. This private
implementation can be dropped as a followup of this patch.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
---
 include/linux/math64.h |  1 +
 lib/math/div64.c       | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/math64.h b/include/linux/math64.h
index 6aaccc1626ab..0c545b3ddaa5 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -283,6 +283,7 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
 #endif /* mul_u64_u32_div */
 
 u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
+u64 mul_u64_u64_div_u64_roundup(u64 a, u64 mul, u64 div);
 
 /**
  * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
diff --git a/lib/math/div64.c b/lib/math/div64.c
index 5faa29208bdb..66beb669992d 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -267,3 +267,18 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
 }
 EXPORT_SYMBOL(mul_u64_u64_div_u64);
 #endif
+
+#ifndef mul_u64_u64_div_u64_roundup
+u64 mul_u64_u64_div_u64_roundup(u64 a, u64 b, u64 c)
+{
+	u64 res = mul_u64_u64_div_u64(a, b, c);
+	/* Those multiplications might overflow but it doesn't matter */
+	u64 rem = a * b - c * res;
+
+	if (rem)
+		res += 1;
+
+	return res;
+}
+EXPORT_SYMBOL(mul_u64_u64_div_u64_roundup);
+#endif

base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
-- 
2.47.1

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by David Laight 10 months, 3 weeks ago

On Wed, 19 Mar 2025 18:14:25 +0100
Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:

> This is needed (at least) in the pwm-stm32 driver. Currently the
> pwm-stm32 driver implements this function itself. This private
> implementation can be dropped as a followup of this patch.
> 
> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
> ---
>  include/linux/math64.h |  1 +
>  lib/math/div64.c       | 15 +++++++++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/include/linux/math64.h b/include/linux/math64.h
> index 6aaccc1626ab..0c545b3ddaa5 100644
> --- a/include/linux/math64.h
> +++ b/include/linux/math64.h
> @@ -283,6 +283,7 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
>  #endif /* mul_u64_u32_div */
>  
>  u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
> +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 mul, u64 div);
>  
>  /**
>   * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 5faa29208bdb..66beb669992d 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -267,3 +267,18 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
>  }
>  EXPORT_SYMBOL(mul_u64_u64_div_u64);
>  #endif
> +
> +#ifndef mul_u64_u64_div_u64_roundup
> +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 b, u64 c)
> +{
> +	u64 res = mul_u64_u64_div_u64(a, b, c);
> +	/* Those multiplications might overflow but it doesn't matter */
> +	u64 rem = a * b - c * res;
> +
> +	if (rem)
> +		res += 1;

Ugg...
	return (((unsigned __int128_t)a * b) + (c - 1)) / c;
nearly works (on 64bit) but needs a u64 div_128_64()

	David


> +
> +	return res;
> +}
> +EXPORT_SYMBOL(mul_u64_u64_div_u64_roundup);
> +#endif
> 
> base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Uwe Kleine-König 10 months, 1 week ago

Hello,

On Fri, Mar 21, 2025 at 01:18:13PM +0000, David Laight wrote:
> On Wed, 19 Mar 2025 18:14:25 +0100
> Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> 
> > This is needed (at least) in the pwm-stm32 driver. Currently the
> > pwm-stm32 driver implements this function itself. This private
> > implementation can be dropped as a followup of this patch.
> > 
> > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
> > ---
> >  include/linux/math64.h |  1 +
> >  lib/math/div64.c       | 15 +++++++++++++++
> >  2 files changed, 16 insertions(+)
> > 
> > diff --git a/include/linux/math64.h b/include/linux/math64.h
> > index 6aaccc1626ab..0c545b3ddaa5 100644
> > --- a/include/linux/math64.h
> > +++ b/include/linux/math64.h
> > @@ -283,6 +283,7 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
> >  #endif /* mul_u64_u32_div */
> >  
> >  u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
> > +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 mul, u64 div);
> >  
> >  /**
> >   * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
> > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > index 5faa29208bdb..66beb669992d 100644
> > --- a/lib/math/div64.c
> > +++ b/lib/math/div64.c
> > @@ -267,3 +267,18 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> >  }
> >  EXPORT_SYMBOL(mul_u64_u64_div_u64);
> >  #endif
> > +
> > +#ifndef mul_u64_u64_div_u64_roundup
> > +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 b, u64 c)
> > +{
> > +	u64 res = mul_u64_u64_div_u64(a, b, c);
> > +	/* Those multiplications might overflow but it doesn't matter */
> > +	u64 rem = a * b - c * res;
> > +
> > +	if (rem)
> > +		res += 1;
> 
> Ugg...
> 	return (((unsigned __int128_t)a * b) + (c - 1)) / c;
> nearly works (on 64bit) but needs a u64 div_128_64()

Both mul_u64_u64_div_u64_roundup() and mul_u64_u64_div_u64() would not
be needed if we had a 128 bit type and a corresponding division on all
supported architectures.

So given we're not in that situation, I wonder if Andrew is still
considering this patch or if someone else would pick up this patch?

Best regards
Uwe

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by David Laight 10 months, 1 week ago

On Mon, 31 Mar 2025 18:14:29 +0200
Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:

> Hello,
> 
> On Fri, Mar 21, 2025 at 01:18:13PM +0000, David Laight wrote:
> > On Wed, 19 Mar 2025 18:14:25 +0100
> > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> >   
> > > This is needed (at least) in the pwm-stm32 driver. Currently the
> > > pwm-stm32 driver implements this function itself. This private
> > > implementation can be dropped as a followup of this patch.
> > > 
> > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
> > > ---
> > >  include/linux/math64.h |  1 +
> > >  lib/math/div64.c       | 15 +++++++++++++++
> > >  2 files changed, 16 insertions(+)
> > > 
> > > diff --git a/include/linux/math64.h b/include/linux/math64.h
> > > index 6aaccc1626ab..0c545b3ddaa5 100644
> > > --- a/include/linux/math64.h
> > > +++ b/include/linux/math64.h
> > > @@ -283,6 +283,7 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
> > >  #endif /* mul_u64_u32_div */
> > >  
> > >  u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
> > > +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 mul, u64 div);
> > >  
> > >  /**
> > >   * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
> > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > index 5faa29208bdb..66beb669992d 100644
> > > --- a/lib/math/div64.c
> > > +++ b/lib/math/div64.c
> > > @@ -267,3 +267,18 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > >  }
> > >  EXPORT_SYMBOL(mul_u64_u64_div_u64);
> > >  #endif
> > > +
> > > +#ifndef mul_u64_u64_div_u64_roundup
> > > +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 b, u64 c)
> > > +{
> > > +	u64 res = mul_u64_u64_div_u64(a, b, c);
> > > +	/* Those multiplications might overflow but it doesn't matter */
> > > +	u64 rem = a * b - c * res;
> > > +
> > > +	if (rem)
> > > +		res += 1;  
> > 
> > Ugg...
> > 	return (((unsigned __int128_t)a * b) + (c - 1)) / c;
> > nearly works (on 64bit) but needs a u64 div_128_64()  
> 
> Both mul_u64_u64_div_u64_roundup() and mul_u64_u64_div_u64() would not
> be needed if we had a 128 bit type and a corresponding division on all
> supported architectures.

True, but the compiler would be doing a 128 by 128 divide - which isn't
needed here.

But you can rework the code to add in the offset between the multiply
and divide - just needs a 'tweak' to mul_u64_u64_div_u64().

	David

> 
> So given we're not in that situation, I wonder if Andrew is still
> considering this patch or if someone else would pick up this patch?
> 
> Best regards
> Uwe

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Uwe Kleine-König 10 months, 1 week ago

Hello David,

On Mon, Mar 31, 2025 at 07:53:57PM +0100, David Laight wrote:
> On Mon, 31 Mar 2025 18:14:29 +0200
> Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> > On Fri, Mar 21, 2025 at 01:18:13PM +0000, David Laight wrote:
> > > On Wed, 19 Mar 2025 18:14:25 +0100
> > > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> > >   
> > > > This is needed (at least) in the pwm-stm32 driver. Currently the
> > > > pwm-stm32 driver implements this function itself. This private
> > > > implementation can be dropped as a followup of this patch.
> > > > 
> > > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
> > > > ---
> > > >  include/linux/math64.h |  1 +
> > > >  lib/math/div64.c       | 15 +++++++++++++++
> > > >  2 files changed, 16 insertions(+)
> > > > 
> > > > diff --git a/include/linux/math64.h b/include/linux/math64.h
> > > > index 6aaccc1626ab..0c545b3ddaa5 100644
> > > > --- a/include/linux/math64.h
> > > > +++ b/include/linux/math64.h
> > > > @@ -283,6 +283,7 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
> > > >  #endif /* mul_u64_u32_div */
> > > >  
> > > >  u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
> > > > +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 mul, u64 div);
> > > >  
> > > >  /**
> > > >   * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
> > > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > > index 5faa29208bdb..66beb669992d 100644
> > > > --- a/lib/math/div64.c
> > > > +++ b/lib/math/div64.c
> > > > @@ -267,3 +267,18 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > > >  }
> > > >  EXPORT_SYMBOL(mul_u64_u64_div_u64);
> > > >  #endif
> > > > +
> > > > +#ifndef mul_u64_u64_div_u64_roundup
> > > > +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 b, u64 c)
> > > > +{
> > > > +	u64 res = mul_u64_u64_div_u64(a, b, c);
> > > > +	/* Those multiplications might overflow but it doesn't matter */
> > > > +	u64 rem = a * b - c * res;
> > > > +
> > > > +	if (rem)
> > > > +		res += 1;  
> > > 
> > > Ugg...
> > > 	return (((unsigned __int128_t)a * b) + (c - 1)) / c;
> > > nearly works (on 64bit) but needs a u64 div_128_64()  
> > 
> > Both mul_u64_u64_div_u64_roundup() and mul_u64_u64_div_u64() would not
> > be needed if we had a 128 bit type and a corresponding division on all
> > supported architectures.
> 
> True, but the compiler would be doing a 128 by 128 divide - which isn't
> needed here.
> 
> But you can rework the code to add in the offset between the multiply
> and divide - just needs a 'tweak' to mul_u64_u64_div_u64().

Yes, that would be a possibility, but I'm not convinced this gives an
advantage. Yes it simplifies mul_u64_u64_div_u64_roundup() a bit, in
return to making mul_u64_u64_div_u64() a bit more complicated (which is
quite complicated already).

With this patch applied and drivers/pwm/pwm-stm32.c making use of it we
have:

linux$ git grep '\<mul_u64_u64_div_u64\>' | wc -l
56
linux$ git grep '\<mul_u64_u64_div_u64_roundup\>' | wc -l
7

where 13 of the former and 4 of the latter are matches of the respective
implementation or in comments and tests, so ~14 times more users of the
downrounding variant and I don't want to penalize these.

Best regards
Uwe

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by David Laight 10 months, 1 week ago

On Tue, 1 Apr 2025 09:25:17 +0200
Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:

> Hello David,
> 
> On Mon, Mar 31, 2025 at 07:53:57PM +0100, David Laight wrote:
> > On Mon, 31 Mar 2025 18:14:29 +0200
> > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:  
> > > On Fri, Mar 21, 2025 at 01:18:13PM +0000, David Laight wrote:  
> > > > On Wed, 19 Mar 2025 18:14:25 +0100
> > > > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> > > >     
> > > > > This is needed (at least) in the pwm-stm32 driver. Currently the
> > > > > pwm-stm32 driver implements this function itself. This private
> > > > > implementation can be dropped as a followup of this patch.
> > > > > 
> > > > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
> > > > > ---
> > > > >  include/linux/math64.h |  1 +
> > > > >  lib/math/div64.c       | 15 +++++++++++++++
> > > > >  2 files changed, 16 insertions(+)
> > > > > 
> > > > > diff --git a/include/linux/math64.h b/include/linux/math64.h
> > > > > index 6aaccc1626ab..0c545b3ddaa5 100644
> > > > > --- a/include/linux/math64.h
> > > > > +++ b/include/linux/math64.h
> > > > > @@ -283,6 +283,7 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
> > > > >  #endif /* mul_u64_u32_div */
> > > > >  
> > > > >  u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
> > > > > +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 mul, u64 div);
> > > > >  
> > > > >  /**
> > > > >   * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
> > > > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > > > index 5faa29208bdb..66beb669992d 100644
> > > > > --- a/lib/math/div64.c
> > > > > +++ b/lib/math/div64.c
> > > > > @@ -267,3 +267,18 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > > > >  }
> > > > >  EXPORT_SYMBOL(mul_u64_u64_div_u64);
> > > > >  #endif
> > > > > +
> > > > > +#ifndef mul_u64_u64_div_u64_roundup
> > > > > +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 b, u64 c)
> > > > > +{
> > > > > +	u64 res = mul_u64_u64_div_u64(a, b, c);
> > > > > +	/* Those multiplications might overflow but it doesn't matter */
> > > > > +	u64 rem = a * b - c * res;
> > > > > +
> > > > > +	if (rem)
> > > > > +		res += 1;    
> > > > 
> > > > Ugg...
> > > > 	return (((unsigned __int128_t)a * b) + (c - 1)) / c;
> > > > nearly works (on 64bit) but needs a u64 div_128_64()    
> > > 
> > > Both mul_u64_u64_div_u64_roundup() and mul_u64_u64_div_u64() would not
> > > be needed if we had a 128 bit type and a corresponding division on all
> > > supported architectures.  
> > 
> > True, but the compiler would be doing a 128 by 128 divide - which isn't
> > needed here.
> > 
> > But you can rework the code to add in the offset between the multiply
> > and divide - just needs a 'tweak' to mul_u64_u64_div_u64().  
> 
> Yes, that would be a possibility, but I'm not convinced this gives an
> advantage. Yes it simplifies mul_u64_u64_div_u64_roundup() a bit, in
> return to making mul_u64_u64_div_u64() a bit more complicated (which is
> quite complicated already).

Adding in a 64bit offset isn't that much extra.
On most cpu it is an 'add' 'adc' pair.
Clearly it could be optimised away if a constant zero, but that will
be noise except for the x86-64 asm version.
Even there the extra 2 clocks might not be noticeable, but a separate
version for 'constant zero' wouldn't be that bad.

Looking at the C version, I wonder if the two ilog2() calls are needed.
They may not be cheap, and are the same as checking 'n_hi == 0'.

	David

> 
> With this patch applied and drivers/pwm/pwm-stm32.c making use of it we
> have:
> 
> linux$ git grep '\<mul_u64_u64_div_u64\>' | wc -l
> 56
> linux$ git grep '\<mul_u64_u64_div_u64_roundup\>' | wc -l
> 7
> 
> where 13 of the former and 4 of the latter are matches of the respective
> implementation or in comments and tests, so ~14 times more users of the
> downrounding variant and I don't want to penalize these.
> 
> Best regards
> Uwe

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Nicolas Pitre 10 months, 1 week ago

On Tue, 1 Apr 2025, David Laight wrote:

> On Tue, 1 Apr 2025 09:25:17 +0200
> Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> 
> > On Mon, Mar 31, 2025 at 07:53:57PM +0100, David Laight wrote:
> > 
> > > But you can rework the code to add in the offset between the multiply
> > > and divide - just needs a 'tweak' to mul_u64_u64_div_u64().  
> > 
> > Yes, that would be a possibility, but I'm not convinced this gives an
> > advantage. Yes it simplifies mul_u64_u64_div_u64_roundup() a bit, in
> > return to making mul_u64_u64_div_u64() a bit more complicated (which is
> > quite complicated already).
> 
> Adding in a 64bit offset isn't that much extra.
> On most cpu it is an 'add' 'adc' pair.
> Clearly it could be optimised away if a constant zero, but that will
> be noise except for the x86-64 asm version.

You still have to ask:

- How many users do need that version?

- For the zero case to be optimized away, you need some inlining which 
  those functions aren't. This means either passing the (albeit tiny) 
  overhead to everybody or duplicating the core division code meaning 
  bigger footprint.

- And this is not only about the extra 2 clocks. You need to account for 
  passing an extra 64-bit argument which is most likely to be spilled to 
  the stack especially on 32-bits systems as the 64x64=128 multiply has 
  to be performed before adding the extra argument.

Hence the proposed compromise.

> Looking at the C version, I wonder if the two ilog2() calls are needed.
> They may not be cheap, and are the same as checking 'n_hi == 0'.

Which two calls? I see only one.

And please explain how it can be the same as checking 'n_hi == 0'.


Nicolas

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Nicolas Pitre 10 months, 1 week ago

On Tue, 1 Apr 2025, Nicolas Pitre wrote:

> On Tue, 1 Apr 2025, David Laight wrote:
> 
> > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > They may not be cheap, and are the same as checking 'n_hi == 0'.
> 
> Which two calls? I see only one.

Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most 
CPUs have a dedicated instruction for that. The ctz, though, is more 
expensive (unless it is ARMv7 and above with an RBIT instruction).

> And please explain how it can be the same as checking 'n_hi == 0'.

This question still stands.

Nicolas

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by David Laight 10 months, 1 week ago

On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
Nicolas Pitre <nico@fluxnic.net> wrote:

> On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> 
> > On Tue, 1 Apr 2025, David Laight wrote:
> >   
> > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > They may not be cheap, and are the same as checking 'n_hi == 0'.  
> > 
> > Which two calls? I see only one.  
> 
> Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most 
> CPUs have a dedicated instruction for that. The ctz, though, is more 
> expensive (unless it is ARMv7 and above with an RBIT instruction).

The code (6.14-rc6) starts:

u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
{
	if (ilog2(a) + ilog2(b) <= 62)
		return div64_u64(a * b, c);

Now ilog2() is (probably) much the same as clz - but not all cpu have it.
Indeed clz(a) + clz(b) >= 64 would be a more obvious test.

On 32bit a check for a >> 32 | b >> 32 before the multiply might be sane.

> > And please explain how it can be the same as checking 'n_hi == 0'.  
> 
> This question still stands.

'ni_hi == 0' is exactly the condition under which you can do a 64 bit divide.
Since it is when 'a * b' fits in 64 bits.

If you assume that most calls need the 128bit product (or why use the function)
then there is little point adding code to optimise for the unusual case.

	David

> 
> 
> Nicolas

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Uwe Kleine-König 10 months, 1 week ago

On Tue, Apr 01, 2025 at 10:37:31PM +0100, David Laight wrote:
> On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
> Nicolas Pitre <nico@fluxnic.net> wrote:
> 
> > On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> > 
> > > On Tue, 1 Apr 2025, David Laight wrote:
> > >   
> > > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > > They may not be cheap, and are the same as checking 'n_hi == 0'.  
> > > 
> > > Which two calls? I see only one.  
> > 
> > Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most 
> > CPUs have a dedicated instruction for that. The ctz, though, is more 
> > expensive (unless it is ARMv7 and above with an RBIT instruction).
> 
> The code (6.14-rc6) starts:
> 
> u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> {
> 	if (ilog2(a) + ilog2(b) <= 62)
> 		return div64_u64(a * b, c);
> 
> Now ilog2() is (probably) much the same as clz - but not all cpu have it.
> Indeed clz(a) + clz(b) >= 64 would be a more obvious test.

Ack, the more intuitive check might be

	if (fls64(a) + fls64(b) <= 64)
 		return div64_u64(a * b, c);

then, but maybe "intuitive" is subjective here?

> On 32bit a check for a >> 32 | b >> 32 before the multiply might be sane.

Not 100% sure I'm following. `a >> 32 | b >> 32` is just an indicator
that a * b fits into an u64 and in that case the above check is the
better one as this also catches e.g. a = (1ULL << 32) and b = 42.

> > > And please explain how it can be the same as checking 'n_hi == 0'.  
> > 
> > This question still stands.
> 
> 'ni_hi == 0' is exactly the condition under which you can do a 64 bit divide.
> Since it is when 'a * b' fits in 64 bits.
> 
> If you assume that most calls need the 128bit product (or why use the function)
> then there is little point adding code to optimise for the unusual case.

n_hi won't be zero when the branch

	if (ilog2(a) + ilog2(b) <= 62)
		return div64_u64(a * b, c);

wasn't taken?

Best regards
Uwe

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by David Laight 10 months, 1 week ago

On Wed, 2 Apr 2025 10:16:19 +0200
Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:

> On Tue, Apr 01, 2025 at 10:37:31PM +0100, David Laight wrote:
> > On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
> > Nicolas Pitre <nico@fluxnic.net> wrote:
> >   
> > > On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> > >   
> > > > On Tue, 1 Apr 2025, David Laight wrote:
> > > >     
> > > > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > > > They may not be cheap, and are the same as checking 'n_hi == 0'.    
> > > > 
> > > > Which two calls? I see only one.    
> > > 
> > > Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most 
> > > CPUs have a dedicated instruction for that. The ctz, though, is more 
> > > expensive (unless it is ARMv7 and above with an RBIT instruction).  
> > 
> > The code (6.14-rc6) starts:
> > 
> > u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > {
> > 	if (ilog2(a) + ilog2(b) <= 62)
> > 		return div64_u64(a * b, c);
> > 
> > Now ilog2() is (probably) much the same as clz - but not all cpu have it.
> > Indeed clz(a) + clz(b) >= 64 would be a more obvious test.  
> 
> Ack, the more intuitive check might be
> 
> 	if (fls64(a) + fls64(b) <= 64)
>  		return div64_u64(a * b, c);
> 
> then, but maybe "intuitive" is subjective here?

Depends on whether you grok 'clz' or 'fls' :-)

> 
> > On 32bit a check for a >> 32 | b >> 32 before the multiply might be sane.  
> 
> Not 100% sure I'm following. `a >> 32 | b >> 32` is just an indicator
> that a * b fits into an u64 and in that case the above check is the
> better one as this also catches e.g. a = (1ULL << 32) and b = 42.

That is to optimise the multiple as well as the divide.
It is also a very cheap test.

> 
> > > > And please explain how it can be the same as checking 'n_hi == 0'.    
> > > 
> > > This question still stands.  
> > 
> > 'ni_hi == 0' is exactly the condition under which you can do a 64 bit divide.
> > Since it is when 'a * b' fits in 64 bits.
> > 
> > If you assume that most calls need the 128bit product (or why use the function)
> > then there is little point adding code to optimise for the unusual case.  
> 
> n_hi won't be zero when the branch
> 
> 	if (ilog2(a) + ilog2(b) <= 62)
> 		return div64_u64(a * b, c);
> 
> wasn't taken?

Right, but you can remove that test and check n_hi instead.
The multiplies are cheap compared to the divide loop.
(Especially when uint128 exists.)

I also think you can do a much better divide loop (for many cpu).
Shift 'c' so that clz(c) is 32.
Shift 'n_hi/n_lo' so that clz(n_hi) is 1.
Do a 64 by 32 divide and remainder (probably about 32 or 64 clocks).
If bits of 'c' were discarded multiple and subtract (with overflow check).
'Rinse and repeat' with the remainder.
Build the quotient out of all the partial values.

	David

> 
> Best regards
> Uwe

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Uwe Kleine-König 10 months, 1 week ago

Hello David,

On Wed, Apr 02, 2025 at 01:52:19PM +0100, David Laight wrote:
> On Wed, 2 Apr 2025 10:16:19 +0200
> Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> 
> > On Tue, Apr 01, 2025 at 10:37:31PM +0100, David Laight wrote:
> > > On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
> > > Nicolas Pitre <nico@fluxnic.net> wrote:
> > >   
> > > > On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> > > >   
> > > > > On Tue, 1 Apr 2025, David Laight wrote:
> > > > >     
> > > > > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > > > > They may not be cheap, and are the same as checking 'n_hi == 0'.    
> > > > > 
> > > > > Which two calls? I see only one.    
> > > > 
> > > > Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most 
> > > > CPUs have a dedicated instruction for that. The ctz, though, is more 
> > > > expensive (unless it is ARMv7 and above with an RBIT instruction).  
> > > 
> > > The code (6.14-rc6) starts:
> > > 
> > > u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > > {
> > > 	if (ilog2(a) + ilog2(b) <= 62)
> > > 		return div64_u64(a * b, c);
> > > 
> > > Now ilog2() is (probably) much the same as clz - but not all cpu have it.
> > > Indeed clz(a) + clz(b) >= 64 would be a more obvious test.  
> > 
> > Ack, the more intuitive check might be
> > 
> > 	if (fls64(a) + fls64(b) <= 64)
> >  		return div64_u64(a * b, c);
> > 
> > then, but maybe "intuitive" is subjective here?
> 
> Depends on whether you grok 'clz' or 'fls' :-)

And it also depends on the availability of the respective functions.
There is a fls64 function provided by include/asm-generic/bitops/fls64.h
and in several arch-specific arch/*/include/asm/bitops.h, but I didn't
find a clz function apart from one for arch=arc.

> > > On 32bit a check for a >> 32 | b >> 32 before the multiply might be sane.  
> > 
> > Not 100% sure I'm following. `a >> 32 | b >> 32` is just an indicator
> > that a * b fits into an u64 and in that case the above check is the
> > better one as this also catches e.g. a = (1ULL << 32) and b = 42.
> 
> That is to optimise the multiple as well as the divide.
> It is also a very cheap test.

OK, so we have:

	                                 `a >> 32 | b >> 32` | `fls64(a) + fls64(b) <= 64`
	cheap                          |          ✓          |             ✓
	allows to skip multiplication  |          ✓          |             ✓
	allows to skip 128bit division |          ✓          |             ✓
	catches all skip possibilities |          x          |             ✓

> > > > > And please explain how it can be the same as checking 'n_hi == 0'.    
> > > > 
> > > > This question still stands.  
> > > 
> > > 'ni_hi == 0' is exactly the condition under which you can do a 64 bit divide.
> > > Since it is when 'a * b' fits in 64 bits.
> > > 
> > > If you assume that most calls need the 128bit product (or why use the function)
> > > then there is little point adding code to optimise for the unusual case.  
> > 
> > n_hi won't be zero when the branch
> > 
> > 	if (ilog2(a) + ilog2(b) <= 62)
> > 		return div64_u64(a * b, c);
> > 
> > wasn't taken?
> 
> Right, but you can remove that test and check n_hi instead.
> The multiplies are cheap compared to the divide loop.
> (Especially when uint128 exists.)

But you can check n_hi only after you completed the multiplication, so
checking `ilog2(a) + ilog2(b) <= 62` (or `fls64(a) + fls64(b) <= 64` or
`clz(a) + clz(b) >= 64`) sounds like a good idea to me.

> I also think you can do a much better divide loop (for many cpu).
> Shift 'c' so that clz(c) is 32.
> Shift 'n_hi/n_lo' so that clz(n_hi) is 1.
> Do a 64 by 32 divide and remainder (probably about 32 or 64 clocks).
> If bits of 'c' were discarded multiple and subtract (with overflow check).
> 'Rinse and repeat' with the remainder.
> Build the quotient out of all the partial values.

I let this for Nicolas to reply. I guess he is much more into these
details than me.

Best regards
Uwe

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by David Laight 10 months, 1 week ago

On Wed, 2 Apr 2025 17:01:49 +0200
Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:

How about (tab damaged):

Compile tested only, on x86-x64 (once with the local definitions removed).

Looking at the object code, if u128 is supported then checking n_hi
is always going to be better than a pre-check.
Remember multiply is cheap.

	David

diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index 9931e4c7d73f..6115f3fcb975 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -84,21 +84,28 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
  * Will generate an #DE when the result doesn't fit u64, could fix with an
  * __ex_table[] entry when it becomes an issue.
  */
-static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
+static inline u64 mul_u64_add_u64_div_u64(u64 a, u64 mul, u64 add, u64 div)
 {
        u64 q;

-       asm ("mulq %2; divq %3" : "=a" (q)
-                               : "a" (a), "rm" (mul), "rm" (div)
-                               : "rdx");
+       if (statically_true(!add)) {
+               asm ("mulq %2; divq %3" : "=a" (q)
+                                       : "a" (a), "rm" (mul), "rm" (div)
+                                       : "rdx");
+       } else {
+               asm ("mulq %2; addq %4,%rax; addc $0,%rdx; divq %3"
+                       : "=a" (q)
+                       : "a" (a), "rm" (mul), "rm" (div), "rm" (add)
+                       : "rdx");
+       }

        return q;
 }
-#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
+#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64

 static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
 {
-       return mul_u64_u64_div_u64(a, mul, div);
+       return mul_u64_add_u64_div_u64(a, mul, 0, div);
 }
 #define mul_u64_u32_div        mul_u64_u32_div

diff --git a/include/linux/math64.h b/include/linux/math64.h
index 6aaccc1626ab..1544dc37e317 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -282,7 +282,10 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
 }
 #endif /* mul_u64_u32_div */

-u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
+u64 mul_u64_add_u64_div_u64(u64 a, u64 mul, u64 add, u64 div);
+#define mul_u64_u64_div_u64(a, mul, div) mul_u64_add_u64_div_u64(a, mul, 0, div)
+#define mul_u64_u64_div_u64_roundup(a, mul, div) \
+       ({ u64 _tmp = (div); mul_u64_add_u64_div_u64(a, mul, _tmp - 1, _tmp); })

 /**
  * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
diff --git a/lib/math/div64.c b/lib/math/div64.c
index 5faa29208bdb..efcc8d729c74 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -183,16 +183,13 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 }
 EXPORT_SYMBOL(iter_div_u64_rem);

-#ifndef mul_u64_u64_div_u64
-u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
+#ifndef mul_u64_add_u64_div_u64
+u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 add, u64 c)
 {
-       if (ilog2(a) + ilog2(b) <= 62)
-               return div64_u64(a * b, c);
-
 #if defined(__SIZEOF_INT128__)

        /* native 64x64=128 bits multiplication */
-       u128 prod = (u128)a * b;
+       u128 prod = (u128)a * b + add;
        u64 n_lo = prod, n_hi = prod >> 64;

 #else
@@ -201,6 +198,11 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
        u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
        u64 x, y, z;

+#if BITS_PER_LONG == 32
+       if (!(a_hi | b_hi))
+               return div64_u64(a_lo * b_lo + add, c);
+#endif
+
        x = (u64)a_lo * b_lo;
        y = (u64)a_lo * b_hi + (u32)(x >> 32);
        z = (u64)a_hi * b_hi + (u32)(y >> 32);
@@ -208,10 +210,13 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
        z += (u32)(y >> 32);
        x = (y << 32) + (u32)x;

-       u64 n_lo = x, n_hi = z;
+       u64 n_lo = x + add, n_hi = z + (n_lo < x);

 #endif

+       if (!n_hi)
+               return div64_u64(n_lo, c);
+
        /* make sure c is not zero, trigger exception otherwise */
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wdiv-by-zero"
@@ -265,5 +270,5 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)

        return res;
 }
-EXPORT_SYMBOL(mul_u64_u64_div_u64);
+EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
 #endif

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by David Laight 10 months, 1 week ago

On Wed, 2 Apr 2025 17:01:49 +0200
Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:

> Hello David,
> 
> On Wed, Apr 02, 2025 at 01:52:19PM +0100, David Laight wrote:
> > On Wed, 2 Apr 2025 10:16:19 +0200
> > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> >   
> > > On Tue, Apr 01, 2025 at 10:37:31PM +0100, David Laight wrote:  
> > > > On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
> > > > Nicolas Pitre <nico@fluxnic.net> wrote:
> > > >     
> > > > > On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> > > > >     
> > > > > > On Tue, 1 Apr 2025, David Laight wrote:
> > > > > >       
> > > > > > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > > > > > They may not be cheap, and are the same as checking 'n_hi == 0'.      
> > > > > > 
> > > > > > Which two calls? I see only one.      
> > > > > 
> > > > > Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most 
> > > > > CPUs have a dedicated instruction for that. The ctz, though, is more 
> > > > > expensive (unless it is ARMv7 and above with an RBIT instruction).    
> > > > 
> > > > The code (6.14-rc6) starts:
> > > > 
> > > > u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > > > {
> > > > 	if (ilog2(a) + ilog2(b) <= 62)
> > > > 		return div64_u64(a * b, c);
> > > > 
> > > > Now ilog2() is (probably) much the same as clz - but not all cpu have it.
> > > > Indeed clz(a) + clz(b) >= 64 would be a more obvious test.    
> > > 
> > > Ack, the more intuitive check might be
> > > 
> > > 	if (fls64(a) + fls64(b) <= 64)
> > >  		return div64_u64(a * b, c);
> > > 
> > > then, but maybe "intuitive" is subjective here?  
> > 
> > Depends on whether you grok 'clz' or 'fls' :-)  
> 
> And it also depends on the availability of the respective functions.
> There is a fls64 function provided by include/asm-generic/bitops/fls64.h
> and in several arch-specific arch/*/include/asm/bitops.h, but I didn't
> find a clz function apart from one for arch=arc.
> 
> > > > On 32bit a check for a >> 32 | b >> 32 before the multiply might be sane.    
> > > 
> > > Not 100% sure I'm following. `a >> 32 | b >> 32` is just an indicator
> > > that a * b fits into an u64 and in that case the above check is the
> > > better one as this also catches e.g. a = (1ULL << 32) and b = 42.  
> > 
> > That is to optimise the multiple as well as the divide.
> > It is also a very cheap test.  
> 
> OK, so we have:
> 
> 	                                 `a >> 32 | b >> 32` | `fls64(a) + fls64(b) <= 64`
> 	cheap                          |          ✓          |             ✓

fls() isn't always cheap.
x86 has had bsr since the 386, but I don't remember seeing it in arm32 or ppc.
The 'a >> 32 | b >> 32' is very cheap on 32bit.

> 	allows to skip multiplication  |          ✓          |             ✓
> 	allows to skip 128bit division |          ✓          |             ✓
> 	catches all skip possibilities |          x          |             ✓
> 
> > > > > > And please explain how it can be the same as checking 'n_hi == 0'.      
> > > > > 
> > > > > This question still stands.    
> > > > 
> > > > 'ni_hi == 0' is exactly the condition under which you can do a 64 bit divide.
> > > > Since it is when 'a * b' fits in 64 bits.
> > > > 
> > > > If you assume that most calls need the 128bit product (or why use the function)
> > > > then there is little point adding code to optimise for the unusual case.    
> > > 
> > > n_hi won't be zero when the branch
> > > 
> > > 	if (ilog2(a) + ilog2(b) <= 62)
> > > 		return div64_u64(a * b, c);
> > > 
> > > wasn't taken?  
> > 
> > Right, but you can remove that test and check n_hi instead.
> > The multiplies are cheap compared to the divide loop.
> > (Especially when uint128 exists.)  
> 
> But you can check n_hi only after you completed the multiplication, so
> checking `ilog2(a) + ilog2(b) <= 62` (or `fls64(a) + fls64(b) <= 64` or
> `clz(a) + clz(b) >= 64`) sounds like a good idea to me.

Depends on how much of the time you think the multiply is needed.
If it is needed most of the time you want to do it first.
If it is hardly ever needed then the initial check is likely to be a gain.

	David

> 
> > I also think you can do a much better divide loop (for many cpu).
> > Shift 'c' so that clz(c) is 32.
> > Shift 'n_hi/n_lo' so that clz(n_hi) is 1.
> > Do a 64 by 32 divide and remainder (probably about 32 or 64 clocks).
> > If bits of 'c' were discarded multiple and subtract (with overflow check).
> > 'Rinse and repeat' with the remainder.
> > Build the quotient out of all the partial values.  
> 
> I let this for Nicolas to reply. I guess he is much more into these
> details than me.
> 
> Best regards
> Uwe

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Uwe Kleine-König 10 months, 1 week ago

On Wed, Apr 02, 2025 at 09:59:19PM +0100, David Laight wrote:
> On Wed, 2 Apr 2025 17:01:49 +0200
> Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> 
> > Hello David,
> > 
> > On Wed, Apr 02, 2025 at 01:52:19PM +0100, David Laight wrote:
> > > On Wed, 2 Apr 2025 10:16:19 +0200
> > > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> > >   
> > > > On Tue, Apr 01, 2025 at 10:37:31PM +0100, David Laight wrote:  
> > > > > On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
> > > > > Nicolas Pitre <nico@fluxnic.net> wrote:
> > > > >     
> > > > > > On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> > > > > >     
> > > > > > > On Tue, 1 Apr 2025, David Laight wrote:
> > > > > > >       
> > > > > > > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > > > > > > They may not be cheap, and are the same as checking 'n_hi == 0'.      
> > > > > > > 
> > > > > > > Which two calls? I see only one.      
> > > > > > 
> > > > > > Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most 
> > > > > > CPUs have a dedicated instruction for that. The ctz, though, is more 
> > > > > > expensive (unless it is ARMv7 and above with an RBIT instruction).    
> > > > > 
> > > > > The code (6.14-rc6) starts:
> > > > > 
> > > > > u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > > > > {
> > > > > 	if (ilog2(a) + ilog2(b) <= 62)
> > > > > 		return div64_u64(a * b, c);
> > > > > 
> > > > > Now ilog2() is (probably) much the same as clz - but not all cpu have it.
> > > > > Indeed clz(a) + clz(b) >= 64 would be a more obvious test.    
> > > > 
> > > > Ack, the more intuitive check might be
> > > > 
> > > > 	if (fls64(a) + fls64(b) <= 64)
> > > >  		return div64_u64(a * b, c);
> > > > 
> > > > then, but maybe "intuitive" is subjective here?  
> > > 
> > > Depends on whether you grok 'clz' or 'fls' :-)  
> > 
> > And it also depends on the availability of the respective functions.
> > There is a fls64 function provided by include/asm-generic/bitops/fls64.h
> > and in several arch-specific arch/*/include/asm/bitops.h, but I didn't
> > find a clz function apart from one for arch=arc.
> > 
> > > > > On 32bit a check for a >> 32 | b >> 32 before the multiply might be sane.    
> > > > 
> > > > Not 100% sure I'm following. `a >> 32 | b >> 32` is just an indicator
> > > > that a * b fits into an u64 and in that case the above check is the
> > > > better one as this also catches e.g. a = (1ULL << 32) and b = 42.  
> > > 
> > > That is to optimise the multiple as well as the divide.
> > > It is also a very cheap test.  
> > 
> > OK, so we have:
> > 
> > 	                                 `a >> 32 | b >> 32` | `fls64(a) + fls64(b) <= 64`
> > 	cheap                          |          ✓          |             ✓
> 
> fls() isn't always cheap.
> x86 has had bsr since the 386, but I don't remember seeing it in arm32 or ppc.

I don't know about ppc, but arm32 has a clz instruction. With the
definition of fls64 for BITS_PER_LONG == 32 in
include/asm-generic/bitops/fls64.h:

	static __always_inline int fls64(__u64 x)
	{
		__u32 h = x >> 32;
		if (h)
			return fls(h) + 32;
		return fls(x);
	}

and fls from include/asm-generic/bitops/builtin-fls.h:

	static __always_inline int fls(unsigned int x)
	{
		return x ? sizeof(x) * 8 - __builtin_clz(x) : 0;
	}

I'd claim this qualifies for "cheap". Agreed, it's still more expensive
than a >> 32 | b >> 32, but its result is also better. I guess we cannot
judge without a deeper analysis and profiling and a guess about likely
values of a and b.

[Side note: gcc also has a __builtin_clzll which might be a good fit to
implement fls64.]

> The 'a >> 32 | b >> 32' is very cheap on 32bit.
> 
> > 	allows to skip multiplication  |          ✓          |             ✓
> > 	allows to skip 128bit division |          ✓          |             ✓
> > 	catches all skip possibilities |          x          |             ✓
> > 
> > > > > > > And please explain how it can be the same as checking 'n_hi == 0'.      
> > > > > > 
> > > > > > This question still stands.    
> > > > > 
> > > > > 'ni_hi == 0' is exactly the condition under which you can do a 64 bit divide.
> > > > > Since it is when 'a * b' fits in 64 bits.
> > > > > 
> > > > > If you assume that most calls need the 128bit product (or why use the function)
> > > > > then there is little point adding code to optimise for the unusual case.    
> > > > 
> > > > n_hi won't be zero when the branch
> > > > 
> > > > 	if (ilog2(a) + ilog2(b) <= 62)
> > > > 		return div64_u64(a * b, c);
> > > > 
> > > > wasn't taken?  
> > > 
> > > Right, but you can remove that test and check n_hi instead.
> > > The multiplies are cheap compared to the divide loop.
> > > (Especially when uint128 exists.)  
> > 
> > But you can check n_hi only after you completed the multiplication, so
> > checking `ilog2(a) + ilog2(b) <= 62` (or `fls64(a) + fls64(b) <= 64` or
> > `clz(a) + clz(b) >= 64`) sounds like a good idea to me.
> 
> Depends on how much of the time you think the multiply is needed.
> If it is needed most of the time you want to do it first.
> If it is hardly ever needed then the initial check is likely to be a gain.

It also depends on the costs of the two checks. If they are similar,
doing the early check is better.

Anyhow, this discussion is quite a bit away from what I want to achieve.
My objective is to drop the local implementatino of
mul_u64_u64_div_u64_roundup() from the pwm-stm32 driver. So I suggest we
either add my suggested mul_u64_u64_div_u64_roundup() now or you care
about adding an optimized variant of it (in a timely manner is
welcomed).

Best regards
Uwe

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Nicolas Pitre 10 months, 1 week ago

On Tue, 1 Apr 2025, David Laight wrote:

> On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
> Nicolas Pitre <nico@fluxnic.net> wrote:
> 
> > On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> > 
> > > On Tue, 1 Apr 2025, David Laight wrote:
> > >   
> > > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > > They may not be cheap, and are the same as checking 'n_hi == 0'.  
> > > 
> > > Which two calls? I see only one.  
> > 
> > Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most 
> > CPUs have a dedicated instruction for that. The ctz, though, is more 
> > expensive (unless it is ARMv7 and above with an RBIT instruction).
> 
> The code (6.14-rc6) starts:
> 
> u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> {
> 	if (ilog2(a) + ilog2(b) <= 62)
> 		return div64_u64(a * b, c);

Oh! those... yeah. They are carry-overs from the previous 
implementation. They kinda fell outside my consciousness.

> If you assume that most calls need the 128bit product (or why use the function)
> then there is little point adding code to optimise for the unusual case.

It's not that you "assume most calls". It is usually that you "may need 
128 bits" but don't know in advance. I'd even argue that most of the 
time, 128 bits might be needed but often enough isn't.


Nicolas

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Nicolas Pitre 10 months, 3 weeks ago

On Wed, 19 Mar 2025, Uwe Kleine-König wrote:

> This is needed (at least) in the pwm-stm32 driver. Currently the
> pwm-stm32 driver implements this function itself. This private
> implementation can be dropped as a followup of this patch.
> 
> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
> ---
[...]
> +#ifndef mul_u64_u64_div_u64_roundup
> +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 b, u64 c)
> +{
> +	u64 res = mul_u64_u64_div_u64(a, b, c);
> +	/* Those multiplications might overflow but it doesn't matter */
> +	u64 rem = a * b - c * res;
> +
> +	if (rem)
> +		res += 1;
> +
> +	return res;
> +}
> +EXPORT_SYMBOL(mul_u64_u64_div_u64_roundup);
> +#endif

Might there ever be a need for a _rem variant similar to 
div64_u64_rem()? If so the _roundup could then be a simple wrapper.

Otherwise:

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div_u64()

Posted by Uwe Kleine-König 10 months, 3 weeks ago

Hello Nico,

On Wed, Mar 19, 2025 at 03:38:03PM -0400, Nicolas Pitre wrote:
> On Wed, 19 Mar 2025, Uwe Kleine-König wrote:
> [...]
> > +#ifndef mul_u64_u64_div_u64_roundup
> > +u64 mul_u64_u64_div_u64_roundup(u64 a, u64 b, u64 c)
> > +{
> > +	u64 res = mul_u64_u64_div_u64(a, b, c);
> > +	/* Those multiplications might overflow but it doesn't matter */
> > +	u64 rem = a * b - c * res;
> > +
> > +	if (rem)
> > +		res += 1;
> > +
> > +	return res;
> > +}
> > +EXPORT_SYMBOL(mul_u64_u64_div_u64_roundup);
> > +#endif
> 
> Might there ever be a need for a _rem variant similar to 
> div64_u64_rem()? If so the _roundup could then be a simple wrapper.

If such a need pops up I hope this synergy is taken care of. I wouldn't
proactively create a _rem variant now though.

> Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

Thanks!

Best regards
Uwe