clocksource: Replace loop within clocks_calc_mult_shift() with clz-based calculation for sftacc

[PATCH v2] clocksource: Replace loop within clocks_calc_mult_shift() with clz-based calculation for sftacc

Posted by I Hsin Cheng 8 months ago

Utilize (32 - __builtin_clz(tmp)) in replacement of while loop counting
for the decremenet of "sftacc". They're equivalent in computation result
but the former is more effective.

Using 32 instead of 31 here because MSB position is 0-index, and the bit
width will be (MSB position + 1).

The case for "tmp == 0" should be handled separately since
__builtin_clz(0) has undefined behavior.

This change is tested against a test script [1].
Run the test 10 times for each version of implementation and take the
average. The result shown that with this change, the operation overhead
of "clocks_calc_mult_shift()" can be reduced around 99.7% .

-----------------------------
| old version | new version |
-----------------------------
|  11500.6 ns |       44 ns |
-----------------------------

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
---
Changelog:

v1 -> v2:
	- Refine commit message to explain more about "why"
	- Check the frequency of "clocks_calc_mult_shift()" get called,
	  it's not in hotpath on my machine, refine the commit message
      to avoid overselling it
	- Add comments for the code to explain the implementation in
	  more detail
	- Handle case for "tmp == 0" to avoid undefined behavior
    - Experiment using ffs() but it's used for finding the LSB of
      a number, which isn't MSB as we want. It would still need
      looping when intended to use ffs() in this scenario

[1]:
static int __init test_init(void)
{
    u32 mult, shift;
    u32 from, to, maxsec;
    ktime_t start_time, end_time, total_time;
    pr_info("Starting clocks_calc_mult_shift simple test\n");

    start_time = ktime_get();
    // Test with parameters from 1 to 1000
    for (from = 1; from <= 1000; from += 100) {
        for (to = 1; to <= 1000; to += 100) {
            for (maxsec = 1; maxsec <= 10; maxsec++) {

                clocks_calc_mult_shift(&mult, &shift, from, to, maxsec);
            }
        }
    }

    end_time = ktime_get();
    total_time = ktime_to_ns(ktime_sub(end_time, start_time));

    pr_info("Test completed\n");
    pr_info("Total execution time: %lld ns \n", total_time);
    return 0;
}

The test is running in the form of kernel module.
The test machine is running ubuntu 24.04 on x86_64 machine with kernel
version of v6.14.0, CPU type is AMD Ryzen 7 5700X3D 8-Core Processor.

Best regards,
I Hsin Cheng.
---
 kernel/time/clocksource.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 2a7802ec480c..59145d762f1e 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -66,10 +66,20 @@ clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 maxsec)
 	 * range:
 	 */
 	tmp = ((u64)maxsec * from) >> 32;
-	while (tmp) {
-		tmp >>=1;
-		sftacc--;
-	}
+
+	/*
+	 * Decrement "sftacc" by the number of bits needed to represent "tmp".
+	 * Using (32 - __builtin_clz(tmp)) to ge the bit width:
+	 * - __builtin_clz(tmp) returns the count of leading zero bits
+	 * - (32 - __builtin_clz(tmp)) gives us the number of significant bits
+	 *
+	 * Note: It use 32 instead of 31 because we want bit width (0-index MSB
+	 * position + 1), not just the MSB position itself.
+	 *
+	 * Handle tmp == 0 separately since __builtin_clz(0) has undefined behavior.
+	 */
+	if (tmp)
+		sftacc -= (32 - __builtin_clz(tmp));
 
 	/*
 	 * Find the conversion shift/mult pair which has the best
-- 
2.43.0

Re: [PATCH v2] clocksource: Replace loop within clocks_calc_mult_shift() with clz-based calculation for sftacc

Posted by John Stultz 8 months ago

On Tue, Jun 10, 2025 at 8:30 PM I Hsin Cheng <richard120310@gmail.com> wrote:
>
> v1 -> v2:
>         - Refine commit message to explain more about "why"
>         - Check the frequency of "clocks_calc_mult_shift()" get called,
>           it's not in hotpath on my machine, refine the commit message
>       to avoid overselling it
>         - Add comments for the code to explain the implementation in
>           more detail
>         - Handle case for "tmp == 0" to avoid undefined behavior
>     - Experiment using ffs() but it's used for finding the LSB of
>       a number, which isn't MSB as we want. It would still need
>       looping when intended to use ffs() in this scenario

Oh, apologies for mixing that up and leading you astray!


> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 2a7802ec480c..59145d762f1e 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -66,10 +66,20 @@ clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 maxsec)
>          * range:
>          */
>         tmp = ((u64)maxsec * from) >> 32;
> -       while (tmp) {
> -               tmp >>=1;
> -               sftacc--;
> -       }
> +
> +       /*
> +        * Decrement "sftacc" by the number of bits needed to represent "tmp".
> +        * Using (32 - __builtin_clz(tmp)) to ge the bit width:
> +        * - __builtin_clz(tmp) returns the count of leading zero bits
> +        * - (32 - __builtin_clz(tmp)) gives us the number of significant bits
> +        *
> +        * Note: It use 32 instead of 31 because we want bit width (0-index MSB
> +        * position + 1), not just the MSB position itself.
> +        *
> +        * Handle tmp == 0 separately since __builtin_clz(0) has undefined behavior.
> +        */
> +       if (tmp)
> +               sftacc -= (32 - __builtin_clz(tmp));

Still think the detail that __builtin_clz only works on the bottom
32bits is good to highlight in the comment.
Though maybe, would explicitly casting tmp to a u32 help it be more clear here?

thanks
-john

Re: [PATCH v2] clocksource: Replace loop within clocks_calc_mult_shift() with clz-based calculation for sftacc

Posted by I Hsin Cheng 8 months ago

On Tue, Jun 10, 2025 at 09:31:49PM -0700, John Stultz wrote:
> On Tue, Jun 10, 2025 at 8:30 PM I Hsin Cheng <richard120310@gmail.com> wrote:
> >
> > v1 -> v2:
> >         - Refine commit message to explain more about "why"
> >         - Check the frequency of "clocks_calc_mult_shift()" get called,
> >           it's not in hotpath on my machine, refine the commit message
> >       to avoid overselling it
> >         - Add comments for the code to explain the implementation in
> >           more detail
> >         - Handle case for "tmp == 0" to avoid undefined behavior
> >     - Experiment using ffs() but it's used for finding the LSB of
> >       a number, which isn't MSB as we want. It would still need
> >       looping when intended to use ffs() in this scenario
> 
> Oh, apologies for mixing that up and leading you astray!
> 
> 
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index 2a7802ec480c..59145d762f1e 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -66,10 +66,20 @@ clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 maxsec)
> >          * range:
> >          */
> >         tmp = ((u64)maxsec * from) >> 32;
> > -       while (tmp) {
> > -               tmp >>=1;
> > -               sftacc--;
> > -       }
> > +
> > +       /*
> > +        * Decrement "sftacc" by the number of bits needed to represent "tmp".
> > +        * Using (32 - __builtin_clz(tmp)) to ge the bit width:
> > +        * - __builtin_clz(tmp) returns the count of leading zero bits
> > +        * - (32 - __builtin_clz(tmp)) gives us the number of significant bits
> > +        *
> > +        * Note: It use 32 instead of 31 because we want bit width (0-index MSB
> > +        * position + 1), not just the MSB position itself.
> > +        *
> > +        * Handle tmp == 0 separately since __builtin_clz(0) has undefined behavior.
> > +        */
> > +       if (tmp)
> > +               sftacc -= (32 - __builtin_clz(tmp));
> 
> Still think the detail that __builtin_clz only works on the bottom
> 32bits is good to highlight in the comment.
> Though maybe, would explicitly casting tmp to a u32 help it be more clear here?
> 
> thanks
> -john
>

Hi John,

Sure thing, sorry I missed the part to mention it onlys works on the
lower 32 bits of "tmp". I'll amend that ASAP.

Btw I just realized it's more appropriate to use "find_last_bit()"
instead of __builtin_clz(). Most of the in-tree use cases utilize
find_last_bit() as it support more optimization. I'll cc to bitmap's
maintainer in the next version and see what suggestion he gives so we
can make sure we consider everything as much as possible.

Thanks !

Best regards,
I Hsin Cheng