kernel/time/clocksource.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
Previously, counting the value of "sftacc" within
"clocks_calc_mult_shift()" use a while loop to determine the amount of
decrement operation for "sftacc".
It's equivalent to the position of MSB of "tmp", for which we can
derive from (32 - __builtin_clz(tmp)). Use 32 instead of 31 here because
1UL should be treat as an amount of 1, not the index 0, and even though
"tmp" is of type u64, since it's already shifted right by 32, only the
lowest 32 bits will be possible to have non-zero value.
This change is tested against a test script [1].
Result shown that it can save a significant amount of execution overhead
for this function, which is invoked often, the whole system would likely
benefit from it.
-----------------------------
| old version | new version |
-----------------------------
| 11500.6 ns | 44 ns |
-----------------------------
The execution time is reduced around 99.7%
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
---
[1]:
static int __init test_init(void)
{
u32 mult, shift;
u32 from, to, maxsec;
ktime_t start_time, end_time, total_time;
pr_info("Starting clocks_calc_mult_shift simple test\n");
start_time = ktime_get();
// Test with parameters from 1 to 1000
for (from = 1; from <= 1000; from += 100) {
for (to = 1; to <= 1000; to += 100) {
for (maxsec = 1; maxsec <= 10; maxsec++) {
clocks_calc_mult_shift(&mult, &shift, from, to, maxsec);
}
}
}
end_time = ktime_get();
total_time = ktime_to_ns(ktime_sub(end_time, start_time));
pr_info("Test completed\n");
pr_info("Total execution time: %lld ns \n", total_time);
return 0;
}
The test is running in the form of kernel module.
The test machine is running ubuntu 24.04 on x86_64 machine with kernel
version of v6.14.0, CPU type is AMD Ryzen 7 5700X3D 8-Core Processor.
Best regards,
I Hsin Cheng.
---
kernel/time/clocksource.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 2a7802ec480c..b4f8fffae847 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -66,10 +66,7 @@ clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 maxsec)
* range:
*/
tmp = ((u64)maxsec * from) >> 32;
- while (tmp) {
- tmp >>=1;
- sftacc--;
- }
+ sftacc -= (32 - __builtin_clz(tmp));
/*
* Find the conversion shift/mult pair which has the best
--
2.43.0
Thanks so much for sending this optimization in! It looks promising, but a few comments below: On Mon, Jun 9, 2025 at 12:46 PM I Hsin Cheng <richard120310@gmail.com> wrote: > > Previously, counting the value of "sftacc" within > "clocks_calc_mult_shift()" use a while loop to determine the amount of > decrement operation for "sftacc". This is just a direct translation of the code into english, without any context or explanation as to *why* the code is doing what it is doing. It might be better to explain first *why* the calculation is being done, and then get into the limitations of the current implementation. > It's equivalent to the position of MSB of "tmp", for which we can > derive from (32 - __builtin_clz(tmp)). Use 32 instead of 31 here because > 1UL should be treat as an amount of 1, not the index 0, and even though > "tmp" is of type u64, since it's already shifted right by 32, only the > lowest 32 bits will be possible to have non-zero value. I was having a bit of a hard time parsing this paragraph, so it probably needs some more work. On first skim, this also was a bit confusing as __builtin_clz(u64) *seems* like it could return greater than 32 (but that would be __builtin_clzll). So the (32 - clz(u64)) bit might confuse folks, as It's pretty subtle that __builtin_clz() only works on the bottom 32bits. The code should definitely get clear a comment on this. And thinking more, would using ffs() be more straightforward here for what is wanted? > This change is tested against a test script [1]. > Result shown that it can save a significant amount of execution overhead > for this function, which is invoked often, the whole system would likely > benefit from it. > > ----------------------------- > | old version | new version | > ----------------------------- > | 11500.6 ns | 44 ns | > ----------------------------- > > The execution time is reduced around 99.7% That is impressive, but should probably be contextualized around how often one might expect clocks_calc_mult_shift() to be called. It still looks like a good optimization, but we shouldn't over-sell it unless there's a concrete impact to users. Thanks again for sending this along! I'd just take another pass at the commit message and the comments and see if this can be made easier to understand when someone 5 years from now is looking at the code. thanks -john
On Mon, Jun 09, 2025 at 05:32:05PM -0700, John Stultz wrote: > Thanks so much for sending this optimization in! It looks promising, > but a few comments below: > > On Mon, Jun 9, 2025 at 12:46 PM I Hsin Cheng <richard120310@gmail.com> wrote: > > > > Previously, counting the value of "sftacc" within > > "clocks_calc_mult_shift()" use a while loop to determine the amount of > > decrement operation for "sftacc". > > This is just a direct translation of the code into english, without > any context or explanation as to *why* the code is doing what it is > doing. > > It might be better to explain first *why* the calculation is being > done, and then get into the limitations of the current implementation. > Sure thing, I'll change this part. > > It's equivalent to the position of MSB of "tmp", for which we can > > derive from (32 - __builtin_clz(tmp)). Use 32 instead of 31 here because > > 1UL should be treat as an amount of 1, not the index 0, and even though > > "tmp" is of type u64, since it's already shifted right by 32, only the > > lowest 32 bits will be possible to have non-zero value. > > I was having a bit of a hard time parsing this paragraph, so it > probably needs some more work. > > On first skim, this also was a bit confusing as __builtin_clz(u64) > *seems* like it could return greater than 32 (but that would be > __builtin_clzll). So the (32 - clz(u64)) bit might confuse folks, as > It's pretty subtle that __builtin_clz() only works on the bottom > 32bits. The code should definitely get clear a comment on this. > > And thinking more, would using ffs() be more straightforward here for > what is wanted? > This is a nice head up, thanks so much ! I'll switch to ffs() and see how it goes. > > This change is tested against a test script [1]. > > Result shown that it can save a significant amount of execution overhead > > for this function, which is invoked often, the whole system would likely > > benefit from it. > > > > ----------------------------- > > | old version | new version | > > ----------------------------- > > | 11500.6 ns | 44 ns | > > ----------------------------- > > > > The execution time is reduced around 99.7% > > That is impressive, but should probably be contextualized around how > often one might expect clocks_calc_mult_shift() to be called. > It still looks like a good optimization, but we shouldn't over-sell it > unless there's a concrete impact to users. > No problem, I'll try to use bpf or some debug prints approach to test the invoked frequency of this function, what do you think ? maybe there're better and more elegant approach to do such profiling ? Btw I ran each test for 10 times and take the average of them, I should amend this in the commit message. > Thanks again for sending this along! I'd just take another pass at the > commit message and the comments and see if this can be made easier to > understand when someone 5 years from now is looking at the code. > > thanks > -john Hi John, Thanks for your review and these suggestions ! I'll refine the commit message and do more test ( e.g. use ffs() ) to see if it can be better, I agree the commit message might be too confusing, I'll work on that more, thanks for your time! Best regards, I Hsin Cheng
© 2016 - 2025 Red Hat, Inc.