arch/x86/include/asm/div64.h | 19 ++-- include/linux/math64.h | 44 ++++++++- lib/math/div64.c | 57 ++++++++---- lib/math/test_mul_u64_u64_div_u64.c | 136 +++++++++++++++++----------- 4 files changed, 179 insertions(+), 77 deletions(-)
The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64(). This can be done simply by adding 'divisor - 1' to the 128bit product. Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the existing code. Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d). Only x86-64 has an optimsed (asm) version of the function. That is optimised to avoid the 'add c' when c is known to be zero. In all other cases the extra code will be noise compared to the software divide code. I've updated the test module to test mul_u64_u64_div_u64_roundup() and also enhanced it to verify the C division code on x86-64. Note that the code generated by gcc (eg for 32bit x86) just for the multiply is rather more horrid than one would expect (clang does better). I dread to think how long the divide loop takes. And I'm not at all sure the call in kernel/sched/cputime.c isn't in a relatively common path (rather than just hardware initialisation). David Laight (3): lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() lib: Add tests for mul_u64_u64_div_u64_roundup() lib: Update the muldiv64 tests to verify the C on x86-64 arch/x86/include/asm/div64.h | 19 ++-- include/linux/math64.h | 44 ++++++++- lib/math/div64.c | 57 ++++++++---- lib/math/test_mul_u64_u64_div_u64.c | 136 +++++++++++++++++----------- 4 files changed, 179 insertions(+), 77 deletions(-) -- 2.39.5
Hello David, On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote: > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64(). > This can be done simply by adding 'divisor - 1' to the 128bit product. > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the > existing code. > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d). > > Only x86-64 has an optimsed (asm) version of the function. > That is optimised to avoid the 'add c' when c is known to be zero. > In all other cases the extra code will be noise compared to the software > divide code. > > I've updated the test module to test mul_u64_u64_div_u64_roundup() and > also enhanced it to verify the C division code on x86-64. > > Note that the code generated by gcc (eg for 32bit x86) just for the multiply > is rather more horrid than one would expect (clang does better). > I dread to think how long the divide loop takes. > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a > relatively common path (rather than just hardware initialisation). > > David Laight (3): > lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() > lib: Add tests for mul_u64_u64_div_u64_roundup() > lib: Update the muldiv64 tests to verify the C on x86-64 I wonder what happend to this series. I'd like to make use of mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the mainline. Best regards Uwe
On Fri, 16 May 2025 11:47:58 +0200 Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote: > Hello David, > > On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote: > > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64(). > > This can be done simply by adding 'divisor - 1' to the 128bit product. > > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the > > existing code. > > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and > > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d). > > > > Only x86-64 has an optimsed (asm) version of the function. > > That is optimised to avoid the 'add c' when c is known to be zero. > > In all other cases the extra code will be noise compared to the software > > divide code. > > > > I've updated the test module to test mul_u64_u64_div_u64_roundup() and > > also enhanced it to verify the C division code on x86-64. > > > > Note that the code generated by gcc (eg for 32bit x86) just for the multiply > > is rather more horrid than one would expect (clang does better). > > I dread to think how long the divide loop takes. > > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a > > relatively common path (rather than just hardware initialisation). > > > > David Laight (3): > > lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() > > lib: Add tests for mul_u64_u64_div_u64_roundup() > > lib: Update the muldiv64 tests to verify the C on x86-64 > > I wonder what happend to this series. I'd like to make use of > mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the > mainline. I've a WIP rewrite of the divide code, speeds it up considerably for 'not amd-64'. IIRC (the test machine is powered off) the test cases with random data do down from over 900 clocks to below 150 for x86-32. The 64bit code (on x64-x64 but skipping the divide instruction) are ~70 clocks rather than nearer 180. Both those clock counts are almost data independent. The code relies on the cpu having an 'unsigned long/unsigned long' instruction. I got reasonable code for 32bit x64 running gcc 12.2 from debian. Then copied it to godbolt and found gcc before 15.0 making a pigs breakfast of it (all clang versions are fine). So some rework. I still need to actually plumb it into the kernel sources. I do want to test the divide loop on x86-64, at least for 64bit. Probably by including the code directly into the test module. David
On Fri, 16 May 2025, David Laight wrote: > On Fri, 16 May 2025 11:47:58 +0200 > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote: > > > Hello David, > > > > On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote: > > > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64(). > > > This can be done simply by adding 'divisor - 1' to the 128bit product. > > > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the > > > existing code. > > > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and > > > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d). > > > > > > Only x86-64 has an optimsed (asm) version of the function. > > > That is optimised to avoid the 'add c' when c is known to be zero. > > > In all other cases the extra code will be noise compared to the software > > > divide code. > > > > > > I've updated the test module to test mul_u64_u64_div_u64_roundup() and > > > also enhanced it to verify the C division code on x86-64. > > > > > > Note that the code generated by gcc (eg for 32bit x86) just for the multiply > > > is rather more horrid than one would expect (clang does better). > > > I dread to think how long the divide loop takes. > > > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a > > > relatively common path (rather than just hardware initialisation). > > > > > > David Laight (3): > > > lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() > > > lib: Add tests for mul_u64_u64_div_u64_roundup() > > > lib: Update the muldiv64 tests to verify the C on x86-64 > > > > I wonder what happend to this series. I'd like to make use of > > mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the > > mainline. > > I've a WIP rewrite of the divide code, speeds it up considerably for > 'not amd-64'. May I suggest you simply submit the new API now (addressing my latest comments if possible) and submit the divide optimization later? Nicolas
On Fri, 16 May 2025 11:49:44 -0400 (EDT) Nicolas Pitre <npitre@baylibre.com> wrote: > On Fri, 16 May 2025, David Laight wrote: > > > On Fri, 16 May 2025 11:47:58 +0200 > > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote: > > > > > Hello David, > > > > > > On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote: > > > > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64(). > > > > This can be done simply by adding 'divisor - 1' to the 128bit product. > > > > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the > > > > existing code. > > > > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and > > > > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d). > > > > > > > > Only x86-64 has an optimsed (asm) version of the function. > > > > That is optimised to avoid the 'add c' when c is known to be zero. > > > > In all other cases the extra code will be noise compared to the software > > > > divide code. > > > > > > > > I've updated the test module to test mul_u64_u64_div_u64_roundup() and > > > > also enhanced it to verify the C division code on x86-64. > > > > > > > > Note that the code generated by gcc (eg for 32bit x86) just for the multiply > > > > is rather more horrid than one would expect (clang does better). > > > > I dread to think how long the divide loop takes. > > > > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a > > > > relatively common path (rather than just hardware initialisation). ... > > > I wonder what happend to this series. I'd like to make use of > > > mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the > > > mainline. > > > > I've a WIP rewrite of the divide code, speeds it up considerably for > > 'not amd-64'. > > May I suggest you simply submit the new API now (addressing my latest > comments if possible) and submit the divide optimization later? I've just 'lobbed in' a 'v2' that excludes the last patch (extra changes to the test module) and replaces 'if (!divisor) 1/0;' with BUG_ON(!divisor). I'll to the speedup changes on top. David
© 2016 - 2026 Red Hat, Inc.