[v2] do_div() with constant divisor simplification

[PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance

Posted by Nicolas Pitre 1 year, 7 months ago

From: Nicolas Pitre <npitre@baylibre.com>

Recent gcc versions started not systematically inline __arch_xprod64()
and that has performance implications. Give the compiler the freedom to
decide only when optimizing for size.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 arch/arm/include/asm/div64.h | 7 ++++++-
 include/asm-generic/div64.h  | 7 ++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/div64.h b/arch/arm/include/asm/div64.h
index 562d5376ae..3912bf4ce9 100644
--- a/arch/arm/include/asm/div64.h
+++ b/arch/arm/include/asm/div64.h
@@ -52,7 +52,12 @@ static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
 
 #else
 
-static inline uint64_t __arch_xprod_64(uint64_t m, uint64_t n, bool bias)
+#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
+static __always_inline
+#else
+static inline
+#endif
+__arch_xprod_64(uint64_t m, uint64_t n, bool bias)
 {
 	unsigned long long res;
 	register unsigned int tmp asm("ip") = 0;
diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 5d59cf7e73..25e7b4b58d 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -134,7 +134,12 @@
  * Hoping for compile-time optimization of  conditional code.
  * Architectures may provide their own optimized assembly implementation.
  */
-static inline uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
+#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
+static __always_inline
+#else
+static inline
+#endif
+uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
 {
 	uint32_t m_lo = m;
 	uint32_t m_hi = m >> 32;
-- 
2.45.2

Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance

Posted by kernel test robot 1 year, 7 months ago

Hi Nicolas,

kernel test robot noticed the following build errors:

[auto build test ERROR on arnd-asm-generic/master]
[also build test ERROR on arm/for-next arm/fixes soc/for-next linus/master v6.10-rc6 next-20240703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicolas-Pitre/lib-math-test_div64-add-some-edge-cases-relevant-to-__div64_const32/20240708-013344
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git master
patch link:    https://lore.kernel.org/r/20240707171919.1951895-5-nico%40fluxnic.net
patch subject: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance
config: arm-randconfig-002-20240708 (https://download.01.org/0day-ci/archive/20240708/202407080355.VUEmeBsv-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240708/202407080355.VUEmeBsv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407080355.VUEmeBsv-lkp@intel.com/

All errors (new ones prefixed by >>):

   scripts/genksyms/parse.y: warning: 9 shift/reduce conflicts [-Wconflicts-sr]
   scripts/genksyms/parse.y: warning: 5 reduce/reduce conflicts [-Wconflicts-rr]
   scripts/genksyms/parse.y: note: rerun with option '-Wcounterexamples' to generate conflict counterexamples
   In file included from include/linux/math.h:6,
                    from include/linux/kernel.h:27,
                    from include/linux/cpumask.h:11,
                    from include/linux/sched.h:16,
                    from arch/arm/kernel/asm-offsets.c:11:
>> arch/arm/include/asm/div64.h:60:1: error: return type defaults to 'int' [-Werror=implicit-int]
      60 | __arch_xprod_64(uint64_t m, uint64_t n, bool bias)
         | ^~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
   make[3]: *** [scripts/Makefile.build:117: arch/arm/kernel/asm-offsets.s] Error 1 shuffle=2335528022
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1208: prepare0] Error 2 shuffle=2335528022
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:240: __sub-make] Error 2 shuffle=2335528022
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:240: __sub-make] Error 2 shuffle=2335528022
   make: Target 'prepare' not remade because of errors.


vim +/int +60 arch/arm/include/asm/div64.h

    54	
    55	#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
    56	static __always_inline
    57	#else
    58	static inline
    59	#endif
  > 60	__arch_xprod_64(uint64_t m, uint64_t n, bool bias)
    61	{
    62		unsigned long long res;
    63		register unsigned int tmp asm("ip") = 0;
    64		bool no_ovf = __builtin_constant_p(m) &&
    65			      ((m >> 32) + (m & 0xffffffff) < 0x100000000);
    66	
    67		if (!bias) {
    68			asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"
    69				"mov	%Q0, #0"
    70				: "=&r" (res)
    71				: "r" (m), "r" (n)
    72				: "cc");
    73		} else if (no_ovf) {
    74			res = m;
    75			asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"
    76				"mov	%Q0, #0"
    77				: "+&r" (res)
    78				: "r" (m), "r" (n)
    79				: "cc");
    80		} else {
    81			asm (	"umull	%Q0, %R0, %Q2, %Q3\n\t"
    82				"cmn	%Q0, %Q2\n\t"
    83				"adcs	%R0, %R0, %R2\n\t"
    84				"adc	%Q0, %1, #0"
    85				: "=&r" (res), "+&r" (tmp)
    86				: "r" (m), "r" (n)
    87				: "cc");
    88		}
    89	
    90		if (no_ovf) {
    91			asm (	"umlal	%R0, %Q0, %R1, %Q2\n\t"
    92				"umlal	%R0, %Q0, %Q1, %R2\n\t"
    93				"mov	%R0, #0\n\t"
    94				"umlal	%Q0, %R0, %R1, %R2"
    95				: "+&r" (res)
    96				: "r" (m), "r" (n)
    97				: "cc");
    98		} else {
    99			asm (	"umlal	%R0, %Q0, %R2, %Q3\n\t"
   100				"umlal	%R0, %1, %Q2, %R3\n\t"
   101				"mov	%R0, #0\n\t"
   102				"adds	%Q0, %1, %Q0\n\t"
   103				"adc	%R0, %R0, #0\n\t"
   104				"umlal	%Q0, %R0, %R2, %R3"
   105				: "+&r" (res), "+&r" (tmp)
   106				: "r" (m), "r" (n)
   107				: "cc");
   108		}
   109	
   110		return res;
   111	}
   112	#define __arch_xprod_64 __arch_xprod_64
   113	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance

Posted by kernel test robot 1 year, 7 months ago

Hi Nicolas,

kernel test robot noticed the following build errors:

[auto build test ERROR on arnd-asm-generic/master]
[also build test ERROR on arm/for-next arm/fixes soc/for-next linus/master v6.10-rc6 next-20240703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicolas-Pitre/lib-math-test_div64-add-some-edge-cases-relevant-to-__div64_const32/20240708-013344
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git master
patch link:    https://lore.kernel.org/r/20240707171919.1951895-5-nico%40fluxnic.net
patch subject: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance
config: arm-randconfig-003-20240708 (https://download.01.org/0day-ci/archive/20240708/202407080326.fBdpm1Tq-lkp@intel.com/config)
compiler: clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240708/202407080326.fBdpm1Tq-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407080326.fBdpm1Tq-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/arm/kernel/asm-offsets.c:11:
   In file included from include/linux/sched.h:16:
   In file included from include/linux/cpumask.h:11:
   In file included from include/linux/kernel.h:27:
   In file included from include/linux/math.h:6:
>> arch/arm/include/asm/div64.h:60:1: error: type specifier missing, defaults to 'int'; ISO C99 and later do not support implicit int [-Wimplicit-int]
   __arch_xprod_64(uint64_t m, uint64_t n, bool bias)
   ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1120:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:11: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                           ^        ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1120:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:25: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                                         ^        ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1120:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:11: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                            ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1120:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:27: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                                            ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1120:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:5: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
                           (set1->sig[2] == set2->sig[2]) &&
                            ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1120:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:21: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
                           (set1->sig[2] == set2->sig[2]) &&
                                            ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1120:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:157:1: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
   _SIG_SET_BINOP(sigorsets, _sig_or)
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:138:8: note: expanded from macro '_SIG_SET_BINOP'
                   a3 = a->sig[3]; a2 = a->sig[2];                         \
                        ^      ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1120:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:


vim +/int +60 arch/arm/include/asm/div64.h

    54	
    55	#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
    56	static __always_inline
    57	#else
    58	static inline
    59	#endif
  > 60	__arch_xprod_64(uint64_t m, uint64_t n, bool bias)
    61	{
    62		unsigned long long res;
    63		register unsigned int tmp asm("ip") = 0;
    64		bool no_ovf = __builtin_constant_p(m) &&
    65			      ((m >> 32) + (m & 0xffffffff) < 0x100000000);
    66	
    67		if (!bias) {
    68			asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"
    69				"mov	%Q0, #0"
    70				: "=&r" (res)
    71				: "r" (m), "r" (n)
    72				: "cc");
    73		} else if (no_ovf) {
    74			res = m;
    75			asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"
    76				"mov	%Q0, #0"
    77				: "+&r" (res)
    78				: "r" (m), "r" (n)
    79				: "cc");
    80		} else {
    81			asm (	"umull	%Q0, %R0, %Q2, %Q3\n\t"
    82				"cmn	%Q0, %Q2\n\t"
    83				"adcs	%R0, %R0, %R2\n\t"
    84				"adc	%Q0, %1, #0"
    85				: "=&r" (res), "+&r" (tmp)
    86				: "r" (m), "r" (n)
    87				: "cc");
    88		}
    89	
    90		if (no_ovf) {
    91			asm (	"umlal	%R0, %Q0, %R1, %Q2\n\t"
    92				"umlal	%R0, %Q0, %Q1, %R2\n\t"
    93				"mov	%R0, #0\n\t"
    94				"umlal	%Q0, %R0, %R1, %R2"
    95				: "+&r" (res)
    96				: "r" (m), "r" (n)
    97				: "cc");
    98		} else {
    99			asm (	"umlal	%R0, %Q0, %R2, %Q3\n\t"
   100				"umlal	%R0, %1, %Q2, %R3\n\t"
   101				"mov	%R0, #0\n\t"
   102				"adds	%Q0, %1, %Q0\n\t"
   103				"adc	%R0, %R0, #0\n\t"
   104				"umlal	%Q0, %R0, %R2, %R3"
   105				: "+&r" (res), "+&r" (tmp)
   106				: "r" (m), "r" (n)
   107				: "cc");
   108		}
   109	
   110		return res;
   111	}
   112	#define __arch_xprod_64 __arch_xprod_64
   113	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance

Posted by Arnd Bergmann 1 year, 7 months ago

On Sun, Jul 7, 2024, at 19:17, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
>
> Recent gcc versions started not systematically inline __arch_xprod64()
> and that has performance implications. Give the compiler the freedom to
> decide only when optimizing for size.
>
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

Seems reasonable. Just to make sure: do you know if the non-inline
version of xprod_64 ends up producing a more effecient division
result than the __do_div64() code path on arch/arm?

     Arnd

Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance

Posted by Nicolas Pitre 1 year, 7 months ago

On Sun, 7 Jul 2024, Arnd Bergmann wrote:

> On Sun, Jul 7, 2024, at 19:17, Nicolas Pitre wrote:
> > From: Nicolas Pitre <npitre@baylibre.com>
> >
> > Recent gcc versions started not systematically inline __arch_xprod64()
> > and that has performance implications. Give the compiler the freedom to
> > decide only when optimizing for size.
> >
> > Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> 
> Seems reasonable. Just to make sure: do you know if the non-inline
> version of xprod_64 ends up producing a more effecient division
> result than the __do_div64() code path on arch/arm?

__arch_xprod_64() is part of the __do_div64() code path. So I'm not sure 
of your question.

Obviously, having __arch_xprod_64() inlined is faster but it increases 
binary size.


Nicolas

Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance

Posted by Arnd Bergmann 1 year, 7 months ago

On Sun, Jul 7, 2024, at 21:14, Nicolas Pitre wrote:
> On Sun, 7 Jul 2024, Arnd Bergmann wrote:
>
>> On Sun, Jul 7, 2024, at 19:17, Nicolas Pitre wrote:
>> > From: Nicolas Pitre <npitre@baylibre.com>
>> >
>> > Recent gcc versions started not systematically inline __arch_xprod64()
>> > and that has performance implications. Give the compiler the freedom to
>> > decide only when optimizing for size.
>> >
>> > Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
>> 
>> Seems reasonable. Just to make sure: do you know if the non-inline
>> version of xprod_64 ends up producing a more effecient division
>> result than the __do_div64() code path on arch/arm?
>
> __arch_xprod_64() is part of the __do_div64() code path. So I'm not sure 
> of your question.
>
> Obviously, having __arch_xprod_64() inlined is faster but it increases 
> binary size.

I meant whether calling __div64_const32->__arch_xprod_64() is
still faster for a constant base when the new __arch_xprod_64()
is out of line, compared to the __div64_32->__do_div64()
assembly code path we take for a non-constant base.

       Arnd

Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance

Posted by Nicolas Pitre 1 year, 7 months ago

On Sun, 7 Jul 2024, Arnd Bergmann wrote:

> On Sun, Jul 7, 2024, at 21:14, Nicolas Pitre wrote:
> > On Sun, 7 Jul 2024, Arnd Bergmann wrote:
> >
> >> On Sun, Jul 7, 2024, at 19:17, Nicolas Pitre wrote:
> >> > From: Nicolas Pitre <npitre@baylibre.com>
> >> >
> >> > Recent gcc versions started not systematically inline __arch_xprod64()
> >> > and that has performance implications. Give the compiler the freedom to
> >> > decide only when optimizing for size.
> >> >
> >> > Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> >> 
> >> Seems reasonable. Just to make sure: do you know if the non-inline
> >> version of xprod_64 ends up producing a more effecient division
> >> result than the __do_div64() code path on arch/arm?
> >
> > __arch_xprod_64() is part of the __do_div64() code path. So I'm not sure 
> > of your question.
> >
> > Obviously, having __arch_xprod_64() inlined is faster but it increases 
> > binary size.
> 
> I meant whether calling __div64_const32->__arch_xprod_64() is
> still faster for a constant base when the new __arch_xprod_64()
> is out of line, compared to the __div64_32->__do_div64()
> assembly code path we take for a non-constant base.

Oh, most likely yes. The non-constant base has to go through the whole 
one-bit-at-a-time division loop whereas the constant base with 
__div64_const32 results in 4 64-bits multiply and add. Moving 
__arch_xprod_64() out of line adds the argument shuffling overhead and 
it can't skip overflow handling, but still.

Here's some numbers. With latest patches using __always_inline:

test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.048285584s elapsed

Latest patches but __always_inline left out:

test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.053023584s elapsed

Forcing both constant and non-constant base through the same path:

test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.103263776s elapsed

It is worth noting that test_div64 does half the test with non constant 
divisors already so the impact is greater than what those numbers show.

And for what it is worth, those numbers were obtained using QEMU. The 
gcc version is 14.1.0.

Nicolas

Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance

Posted by Arnd Bergmann 1 year, 7 months ago

On Mon, Jul 8, 2024, at 03:21, Nicolas Pitre wrote:
> On Sun, 7 Jul 2024, Arnd Bergmann wrote:
>> On Sun, Jul 7, 2024, at 21:14, Nicolas Pitre wrote:
>
> Oh, most likely yes. The non-constant base has to go through the whole 
> one-bit-at-a-time division loop whereas the constant base with 
> __div64_const32 results in 4 64-bits multiply and add. Moving 
> __arch_xprod_64() out of line adds the argument shuffling overhead and 
> it can't skip overflow handling, but still.
>
> Here's some numbers. With latest patches using __always_inline:
>
> test_div64: Starting 64bit/32bit division and modulo test
> test_div64: Completed 64bit/32bit division and modulo test, 0.048285584s elapsed
>
> Latest patches but __always_inline left out:
>
> test_div64: Starting 64bit/32bit division and modulo test
> test_div64: Completed 64bit/32bit division and modulo test, 0.053023584s elapsed
>
> Forcing both constant and non-constant base through the same path:
>
> test_div64: Starting 64bit/32bit division and modulo test
> test_div64: Completed 64bit/32bit division and modulo test, 0.103263776s elapsed
>
> It is worth noting that test_div64 does half the test with non constant 
> divisors already so the impact is greater than what those numbers show.
>
> And for what it is worth, those numbers were obtained using QEMU. The 
> gcc version is 14.1.0.

Right, so with the numbers in qemu matching your explanation,
that seems reasonable to assume it will behave the same way
across a wide range of physical CPUs.

    Arnd