[PATCH 2/2] asm-generic/div64: reimplement __arch_xprod64()

Nicolas Pitre posted 2 patches 1 year, 7 months ago
There is a newer version of this series
[PATCH 2/2] asm-generic/div64: reimplement __arch_xprod64()
Posted by Nicolas Pitre 1 year, 7 months ago
From: Nicolas Pitre <npitre@baylibre.com>

Several years later I just realized that this code could be optimized
and more importantly simplified even further. With some reordering, it
is possible to dispense with overflow handling entirely and still have
optimal code.

There is also no longer a reason to have the possibility for
architectures to override the generic version. Only ARM did it and these
days the compiler does a better job than the hand-crafted assembly
version anyway.

Kernel binary gets slightly smaller as well. Using the ARM's
versatile_defconfig plus CONFIG_TEST_DIV64=y:

Before this patch:

   text    data     bss     dec     hex filename
9644668 2743926  193424 12582018         bffc82 vmlinux

With this patch:

   text    data     bss     dec     hex filename
9643572 2743926  193424 12580922         bff83a vmlinux

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
 include/asm-generic/div64.h | 105 +++++++++++-------------------------
 1 file changed, 31 insertions(+), 74 deletions(-)

diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
index 13f5aa68a4..0741c2b003 100644
--- a/include/asm-generic/div64.h
+++ b/include/asm-generic/div64.h
@@ -116,98 +116,55 @@
 		___m = (~0ULL / ___b) * ___p;				\
 		___m += ((~0ULL % ___b + 1) * ___p) / ___b;		\
 	} else {							\
-		/*							\
-		 * Reduce m / p, and try to clear bit 31 of m when	\
-		 * possible, otherwise that'll need extra overflow	\
-		 * handling later.					\
-		 */							\
-		uint32_t ___bits = -(___m & -___m);			\
-		___bits |= ___m >> 32;					\
-		___bits = (~___bits) << 1;				\
-		/*							\
-		 * If ___bits == 0 then setting bit 31 is  unavoidable.	\
-		 * Simply apply the maximum possible reduction in that	\
-		 * case. Otherwise the MSB of ___bits indicates the	\
-		 * best reduction we should apply.			\
-		 */							\
-		if (!___bits) {						\
-			___p /= (___m & -___m);				\
-			___m /= (___m & -___m);				\
-		} else {						\
-			___p >>= ilog2(___bits);			\
-			___m >>= ilog2(___bits);			\
-		}							\
+		/* Reduce m / p */					\
+		___p /= (___m & -___m);					\
+		___m /= (___m & -___m);					\
 		/* No bias needed. */					\
 		___bias = 0;						\
 	}								\
 									\
 	/*								\
-	 * Now we have a combination of 2 conditions:			\
-	 *								\
-	 * 1) whether or not we need to apply a bias, and		\
-	 *								\
-	 * 2) whether or not there might be an overflow in the cross	\
-	 *    product determined by (___m & ((1 << 63) | (1 << 31))).	\
-	 *								\
-	 * Select the best way to do (m_bias + m * n) / (1 << 64).	\
+	 * Perform (m_bias + m * n) / (1 << 64).			\
 	 * From now on there will be actual runtime code generated.	\
 	 */								\
-	___res = __arch_xprod_64(___m, ___n, ___bias);			\
+	___res = __xprod_64(___m, ___n, ___bias);			\
 									\
 	___res /= ___p;							\
 })
 
-#ifndef __arch_xprod_64
 /*
- * Default C implementation for __arch_xprod_64()
- *
- * Prototype: uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
  * Semantic:  retval = ((bias ? m : 0) + m * n) >> 64
  *
  * The product is a 128-bit value, scaled down to 64 bits.
- * Assuming constant propagation to optimize away unused conditional code.
- * Architectures may provide their own optimized assembly implementation.
  */
-static inline uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
+static inline uint64_t __xprod_64(const uint64_t m, uint64_t n, bool bias)
 {
-	uint32_t m_lo = m;
-	uint32_t m_hi = m >> 32;
-	uint32_t n_lo = n;
-	uint32_t n_hi = n >> 32;
-	uint64_t res;
-	uint32_t res_lo, res_hi, tmp;
-
-	if (!bias) {
-		res = ((uint64_t)m_lo * n_lo) >> 32;
-	} else if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
-		/* there can't be any overflow here */
-		res = (m + (uint64_t)m_lo * n_lo) >> 32;
-	} else {
-		res = m + (uint64_t)m_lo * n_lo;
-		res_lo = res >> 32;
-		res_hi = (res_lo < m_hi);
-		res = res_lo | ((uint64_t)res_hi << 32);
-	}
-
-	if (!(m & ((1ULL << 63) | (1ULL << 31)))) {
-		/* there can't be any overflow here */
-		res += (uint64_t)m_lo * n_hi;
-		res += (uint64_t)m_hi * n_lo;
-		res >>= 32;
-	} else {
-		res += (uint64_t)m_lo * n_hi;
-		tmp = res >> 32;
-		res += (uint64_t)m_hi * n_lo;
-		res_lo = res >> 32;
-		res_hi = (res_lo < tmp);
-		res = res_lo | ((uint64_t)res_hi << 32);
-	}
-
-	res += (uint64_t)m_hi * n_hi;
-
-	return res;
-}
+	union {
+		uint64_t v;
+		struct {
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+			uint32_t l;
+			uint32_t h;
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+			uint32_t h;
+			uint32_t l;
+#else
+#error "unknown endianness"
 #endif
+		};
+	} A, B, X, Y, Z;
+
+	A.v = m;
+	B.v = n;
+
+	X.v = (uint64_t)A.l * B.l + (bias ? m : 0);
+	Y.v = (uint64_t)A.l * B.h + X.h;
+	Z.v = (uint64_t)A.h * B.h + Y.h;
+	Y.v = (uint64_t)A.h * B.l + Y.l;
+	Z.v += Y.h;
+
+	return Z.v;
+}
 
 #ifndef __div64_32
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
-- 
2.45.2
Re: [PATCH 2/2] asm-generic/div64: reimplement __arch_xprod64()
Posted by kernel test robot 1 year, 7 months ago
Hi Nicolas,

kernel test robot noticed the following build warnings:

[auto build test WARNING on arnd-asm-generic/master]
[also build test WARNING on arm/for-next arm/fixes soc/for-next linus/master v6.10-rc6 next-20240703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicolas-Pitre/ARM-div64-remove-custom-__arch_xprod_64/20240705-195905
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git master
patch link:    https://lore.kernel.org/r/20240705022334.1378363-3-nico%40fluxnic.net
patch subject: [PATCH 2/2] asm-generic/div64: reimplement __arch_xprod64()
config: parisc-defconfig (https://download.01.org/0day-ci/archive/20240706/202407061823.qXqNlq8o-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240706/202407061823.qXqNlq8o-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407061823.qXqNlq8o-lkp@intel.com/

All warnings (new ones prefixed by >>):

   security/keys/proc.c: In function 'proc_keys_show':
   security/keys/proc.c:215:45: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=]
     215 |                         sprintf(xbuf, "%llud", div_u64(timo, 60 * 60 * 24));
         |                                             ^
   security/keys/proc.c:215:25: note: 'sprintf' output between 3 and 17 bytes into a destination of size 16
     215 |                         sprintf(xbuf, "%llud", div_u64(timo, 60 * 60 * 24));
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   security/keys/proc.c:213:44: warning: 'h' directive writing 1 byte into a region of size between 0 and 15 [-Wformat-overflow=]
     213 |                         sprintf(xbuf, "%lluh", div_u64(timo, 60 * 60));
         |                                            ^
   security/keys/proc.c:213:25: note: 'sprintf' output between 3 and 18 bytes into a destination of size 16
     213 |                         sprintf(xbuf, "%lluh", div_u64(timo, 60 * 60));
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> security/keys/proc.c:211:40: warning: '%llu' directive writing between 1 and 19 bytes into a region of size 16 [-Wformat-overflow=]
     211 |                         sprintf(xbuf, "%llum", div_u64(timo, 60));
         |                                        ^~~~
   security/keys/proc.c:211:39: note: directive argument in the range [0, 1229782946264575725]
     211 |                         sprintf(xbuf, "%llum", div_u64(timo, 60));
         |                                       ^~~~~~~
   security/keys/proc.c:211:25: note: 'sprintf' output between 3 and 21 bytes into a destination of size 16
     211 |                         sprintf(xbuf, "%llum", div_u64(timo, 60));
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +211 security/keys/proc.c

^1da177e4c3f415 Linus Torvalds 2005-04-16  152  
^1da177e4c3f415 Linus Torvalds 2005-04-16  153  static int proc_keys_show(struct seq_file *m, void *v)
^1da177e4c3f415 Linus Torvalds 2005-04-16  154  {
^1da177e4c3f415 Linus Torvalds 2005-04-16  155  	struct rb_node *_p = v;
^1da177e4c3f415 Linus Torvalds 2005-04-16  156  	struct key *key = rb_entry(_p, struct key, serial_node);
ab5c69f01313c80 Eric Biggers   2017-09-27  157  	unsigned long flags;
927942aabbbe506 David Howells  2010-06-11  158  	key_ref_t key_ref, skey_ref;
074d58989569b39 Baolin Wang    2017-11-15  159  	time64_t now, expiry;
03dab869b7b239c David Howells  2016-10-26  160  	char xbuf[16];
363b02dab09b322 David Howells  2017-10-04  161  	short state;
074d58989569b39 Baolin Wang    2017-11-15  162  	u64 timo;
06ec7be557a1259 Michael LeMay  2006-06-26  163  	int rc;
06ec7be557a1259 Michael LeMay  2006-06-26  164  
4bdf0bc30031414 David Howells  2013-09-24  165  	struct keyring_search_context ctx = {
ede0fa98a900e65 Eric Biggers   2019-02-22  166  		.index_key		= key->index_key,
4aa68e07d845562 Eric Biggers   2017-09-18  167  		.cred			= m->file->f_cred,
462919591a1791e David Howells  2014-09-16  168  		.match_data.cmp		= lookup_user_key_possessed,
462919591a1791e David Howells  2014-09-16  169  		.match_data.raw_data	= key,
462919591a1791e David Howells  2014-09-16  170  		.match_data.lookup_type	= KEYRING_SEARCH_LOOKUP_DIRECT,
dcf49dbc8077e27 David Howells  2019-06-26  171  		.flags			= (KEYRING_SEARCH_NO_STATE_CHECK |
dcf49dbc8077e27 David Howells  2019-06-26  172  					   KEYRING_SEARCH_RECURSE),
4bdf0bc30031414 David Howells  2013-09-24  173  	};
4bdf0bc30031414 David Howells  2013-09-24  174  
028db3e290f15ac Linus Torvalds 2019-07-10  175  	key_ref = make_key_ref(key, 0);
927942aabbbe506 David Howells  2010-06-11  176  
927942aabbbe506 David Howells  2010-06-11  177  	/* determine if the key is possessed by this process (a test we can
927942aabbbe506 David Howells  2010-06-11  178  	 * skip if the key does not indicate the possessor can view it
927942aabbbe506 David Howells  2010-06-11  179  	 */
028db3e290f15ac Linus Torvalds 2019-07-10  180  	if (key->perm & KEY_POS_VIEW) {
028db3e290f15ac Linus Torvalds 2019-07-10  181  		rcu_read_lock();
e59428f721ee096 David Howells  2019-06-19  182  		skey_ref = search_cred_keyrings_rcu(&ctx);
028db3e290f15ac Linus Torvalds 2019-07-10  183  		rcu_read_unlock();
927942aabbbe506 David Howells  2010-06-11  184  		if (!IS_ERR(skey_ref)) {
927942aabbbe506 David Howells  2010-06-11  185  			key_ref_put(skey_ref);
927942aabbbe506 David Howells  2010-06-11  186  			key_ref = make_key_ref(key, 1);
927942aabbbe506 David Howells  2010-06-11  187  		}
927942aabbbe506 David Howells  2010-06-11  188  	}
927942aabbbe506 David Howells  2010-06-11  189  
4aa68e07d845562 Eric Biggers   2017-09-18  190  	/* check whether the current task is allowed to view the key */
f5895943d91b41b David Howells  2014-03-14  191  	rc = key_task_permission(key_ref, ctx.cred, KEY_NEED_VIEW);
06ec7be557a1259 Michael LeMay  2006-06-26  192  	if (rc < 0)
028db3e290f15ac Linus Torvalds 2019-07-10  193  		return 0;
^1da177e4c3f415 Linus Torvalds 2005-04-16  194  
074d58989569b39 Baolin Wang    2017-11-15  195  	now = ktime_get_real_seconds();
^1da177e4c3f415 Linus Torvalds 2005-04-16  196  
028db3e290f15ac Linus Torvalds 2019-07-10  197  	rcu_read_lock();
028db3e290f15ac Linus Torvalds 2019-07-10  198  
^1da177e4c3f415 Linus Torvalds 2005-04-16  199  	/* come up with a suitable timeout value */
ab5c69f01313c80 Eric Biggers   2017-09-27  200  	expiry = READ_ONCE(key->expiry);
39299bdd2546688 David Howells  2023-12-09  201  	if (expiry == TIME64_MAX) {
^1da177e4c3f415 Linus Torvalds 2005-04-16  202  		memcpy(xbuf, "perm", 5);
074d58989569b39 Baolin Wang    2017-11-15  203  	} else if (now >= expiry) {
^1da177e4c3f415 Linus Torvalds 2005-04-16  204  		memcpy(xbuf, "expd", 5);
7b1b9164598286f David Howells  2009-09-02  205  	} else {
074d58989569b39 Baolin Wang    2017-11-15  206  		timo = expiry - now;
^1da177e4c3f415 Linus Torvalds 2005-04-16  207  
^1da177e4c3f415 Linus Torvalds 2005-04-16  208  		if (timo < 60)
074d58989569b39 Baolin Wang    2017-11-15  209  			sprintf(xbuf, "%llus", timo);
^1da177e4c3f415 Linus Torvalds 2005-04-16  210  		else if (timo < 60*60)
074d58989569b39 Baolin Wang    2017-11-15 @211  			sprintf(xbuf, "%llum", div_u64(timo, 60));
^1da177e4c3f415 Linus Torvalds 2005-04-16  212  		else if (timo < 60*60*24)
074d58989569b39 Baolin Wang    2017-11-15  213  			sprintf(xbuf, "%lluh", div_u64(timo, 60 * 60));
^1da177e4c3f415 Linus Torvalds 2005-04-16  214  		else if (timo < 60*60*24*7)
074d58989569b39 Baolin Wang    2017-11-15 @215  			sprintf(xbuf, "%llud", div_u64(timo, 60 * 60 * 24));
^1da177e4c3f415 Linus Torvalds 2005-04-16  216  		else
074d58989569b39 Baolin Wang    2017-11-15  217  			sprintf(xbuf, "%lluw", div_u64(timo, 60 * 60 * 24 * 7));
^1da177e4c3f415 Linus Torvalds 2005-04-16  218  	}
^1da177e4c3f415 Linus Torvalds 2005-04-16  219  
363b02dab09b322 David Howells  2017-10-04  220  	state = key_read_state(key);
363b02dab09b322 David Howells  2017-10-04  221  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH 2/2] asm-generic/div64: reimplement __arch_xprod64()
Posted by Arnd Bergmann 1 year, 7 months ago
On Fri, Jul 5, 2024, at 04:20, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
>
> Several years later I just realized that this code could be optimized
> and more importantly simplified even further. With some reordering, it
> is possible to dispense with overflow handling entirely and still have
> optimal code.
>
> There is also no longer a reason to have the possibility for
> architectures to override the generic version. Only ARM did it and these
> days the compiler does a better job than the hand-crafted assembly
> version anyway.
>
> Kernel binary gets slightly smaller as well. Using the ARM's
> versatile_defconfig plus CONFIG_TEST_DIV64=y:
>
> Before this patch:
>
>    text    data     bss     dec     hex filename
> 9644668 2743926  193424 12582018         bffc82 vmlinux
>
> With this patch:
>
>    text    data     bss     dec     hex filename
> 9643572 2743926  193424 12580922         bff83a vmlinux
>
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

This looks really nice, thanks for the work!

I've tried reproducing your finding to see what compiler
version started being good enough to benefit from the
new version. Looking at just the vmlinux size as you did
above, I can confirm that the generated code is noticeably
smaller in gcc-11 and above, slightly smaller in gcc-10
but larger in gcc-9 and below. With gcc-10 being 4 years
old now and already in debian 'oldstable', that should be
good enough.

Unfortunately, I see that clang-19 still produces smaller
arm code with the old version, so this is likely missing
some optimization that went into gcc. Specifically these
are the numbers I see for an armv7 defconfig with many
drivers disabled for faster builds, comparing the current
upstream version with inline asm, the upstream C version
(patch 1/2 applied) and the new C version (both applied):

text	   data	    bss	    dec	    hex	filename
6332190	2577094	 257664	9166948	 8be064	vmlinux-old-asm
6334518	2577158	 257664	9169340	 8be9bc	vmlinux-old-C
6333366	2577158	 257664	9168188	 8be53c	vmlinux-new-C

The results for clang-14 are very similar. Adding Nathan
and the llvm linux mailing list to see if anyone there
thinks we need to dig deeper on whether llvm should handle
this better.

I also checked a few other 32-bit targets with gcc-14
and found that mips and powerpc get slightly worse with
your new version, while x86 doesn't use this code and
is unaffected.

With all this said, I think we still want your patch
or something very close to it because the new version
is so much more readable and better on the one 32-bit
config that users care about in practice (armv7 with
modern gcc), but it would be nice if we could find
a way to not make it worse for the other configurations.

       Arnd
Re: [PATCH 2/2] asm-generic/div64: reimplement __arch_xprod64()
Posted by Nicolas Pitre 1 year, 7 months ago
On Fri, 5 Jul 2024, Arnd Bergmann wrote:

> On Fri, Jul 5, 2024, at 04:20, Nicolas Pitre wrote:
> > From: Nicolas Pitre <npitre@baylibre.com>
> >
> > Several years later I just realized that this code could be optimized
> > and more importantly simplified even further. With some reordering, it
> > is possible to dispense with overflow handling entirely and still have
> > optimal code.
> >
> > There is also no longer a reason to have the possibility for
> > architectures to override the generic version. Only ARM did it and these
> > days the compiler does a better job than the hand-crafted assembly
> > version anyway.
> >
> > Kernel binary gets slightly smaller as well. Using the ARM's
> > versatile_defconfig plus CONFIG_TEST_DIV64=y:
> >
> > Before this patch:
> >
> >    text    data     bss     dec     hex filename
> > 9644668 2743926  193424 12582018         bffc82 vmlinux
> >
> > With this patch:
> >
> >    text    data     bss     dec     hex filename
> > 9643572 2743926  193424 12580922         bff83a vmlinux
> >
> > Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> 
> This looks really nice, thanks for the work!
> 
> I've tried reproducing your finding to see what compiler
> version started being good enough to benefit from the
> new version. Looking at just the vmlinux size as you did
> above, I can confirm that the generated code is noticeably
> smaller in gcc-11 and above, slightly smaller in gcc-10
> but larger in gcc-9 and below.

Well well... Turns out that binary size is a bad metric here. The main 
reason why the compiled code gets smaller is because gcc decides to 
_not_ inline __arch_xprod_64(). That makes the kernel smaller, but a 
bunch of conditionals in there were really meant to be resolved at 
compile time in order to generate the best code for each instance. With 
a non inlined version, those conditionals are no longer based on 
constants and the compiler emits code to determine at runtime if 2 or 3 
instructions can be saved, which completely defeats the purpose in 
addition to make performance worse.

So I've reworked it all again, this time taking into account the 
possibility for the compiler not to inline that code sometimes. Plus 
some more simplifications.


Nicolas