Following the improvements to Xen's bitops, fls() does constant propagation in
all cases. Use it, and drop the local opencoded helper.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
ARM32 gets a very minor code generation improvement:
xen.git/xen$ ../scripts/bloat-o-meter xen-syms-arm32-{before,after}
add/remove: 0/0 grow/shrink: 0/6 up/down: 0/-48 (-48)
Function old new delta
wallclock_time 288 280 -8
printk_start_of_line 560 552 -8
domain_vtimer_init 472 464 -8
do_settime 376 368 -8
burn_credits 760 752 -8
__printk_ratelimit 424 416 -8
But it's just a couple of operations improvement and no real change in code
structure. I expect that the constant propagation being done through
__builtin_clz(), rather than pure C, is giving the optimiser a bit more
information to work with.
This file also has an __GNUC__ < 4 case which seems ripe for removing...
---
xen/arch/arm/include/asm/div64.h | 18 +++---------------
1 file changed, 3 insertions(+), 15 deletions(-)
diff --git a/xen/arch/arm/include/asm/div64.h b/xen/arch/arm/include/asm/div64.h
index 0459d5cc0122..da1f1fcbd503 100644
--- a/xen/arch/arm/include/asm/div64.h
+++ b/xen/arch/arm/include/asm/div64.h
@@ -102,7 +102,7 @@
/* preserve low part of n for reminder computation */ \
__r = __n; \
/* determine number of bits to represent __b */ \
- __p = 1 << __div64_fls(__b); \
+ __p = 1 << fls(__b); \
/* compute __m = ((__p << 64) + __b - 1) / __b */ \
__m = (~0ULL / __b) * __p; \
__m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b; \
@@ -150,8 +150,8 @@
__p /= (__m & -__m); \
__m /= (__m & -__m); \
} else { \
- __p >>= __div64_fls(__bits); \
- __m >>= __div64_fls(__bits); \
+ __p >>= fls(__bits); \
+ __m >>= fls(__bits); \
} \
/* No correction needed. */ \
__c = 0; \
@@ -217,18 +217,6 @@
__r; \
})
-/* our own fls implementation to make sure constant propagation is fine */
-#define __div64_fls(bits) \
-({ \
- unsigned int __left = (bits), __nr = 0; \
- if (__left & 0xffff0000) __nr += 16, __left >>= 16; \
- if (__left & 0x0000ff00) __nr += 8, __left >>= 8; \
- if (__left & 0x000000f0) __nr += 4, __left >>= 4; \
- if (__left & 0x0000000c) __nr += 2, __left >>= 2; \
- if (__left & 0x00000002) __nr += 1; \
- __nr; \
-})
-
#endif /* GCC version */
#endif /* BITS_PER_LONG */
--
2.39.2
On 02/09/2024 12:03, Andrew Cooper wrote: > > > Following the improvements to Xen's bitops, fls() does constant propagation in > all cases. Use it, and drop the local opencoded helper. > > No functional change. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com> > --- > CC: Stefano Stabellini <sstabellini@kernel.org> > CC: Julien Grall <julien@xen.org> > CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> > CC: Bertrand Marquis <bertrand.marquis@arm.com> > CC: Michal Orzel <michal.orzel@amd.com> > > ARM32 gets a very minor code generation improvement: > > xen.git/xen$ ../scripts/bloat-o-meter xen-syms-arm32-{before,after} > add/remove: 0/0 grow/shrink: 0/6 up/down: 0/-48 (-48) > Function old new delta > wallclock_time 288 280 -8 > printk_start_of_line 560 552 -8 > domain_vtimer_init 472 464 -8 > do_settime 376 368 -8 > burn_credits 760 752 -8 > __printk_ratelimit 424 416 -8 > > But it's just a couple of operations improvement and no real change in code > structure. I expect that the constant propagation being done through > __builtin_clz(), rather than pure C, is giving the optimiser a bit more > information to work with. > > This file also has an __GNUC__ < 4 case which seems ripe for removing... Agree and noted. ~Michal
© 2016 - 2025 Red Hat, Inc.