Rework arm64 LTO __READ_ONCE() to improve code generation as follows:
1. Replace _Generic-based __unqual_scalar_typeof() with more complete
__rwonce_typeof_unqual(). This strips qualifiers from all types, not
just integer types, which is required to be able to assign (must be
non-const) to __u.__val in the non-atomic case (required for #2).
Once our minimum compiler versions are bumped, this just becomes
TYPEOF_UNQUAL() (or typeof_unqual() should we decide to adopt C23
naming). Sadly the fallback version of __rwonce_typeof_unqual() cannot
be used as a general TYPEOF_UNQUAL() fallback (see code comments).
One subtle point here is that non-integer types of __val could be const
or volatile within the union with the old __unqual_scalar_typeof(), if
the passed variable is const or volatile. This would then result in a
forced load from the stack if __u.__val is volatile; in the case of
const, it does look odd if the underlying storage changes, but the
compiler is told said member is "const" -- it smells like UB.
2. Eliminate the atomic flag and ternary conditional expression. Move
the fallback volatile load into the default case of the switch,
ensuring __u is unconditionally initialized across all paths.
The statement expression now unconditionally returns __u.__val.
This refactoring appears to help the compiler improve (or fix) code
generation.
With a defconfig + LTO + debug options builds, we observe different
codegen for the following functions:
btrfs_reclaim_sweep (708 -> 1032 bytes)
btrfs_sinfo_bg_reclaim_threshold_store (200 -> 204 bytes)
check_mem_access (3652 -> 3692 bytes) [inlined bpf_map_is_rdonly]
console_flush_all (1268 -> 1264 bytes)
console_lock_spinning_disable_and_check (180 -> 176 bytes)
igb_add_filter (640 -> 636 bytes)
igb_config_tx_modes (2404 -> 2400 bytes)
kvm_vcpu_on_spin (480 -> 476 bytes)
map_freeze (376 -> 380 bytes)
netlink_bind (1664 -> 1656 bytes)
nmi_cpu_backtrace (404 -> 400 bytes)
set_rps_cpu (516 -> 520 bytes)
swap_cluster_readahead (944 -> 932 bytes)
tcp_accecn_third_ack (328 -> 336 bytes)
tcp_create_openreq_child (1764 -> 1772 bytes)
tcp_data_queue (5784 -> 5892 bytes)
tcp_ecn_rcv_synack (620 -> 628 bytes)
xen_manage_runstate_time (944 -> 896 bytes)
xen_steal_clock (340 -> 296 bytes)
Increase of some functions are due to more aggressive inlining due to
better codegen (in this build, e.g. bpf_map_is_rdonly is no longer
present due to being inlined completely).
Signed-off-by: Marco Elver <elver@google.com>
---
v3:
* Comment.
v2:
* Add __rwonce_typeof_unqual() as fallback for old compilers.
---
arch/arm64/include/asm/rwonce.h | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
index fc0fb42b0b64..42c9e8429274 100644
--- a/arch/arm64/include/asm/rwonce.h
+++ b/arch/arm64/include/asm/rwonce.h
@@ -19,6 +19,20 @@
"ldapr" #sfx "\t" #regs, \
ARM64_HAS_LDAPR)
+#ifdef USE_TYPEOF_UNQUAL
+#define __rwonce_typeof_unqual(x) TYPEOF_UNQUAL(x)
+#else
+/*
+ * Fallback for older compilers (Clang < 19).
+ *
+ * Uses the fact that, for all supported Clang versions, 'auto' correctly drops
+ * qualifiers. Unlike typeof_unqual(), the type must be completely defined, i.e.
+ * no forward-declared struct pointer dereferences. The array-to-pointer decay
+ * case does not matter for usage in READ_ONCE() either.
+ */
+#define __rwonce_typeof_unqual(x) typeof(({ auto ____t = (x); ____t; }))
+#endif
+
/*
* When building with LTO, there is an increased risk of the compiler
* converting an address dependency headed by a READ_ONCE() invocation
@@ -32,8 +46,7 @@
#define __READ_ONCE(x) \
({ \
typeof(&(x)) __x = &(x); \
- int atomic = 1; \
- union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
+ union { __rwonce_typeof_unqual(*__x) __val; char __c[1]; } __u; \
switch (sizeof(x)) { \
case 1: \
asm volatile(__LOAD_RCPC(b, %w0, %1) \
@@ -56,9 +69,9 @@
: "Q" (*__x) : "memory"); \
break; \
default: \
- atomic = 0; \
+ __u.__val = *(volatile typeof(*__x) *)__x; \
} \
- atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(*__x) *)__x);\
+ __u.__val; \
})
#endif /* !BUILD_VDSO */
--
2.53.0.rc1.225.gd81095ad13-goog
On Fri, Jan 30, 2026 at 02:28:25PM +0100, Marco Elver wrote:
> Rework arm64 LTO __READ_ONCE() to improve code generation as follows:
>
> 1. Replace _Generic-based __unqual_scalar_typeof() with more complete
> __rwonce_typeof_unqual(). This strips qualifiers from all types, not
> just integer types, which is required to be able to assign (must be
> non-const) to __u.__val in the non-atomic case (required for #2).
>
> Once our minimum compiler versions are bumped, this just becomes
> TYPEOF_UNQUAL() (or typeof_unqual() should we decide to adopt C23
> naming). Sadly the fallback version of __rwonce_typeof_unqual() cannot
> be used as a general TYPEOF_UNQUAL() fallback (see code comments).
>
> One subtle point here is that non-integer types of __val could be const
> or volatile within the union with the old __unqual_scalar_typeof(), if
> the passed variable is const or volatile. This would then result in a
> forced load from the stack if __u.__val is volatile; in the case of
> const, it does look odd if the underlying storage changes, but the
> compiler is told said member is "const" -- it smells like UB.
>
> 2. Eliminate the atomic flag and ternary conditional expression. Move
> the fallback volatile load into the default case of the switch,
> ensuring __u is unconditionally initialized across all paths.
> The statement expression now unconditionally returns __u.__val.
>
> This refactoring appears to help the compiler improve (or fix) code
> generation.
>
> With a defconfig + LTO + debug options builds, we observe different
> codegen for the following functions:
>
> btrfs_reclaim_sweep (708 -> 1032 bytes)
> btrfs_sinfo_bg_reclaim_threshold_store (200 -> 204 bytes)
> check_mem_access (3652 -> 3692 bytes) [inlined bpf_map_is_rdonly]
> console_flush_all (1268 -> 1264 bytes)
> console_lock_spinning_disable_and_check (180 -> 176 bytes)
> igb_add_filter (640 -> 636 bytes)
> igb_config_tx_modes (2404 -> 2400 bytes)
> kvm_vcpu_on_spin (480 -> 476 bytes)
> map_freeze (376 -> 380 bytes)
> netlink_bind (1664 -> 1656 bytes)
> nmi_cpu_backtrace (404 -> 400 bytes)
> set_rps_cpu (516 -> 520 bytes)
> swap_cluster_readahead (944 -> 932 bytes)
> tcp_accecn_third_ack (328 -> 336 bytes)
> tcp_create_openreq_child (1764 -> 1772 bytes)
> tcp_data_queue (5784 -> 5892 bytes)
> tcp_ecn_rcv_synack (620 -> 628 bytes)
> xen_manage_runstate_time (944 -> 896 bytes)
> xen_steal_clock (340 -> 296 bytes)
>
> Increase of some functions are due to more aggressive inlining due to
> better codegen (in this build, e.g. bpf_map_is_rdonly is no longer
> present due to being inlined completely).
>
> Signed-off-by: Marco Elver <elver@google.com>
> ---
> v3:
> * Comment.
>
> v2:
> * Add __rwonce_typeof_unqual() as fallback for old compilers.
> ---
> arch/arm64/include/asm/rwonce.h | 21 +++++++++++++++++----
> 1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> index fc0fb42b0b64..42c9e8429274 100644
> --- a/arch/arm64/include/asm/rwonce.h
> +++ b/arch/arm64/include/asm/rwonce.h
> @@ -19,6 +19,20 @@
> "ldapr" #sfx "\t" #regs, \
> ARM64_HAS_LDAPR)
>
> +#ifdef USE_TYPEOF_UNQUAL
> +#define __rwonce_typeof_unqual(x) TYPEOF_UNQUAL(x)
> +#else
> +/*
> + * Fallback for older compilers (Clang < 19).
> + *
> + * Uses the fact that, for all supported Clang versions, 'auto' correctly drops
> + * qualifiers. Unlike typeof_unqual(), the type must be completely defined, i.e.
> + * no forward-declared struct pointer dereferences. The array-to-pointer decay
> + * case does not matter for usage in READ_ONCE() either.
> + */
> +#define __rwonce_typeof_unqual(x) typeof(({ auto ____t = (x); ____t; }))
> +#endif
I know that CONFIG_LTO practically depends on Clang, but it's a bit
grotty relying on that assumption here. Ideally, it would be
straightforward to enable the strong READ_ONCE() semantics on arm64
regardless of the compiler.
> /*
> * When building with LTO, there is an increased risk of the compiler
> * converting an address dependency headed by a READ_ONCE() invocation
> @@ -32,8 +46,7 @@
> #define __READ_ONCE(x) \
> ({ \
> typeof(&(x)) __x = &(x); \
> - int atomic = 1; \
> - union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
> + union { __rwonce_typeof_unqual(*__x) __val; char __c[1]; } __u; \
> switch (sizeof(x)) { \
> case 1: \
> asm volatile(__LOAD_RCPC(b, %w0, %1) \
> @@ -56,9 +69,9 @@
> : "Q" (*__x) : "memory"); \
> break; \
> default: \
> - atomic = 0; \
> + __u.__val = *(volatile typeof(*__x) *)__x; \
Since we're not providing acquire semantics for the non-atomic case,
what we really want is the generic definition of __READ_ONCE() from
include/asm-generic/rwonce.h here. The header inclusion mess prevents
that, but why can't we just inline that definition here for the
'default' case? If TYPEOF_UNQUAL() leads to better codegen, shouldn't
we use that to implement __unqual_scalar_typeof() when it is available?
I fear I'm missing something here, but it just feels like we're
optimising a pretty niche case (arm64 + LTO + non-atomic __READ_ONCE())
in a way that looks more generally applicable.
Will
On Mon, 2 Feb 2026 15:36:40 +0000 Will Deacon <will@kernel.org> wrote: > On Fri, Jan 30, 2026 at 02:28:25PM +0100, Marco Elver wrote: > > Rework arm64 LTO __READ_ONCE() to improve code generation as follows: ... \ > > default: \ > > - atomic = 0; \ > > + __u.__val = *(volatile typeof(*__x) *)__x; \ > > Since we're not providing acquire semantics for the non-atomic case, > what we really want is the generic definition of __READ_ONCE() from > include/asm-generic/rwonce.h here. The header inclusion mess prevents > that, but why can't we just inline that definition here for the > 'default' case? If TYPEOF_UNQUAL() leads to better codegen, shouldn't > we use that to implement __unqual_scalar_typeof() when it is available? > > I fear I'm missing something here, but it just feels like we're > optimising a pretty niche case (arm64 + LTO + non-atomic __READ_ONCE()) > in a way that looks more generally applicable. Is that path even needed? I'm sure I've built an x86-64 allmodconfig with it being an error. If you look back in the history it used to be an error. Anything to simplify READ_ONCE() will noticeably speed up build times. Even on x86 just removing the check that the size is 1, 2, 4 or 8 makes a measurable difference - and that check doesn't need to be done on every compile. I'm not setup to do arm builds - never mind LTO ones. (Yes, I know, it 'just' involves downloading the toolchain.) David
On Mon, Feb 02, 2026 at 03:36:40PM +0000, Will Deacon wrote:
> Since we're not providing acquire semantics for the non-atomic case,
> what we really want is the generic definition of __READ_ONCE() from
> include/asm-generic/rwonce.h here. The header inclusion mess prevents
> that, but why can't we just inline that definition here for the
> 'default' case? If TYPEOF_UNQUAL() leads to better codegen, shouldn't
> we use that to implement __unqual_scalar_typeof() when it is available?
We are?
---
commit fd69b2f7d5f4e1d89cea4cdfa6f15e7fa53d8358
Author: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jan 16 19:18:16 2026 +0100
compiler: Use __typeof_unqual__() for __unqual_scalar_typeof()
The recent changes to get_unaligned() resulted in a new sparse warning:
net/rds/ib_cm.c:96:35: sparse: sparse: incorrect type in argument 1 (different modifiers) @@ expected void * @@ got restricted __be64 const * @@
net/rds/ib_cm.c:96:35: sparse: expected void *
net/rds/ib_cm.c:96:35: sparse: got restricted __be64 const *
The updated get_unaligned_t() uses __unqual_scalar_typeof() to get an
unqualified type. This works correctly for the compilers, but fails for
sparse when the data type is __be64 (or any other __beNN variant).
On sparse runs (C=[12]) __beNN types are annotated with
__attribute__((bitwise)).
That annotation allows sparse to detect incompatible operations on __beNN
variables, but it also prevents sparse from evaluating the _Generic() in
__unqual_scalar_typeof() and map __beNN to a unqualified scalar type, so it
ends up with the default, i.e. the original qualified type of a 'const
__beNN' pointer. That then ends up as the first pointer argument to
builtin_memcpy(), which obviously causes the above sparse warnings.
The sparse git tree supports typeof_unqual() now, which allows to use it
instead of the _Generic() based __unqual_scalar_typeof(). With that sparse
correctly evaluates the unqualified type and keeps the __beNN logic intact.
The downside is that this requires a top of tree sparse build and an old
sparse version will emit a metric ton of incomprehensible error messages
before it dies with a segfault.
Therefore implement a sanity check which validates that the checker is
available and capable of handling typeof_unqual(). Emit a warning if not so
the user can take informed action.
[ tglx: Move the evaluation of USE_TYPEOF_UNQUAL to compiler_types.h so it is
set before use and implement the sanity checker ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/87ecnp2zh3.ffs@tglx
Closes: https://lore.kernel.org/oe-kbuild-all/202601150001.sKSN644a-lkp@intel.com/
diff --git a/Makefile b/Makefile
index 9d38125263fb..179c9d9a56dd 100644
--- a/Makefile
+++ b/Makefile
@@ -1187,6 +1187,14 @@ CHECKFLAGS += $(if $(CONFIG_CPU_BIG_ENDIAN),-mbig-endian,-mlittle-endian)
# the checker needs the correct machine size
CHECKFLAGS += $(if $(CONFIG_64BIT),-m64,-m32)
+# Validate the checker is available and functional
+ifneq ($(KBUILD_CHECKSRC), 0)
+ ifneq ($(shell $(srctree)/scripts/checker-valid.sh $(CHECK) $(CHECKFLAGS)), 1)
+ $(warning C=$(KBUILD_CHECKSRC) specified, but $(CHECK) is not available or not up to date)
+ KBUILD_CHECKSRC = 0
+ endif
+endif
+
# Default kernel image to build when no specific target is given.
# KBUILD_IMAGE may be overruled on the command line or
# set in the environment
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 04487c9bd751..c601222b495a 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -230,16 +230,6 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
__BUILD_BUG_ON_ZERO_MSG(!__is_noncstr(p), \
"must be non-C-string (not NUL-terminated)")
-/*
- * Use __typeof_unqual__() when available.
- *
- * XXX: Remove test for __CHECKER__ once
- * sparse learns about __typeof_unqual__().
- */
-#if CC_HAS_TYPEOF_UNQUAL && !defined(__CHECKER__)
-# define USE_TYPEOF_UNQUAL 1
-#endif
-
/*
* Define TYPEOF_UNQUAL() to use __typeof_unqual__() as typeof
* operator when available, to return an unqualified type of the exp.
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index d3318a3c2577..377df1e64096 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -562,6 +562,14 @@ struct ftrace_likely_data {
#define asm_inline asm
#endif
+#ifndef __ASSEMBLY__
+/*
+ * Use __typeof_unqual__() when available.
+ */
+#if CC_HAS_TYPEOF_UNQUAL || defined(__CHECKER__)
+# define USE_TYPEOF_UNQUAL 1
+#endif
+
/* Are two types/vars the same type (ignoring qualifiers)? */
#define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
@@ -569,6 +577,7 @@ struct ftrace_likely_data {
* __unqual_scalar_typeof(x) - Declare an unqualified scalar type, leaving
* non-scalar types unchanged.
*/
+#ifndef USE_TYPEOF_UNQUAL
/*
* Prefer C11 _Generic for better compile-times and simpler code. Note: 'char'
* is not type-compatible with 'signed char', and we define a separate case.
@@ -586,6 +595,10 @@ struct ftrace_likely_data {
__scalar_type_to_expr_cases(long), \
__scalar_type_to_expr_cases(long long), \
default: (x)))
+#else
+#define __unqual_scalar_typeof(x) __typeof_unqual__(x)
+#endif
+#endif /* !__ASSEMBLY__ */
/* Is this type a native word size -- useful for atomic operations */
#define __native_word(t) \
diff --git a/scripts/checker-valid.sh b/scripts/checker-valid.sh
new file mode 100755
index 000000000000..625a789ed1c8
--- /dev/null
+++ b/scripts/checker-valid.sh
@@ -0,0 +1,19 @@
+#!/bin/sh -eu
+# SPDX-License-Identifier: GPL-2.0
+
+[ ! -x "$(command -v "$1")" ] && exit 1
+
+tmp_file=$(mktemp)
+trap "rm -f $tmp_file" EXIT
+
+cat << EOF >$tmp_file
+static inline int u(const int *q)
+{
+ __typeof_unqual__(*q) v = *q;
+ return v;
+}
+EOF
+
+# sparse happily exits with 0 on error so validate
+# there is none on stderr. Use awk as grep is a pain with sh -e
+$@ $tmp_file 2>&1 | awk -v c=1 '/error/{c=0}END{print c}'
On Mon, Feb 02, 2026 at 05:01:39PM +0100, Peter Zijlstra wrote: > On Mon, Feb 02, 2026 at 03:36:40PM +0000, Will Deacon wrote: > > > Since we're not providing acquire semantics for the non-atomic case, > > what we really want is the generic definition of __READ_ONCE() from > > include/asm-generic/rwonce.h here. The header inclusion mess prevents > > that, but why can't we just inline that definition here for the > > 'default' case? If TYPEOF_UNQUAL() leads to better codegen, shouldn't > > we use that to implement __unqual_scalar_typeof() when it is available? > > We are? Great! Then I don't grok why we need to choose between __unqual_scalar_typeof() and __typeof_unqual__() in the arch code. We should just use the former and it will DTRT. Will
On Mon, Feb 02, 2026 at 03:36:40PM +0000, Will Deacon wrote:
> I know that CONFIG_LTO practically depends on Clang, but it's a bit
> grotty relying on that assumption here. Ideally, it would be
> straightforward to enable the strong READ_ONCE() semantics on arm64
> regardless of the compiler.
Does it matter for GCC versions that do not support LTO? Because I'm
quite sure that if, one day, we add support for GCC LTO, that GCC
version will be new enough that it'll just take the __typeof_unqual__()
version and it'll "just work".
The problem with older GCC versions was that their __auto_type did not
actually strip qualifiers (which it should have) -- this was fixed at
some point.
On Mon, Feb 02, 2026 at 04:05PM +0000, Will Deacon wrote:
> On Mon, Feb 02, 2026 at 05:01:39PM +0100, Peter Zijlstra wrote:
> > On Mon, Feb 02, 2026 at 03:36:40PM +0000, Will Deacon wrote:
> >
> > > Since we're not providing acquire semantics for the non-atomic case,
> > > what we really want is the generic definition of __READ_ONCE() from
> > > include/asm-generic/rwonce.h here. The header inclusion mess prevents
> > > that, but why can't we just inline that definition here for the
> > > 'default' case? If TYPEOF_UNQUAL() leads to better codegen, shouldn't
> > > we use that to implement __unqual_scalar_typeof() when it is available?
> >
> > We are?
>
> Great! Then I don't grok why we need to choose between
> __unqual_scalar_typeof() and __typeof_unqual__() in the arch code. We
> should just use the former and it will DTRT.
The old __unqual_scalar_typeof() is still broken where
__typeof_unqual__() is unavailable - for the arm64 + LTO case that'd be
Clang <= 18, which we still have to support.
We could probably just ignore the performance issue ('volatile' reload
from stack, rare enough though given volatile variables are not usually
allowed) for these older versions and just say "use the newer compiler
to get better perf", but the 'const' issue will break the build:
| --- a/arch/arm64/include/asm/rwonce.h
| +++ b/arch/arm64/include/asm/rwonce.h
| @@ -46,7 +46,7 @@
| #define __READ_ONCE(x) \
| ({ \
| auto __x = &(x); \
| - auto __ret = (__rwonce_typeof_unqual(*__x) *)__x; \
| + auto __ret = (__unqual_scalar_typeof(*__x) *)__x; \
| /* Hides alias reassignment from Clang's -Wthread-safety. */ \
| auto __retp = &__ret; \
| union { typeof(*__ret) __val; char __c[1]; } __u; \
Results in:
| In file included from arch/arm64/kernel/asm-offsets.c:11:
| In file included from ./include/linux/arm_sdei.h:8:
| In file included from ./include/acpi/ghes.h:5:
| In file included from ./include/acpi/apei.h:9:
| In file included from ./include/linux/acpi.h:15:
| In file included from ./include/linux/device.h:32:
| In file included from ./include/linux/device/driver.h:21:
| In file included from ./include/linux/module.h:20:
| In file included from ./include/linux/elf.h:6:
| In file included from ./arch/arm64/include/asm/elf.h:141:
| ./include/linux/fs.h:1344:9: error: cannot assign to non-static data member '__val' with const-qualified type 'typeof (*__ret)' (aka 'struct fown_struct *const')
| 1344 | return READ_ONCE(file->f_owner);
| | ^~~~~~~~~~~~~~~~~~~~~~~~
| ./include/asm-generic/rwonce.h:50:2: note: expanded from macro 'READ_ONCE'
| 50 | __READ_ONCE(x); \
| | ^~~~~~~~~~~~~~
| ./arch/arm64/include/asm/rwonce.h:76:13: note: expanded from macro '__READ_ONCE'
| 76 | __u.__val = *(volatile typeof(*__x) *)__x; \
| | ~~~~~~~~~ ^
| ./include/linux/fs.h:1344:9: note: non-static data member '__val' declared const here
| 1344 | return READ_ONCE(file->f_owner);
| | ^~~~~~~~~~~~~~~~~~~~~~~~
| ./include/asm-generic/rwonce.h:50:2: note: expanded from macro 'READ_ONCE'
| 50 | __READ_ONCE(x); \
| | ^~~~~~~~~~~~~~
| ./arch/arm64/include/asm/rwonce.h:52:25: note: expanded from macro '__READ_ONCE'
| 52 | union { typeof(*__ret) __val; char __c[1]; } __u; \
| | ~~~~~~~~~~~~~~~^~~~~
... and many many more such errors.
It's an unfortunate mess today, but I hope sooner than later we bump the
minimum compiler versions that we can just unconditionally use
__typeof_unqual__() and delete __unqual_scalar_typeof(),
__rwonce_typeof_unqual() workaround and all the other code that appears
to be conditional on USE_TYPEOF_UNQUAL:
% git grep USE_TYPEOF_UNQUAL
arch/x86/include/asm/percpu.h:#if defined(CONFIG_USE_X86_SEG_SUPPORT) && defined(USE_TYPEOF_UNQUAL)
On Fri, 30 Jan 2026 14:28:25 +0100
Marco Elver <elver@google.com> wrote:
> Rework arm64 LTO __READ_ONCE() to improve code generation as follows:
>
> 1. Replace _Generic-based __unqual_scalar_typeof() with more complete
> __rwonce_typeof_unqual(). This strips qualifiers from all types, not
> just integer types, which is required to be able to assign (must be
> non-const) to __u.__val in the non-atomic case (required for #2).
>
> Once our minimum compiler versions are bumped, this just becomes
> TYPEOF_UNQUAL() (or typeof_unqual() should we decide to adopt C23
> naming). Sadly the fallback version of __rwonce_typeof_unqual() cannot
> be used as a general TYPEOF_UNQUAL() fallback (see code comments).
>
> One subtle point here is that non-integer types of __val could be const
> or volatile within the union with the old __unqual_scalar_typeof(), if
> the passed variable is const or volatile. This would then result in a
> forced load from the stack if __u.__val is volatile; in the case of
> const, it does look odd if the underlying storage changes, but the
> compiler is told said member is "const" -- it smells like UB.
>
> 2. Eliminate the atomic flag and ternary conditional expression. Move
> the fallback volatile load into the default case of the switch,
> ensuring __u is unconditionally initialized across all paths.
> The statement expression now unconditionally returns __u.__val.
>
> This refactoring appears to help the compiler improve (or fix) code
> generation.
>
> With a defconfig + LTO + debug options builds, we observe different
> codegen for the following functions:
>
> btrfs_reclaim_sweep (708 -> 1032 bytes)
> btrfs_sinfo_bg_reclaim_threshold_store (200 -> 204 bytes)
> check_mem_access (3652 -> 3692 bytes) [inlined bpf_map_is_rdonly]
> console_flush_all (1268 -> 1264 bytes)
> console_lock_spinning_disable_and_check (180 -> 176 bytes)
> igb_add_filter (640 -> 636 bytes)
> igb_config_tx_modes (2404 -> 2400 bytes)
> kvm_vcpu_on_spin (480 -> 476 bytes)
> map_freeze (376 -> 380 bytes)
> netlink_bind (1664 -> 1656 bytes)
> nmi_cpu_backtrace (404 -> 400 bytes)
> set_rps_cpu (516 -> 520 bytes)
> swap_cluster_readahead (944 -> 932 bytes)
> tcp_accecn_third_ack (328 -> 336 bytes)
> tcp_create_openreq_child (1764 -> 1772 bytes)
> tcp_data_queue (5784 -> 5892 bytes)
> tcp_ecn_rcv_synack (620 -> 628 bytes)
> xen_manage_runstate_time (944 -> 896 bytes)
> xen_steal_clock (340 -> 296 bytes)
>
> Increase of some functions are due to more aggressive inlining due to
> better codegen (in this build, e.g. bpf_map_is_rdonly is no longer
> present due to being inlined completely).
>
> Signed-off-by: Marco Elver <elver@google.com>
Having most of the comment in the commit message and a short one in the
code look good.
I think it will also fix a 'bleat' from min() about a signed v unsigned compare.
The ?: causes 'u8' to be promoted to 'int' with the expected outcome.
Reviewed-by: David Laight <david.laight.linux>@gmail.com
> ---
> v3:
> * Comment.
>
> v2:
> * Add __rwonce_typeof_unqual() as fallback for old compilers.
> ---
> arch/arm64/include/asm/rwonce.h | 21 +++++++++++++++++----
> 1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> index fc0fb42b0b64..42c9e8429274 100644
> --- a/arch/arm64/include/asm/rwonce.h
> +++ b/arch/arm64/include/asm/rwonce.h
> @@ -19,6 +19,20 @@
> "ldapr" #sfx "\t" #regs, \
> ARM64_HAS_LDAPR)
>
> +#ifdef USE_TYPEOF_UNQUAL
> +#define __rwonce_typeof_unqual(x) TYPEOF_UNQUAL(x)
> +#else
> +/*
> + * Fallback for older compilers (Clang < 19).
> + *
> + * Uses the fact that, for all supported Clang versions, 'auto' correctly drops
> + * qualifiers. Unlike typeof_unqual(), the type must be completely defined, i.e.
> + * no forward-declared struct pointer dereferences. The array-to-pointer decay
> + * case does not matter for usage in READ_ONCE() either.
> + */
> +#define __rwonce_typeof_unqual(x) typeof(({ auto ____t = (x); ____t; }))
> +#endif
> +
> /*
> * When building with LTO, there is an increased risk of the compiler
> * converting an address dependency headed by a READ_ONCE() invocation
> @@ -32,8 +46,7 @@
> #define __READ_ONCE(x) \
> ({ \
> typeof(&(x)) __x = &(x); \
> - int atomic = 1; \
> - union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
> + union { __rwonce_typeof_unqual(*__x) __val; char __c[1]; } __u; \
> switch (sizeof(x)) { \
> case 1: \
> asm volatile(__LOAD_RCPC(b, %w0, %1) \
> @@ -56,9 +69,9 @@
> : "Q" (*__x) : "memory"); \
> break; \
> default: \
> - atomic = 0; \
> + __u.__val = *(volatile typeof(*__x) *)__x; \
> } \
> - atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(*__x) *)__x);\
> + __u.__val; \
> })
>
> #endif /* !BUILD_VDSO */
© 2016 - 2026 Red Hat, Inc.