[PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()

Eric Dumazet posted 2 patches 1 month, 3 weeks ago
[PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
Posted by Eric Dumazet 1 month, 3 weeks ago
clang is generating very inefficient code for native_save_fl() which
is used for local_irq_save() in critical spots.

Allowing the "pop %0" to use memory:

1) forces the compiler to add annoying stack canaries when
   CONFIG_STACKPROTECTOR_STRONG=y in many places.

2) Almost always is followed by an immediate "move memory,register"

One good example is _raw_spin_lock_irqsave, with 8 extra instructions

ffffffff82067a30 <_raw_spin_lock_irqsave>:
ffffffff82067a30:		...
ffffffff82067a39:		53						push   %rbx

// Three instructions to ajust the stack, read the per-cpu canary
// and copy it to 8(%rsp)
ffffffff82067a3a:		48 83 ec 10 			sub    $0x10,%rsp
ffffffff82067a3e:		65 48 8b 05 da 15 45 02 mov    %gs:0x24515da(%rip),%rax 	   # <__stack_chk_guard>
ffffffff82067a46:		48 89 44 24 08			mov    %rax,0x8(%rsp)

ffffffff82067a4b:		9c						pushf

// instead of pop %rbx, compiler uses 2 instructions.
ffffffff82067a4c:		8f 04 24				pop    (%rsp)
ffffffff82067a4f:		48 8b 1c 24 			mov    (%rsp),%rbx

ffffffff82067a53:		fa						cli
ffffffff82067a54:		b9 01 00 00 00			mov    $0x1,%ecx
ffffffff82067a59:		31 c0					xor    %eax,%eax
ffffffff82067a5b:		f0 0f b1 0f 			lock cmpxchg %ecx,(%rdi)
ffffffff82067a5f:		75 1d					jne    ffffffff82067a7e <_raw_spin_lock_irqsave+0x4e>

// three instructions to check the stack canary
ffffffff82067a61:		65 48 8b 05 b7 15 45 02 mov    %gs:0x24515b7(%rip),%rax 	   # <__stack_chk_guard>
ffffffff82067a69:		48 3b 44 24 08			cmp    0x8(%rsp),%rax
ffffffff82067a6e:		75 17					jne    ffffffff82067a87

...

// One extra instruction to adjust the stack.
ffffffff82067a73:		48 83 c4 10 			add    $0x10,%rsp
...

// One more instruction in case the stack was mangled.
ffffffff82067a87:		e8 a4 35 ff ff			call   ffffffff8205b030 <__stack_chk_fail>

This patch changes nothing for gcc, but for clang saves ~20000 bytes of text
even though more functions are inlined.

$ size vmlinux.gcc.before vmlinux.gcc.after vmlinux.clang.before vmlinux.clang.after
   text	   data		bss		dec		hex	filename
45565821	25005462	4704800	75276083	47c9f33	vmlinux.gcc.before
45565821	25005462	4704800	75276083	47c9f33	vmlinux.gcc.after
45121072	24638617	5533040	75292729	47ce039	vmlinux.clang.before
45093887	24638633	5536808	75269328	47c84d0	vmlinux.clang.after

$ scripts/bloat-o-meter -t vmlinux.clang.before vmlinux.clang.after
add/remove: 1/2 grow/shrink: 21/533 up/down: 2250/-22112 (-19862)
Function									 old	 new   delta
wakeup_cpu_via_vmgexit						1002	1447	+445
rcu_tasks_trace_pregp_step					1052	1454	+402
snp_kexec_finish							1290	1527	+237
check_all_holdout_tasks_trace				 909	1106	+197
x2apic_send_IPI_mask_allbutself 			  38	 198	+160
hpet_set_rtc_irq_bit						 118	 265	+147
x2apic_send_IPI_mask						  38	 184	+146
ring_buffer_poll_wait						 261	 405	+144
rb_watermark_hit							 253	 386	+133
__unlikely_text_end 						 368	 416	 +48
printk_trigger_flush						 262	 298	 +36
__softirqentry_text_end 					   -	  32	 +32
pstore_dump 								1145	1164	 +19
printk_legacy_allow_panic_sync				 159	 178	 +19
netlink_insert								 979	 995	 +16
console_try_replay_all						 268	 283	 +15
do_flush_tlb_all							 151	 165	 +14
__flush_tlb_all 							 151	 165	 +14
synchronize_rcu_expedited					2248	2259	 +11
...
tcp_wfree									 402	 332	 -70
stacktrace_trigger							 133	  62	 -71
w1_touch_bit								 418	 343	 -75
w1_triplet									 446	 370	 -76
link_create 								 980	 902	 -78
drain_dead_softirq_workfn					 425	 347	 -78
kcryptd_queue_crypt 						 253	 174	 -79
perf_event_aux_pause						 448	 368	 -80
idle_worker_timeout 						 320	 240	 -80
srcu_funnel_exp_start						 418	 333	 -85
call_rcu									 751	 666	 -85
enable_IR_x2apic							 279	 191	 -88
bpf_link_free								 432	 342	 -90
synchronize_rcu 							 497	 403	 -94
identify_cpu								2665	2569	 -96
ftrace_modify_all_code						 355	 258	 -97
load_gs_index								 212	 104	-108
verity_end_io								 369	 257	-112
bpf_prog_detach 							 672	 555	-117
__x2apic_send_IPI_mask						 552	 275	-277
snp_cleanup_vmsa							 284	   -	-284
__noinstr_text_start						3072	1920   -1152
Total: Before=28577936, After=28558074, chg -0.07%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
v2: use ASM_OUTPUT_RM (Uros Bizjak)
v1: https://lore.kernel.org/lkml/CANn89iJ+HKXRn7qF4KrT6gghw6CwWcsvoj8Scw17CkCqhGbk=A@mail.gmail.com/T/#mc2322d458f07118580eca7c5fa1f0bc931c32d30

 arch/x86/include/asm/irqflags.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index b30e5474c18e1be63b7c69354c26ae6a6cb02731..a1193e9d65f2000d6de88468bee58f2dae9c6cd5 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -25,7 +25,7 @@ extern __always_inline unsigned long native_save_fl(void)
	 */
	asm volatile("# __raw_save_flags\n\t"
			 "pushf ; pop %0"
-			 : "=rm" (flags)
+			 : ASM_OUTPUT_RM (flags)
			 : /* no input */
			 : "memory");

--
2.52.0.322.g1dd061c0dc-goog
Re: [PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
Posted by H. Peter Anvin 1 month, 2 weeks ago
On December 19, 2025 3:20:07 AM PST, Eric Dumazet <edumazet@google.com> wrote:
>clang is generating very inefficient code for native_save_fl() which
>is used for local_irq_save() in critical spots.
>
>Allowing the "pop %0" to use memory:
>
>1) forces the compiler to add annoying stack canaries when
>   CONFIG_STACKPROTECTOR_STRONG=y in many places.
>
>2) Almost always is followed by an immediate "move memory,register"
>
>One good example is _raw_spin_lock_irqsave, with 8 extra instructions
>
>ffffffff82067a30 <_raw_spin_lock_irqsave>:
>ffffffff82067a30:		...
>ffffffff82067a39:		53						push   %rbx
>
>// Three instructions to ajust the stack, read the per-cpu canary
>// and copy it to 8(%rsp)
>ffffffff82067a3a:		48 83 ec 10 			sub    $0x10,%rsp
>ffffffff82067a3e:		65 48 8b 05 da 15 45 02 mov    %gs:0x24515da(%rip),%rax 	   # <__stack_chk_guard>
>ffffffff82067a46:		48 89 44 24 08			mov    %rax,0x8(%rsp)
>
>ffffffff82067a4b:		9c						pushf
>
>// instead of pop %rbx, compiler uses 2 instructions.
>ffffffff82067a4c:		8f 04 24				pop    (%rsp)
>ffffffff82067a4f:		48 8b 1c 24 			mov    (%rsp),%rbx
>
>ffffffff82067a53:		fa						cli
>ffffffff82067a54:		b9 01 00 00 00			mov    $0x1,%ecx
>ffffffff82067a59:		31 c0					xor    %eax,%eax
>ffffffff82067a5b:		f0 0f b1 0f 			lock cmpxchg %ecx,(%rdi)
>ffffffff82067a5f:		75 1d					jne    ffffffff82067a7e <_raw_spin_lock_irqsave+0x4e>
>
>// three instructions to check the stack canary
>ffffffff82067a61:		65 48 8b 05 b7 15 45 02 mov    %gs:0x24515b7(%rip),%rax 	   # <__stack_chk_guard>
>ffffffff82067a69:		48 3b 44 24 08			cmp    0x8(%rsp),%rax
>ffffffff82067a6e:		75 17					jne    ffffffff82067a87
>
>...
>
>// One extra instruction to adjust the stack.
>ffffffff82067a73:		48 83 c4 10 			add    $0x10,%rsp
>...
>
>// One more instruction in case the stack was mangled.
>ffffffff82067a87:		e8 a4 35 ff ff			call   ffffffff8205b030 <__stack_chk_fail>
>
>This patch changes nothing for gcc, but for clang saves ~20000 bytes of text
>even though more functions are inlined.
>
>$ size vmlinux.gcc.before vmlinux.gcc.after vmlinux.clang.before vmlinux.clang.after
>   text	   data		bss		dec		hex	filename
>45565821	25005462	4704800	75276083	47c9f33	vmlinux.gcc.before
>45565821	25005462	4704800	75276083	47c9f33	vmlinux.gcc.after
>45121072	24638617	5533040	75292729	47ce039	vmlinux.clang.before
>45093887	24638633	5536808	75269328	47c84d0	vmlinux.clang.after
>
>$ scripts/bloat-o-meter -t vmlinux.clang.before vmlinux.clang.after
>add/remove: 1/2 grow/shrink: 21/533 up/down: 2250/-22112 (-19862)
>Function									 old	 new   delta
>wakeup_cpu_via_vmgexit						1002	1447	+445
>rcu_tasks_trace_pregp_step					1052	1454	+402
>snp_kexec_finish							1290	1527	+237
>check_all_holdout_tasks_trace				 909	1106	+197
>x2apic_send_IPI_mask_allbutself 			  38	 198	+160
>hpet_set_rtc_irq_bit						 118	 265	+147
>x2apic_send_IPI_mask						  38	 184	+146
>ring_buffer_poll_wait						 261	 405	+144
>rb_watermark_hit							 253	 386	+133
>__unlikely_text_end 						 368	 416	 +48
>printk_trigger_flush						 262	 298	 +36
>__softirqentry_text_end 					   -	  32	 +32
>pstore_dump 								1145	1164	 +19
>printk_legacy_allow_panic_sync				 159	 178	 +19
>netlink_insert								 979	 995	 +16
>console_try_replay_all						 268	 283	 +15
>do_flush_tlb_all							 151	 165	 +14
>__flush_tlb_all 							 151	 165	 +14
>synchronize_rcu_expedited					2248	2259	 +11
>...
>tcp_wfree									 402	 332	 -70
>stacktrace_trigger							 133	  62	 -71
>w1_touch_bit								 418	 343	 -75
>w1_triplet									 446	 370	 -76
>link_create 								 980	 902	 -78
>drain_dead_softirq_workfn					 425	 347	 -78
>kcryptd_queue_crypt 						 253	 174	 -79
>perf_event_aux_pause						 448	 368	 -80
>idle_worker_timeout 						 320	 240	 -80
>srcu_funnel_exp_start						 418	 333	 -85
>call_rcu									 751	 666	 -85
>enable_IR_x2apic							 279	 191	 -88
>bpf_link_free								 432	 342	 -90
>synchronize_rcu 							 497	 403	 -94
>identify_cpu								2665	2569	 -96
>ftrace_modify_all_code						 355	 258	 -97
>load_gs_index								 212	 104	-108
>verity_end_io								 369	 257	-112
>bpf_prog_detach 							 672	 555	-117
>__x2apic_send_IPI_mask						 552	 275	-277
>snp_cleanup_vmsa							 284	   -	-284
>__noinstr_text_start						3072	1920   -1152
>Total: Before=28577936, After=28558074, chg -0.07%
>
>Signed-off-by: Eric Dumazet <edumazet@google.com>
>Cc: Uros Bizjak <ubizjak@gmail.com>
>Cc: Linus Torvalds <torvalds@linux-foundation.org>
>---
>v2: use ASM_OUTPUT_RM (Uros Bizjak)
>v1: https://lore.kernel.org/lkml/CANn89iJ+HKXRn7qF4KrT6gghw6CwWcsvoj8Scw17CkCqhGbk=A@mail.gmail.com/T/#mc2322d458f07118580eca7c5fa1f0bc931c32d30
>
> arch/x86/include/asm/irqflags.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
>index b30e5474c18e1be63b7c69354c26ae6a6cb02731..a1193e9d65f2000d6de88468bee58f2dae9c6cd5 100644
>--- a/arch/x86/include/asm/irqflags.h
>+++ b/arch/x86/include/asm/irqflags.h
>@@ -25,7 +25,7 @@ extern __always_inline unsigned long native_save_fl(void)
>	 */
>	asm volatile("# __raw_save_flags\n\t"
>			 "pushf ; pop %0"
>-			 : "=rm" (flags)
>+			 : ASM_OUTPUT_RM (flags)
>			 : /* no input */
>			 : "memory");
>
>--
>2.52.0.322.g1dd061c0dc-goog
>

Maybe they should provide an intrinsic for this operation?
Re: [PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
Posted by Eric Dumazet 1 month, 2 weeks ago
On Sun, Dec 21, 2025 at 10:40 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On December 19, 2025 3:20:07 AM PST, Eric Dumazet <edumazet@google.com> wrote:
> >clang is generating very inefficient code for native_save_fl() which
> >is used for local_irq_save() in critical spots.
> >
> >Allowing the "pop %0" to use memory:
> >
> >1) forces the compiler to add annoying stack canaries when
> >   CONFIG_STACKPROTECTOR_STRONG=y in many places.
> >
> >2) Almost always is followed by an immediate "move memory,register"
> >
> >One good example is _raw_spin_lock_irqsave, with 8 extra instructions
> >
> >ffffffff82067a30 <_raw_spin_lock_irqsave>:
> >ffffffff82067a30:              ...
> >ffffffff82067a39:              53                                              push   %rbx
> >
> >// Three instructions to ajust the stack, read the per-cpu canary
> >// and copy it to 8(%rsp)
> >ffffffff82067a3a:              48 83 ec 10                     sub    $0x10,%rsp
> >ffffffff82067a3e:              65 48 8b 05 da 15 45 02 mov    %gs:0x24515da(%rip),%rax            # <__stack_chk_guard>
> >ffffffff82067a46:              48 89 44 24 08                  mov    %rax,0x8(%rsp)
> >
> >ffffffff82067a4b:              9c                                              pushf
> >
> >// instead of pop %rbx, compiler uses 2 instructions.
> >ffffffff82067a4c:              8f 04 24                                pop    (%rsp)
> >ffffffff82067a4f:              48 8b 1c 24                     mov    (%rsp),%rbx
> >
> >ffffffff82067a53:              fa                                              cli
> >ffffffff82067a54:              b9 01 00 00 00                  mov    $0x1,%ecx
> >ffffffff82067a59:              31 c0                                   xor    %eax,%eax
> >ffffffff82067a5b:              f0 0f b1 0f                     lock cmpxchg %ecx,(%rdi)
> >ffffffff82067a5f:              75 1d                                   jne    ffffffff82067a7e <_raw_spin_lock_irqsave+0x4e>
> >
> >// three instructions to check the stack canary
> >ffffffff82067a61:              65 48 8b 05 b7 15 45 02 mov    %gs:0x24515b7(%rip),%rax            # <__stack_chk_guard>
> >ffffffff82067a69:              48 3b 44 24 08                  cmp    0x8(%rsp),%rax
> >ffffffff82067a6e:              75 17                                   jne    ffffffff82067a87
> >
> >...
> >
> >// One extra instruction to adjust the stack.
> >ffffffff82067a73:              48 83 c4 10                     add    $0x10,%rsp
> >...
> >
> >// One more instruction in case the stack was mangled.
> >ffffffff82067a87:              e8 a4 35 ff ff                  call   ffffffff8205b030 <__stack_chk_fail>
> >
> >This patch changes nothing for gcc, but for clang saves ~20000 bytes of text
> >even though more functions are inlined.
> >
> >$ size vmlinux.gcc.before vmlinux.gcc.after vmlinux.clang.before vmlinux.clang.after
> >   text           data         bss             dec             hex     filename
> >45565821       25005462        4704800 75276083        47c9f33 vmlinux.gcc.before
> >45565821       25005462        4704800 75276083        47c9f33 vmlinux.gcc.after
> >45121072       24638617        5533040 75292729        47ce039 vmlinux.clang.before
> >45093887       24638633        5536808 75269328        47c84d0 vmlinux.clang.after
> >
> >$ scripts/bloat-o-meter -t vmlinux.clang.before vmlinux.clang.after
> >add/remove: 1/2 grow/shrink: 21/533 up/down: 2250/-22112 (-19862)
> >Function                                                                        old     new   delta
> >wakeup_cpu_via_vmgexit                                         1002    1447    +445
> >rcu_tasks_trace_pregp_step                                     1052    1454    +402
> >snp_kexec_finish                                                       1290    1527    +237
> >check_all_holdout_tasks_trace                           909    1106    +197
> >x2apic_send_IPI_mask_allbutself                          38     198    +160
> >hpet_set_rtc_irq_bit                                            118     265    +147
> >x2apic_send_IPI_mask                                             38     184    +146
> >ring_buffer_poll_wait                                           261     405    +144
> >rb_watermark_hit                                                        253     386    +133
> >__unlikely_text_end                                             368     416     +48
> >printk_trigger_flush                                            262     298     +36
> >__softirqentry_text_end                                           -      32     +32
> >pstore_dump                                                            1145    1164     +19
> >printk_legacy_allow_panic_sync                          159     178     +19
> >netlink_insert                                                          979     995     +16
> >console_try_replay_all                                          268     283     +15
> >do_flush_tlb_all                                                        151     165     +14
> >__flush_tlb_all                                                         151     165     +14
> >synchronize_rcu_expedited                                      2248    2259     +11
> >...
> >tcp_wfree                                                                       402     332     -70
> >stacktrace_trigger                                                      133      62     -71
> >w1_touch_bit                                                            418     343     -75
> >w1_triplet                                                                      446     370     -76
> >link_create                                                             980     902     -78
> >drain_dead_softirq_workfn                                       425     347     -78
> >kcryptd_queue_crypt                                             253     174     -79
> >perf_event_aux_pause                                            448     368     -80
> >idle_worker_timeout                                             320     240     -80
> >srcu_funnel_exp_start                                           418     333     -85
> >call_rcu                                                                        751     666     -85
> >enable_IR_x2apic                                                        279     191     -88
> >bpf_link_free                                                           432     342     -90
> >synchronize_rcu                                                         497     403     -94
> >identify_cpu                                                           2665    2569     -96
> >ftrace_modify_all_code                                          355     258     -97
> >load_gs_index                                                           212     104    -108
> >verity_end_io                                                           369     257    -112
> >bpf_prog_detach                                                         672     555    -117
> >__x2apic_send_IPI_mask                                          552     275    -277
> >snp_cleanup_vmsa                                                        284       -    -284
> >__noinstr_text_start                                           3072    1920   -1152
> >Total: Before=28577936, After=28558074, chg -0.07%
> >
> >Signed-off-by: Eric Dumazet <edumazet@google.com>
> >Cc: Uros Bizjak <ubizjak@gmail.com>
> >Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >---
> >v2: use ASM_OUTPUT_RM (Uros Bizjak)
> >v1: https://lore.kernel.org/lkml/CANn89iJ+HKXRn7qF4KrT6gghw6CwWcsvoj8Scw17CkCqhGbk=A@mail.gmail.com/T/#mc2322d458f07118580eca7c5fa1f0bc931c32d30
> >
> > arch/x86/include/asm/irqflags.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> >index b30e5474c18e1be63b7c69354c26ae6a6cb02731..a1193e9d65f2000d6de88468bee58f2dae9c6cd5 100644
> >--- a/arch/x86/include/asm/irqflags.h
> >+++ b/arch/x86/include/asm/irqflags.h
> >@@ -25,7 +25,7 @@ extern __always_inline unsigned long native_save_fl(void)
> >        */
> >       asm volatile("# __raw_save_flags\n\t"
> >                        "pushf ; pop %0"
> >-                       : "=rm" (flags)
> >+                       : ASM_OUTPUT_RM (flags)
> >                        : /* no input */
> >                        : "memory");
> >
> >--
> >2.52.0.322.g1dd061c0dc-goog
> >
>
> Maybe they should provide an intrinsic for this operation?

They already do provide it, I mentioned this in V1 discussion:

https://lore.kernel.org/lkml/CANn89i+3F6O30Tg5kh-+O2nFa3xJMsverm1bTh-K1G0vk=-gig@mail.gmail.com/

Unfortunately I was told it was not better, some clang versions had
bugs with it.
Re: [PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
Posted by Linus Torvalds 1 month, 2 weeks ago
On Sun, 21 Dec 2025 at 13:47, Eric Dumazet <edumazet@google.com> wrote:
>
> Unfortunately I was told it was not better, some clang versions had
> bugs with it.

The whole "builtin is worse than inline asm" is actually a thing even
when not actively buggy. We've had things like that a few times
before.

It's why we have out own variable_ffs() implementation, for example,
and do things like

   #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))

because __builtin_ffs() generates pointlessly silly code on at least
some compiler versions:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471

So then just using inline asm gets us at least reliable results, and
we can - and do - tweak it for specific known configs.

             Linus
Re: [PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
Posted by H. Peter Anvin 1 month, 2 weeks ago
On December 21, 2025 3:20:43 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sun, 21 Dec 2025 at 13:47, Eric Dumazet <edumazet@google.com> wrote:
>>
>> Unfortunately I was told it was not better, some clang versions had
>> bugs with it.
>
>The whole "builtin is worse than inline asm" is actually a thing even
>when not actively buggy. We've had things like that a few times
>before.
>
>It's why we have out own variable_ffs() implementation, for example,
>and do things like
>
>   #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
>
>because __builtin_ffs() generates pointlessly silly code on at least
>some compiler versions:
>
>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471
>
>So then just using inline asm gets us at least reliable results, and
>we can - and do - tweak it for specific known configs.
>
>             Linus

Sigh.

It's just disappointing to see this kind of stuff.
Re: [PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
Posted by Linus Torvalds 1 month, 2 weeks ago
On Sun, 21 Dec 2025 at 15:44, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Sigh.
>
> It's just disappointing to see this kind of stuff.

I actually find all these builtins existing in the first place more
disappointing.

Inline asm is the *generic* answer, and the one that works for pretty
much any situation. It's required in the general case, and it's the
thing that works with older compilers.

Having a builtin for things like this - without any appreciable
improvement in code - is just pointless and stupid.

Now, generic cross-architecture builtins are one thing: then you can
use them "portably". So having a builtin for 'ffs()' is not wrong.

But something like __builtin_ia32_readeflags_u64() is just entirely
pointless. What is the supposed advantage over just doing inline
assembly?

                 Linus
Re: [PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
Posted by H. Peter Anvin 1 month, 2 weeks ago
On December 21, 2025 5:38:24 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sun, 21 Dec 2025 at 15:44, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> Sigh.
>>
>> It's just disappointing to see this kind of stuff.
>
>I actually find all these builtins existing in the first place more
>disappointing.
>
>Inline asm is the *generic* answer, and the one that works for pretty
>much any situation. It's required in the general case, and it's the
>thing that works with older compilers.
>
>Having a builtin for things like this - without any appreciable
>improvement in code - is just pointless and stupid.
>
>Now, generic cross-architecture builtins are one thing: then you can
>use them "portably". So having a builtin for 'ffs()' is not wrong.
>
>But something like __builtin_ia32_readeflags_u64() is just entirely
>pointless. What is the supposed advantage over just doing inline
>assembly?
>
>                 Linus

Unless they are doing something like actually leaving it on the stack... yeah.
Re: [PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
Posted by Andrew Cooper 1 month, 2 weeks ago
>> But something like __builtin_ia32_readeflags_u64() is just entirely
>> pointless. What is the supposed advantage over just doing inline
>> assembly?
>
> Unless they are doing something like actually leaving it on the stack... yeah.

It's not just pointless, it's actively dangerous. 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104971

At least with inline assembly, it's the programmer's fault when making a
mistake like this.

~Andrew