arch/x86/include/asm/irqflags.h | 19 +++++-------------- 1 file changed, 5 insertions(+), 14 deletions(-)
GCC and Clang both have builtins to read and write the EFLAGS register.
This allows the compiler to determine the best way to generate this
code, which can improve code generation.
This issue arose due to Clang's issue with the "=rm" constraint. Clang
chooses to be conservative in these situations, and so uses memory
instead of registers. This is a known issue, which is currently being
addressed.
However, using builtins is beneficial in general, because it removes the
burden of determining what's the way to read the flags register from the
programmer and places it on to the compiler, which has the information
needed to make that decision. Indeed, this piece of code has had several
changes over the years, some of which were pinging back and forth to
determine the correct constraints to use.
With this change, Clang generates better code:
Original code:
movq $0, -48(%rbp)
#APP
# __raw_save_flags
pushfq
popq -48(%rbp)
#NO_APP
movq -48(%rbp), %rbx
New code:
pushfq
popq %rbx
#APP
Note that the stack slot in the original code is no longer needed in the
new code, saving a small amount of stack space.
There is no change to GCC's output:
Original code:
# __raw_save_flags
pushf ; pop %r13 # flags
New code:
pushfq
popq %r13 # _23
Signed-off-by: Bill Wendling <morbo@google.com>
---
v4: - Clang now no longer generates stack frames when using these builtins.
- Corrected misspellings.
v3: - Add blurb indicating that GCC's output hasn't changed.
v2: - Kept the original function to retain the out-of-line symbol.
- Improved the commit message.
- Note that I couldn't use Nick's suggestion of
return IS_ENABLED(CONFIG_X86_64) ? ...
because Clang complains about using __builtin_ia32_readeflags_u32 in
64-bit mode.
---
arch/x86/include/asm/irqflags.h | 19 +++++--------------
1 file changed, 5 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 87761396e8cc..f31a035f3c6a 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,20 +19,11 @@
extern inline unsigned long native_save_fl(void);
extern __always_inline unsigned long native_save_fl(void)
{
- unsigned long flags;
-
- /*
- * "=rm" is safe here, because "pop" adjusts the stack before
- * it evaluates its effective address -- this is part of the
- * documented behavior of the "pop" instruction.
- */
- asm volatile("# __raw_save_flags\n\t"
- "pushf ; pop %0"
- : "=rm" (flags)
- : /* no input */
- : "memory");
-
- return flags;
+#ifdef CONFIG_X86_64
+ return __builtin_ia32_readeflags_u64();
+#else
+ return __builtin_ia32_readeflags_u32();
+#endif
}
static __always_inline void native_irq_disable(void)
--
2.35.1.265.g69c8d7142f-goog
From: Bill Wendling > Sent: 10 February 2022 22:32 > > GCC and Clang both have builtins to read and write the EFLAGS register. > This allows the compiler to determine the best way to generate this > code, which can improve code generation. > > This issue arose due to Clang's issue with the "=rm" constraint. Clang > chooses to be conservative in these situations, and so uses memory > instead of registers. This is a known issue, which is currently being > addressed. > > However, using builtins is beneficial in general, because it removes the > burden of determining what's the way to read the flags register from the > programmer and places it on to the compiler, which has the information > needed to make that decision. Except that neither gcc nor clang attempt to make that decision. They always do pushf; pop ax; ... > v4: - Clang now no longer generates stack frames when using these builtins. > - Corrected misspellings. While clang 'head' has been fixed, it seems a bit premature to say it is 'fixed' enough for all clang builds to use the builtin. Seems better to change it (back) to "=r" and comment that this is currently as good as __builtin_ia32_readeflags_u64() and that clang makes a 'pigs breakfast' of "=rm" - which has only marginal benefit. Changing to __builtin_ia32_readeflags_u64() may be worth while if/when the compilers will generate pushf; pop mem; for it. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Feb 11, 2022 at 8:40 AM David Laight <David.Laight@aculab.com> wrote:
> From: Bill Wendling
> > Sent: 10 February 2022 22:32
> >
> > GCC and Clang both have builtins to read and write the EFLAGS register.
> > This allows the compiler to determine the best way to generate this
> > code, which can improve code generation.
> >
> > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > chooses to be conservative in these situations, and so uses memory
> > instead of registers. This is a known issue, which is currently being
> > addressed.
> >
> > However, using builtins is beneficial in general, because it removes the
> > burden of determining what's the way to read the flags register from the
> > programmer and places it on to the compiler, which has the information
> > needed to make that decision.
>
> Except that neither gcc nor clang attempt to make that decision.
> They always do pushf; pop ax;
>
It looks like both GCC and Clang pop into virtual registers. The
register allocator is then able to determine if it can allocate a
physical register or if a stack slot is required.
> ...
> > v4: - Clang now no longer generates stack frames when using these builtins.
> > - Corrected misspellings.
>
> While clang 'head' has been fixed, it seems a bit premature to say
> it is 'fixed' enough for all clang builds to use the builtin.
>
True, but it's been cherry-picked into the clang 14.0.0 branch, which
is scheduled for release in March.
> Seems better to change it (back) to "=r" and comment that this
> is currently as good as __builtin_ia32_readeflags_u64() and that
> clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> benefit.
>
That would be okay as far as code generation is concerned, but it does
place the burden of correctness back on the programmer. Also, it was
that at some point, but was changed to "=rm" here. :-)
commit ab94fcf528d127fcb490175512a8910f37e5b346
Author: H. Peter Anvin <hpa@zytor.com>
Date: Tue Aug 25 16:47:16 2009 -0700
x86: allow "=rm" in native_save_fl()
This is a partial revert of f1f029c7bfbf4ee1918b90a431ab823bed812504.
"=rm" is allowed in this context, because "pop" is explicitly defined
to adjust the stack pointer *before* it evaluates its effective
address, if it has one. Thus, we do end up writing to the correct
address even if we use an on-stack memory argument.
The original reporter for f1f029c7bfbf4ee1918b90a431ab823bed812504 was
apparently using a broken x86 simulator.
[ Impact: performance ]
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Gabe Black <spamforgabe@umich.edu>
> Changing to __builtin_ia32_readeflags_u64() may be worth while
> if/when the compilers will generate pushf; pop mem; for it.
>
I was able to come up with an example where GCC generates "pushf ; pop mem":
https://godbolt.org/z/9rocjdoaK
(Clang generates a variation of "pop mem," and is horrible code, but
it's meant for demonstration purposes only.) One interesting thing
about the use of the builtins is that if at all possible, the "pop"
instruction may be moved away from the "pushf" if it's safe and would
reduce register pressure.
-bw
From: Bill Wendling > Sent: 11 February 2022 19:26
>
> On Fri, Feb 11, 2022 at 8:40 AM David Laight <David.Laight@aculab.com> wrote:
> > From: Bill Wendling
> > > Sent: 10 February 2022 22:32
> > >
> > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > This allows the compiler to determine the best way to generate this
> > > code, which can improve code generation.
> > >
> > > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > > chooses to be conservative in these situations, and so uses memory
> > > instead of registers. This is a known issue, which is currently being
> > > addressed.
> > >
> > > However, using builtins is beneficial in general, because it removes the
> > > burden of determining what's the way to read the flags register from the
> > > programmer and places it on to the compiler, which has the information
> > > needed to make that decision.
> >
> > Except that neither gcc nor clang attempt to make that decision.
> > They always do pushf; pop ax;
> >
> It looks like both GCC and Clang pop into virtual registers. The
> register allocator is then able to determine if it can allocate a
> physical register or if a stack slot is required.
Doing:
int fl;
void f(void) { fl = __builtin_ia32_readeflags_u64(); }
Seems to use register.
If it pops to a virtual register it will probably never pop
into a real target location.
See https://godbolt.org/z/8aY8o8rhe
But performance wise the pop+mov is just one byte longer.
Instruction decode time might be higher for two instruction, but since
'pop mem' generates 2 uops (intel) it may be constrained to the first
decoder (I can't rememberthe exact details), but the separate pop+mov
can be decoded in parallel - so could end up faster.
Actual execution time (if that makes any sense) is really the same.
Two operations, one pop and one memory write.
I bet you'd be hard pressed to find a piece of code where it even made
a consistent difference.
> > ...
> > > v4: - Clang now no longer generates stack frames when using these builtins.
> > > - Corrected misspellings.
> >
> > While clang 'head' has been fixed, it seems a bit premature to say
> > it is 'fixed' enough for all clang builds to use the builtin.
> >
> True, but it's been cherry-picked into the clang 14.0.0 branch, which
> is scheduled for release in March.
>
> > Seems better to change it (back) to "=r" and comment that this
> > is currently as good as __builtin_ia32_readeflags_u64() and that
> > clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> > benefit.
> >
> That would be okay as far as code generation is concerned, but it does
> place the burden of correctness back on the programmer. Also, it was
> that at some point, but was changed to "=rm" here. :-)
As I said, a comment should stop the bounce.
...
> I was able to come up with an example where GCC generates "pushf ; pop mem":
>
> https://godbolt.org/z/9rocjdoaK
>
> (Clang generates a variation of "pop mem," and is horrible code, but
> it's meant for demonstration purposes only.) One interesting thing
> about the use of the builtins is that if at all possible, the "pop"
> instruction may be moved away from the "pushf" if it's safe and would
> reduce register pressure.
I wouldn't trust the compiler to get stack pointer relative accesses
right if it does move them apart.
Definitely scope for horrid bugs ;-)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Fri, Feb 11, 2022 at 2:10 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Bill Wendling > Sent: 11 February 2022 19:26
> >
> > On Fri, Feb 11, 2022 at 8:40 AM David Laight <David.Laight@aculab.com> wrote:
> > > From: Bill Wendling
> > > > Sent: 10 February 2022 22:32
> > > >
> > > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > > This allows the compiler to determine the best way to generate this
> > > > code, which can improve code generation.
> > > >
> > > > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > > > chooses to be conservative in these situations, and so uses memory
> > > > instead of registers. This is a known issue, which is currently being
> > > > addressed.
> > > >
> > > > However, using builtins is beneficial in general, because it removes the
> > > > burden of determining what's the way to read the flags register from the
> > > > programmer and places it on to the compiler, which has the information
> > > > needed to make that decision.
> > >
> > > Except that neither gcc nor clang attempt to make that decision.
> > > They always do pushf; pop ax;
> > >
> > It looks like both GCC and Clang pop into virtual registers. The
> > register allocator is then able to determine if it can allocate a
> > physical register or if a stack slot is required.
>
> Doing:
> int fl;
> void f(void) { fl = __builtin_ia32_readeflags_u64(); }
> Seems to use register.
> If it pops to a virtual register it will probably never pop
> into a real target location.
>
> See https://godbolt.org/z/8aY8o8rhe
>
Yes, it does produce the appropriate code. What I meant was that,
internal to the compiler, the code that's generated before register
allocation contains "virtual" registers. I.e., registers that would
exist if the machine had an infinite number of registers to use. It's
the job of the register allocator to replace those virtual registers
with physical ones, or with spills to memory if no registers are
available. My (completely made up) example was to show that in an
extreme case where no registers are available, and no amount of code
motion can alleviate the register pressure, then both clang and gcc
will produce the appropriate spills to memory.
> But performance wise the pop+mov is just one byte longer.
> Instruction decode time might be higher for two instruction, but since
> 'pop mem' generates 2 uops (intel) it may be constrained to the first
> decoder (I can't rememberthe exact details), but the separate pop+mov
> can be decoded in parallel - so could end up faster.
>
It's the spill to memory that I'm trying to avoid here. I'm not
concerned about a "pop ; mov" combination (though even that is one
instruction too many except in the most extreme cases).
> Actual execution time (if that makes any sense) is really the same.
> Two operations, one pop and one memory write.
>
> I bet you'd be hard pressed to find a piece of code where it even made
> a consistent difference.
>
> > > ...
> > > > v4: - Clang now no longer generates stack frames when using these builtins.
> > > > - Corrected misspellings.
> > >
> > > While clang 'head' has been fixed, it seems a bit premature to say
> > > it is 'fixed' enough for all clang builds to use the builtin.
> > >
> > True, but it's been cherry-picked into the clang 14.0.0 branch, which
> > is scheduled for release in March.
> >
> > > Seems better to change it (back) to "=r" and comment that this
> > > is currently as good as __builtin_ia32_readeflags_u64() and that
> > > clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> > > benefit.
> > >
> > That would be okay as far as code generation is concerned, but it does
> > place the burden of correctness back on the programmer. Also, it was
> > that at some point, but was changed to "=rm" here. :-)
>
> As I said, a comment should stop the bounce.
>
The constraints on this piece of code went from "=g" to "=r" to "=rm".
Which is the correct one and why? The last will apparently work in all
situations, because the compiler's left to determine the output
destination: register or memory. The "=r" constraint may fail if a
register cannot be allocated for some reason.
> ...
> > I was able to come up with an example where GCC generates "pushf ; pop mem":
> >
> > https://godbolt.org/z/9rocjdoaK
> >
> > (Clang generates a variation of "pop mem," and is horrible code, but
> > it's meant for demonstration purposes only.) One interesting thing
> > about the use of the builtins is that if at all possible, the "pop"
> > instruction may be moved away from the "pushf" if it's safe and would
> > reduce register pressure.
>
> I wouldn't trust the compiler to get stack pointer relative accesses
> right if it does move them apart.
> Definitely scope for horrid bugs ;-)
>
Compilers are pretty good at knowing such things and doing it correctly.
-bw
On Fri, Feb 11, 2022 at 11:25 AM Bill Wendling <morbo@google.com> wrote: > > On Fri, Feb 11, 2022 at 8:40 AM David Laight <David.Laight@aculab.com> wrote: > > From: Bill Wendling > > > Sent: 10 February 2022 22:32 > > > > > > v4: - Clang now no longer generates stack frames when using these builtins. > > > - Corrected misspellings. > > > > While clang 'head' has been fixed, it seems a bit premature to say > > it is 'fixed' enough for all clang builds to use the builtin. > > > True, but it's been cherry-picked into the clang 14.0.0 branch, which > is scheduled for release in March. While I've requested a cherry pick, the auto PR has not yet been merged into the release/14.x branch. https://github.com/llvm/llvm-project/issues/46875#issuecomment-1035262333 https://github.com/llvmbot/llvm-project/pull/42 I share David's concern. We currently support clang-11+, see Documentation/process/changes.rst Users external to Google do not necessarily live at ToT LLVM; we MUST support them. There are also multiple unwinders for x86 selectable at kconfig time; this patch has implications for those different choices. I would like to discuss this further off list with you Bill before you resend this patch again. Please do not resend this patch again until we've had the chance to speak more about it. I'm generally in favor of the intent of this patch, but let's make sure this patch is the best it can be for everyone, regardless of their compiler, compiler version, or unwinder. -- Thanks, ~Nick Desaulniers
On Fri, Feb 11, 2022 at 4:24 PM Nick Desaulniers <ndesaulniers@google.com> wrote: > I would like to discuss this further off list with you Bill before you > resend this patch again. That won't be necessary. -bw
On Sat, Feb 12, 2022 at 1:23 AM Bill Wendling <morbo@google.com> wrote: > > On Fri, Feb 11, 2022 at 4:24 PM Nick Desaulniers > <ndesaulniers@google.com> wrote: > > I would like to discuss this further off list with you Bill before you > > resend this patch again. > > That won't be necessary. Bill, I'm sorry; I could have worded that better+differently. In code review, I tend to use phrases like "consider doing X" for suggestions that are merely suggestions; things I don't care about. I should have been more explicit that my concern regarding released versions of clang and CONFIG_UNWINDER_ORC=y was not one of those cases. I like the intent of the patch; with the following fixups, I would gladly sign off on it and encourage the x86 maintainers to consider it further. 1. handle the case of released versions of clang and CONFIG_UNWINDER_ORC=y. Something with a grep'able comment for clang-14 so we can clean that up when the minimally supported version of clang is bumped. 2. update the commit message to refer to previous commits that modified these flags, as per Thomas in response to v1. 3. Note no change in generated code from GCC, as per Thomas in response to v2. Perhaps test a few more configs, since I only checked kmem_cache_free yet native_save_fl and local_irq_save have MANY more call sites. 4. Add `Link: https://github.com/llvm/llvm-project/issues/46875` to commit message. Optionally: 5. Add notes to commit message about the intent of this patch improving profiles for kmem_cache_free for CONFIG_SLAB=y. Feel free to use numbers I cited from our internal bug and https://lore.kernel.org/lkml/CAKwvOdmXxmYgqEOT4vSbN60smSkQRcc3B5duQAhaaYoaDo=Riw@mail.gmail.com/. 6. Add note to commit message about memory operands from Agner Fog's instruction tables in the above lore link. 7. Add note to commit that compilers can schedule the generated instructions from the builtins among instructions from surrounding C statements, which they cannot do for inline asm, as demonstrated by https://godbolt.org/z/EsP1KTjxa. Actually 7 is probably more precise than 3. Since there's no change generally in terms of operands chosen AFAICT, but GCC (and clang) could now better schedule instructions. It's still a good patch; please stick with it! Let me know how I can help. -- Thanks, ~Nick Desaulniers
This issue arose due to Clang's issue with the "=rm" constraint. Clang
chooses to be conservative in these situations and always uses memory
instead of registers, resulting in sub-optimal code. (This is a known
issue, which is currently being addressed.)
This function has gone through numerous changes over the years:
- The original version of this function used the "=g" constraint,
which has the following description:
Any register, memory or immediate integer operand is allowed,
except for registers that are not general registers.
- This was changed to "=r" in commit f1f029c7bfbf ("x86: fix assembly
constraints in native_save_fl()"), because someone noticed that:
the offset of the flags variable from the stack pointer will
change when the pushf is performed. gcc doesn't attempt to
understand that fact, and address used for pop will still be the
same. It will write to somewhere near flags on the stack but not
actually into it and overwrite some other value.
- However, commit f1f029c7bfbf ("x86: fix assembly constraints in
native_save_fl()") was partially reverted in commit ab94fcf528d1
("x86: allow "=rm" in native_save_fl()"), because the original
reporter of the issue was using a broken x86 simulator. The
justification for this change was:
"=rm" is allowed in this context, because "pop" is explicitly
defined to adjust the stack pointer *before* it evaluates its
effective address, if it has one. Thus, we do end up writing to
the correct address even if we use an on-stack memory argument.
Clang generates good code when the builtins are used. On one benchmark,
a hotspot in kmem_cache_free went from using 5.18% of cycles popping to
a memory address to 0.13% popping to a register. This benefit is
magnified given that this code is inlined in numerous places in the
kernel.
The builtins also help GCC. It allows GCC (and Clang) to reduce register
pressure and, consequently, register spills by rescheduling
instructions. It can't happen with instructions in inline assembly,
because compilers view inline assembly blocks as "black boxes," whose
instructions can't be rescheduled.
Another benefit of builtins over asm blocks is that compilers are able
to make more precise inlining decisions, since they no longer need to
rely on imprecise measures based on newline counts.
A trivial example demonstrates this code motion.
void y(void);
unsigned long x(void) {
unsigned long v = __builtin_ia32_readeflags_u64();
y();
return v;
}
GCC at -O1:
pushq %rbx
pushfq
popq %rbx
movl $0, %eax
call y
movq %rbx, %rax
popq %rbx
ret
GCC at -O2:
pushq %r12
pushfq
xorl %eax, %eax
popq %r12
call y
movq %r12, %rax
popq %r12
ret
Link: https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67
Link: https://github.com/llvm/llvm-project/issues/20571
Link: https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints
Link: https://godbolt.org/z/5n3Eov1xT
Signed-off-by: Bill Wendling <morbo@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
---
v5: - Incorporate Nick's suggestion to limit the change to Clang >= 14.0 and
GCC.
v4: - Clang now no longer generates stack frames when using these builtins.
- Corrected misspellings.
v3: - Add blurb indicating that GCC's output hasn't changed.
v2: - Kept the original function to retain the out-of-line symbol.
- Improved the commit message.
- Note that I couldn't use Nick's suggestion of
return IS_ENABLED(CONFIG_X86_64) ? ...
because Clang complains about using __builtin_ia32_readeflags_u32 in
64-bit mode.
---
arch/x86/include/asm/irqflags.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 87761396e8cc..2eded855f6ab 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,6 +19,11 @@
extern inline unsigned long native_save_fl(void);
extern __always_inline unsigned long native_save_fl(void)
{
+#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
+ /*
+ * Clang forced frame pointers via the builtins until Clang-14. Use
+ * this as a fall-back until the minimum Clang version is >= 14.0.
+ */
unsigned long flags;
/*
@@ -33,6 +38,11 @@ extern __always_inline unsigned long native_save_fl(void)
: "memory");
return flags;
+#elif defined(CONFIG_X86_64)
+ return __builtin_ia32_readeflags_u64();
+#else
+ return __builtin_ia32_readeflags_u32();
+#endif
}
static __always_inline void native_irq_disable(void)
--
2.35.1.574.g5d30c73bfb-goog
On Tue, Mar 1, 2022 at 12:19 PM Bill Wendling <morbo@google.com> wrote:
>
Bump for review.
-bw
> This issue arose due to Clang's issue with the "=rm" constraint. Clang
> chooses to be conservative in these situations and always uses memory
> instead of registers, resulting in sub-optimal code. (This is a known
> issue, which is currently being addressed.)
>
> This function has gone through numerous changes over the years:
>
> - The original version of this function used the "=g" constraint,
> which has the following description:
>
> Any register, memory or immediate integer operand is allowed,
> except for registers that are not general registers.
>
> - This was changed to "=r" in commit f1f029c7bfbf ("x86: fix assembly
> constraints in native_save_fl()"), because someone noticed that:
>
> the offset of the flags variable from the stack pointer will
> change when the pushf is performed. gcc doesn't attempt to
> understand that fact, and address used for pop will still be the
> same. It will write to somewhere near flags on the stack but not
> actually into it and overwrite some other value.
>
> - However, commit f1f029c7bfbf ("x86: fix assembly constraints in
> native_save_fl()") was partially reverted in commit ab94fcf528d1
> ("x86: allow "=rm" in native_save_fl()"), because the original
> reporter of the issue was using a broken x86 simulator. The
> justification for this change was:
>
> "=rm" is allowed in this context, because "pop" is explicitly
> defined to adjust the stack pointer *before* it evaluates its
> effective address, if it has one. Thus, we do end up writing to
> the correct address even if we use an on-stack memory argument.
>
> Clang generates good code when the builtins are used. On one benchmark,
> a hotspot in kmem_cache_free went from using 5.18% of cycles popping to
> a memory address to 0.13% popping to a register. This benefit is
> magnified given that this code is inlined in numerous places in the
> kernel.
>
> The builtins also help GCC. It allows GCC (and Clang) to reduce register
> pressure and, consequently, register spills by rescheduling
> instructions. It can't happen with instructions in inline assembly,
> because compilers view inline assembly blocks as "black boxes," whose
> instructions can't be rescheduled.
>
> Another benefit of builtins over asm blocks is that compilers are able
> to make more precise inlining decisions, since they no longer need to
> rely on imprecise measures based on newline counts.
>
> A trivial example demonstrates this code motion.
>
> void y(void);
> unsigned long x(void) {
> unsigned long v = __builtin_ia32_readeflags_u64();
> y();
> return v;
> }
>
> GCC at -O1:
> pushq %rbx
> pushfq
> popq %rbx
> movl $0, %eax
> call y
> movq %rbx, %rax
> popq %rbx
> ret
>
> GCC at -O2:
> pushq %r12
> pushfq
> xorl %eax, %eax
> popq %r12
> call y
> movq %r12, %rax
> popq %r12
> ret
>
> Link: https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67
> Link: https://github.com/llvm/llvm-project/issues/20571
> Link: https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints
> Link: https://godbolt.org/z/5n3Eov1xT
> Signed-off-by: Bill Wendling <morbo@google.com>
> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
> ---
> v5: - Incorporate Nick's suggestion to limit the change to Clang >= 14.0 and
> GCC.
> v4: - Clang now no longer generates stack frames when using these builtins.
> - Corrected misspellings.
> v3: - Add blurb indicating that GCC's output hasn't changed.
> v2: - Kept the original function to retain the out-of-line symbol.
> - Improved the commit message.
> - Note that I couldn't use Nick's suggestion of
>
> return IS_ENABLED(CONFIG_X86_64) ? ...
>
> because Clang complains about using __builtin_ia32_readeflags_u32 in
> 64-bit mode.
> ---
> arch/x86/include/asm/irqflags.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 87761396e8cc..2eded855f6ab 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -19,6 +19,11 @@
> extern inline unsigned long native_save_fl(void);
> extern __always_inline unsigned long native_save_fl(void)
> {
> +#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
> + /*
> + * Clang forced frame pointers via the builtins until Clang-14. Use
> + * this as a fall-back until the minimum Clang version is >= 14.0.
> + */
> unsigned long flags;
>
> /*
> @@ -33,6 +38,11 @@ extern __always_inline unsigned long native_save_fl(void)
> : "memory");
>
> return flags;
> +#elif defined(CONFIG_X86_64)
> + return __builtin_ia32_readeflags_u64();
> +#else
> + return __builtin_ia32_readeflags_u32();
> +#endif
> }
>
> static __always_inline void native_irq_disable(void)
> --
> 2.35.1.574.g5d30c73bfb-goog
>
© 2016 - 2026 Red Hat, Inc.