[v3] x86: use builtins to read eflags

[PATCH v3] x86: use builtins to read eflags

Posted by Bill Wendling 4 years, 4 months ago

GCC and Clang both have builtins to read and write the EFLAGS register.
This allows the compiler to determine the best way to generate this
code, which can improve code generation.

This issue arose due to Clang's issue with the "=rm" constraint.  Clang
chooses to be conservative in these situations, and so uses memory
instead of registers. This is a known issue, which is currently being
addressed.

However, using builtins is benefiical in general, because it removes the
burden of determining what's the way to read the flags register from the
programmer and places it on to the compiler, which has the information
needed to make that decision. Indeed, this piece of code has had several
changes over the years, some of which were pinging back and forth to
determine the correct constraints to use.

With this change, Clang generates better code:

Original code:
        movq    $0, -48(%rbp)
        #APP
        # __raw_save_flags
        pushfq
        popq    -48(%rbp)
        #NO_APP
        movq    -48(%rbp), %rbx

New code:
        pushfq
        popq    %rbx
        #APP

Note that the stack slot in the original code is no longer needed in the
new code, saving a small amount of stack space.

There is no change to GCC's ouput:

Original code:

        # __raw_save_flags
        pushf ; pop %r13        # flags

New code:

	pushfq
        popq    %r13    # _23

Signed-off-by: Bill Wendling <morbo@google.com>
---
v3: - Add blurb indicating that GCC's output hasn't changed.
v2: - Kept the original function to retain the out-of-line symbol.
    - Improved the commit message.
    - Note that I couldn't use Nick's suggestion of

        return IS_ENABLED(CONFIG_X86_64) ? ...

      because Clang complains about using __builtin_ia32_readeflags_u32 in
      64-bit mode.
---
 arch/x86/include/asm/irqflags.h | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 87761396e8cc0..f31a035f3c6a9 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,20 +19,11 @@
 extern inline unsigned long native_save_fl(void);
 extern __always_inline unsigned long native_save_fl(void)
 {
-	unsigned long flags;
-
-	/*
-	 * "=rm" is safe here, because "pop" adjusts the stack before
-	 * it evaluates its effective address -- this is part of the
-	 * documented behavior of the "pop" instruction.
-	 */
-	asm volatile("# __raw_save_flags\n\t"
-		     "pushf ; pop %0"
-		     : "=rm" (flags)
-		     : /* no input */
-		     : "memory");
-
-	return flags;
+#ifdef CONFIG_X86_64
+	return __builtin_ia32_readeflags_u64();
+#else
+	return __builtin_ia32_readeflags_u32();
+#endif
 }
 
 static __always_inline void native_irq_disable(void)
-- 
2.35.0.263.gb82422642f-goog

Re: [PATCH v3] x86: use builtins to read eflags

Posted by Nick Desaulniers 4 years, 4 months ago

On Thu, Feb 3, 2022 at 4:57 PM Bill Wendling <morbo@google.com> wrote:
>
> GCC and Clang both have builtins to read and write the EFLAGS register.
> This allows the compiler to determine the best way to generate this
> code, which can improve code generation.
>
> This issue arose due to Clang's issue with the "=rm" constraint.  Clang
> chooses to be conservative in these situations, and so uses memory
> instead of registers. This is a known issue, which is currently being
> addressed.
>
> However, using builtins is benefiical in general, because it removes the

s/benefiical/beneficial/

> burden of determining what's the way to read the flags register from the
> programmer and places it on to the compiler, which has the information
> needed to make that decision. Indeed, this piece of code has had several
> changes over the years, some of which were pinging back and forth to
> determine the correct constraints to use.
>
> With this change, Clang generates better code:
>
> Original code:
>         movq    $0, -48(%rbp)
>         #APP
>         # __raw_save_flags
>         pushfq
>         popq    -48(%rbp)
>         #NO_APP
>         movq    -48(%rbp), %rbx
>
> New code:
>         pushfq
>         popq    %rbx
>         #APP

But it also forces frame pointers due to another bug in LLVM.
https://godbolt.org/z/6badWaGjo
For x86_64, we default to CONFIG_UNWINDER_ORC=y, not
CONFIG_UNWINDER_FRAME_POINTER=y.  So this change would make us use
registers instead of stack slots (improvement), but then force frame
pointers when we probably didn't need or want them (deterioration) for
all released versions of clang.

I think we should fix https://reviews.llvm.org/D92695 first before I'd
be comfortable signing off on this kernel change.  Again, I think we
should test out Phoebe's recommendation
https://reviews.llvm.org/D92695#inline-1086936
or do you already have a fix that I haven't yet been cc'ed on, perhaps?

>
> Note that the stack slot in the original code is no longer needed in the
> new code, saving a small amount of stack space.
>
> There is no change to GCC's ouput:
>
> Original code:
>
>         # __raw_save_flags
>         pushf ; pop %r13        # flags
>
> New code:
>
>         pushfq
>         popq    %r13    # _23
>
> Signed-off-by: Bill Wendling <morbo@google.com>
> ---
> v3: - Add blurb indicating that GCC's output hasn't changed.
> v2: - Kept the original function to retain the out-of-line symbol.
>     - Improved the commit message.
>     - Note that I couldn't use Nick's suggestion of
>
>         return IS_ENABLED(CONFIG_X86_64) ? ...
>
>       because Clang complains about using __builtin_ia32_readeflags_u32 in
>       64-bit mode.
> ---
>  arch/x86/include/asm/irqflags.h | 19 +++++--------------
>  1 file changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 87761396e8cc0..f31a035f3c6a9 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -19,20 +19,11 @@
>  extern inline unsigned long native_save_fl(void);
>  extern __always_inline unsigned long native_save_fl(void)
>  {
> -       unsigned long flags;
> -
> -       /*
> -        * "=rm" is safe here, because "pop" adjusts the stack before
> -        * it evaluates its effective address -- this is part of the
> -        * documented behavior of the "pop" instruction.
> -        */
> -       asm volatile("# __raw_save_flags\n\t"
> -                    "pushf ; pop %0"
> -                    : "=rm" (flags)
> -                    : /* no input */
> -                    : "memory");
> -
> -       return flags;
> +#ifdef CONFIG_X86_64
> +       return __builtin_ia32_readeflags_u64();
> +#else
> +       return __builtin_ia32_readeflags_u32();
> +#endif
>  }
>
>  static __always_inline void native_irq_disable(void)
> --
> 2.35.0.263.gb82422642f-goog
>


-- 
Thanks,
~Nick Desaulniers

RE: [PATCH v3] x86: use builtins to read eflags

Posted by David Laight 4 years, 4 months ago

From: Nick Desaulniers
> Sent: 07 February 2022 22:12
> 
> On Thu, Feb 3, 2022 at 4:57 PM Bill Wendling <morbo@google.com> wrote:
> >
> > GCC and Clang both have builtins to read and write the EFLAGS register.
> > This allows the compiler to determine the best way to generate this
> > code, which can improve code generation.
> >
> > This issue arose due to Clang's issue with the "=rm" constraint.  Clang
> > chooses to be conservative in these situations, and so uses memory
> > instead of registers. This is a known issue, which is currently being
> > addressed.

How much performance would be lost by just using "=r"?

You get two instructions if the actual target is memory.
This might be a marginal code size increase - but not much,
It might also slow things down if the execution is limited
by the instruction decoder.

But on Intel cpu 'pop memory' is 2 uops, exactly the same
as 'pop register' 'store register' (and I think amd is similar).
So the actual execution time is exactly the same for both.

Also it looks like clang's builtin is effectively "=r".
Compiling:
long fl;
void native_save_fl(void) {
       fl = __builtin_ia32_readeflags_u64();
}
Not only generates a stack frame, it also generates:
pushf; pop %rax; mov mem, %rax.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH v3] x86: use builtins to read eflags

Posted by Bill Wendling 4 years, 4 months ago

On Tue, Feb 8, 2022 at 1:14 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Nick Desaulniers
> > Sent: 07 February 2022 22:12
> >
> > On Thu, Feb 3, 2022 at 4:57 PM Bill Wendling <morbo@google.com> wrote:
> > >
> > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > This allows the compiler to determine the best way to generate this
> > > code, which can improve code generation.
> > >
> > > This issue arose due to Clang's issue with the "=rm" constraint.  Clang
> > > chooses to be conservative in these situations, and so uses memory
> > > instead of registers. This is a known issue, which is currently being
> > > addressed.
>
> How much performance would be lost by just using "=r"?
>
> You get two instructions if the actual target is memory.
> This might be a marginal code size increase - but not much,
> It might also slow things down if the execution is limited
> by the instruction decoder.
>
> But on Intel cpu 'pop memory' is 2 uops, exactly the same
> as 'pop register' 'store register' (and I think amd is similar).
> So the actual execution time is exactly the same for both.
>
> Also it looks like clang's builtin is effectively "=r".
> Compiling:
> long fl;
> void native_save_fl(void) {
>        fl = __builtin_ia32_readeflags_u64();
> }
> Not only generates a stack frame, it also generates:
> pushf; pop %rax; mov mem, %rax.
>
It used to be "=r" (see f1f029c7bfbf4e), but was changed back to "=rm"
in ab94fcf528d127. This pinging back and forth is another reason to
use the builtins and be done with it.

-bw

Re: [PATCH v3] x86: use builtins to read eflags

Posted by Nick Desaulniers 4 years, 4 months ago

On Tue, Feb 8, 2022 at 1:14 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Nick Desaulniers
> > Sent: 07 February 2022 22:12
> >
> > On Thu, Feb 3, 2022 at 4:57 PM Bill Wendling <morbo@google.com> wrote:
> > >
> > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > This allows the compiler to determine the best way to generate this
> > > code, which can improve code generation.
> > >
> > > This issue arose due to Clang's issue with the "=rm" constraint.  Clang
> > > chooses to be conservative in these situations, and so uses memory
> > > instead of registers. This is a known issue, which is currently being
> > > addressed.
>
> How much performance would be lost by just using "=r"?

It adds undue register pressure.  When native_save_fl() is inlined
into callers, requiring an operand to be in a register clobbers it, so
we may need to insert stack spills + reloads for values produced
before but consumed after the asm stmt if we run out of registers to
allocate for the live range.

Looking at the disassembly of native_save_fl() or calls to
__builtin_ia32_readeflags_u64() in isolation are less interesting than
what happens once native_save_fl() starts getting inlined into
callers.

>
> You get two instructions if the actual target is memory.
> This might be a marginal code size increase - but not much,
> It might also slow things down if the execution is limited
> by the instruction decoder.

"=rm" vs "=r" is more so about whether the operands to the asm are
required to be in memory or not, rather than number of instructions
generated.

>
> But on Intel cpu 'pop memory' is 2 uops, exactly the same
> as 'pop register' 'store register' (and I think amd is similar).
> So the actual execution time is exactly the same for both.

Page 10 of Agner Fog's
https://www.agner.org/optimize/instruction_tables.pdf
shows that a pop to memory is 3 "ops" while pop to register is 2. Page
9 defines "ops" as

    Number of macro-operations issued from instruction decoder to
schedulers. Instructions with more than 2 macro-operations use
microcode.

Internally, it looks like we have benchmarks that show that
kmem_cache_free spends 5.18% of cycles in that pop-to-memory
instruction (looks like this is CONFIG_SLAB).  Further,
native_save_fl() is called by local_irq_save() all over the kernel.
After this patch, the hot spot disappears from the profile (the
push+pop becomes 0.003% of samples from 5.18%).

So I think with those above two notes, that should answer the question
"why is any of this important?"

Further, page 10 also notes that push to memory is 2 "ops" and push to
register is 1, but that doesn't make a difference in the profiles.

Additionally, I compare the disassembly of kmem_cache_free
(CONFIG_SLAB) before+after this patch. There is no difference for GCC,
but you can see quite an improvement for clang:
https://gist.github.com/nickdesaulniers/7f143bae821d86603294bf00b8bd5074
This is the kind of info I would give Thomas when he asks about
changes to GCC codegen.

To the point I made last Friday about support for actually released
versions of clang, here's how v4 of the patch makes clang codegen
worse. This is ToT llvm with f3481f43bbe2c8a24e74210d310cf3be291bf52d
reverted. (That's the commit that fixed the forcibly inserted stack
frame).
https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67

Specifically, CONFIG_UNWINDER_FRAME_POINTER=y is improved;
CONFIG_UNWINDER_ORC=y is made worse (for all released versions of
clang, since they don't yet have
f3481f43bbe2c8a24e74210d310cf3be291bf52d).

Therefore, v5 of this patch should have something like:

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index f31a035f3c6a..2d2d4d8ab823 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,7 +19,12 @@
 extern inline unsigned long native_save_fl(void);
 extern __always_inline unsigned long native_save_fl(void)
 {
-#ifdef CONFIG_X86_64
+/* clang would force frame pointers via the builtins until clang-14 */
+#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
+       unsigned long flags;
+       asm volatile ("pushf; pop %0" : "=rm"(flags));
+       return flags;
+#elif defined(CONFIG_X86_64)
        return __builtin_ia32_readeflags_u64();
 #else
        return __builtin_ia32_readeflags_u32();

(though I would keep the existing code and comment for the complex
case, rather than blindly applying my full diff above; and I would
wait for https://github.com/llvmbot/llvm-project/pull/42 to actually
be merged before claiming clang was fixed. oh, looks like it just was.
I'd add `Link: https://github.com/llvm/llvm-project/issues/46875`
explicitly to the commit message.)

Then once clang-14 is the minimally supported version of clang
according to Documentation/process/changes.rst we can drop the
existing inline asm and simply retain the builtins.

Thomas also asked in review of v1 that additional info about previous
commits be included in the commit message.
https://lore.kernel.org/llvm/87o85faum5.ffs@tglx/

Finally, if we need exhaustive proof that the builtins are better for
codegen, even with GCC, I think this case demonstrates so perfectly:
https://godbolt.org/z/5n3Eov1xT

Because compilers treat inline asm a black boxes, they can't schedule
instructions in the inline asm with instructions lowered from the
surrounding C statements.  With the compiler builtins, they now can,
because the assembler isn't involved.  Perhaps noting that in the
commit message would help improve the first paragraph of the commit
message of v4.

>
> Also it looks like clang's builtin is effectively "=r".

It's not; it's closer to how "=rm" should work; add register pressure
and you may find a memory operand rather than a register operand.  But
the builtin is resolved by different machinery than the inline asm.
The inline asm is conservative in the presence of both r and m and
picks m to avoid undue register pressure; r and m together is treated
as "r or m" with a priority for m.  In this case, we'd prefer r's
codegen when possible.

I read "=rm" as "r" or "m"; we'd prefer if llvm chose "r" for most
cases, but it will choose "m" conservatively which is legal by the
"or". "r" would be more precise for what we want. I'm not sure if
there are configs that could exhaust the register allocator though if
"m" was not specified as an option.

AFAICT
commit ab94fcf528d1 ("x86: allow "=rm" in native_save_fl()")
does not cite register exhaustion as a reason to permit "m", so
perhaps the version from
commit f1f029c7bfbf ("x86: fix assembly constraints in native_save_fl()")
would be preferred.  That said, who knows how "r" will affect codegen
elsewhere, which is why my suggested diff above retains "=rm" as the
constraint for the fallback code. "No functional change" for released
versions of clang.

> Compiling:
> long fl;
> void native_save_fl(void) {
>        fl = __builtin_ia32_readeflags_u64();
> }
> Not only generates a stack frame, it also generates:
> pushf; pop %rax; mov mem, %rax.

https://github.com/llvm/llvm-project/issues/46875

--
Thanks,
~Nick Desaulniers

[PATCH v4] x86: use builtins to read eflags

Posted by Bill Wendling 4 years, 4 months ago

GCC and Clang both have builtins to read and write the EFLAGS register.
This allows the compiler to determine the best way to generate this
code, which can improve code generation.

This issue arose due to Clang's issue with the "=rm" constraint.  Clang
chooses to be conservative in these situations, and so uses memory
instead of registers. This is a known issue, which is currently being
addressed.

However, using builtins is beneficial in general, because it removes the
burden of determining what's the way to read the flags register from the
programmer and places it on to the compiler, which has the information
needed to make that decision. Indeed, this piece of code has had several
changes over the years, some of which were pinging back and forth to
determine the correct constraints to use.

With this change, Clang generates better code:

Original code:
        movq    $0, -48(%rbp)
        #APP
        # __raw_save_flags
        pushfq
        popq    -48(%rbp)
        #NO_APP
        movq    -48(%rbp), %rbx

New code:
        pushfq
        popq    %rbx
        #APP

Note that the stack slot in the original code is no longer needed in the
new code, saving a small amount of stack space.

There is no change to GCC's output:

Original code:

        # __raw_save_flags
        pushf ; pop %r13        # flags

New code:

        pushfq
        popq    %r13    # _23

Signed-off-by: Bill Wendling <morbo@google.com>
---
v4: - Clang now no longer generates stack frames when using these builtins.
    - Corrected misspellings.
v3: - Add blurb indicating that GCC's output hasn't changed.
v2: - Kept the original function to retain the out-of-line symbol.
    - Improved the commit message.
    - Note that I couldn't use Nick's suggestion of

        return IS_ENABLED(CONFIG_X86_64) ? ...

      because Clang complains about using __builtin_ia32_readeflags_u32 in
      64-bit mode.
---
 arch/x86/include/asm/irqflags.h | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 87761396e8cc..f31a035f3c6a 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,20 +19,11 @@
 extern inline unsigned long native_save_fl(void);
 extern __always_inline unsigned long native_save_fl(void)
 {
-	unsigned long flags;
-
-	/*
-	 * "=rm" is safe here, because "pop" adjusts the stack before
-	 * it evaluates its effective address -- this is part of the
-	 * documented behavior of the "pop" instruction.
-	 */
-	asm volatile("# __raw_save_flags\n\t"
-		     "pushf ; pop %0"
-		     : "=rm" (flags)
-		     : /* no input */
-		     : "memory");
-
-	return flags;
+#ifdef CONFIG_X86_64
+	return __builtin_ia32_readeflags_u64();
+#else
+	return __builtin_ia32_readeflags_u32();
+#endif
 }
 
 static __always_inline void native_irq_disable(void)
-- 
2.35.1.265.g69c8d7142f-goog

RE: [PATCH v4] x86: use builtins to read eflags

Posted by David Laight 4 years, 4 months ago

From: Bill Wendling
> Sent: 10 February 2022 22:32
> 
> GCC and Clang both have builtins to read and write the EFLAGS register.
> This allows the compiler to determine the best way to generate this
> code, which can improve code generation.
> 
> This issue arose due to Clang's issue with the "=rm" constraint.  Clang
> chooses to be conservative in these situations, and so uses memory
> instead of registers. This is a known issue, which is currently being
> addressed.
> 
> However, using builtins is beneficial in general, because it removes the
> burden of determining what's the way to read the flags register from the
> programmer and places it on to the compiler, which has the information
> needed to make that decision.

Except that neither gcc nor clang attempt to make that decision.
They always do pushf; pop ax;

...
> v4: - Clang now no longer generates stack frames when using these builtins.
>     - Corrected misspellings.

While clang 'head' has been fixed, it seems a bit premature to say
it is 'fixed' enough for all clang builds to use the builtin.

Seems better to change it (back) to "=r" and comment that this
is currently as good as __builtin_ia32_readeflags_u64() and that
clang makes a 'pigs breakfast' of "=rm" - which has only marginal
benefit.

Changing to __builtin_ia32_readeflags_u64() may be worth while
if/when the compilers will generate pushf; pop mem; for it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH v4] x86: use builtins to read eflags

Posted by Bill Wendling 4 years, 4 months ago

On Fri, Feb 11, 2022 at 8:40 AM David Laight <David.Laight@aculab.com> wrote:
> From: Bill Wendling
> > Sent: 10 February 2022 22:32
> >
> > GCC and Clang both have builtins to read and write the EFLAGS register.
> > This allows the compiler to determine the best way to generate this
> > code, which can improve code generation.
> >
> > This issue arose due to Clang's issue with the "=rm" constraint.  Clang
> > chooses to be conservative in these situations, and so uses memory
> > instead of registers. This is a known issue, which is currently being
> > addressed.
> >
> > However, using builtins is beneficial in general, because it removes the
> > burden of determining what's the way to read the flags register from the
> > programmer and places it on to the compiler, which has the information
> > needed to make that decision.
>
> Except that neither gcc nor clang attempt to make that decision.
> They always do pushf; pop ax;
>
It looks like both GCC and Clang pop into virtual registers. The
register allocator is then able to determine if it can allocate a
physical register or if a stack slot is required.

> ...
> > v4: - Clang now no longer generates stack frames when using these builtins.
> >     - Corrected misspellings.
>
> While clang 'head' has been fixed, it seems a bit premature to say
> it is 'fixed' enough for all clang builds to use the builtin.
>
True, but it's been cherry-picked into the clang 14.0.0 branch, which
is scheduled for release in March.

> Seems better to change it (back) to "=r" and comment that this
> is currently as good as __builtin_ia32_readeflags_u64() and that
> clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> benefit.
>
That would be okay as far as code generation is concerned, but it does
place the burden of correctness back on the programmer. Also, it was
that at some point, but was changed to "=rm" here. :-)

commit ab94fcf528d127fcb490175512a8910f37e5b346
Author: H. Peter Anvin <hpa@zytor.com>
Date:   Tue Aug 25 16:47:16 2009 -0700

    x86: allow "=rm" in native_save_fl()

    This is a partial revert of f1f029c7bfbf4ee1918b90a431ab823bed812504.

    "=rm" is allowed in this context, because "pop" is explicitly defined
    to adjust the stack pointer *before* it evaluates its effective
    address, if it has one.  Thus, we do end up writing to the correct
    address even if we use an on-stack memory argument.

    The original reporter for f1f029c7bfbf4ee1918b90a431ab823bed812504 was
    apparently using a broken x86 simulator.

    [ Impact: performance ]

    Signed-off-by: H. Peter Anvin <hpa@zytor.com>
    Cc: Gabe Black <spamforgabe@umich.edu>


> Changing to __builtin_ia32_readeflags_u64() may be worth while
> if/when the compilers will generate pushf; pop mem; for it.
>
I was able to come up with an example where GCC generates "pushf ; pop mem":

  https://godbolt.org/z/9rocjdoaK

(Clang generates a variation of "pop mem," and is horrible code, but
it's meant for demonstration purposes only.) One interesting thing
about the use of the builtins is that if at all possible, the "pop"
instruction may be moved away from the "pushf" if it's safe and would
reduce register pressure.

-bw

RE: [PATCH v4] x86: use builtins to read eflags

Posted by David Laight 4 years, 4 months ago

From: Bill Wendling > Sent: 11 February 2022 19:26
> 
> On Fri, Feb 11, 2022 at 8:40 AM David Laight <David.Laight@aculab.com> wrote:
> > From: Bill Wendling
> > > Sent: 10 February 2022 22:32
> > >
> > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > This allows the compiler to determine the best way to generate this
> > > code, which can improve code generation.
> > >
> > > This issue arose due to Clang's issue with the "=rm" constraint.  Clang
> > > chooses to be conservative in these situations, and so uses memory
> > > instead of registers. This is a known issue, which is currently being
> > > addressed.
> > >
> > > However, using builtins is beneficial in general, because it removes the
> > > burden of determining what's the way to read the flags register from the
> > > programmer and places it on to the compiler, which has the information
> > > needed to make that decision.
> >
> > Except that neither gcc nor clang attempt to make that decision.
> > They always do pushf; pop ax;
> >
> It looks like both GCC and Clang pop into virtual registers. The
> register allocator is then able to determine if it can allocate a
> physical register or if a stack slot is required.

Doing:
	int fl;
	void f(void) { fl = __builtin_ia32_readeflags_u64(); }
Seems to use register.
If it pops to a virtual register it will probably never pop
into a real target location.

See https://godbolt.org/z/8aY8o8rhe

But performance wise the pop+mov is just one byte longer.
Instruction decode time might be higher for two instruction, but since
'pop mem' generates 2 uops (intel) it may be constrained to the first
decoder (I can't rememberthe exact details), but the separate pop+mov
can be decoded in parallel - so could end up faster.

Actual execution time (if that makes any sense) is really the same.
Two operations, one pop and one memory write.

I bet you'd be hard pressed to find a piece of code where it even made
a consistent difference.

> > ...
> > > v4: - Clang now no longer generates stack frames when using these builtins.
> > >     - Corrected misspellings.
> >
> > While clang 'head' has been fixed, it seems a bit premature to say
> > it is 'fixed' enough for all clang builds to use the builtin.
> >
> True, but it's been cherry-picked into the clang 14.0.0 branch, which
> is scheduled for release in March.
> 
> > Seems better to change it (back) to "=r" and comment that this
> > is currently as good as __builtin_ia32_readeflags_u64() and that
> > clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> > benefit.
> >
> That would be okay as far as code generation is concerned, but it does
> place the burden of correctness back on the programmer. Also, it was
> that at some point, but was changed to "=rm" here. :-)

As I said, a comment should stop the bounce.

...
> I was able to come up with an example where GCC generates "pushf ; pop mem":
> 
>   https://godbolt.org/z/9rocjdoaK
> 
> (Clang generates a variation of "pop mem," and is horrible code, but
> it's meant for demonstration purposes only.) One interesting thing
> about the use of the builtins is that if at all possible, the "pop"
> instruction may be moved away from the "pushf" if it's safe and would
> reduce register pressure.

I wouldn't trust the compiler to get stack pointer relative accesses
right if it does move them apart.
Definitely scope for horrid bugs ;-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH v4] x86: use builtins to read eflags

Posted by Bill Wendling 4 years, 4 months ago

On Fri, Feb 11, 2022 at 2:10 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Bill Wendling > Sent: 11 February 2022 19:26
> >
> > On Fri, Feb 11, 2022 at 8:40 AM David Laight <David.Laight@aculab.com> wrote:
> > > From: Bill Wendling
> > > > Sent: 10 February 2022 22:32
> > > >
> > > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > > This allows the compiler to determine the best way to generate this
> > > > code, which can improve code generation.
> > > >
> > > > This issue arose due to Clang's issue with the "=rm" constraint.  Clang
> > > > chooses to be conservative in these situations, and so uses memory
> > > > instead of registers. This is a known issue, which is currently being
> > > > addressed.
> > > >
> > > > However, using builtins is beneficial in general, because it removes the
> > > > burden of determining what's the way to read the flags register from the
> > > > programmer and places it on to the compiler, which has the information
> > > > needed to make that decision.
> > >
> > > Except that neither gcc nor clang attempt to make that decision.
> > > They always do pushf; pop ax;
> > >
> > It looks like both GCC and Clang pop into virtual registers. The
> > register allocator is then able to determine if it can allocate a
> > physical register or if a stack slot is required.
>
> Doing:
>         int fl;
>         void f(void) { fl = __builtin_ia32_readeflags_u64(); }
> Seems to use register.
> If it pops to a virtual register it will probably never pop
> into a real target location.
>
> See https://godbolt.org/z/8aY8o8rhe
>
Yes, it does produce the appropriate code. What I meant was that,
internal to the compiler, the code that's generated before register
allocation contains "virtual" registers. I.e., registers that would
exist if the machine had an infinite number of registers to use. It's
the job of the register allocator to replace those virtual registers
with physical ones, or with spills to memory if no registers are
available. My (completely made up) example was to show that in an
extreme case where no registers are available, and no amount of code
motion can alleviate the register pressure, then both clang and gcc
will produce the appropriate spills to memory.

> But performance wise the pop+mov is just one byte longer.
> Instruction decode time might be higher for two instruction, but since
> 'pop mem' generates 2 uops (intel) it may be constrained to the first
> decoder (I can't rememberthe exact details), but the separate pop+mov
> can be decoded in parallel - so could end up faster.
>
It's the spill to memory that I'm trying to avoid here. I'm not
concerned about a "pop ; mov" combination (though even that is one
instruction too many except in the most extreme cases).

> Actual execution time (if that makes any sense) is really the same.
> Two operations, one pop and one memory write.
>
> I bet you'd be hard pressed to find a piece of code where it even made
> a consistent difference.
>
> > > ...
> > > > v4: - Clang now no longer generates stack frames when using these builtins.
> > > >     - Corrected misspellings.
> > >
> > > While clang 'head' has been fixed, it seems a bit premature to say
> > > it is 'fixed' enough for all clang builds to use the builtin.
> > >
> > True, but it's been cherry-picked into the clang 14.0.0 branch, which
> > is scheduled for release in March.
> >
> > > Seems better to change it (back) to "=r" and comment that this
> > > is currently as good as __builtin_ia32_readeflags_u64() and that
> > > clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> > > benefit.
> > >
> > That would be okay as far as code generation is concerned, but it does
> > place the burden of correctness back on the programmer. Also, it was
> > that at some point, but was changed to "=rm" here. :-)
>
> As I said, a comment should stop the bounce.
>
The constraints on this piece of code went from "=g" to "=r" to "=rm".
Which is the correct one and why? The last will apparently work in all
situations, because the compiler's left to determine the output
destination: register or memory. The "=r" constraint may fail if a
register cannot be allocated for some reason.

> ...
> > I was able to come up with an example where GCC generates "pushf ; pop mem":
> >
> >   https://godbolt.org/z/9rocjdoaK
> >
> > (Clang generates a variation of "pop mem," and is horrible code, but
> > it's meant for demonstration purposes only.) One interesting thing
> > about the use of the builtins is that if at all possible, the "pop"
> > instruction may be moved away from the "pushf" if it's safe and would
> > reduce register pressure.
>
> I wouldn't trust the compiler to get stack pointer relative accesses
> right if it does move them apart.
> Definitely scope for horrid bugs ;-)
>
Compilers are pretty good at knowing such things and doing it correctly.

-bw

Re: [PATCH v4] x86: use builtins to read eflags

Posted by Nick Desaulniers 4 years, 4 months ago

On Fri, Feb 11, 2022 at 11:25 AM Bill Wendling <morbo@google.com> wrote:
>
> On Fri, Feb 11, 2022 at 8:40 AM David Laight <David.Laight@aculab.com> wrote:
> > From: Bill Wendling
> > > Sent: 10 February 2022 22:32
> > >
> > > v4: - Clang now no longer generates stack frames when using these builtins.
> > >     - Corrected misspellings.
> >
> > While clang 'head' has been fixed, it seems a bit premature to say
> > it is 'fixed' enough for all clang builds to use the builtin.
> >
> True, but it's been cherry-picked into the clang 14.0.0 branch, which
> is scheduled for release in March.

While I've requested a cherry pick, the auto PR has not yet been
merged into the release/14.x branch.
https://github.com/llvm/llvm-project/issues/46875#issuecomment-1035262333
https://github.com/llvmbot/llvm-project/pull/42

I share David's concern.  We currently support clang-11+, see
Documentation/process/changes.rst

Users external to Google do not necessarily live at ToT LLVM; we MUST
support them.  There are also multiple unwinders for x86 selectable at
kconfig time; this patch has implications for those different choices.

I would like to discuss this further off list with you Bill before you
resend this patch again.  Please do not resend this patch again until
we've had the chance to speak more about it.  I'm generally in favor
of the intent of this patch, but let's make sure this patch is the
best it can be for everyone, regardless of their compiler, compiler
version, or unwinder.
-- 
Thanks,
~Nick Desaulniers

Re: [PATCH v4] x86: use builtins to read eflags

Posted by Bill Wendling 4 years, 4 months ago

On Fri, Feb 11, 2022 at 4:24 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
> I would like to discuss this further off list with you Bill before you
> resend this patch again.

That won't be necessary.

-bw

Re: [PATCH v4] x86: use builtins to read eflags

Posted by Nick Desaulniers 4 years, 4 months ago

On Sat, Feb 12, 2022 at 1:23 AM Bill Wendling <morbo@google.com> wrote:
>
> On Fri, Feb 11, 2022 at 4:24 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> > I would like to discuss this further off list with you Bill before you
> > resend this patch again.
>
> That won't be necessary.

Bill,
I'm sorry; I could have worded that better+differently.

In code review, I tend to use phrases like "consider doing X" for
suggestions that are merely suggestions; things I don't care about.

I should have been more explicit that my concern regarding released
versions of clang and CONFIG_UNWINDER_ORC=y was not one of those
cases.

I like the intent of the patch; with the following fixups, I would
gladly sign off on it and encourage the x86 maintainers to consider it
further.
1. handle the case of released versions of clang and
CONFIG_UNWINDER_ORC=y. Something with a grep'able comment for clang-14
so we can clean that up when the minimally supported version of clang
is bumped.
2. update the commit message to refer to previous commits that
modified these flags, as per Thomas in response to v1.
3. Note no change in generated code from GCC, as per Thomas in
response to v2. Perhaps test a few more configs, since I only checked
kmem_cache_free yet native_save_fl and local_irq_save have MANY more
call sites.
4. Add `Link: https://github.com/llvm/llvm-project/issues/46875` to
commit message.

Optionally:
5. Add notes to commit message about the intent of this patch
improving profiles for kmem_cache_free for CONFIG_SLAB=y. Feel free to
use numbers I cited from our internal bug and
https://lore.kernel.org/lkml/CAKwvOdmXxmYgqEOT4vSbN60smSkQRcc3B5duQAhaaYoaDo=Riw@mail.gmail.com/.
6. Add note to commit message about memory operands from Agner Fog's
instruction tables in the above lore link.
7. Add note to commit that compilers can schedule the generated
instructions from the builtins among instructions from surrounding C
statements, which they cannot do for inline asm, as demonstrated by
https://godbolt.org/z/EsP1KTjxa.

Actually 7 is probably more precise than 3. Since there's no change
generally in terms of operands chosen AFAICT, but GCC (and clang)
could now better schedule instructions.

It's still a good patch; please stick with it!  Let me know how I can help.
-- 
Thanks,
~Nick Desaulniers

[PATCH v5] x86: use builtins to read eflags

Posted by Bill Wendling 4 years, 3 months ago

This issue arose due to Clang's issue with the "=rm" constraint. Clang
chooses to be conservative in these situations and always uses memory
instead of registers, resulting in sub-optimal code. (This is a known
issue, which is currently being addressed.)

This function has gone through numerous changes over the years:

  - The original version of this function used the "=g" constraint,
    which has the following description:

      Any register, memory or immediate integer operand is allowed,
      except for registers that are not general registers.

  - This was changed to "=r" in commit f1f029c7bfbf ("x86: fix assembly
    constraints in native_save_fl()"), because someone noticed that:

      the offset of the flags variable from the stack pointer will
      change when the pushf is performed. gcc doesn't attempt to
      understand that fact, and address used for pop will still be the
      same. It will write to somewhere near flags on the stack but not
      actually into it and overwrite some other value.

  - However, commit f1f029c7bfbf ("x86: fix assembly constraints in
    native_save_fl()") was partially reverted in commit ab94fcf528d1
    ("x86: allow "=rm" in native_save_fl()"), because the original
    reporter of the issue was using a broken x86 simulator. The
    justification for this change was:

      "=rm" is allowed in this context, because "pop" is explicitly
      defined to adjust the stack pointer *before* it evaluates its
      effective address, if it has one.  Thus, we do end up writing to
      the correct address even if we use an on-stack memory argument.

Clang generates good code when the builtins are used. On one benchmark,
a hotspot in kmem_cache_free went from using 5.18% of cycles popping to
a memory address to 0.13% popping to a register. This benefit is
magnified given that this code is inlined in numerous places in the
kernel.

The builtins also help GCC. It allows GCC (and Clang) to reduce register
pressure and, consequently, register spills by rescheduling
instructions. It can't happen with instructions in inline assembly,
because compilers view inline assembly blocks as "black boxes," whose
instructions can't be rescheduled.

Another benefit of builtins over asm blocks is that compilers are able
to make more precise inlining decisions, since they no longer need to
rely on imprecise measures based on newline counts.

A trivial example demonstrates this code motion.

	void y(void);
	unsigned long x(void) {
		unsigned long v = __builtin_ia32_readeflags_u64();
		y();
		return v;
	}

GCC at -O1:
	pushq   %rbx
	pushfq
	popq    %rbx
	movl    $0, %eax
	call    y
	movq    %rbx, %rax
	popq    %rbx
	ret

GCC at -O2:
	pushq   %r12
	pushfq
	xorl    %eax, %eax
	popq    %r12
	call    y
	movq    %r12, %rax
	popq    %r12
	ret

Link: https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67
Link: https://github.com/llvm/llvm-project/issues/20571
Link: https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints
Link: https://godbolt.org/z/5n3Eov1xT
Signed-off-by: Bill Wendling <morbo@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
---
v5: - Incorporate Nick's suggestion to limit the change to Clang >= 14.0 and
      GCC.
v4: - Clang now no longer generates stack frames when using these builtins.
    - Corrected misspellings.
v3: - Add blurb indicating that GCC's output hasn't changed.
v2: - Kept the original function to retain the out-of-line symbol.
    - Improved the commit message.
    - Note that I couldn't use Nick's suggestion of

        return IS_ENABLED(CONFIG_X86_64) ? ...

      because Clang complains about using __builtin_ia32_readeflags_u32 in
      64-bit mode.
---
 arch/x86/include/asm/irqflags.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 87761396e8cc..2eded855f6ab 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,6 +19,11 @@
 extern inline unsigned long native_save_fl(void);
 extern __always_inline unsigned long native_save_fl(void)
 {
+#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
+	/*
+	 * Clang forced frame pointers via the builtins until Clang-14. Use
+	 * this as a fall-back until the minimum Clang version is >= 14.0.
+	 */
 	unsigned long flags;
 
 	/*
@@ -33,6 +38,11 @@ extern __always_inline unsigned long native_save_fl(void)
 		     : "memory");
 
 	return flags;
+#elif defined(CONFIG_X86_64)
+	return __builtin_ia32_readeflags_u64();
+#else
+	return __builtin_ia32_readeflags_u32();
+#endif
 }
 
 static __always_inline void native_irq_disable(void)
-- 
2.35.1.574.g5d30c73bfb-goog

Re: [PATCH v5] x86: use builtins to read eflags

Posted by Bill Wendling 4 years, 3 months ago

On Tue, Mar 1, 2022 at 12:19 PM Bill Wendling <morbo@google.com> wrote:
>
Bump for review.

-bw

> This issue arose due to Clang's issue with the "=rm" constraint. Clang
> chooses to be conservative in these situations and always uses memory
> instead of registers, resulting in sub-optimal code. (This is a known
> issue, which is currently being addressed.)
>
> This function has gone through numerous changes over the years:
>
>   - The original version of this function used the "=g" constraint,
>     which has the following description:
>
>       Any register, memory or immediate integer operand is allowed,
>       except for registers that are not general registers.
>
>   - This was changed to "=r" in commit f1f029c7bfbf ("x86: fix assembly
>     constraints in native_save_fl()"), because someone noticed that:
>
>       the offset of the flags variable from the stack pointer will
>       change when the pushf is performed. gcc doesn't attempt to
>       understand that fact, and address used for pop will still be the
>       same. It will write to somewhere near flags on the stack but not
>       actually into it and overwrite some other value.
>
>   - However, commit f1f029c7bfbf ("x86: fix assembly constraints in
>     native_save_fl()") was partially reverted in commit ab94fcf528d1
>     ("x86: allow "=rm" in native_save_fl()"), because the original
>     reporter of the issue was using a broken x86 simulator. The
>     justification for this change was:
>
>       "=rm" is allowed in this context, because "pop" is explicitly
>       defined to adjust the stack pointer *before* it evaluates its
>       effective address, if it has one.  Thus, we do end up writing to
>       the correct address even if we use an on-stack memory argument.
>
> Clang generates good code when the builtins are used. On one benchmark,
> a hotspot in kmem_cache_free went from using 5.18% of cycles popping to
> a memory address to 0.13% popping to a register. This benefit is
> magnified given that this code is inlined in numerous places in the
> kernel.
>
> The builtins also help GCC. It allows GCC (and Clang) to reduce register
> pressure and, consequently, register spills by rescheduling
> instructions. It can't happen with instructions in inline assembly,
> because compilers view inline assembly blocks as "black boxes," whose
> instructions can't be rescheduled.
>
> Another benefit of builtins over asm blocks is that compilers are able
> to make more precise inlining decisions, since they no longer need to
> rely on imprecise measures based on newline counts.
>
> A trivial example demonstrates this code motion.
>
>         void y(void);
>         unsigned long x(void) {
>                 unsigned long v = __builtin_ia32_readeflags_u64();
>                 y();
>                 return v;
>         }
>
> GCC at -O1:
>         pushq   %rbx
>         pushfq
>         popq    %rbx
>         movl    $0, %eax
>         call    y
>         movq    %rbx, %rax
>         popq    %rbx
>         ret
>
> GCC at -O2:
>         pushq   %r12
>         pushfq
>         xorl    %eax, %eax
>         popq    %r12
>         call    y
>         movq    %r12, %rax
>         popq    %r12
>         ret
>
> Link: https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67
> Link: https://github.com/llvm/llvm-project/issues/20571
> Link: https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints
> Link: https://godbolt.org/z/5n3Eov1xT
> Signed-off-by: Bill Wendling <morbo@google.com>
> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
> ---
> v5: - Incorporate Nick's suggestion to limit the change to Clang >= 14.0 and
>       GCC.
> v4: - Clang now no longer generates stack frames when using these builtins.
>     - Corrected misspellings.
> v3: - Add blurb indicating that GCC's output hasn't changed.
> v2: - Kept the original function to retain the out-of-line symbol.
>     - Improved the commit message.
>     - Note that I couldn't use Nick's suggestion of
>
>         return IS_ENABLED(CONFIG_X86_64) ? ...
>
>       because Clang complains about using __builtin_ia32_readeflags_u32 in
>       64-bit mode.
> ---
>  arch/x86/include/asm/irqflags.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 87761396e8cc..2eded855f6ab 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -19,6 +19,11 @@
>  extern inline unsigned long native_save_fl(void);
>  extern __always_inline unsigned long native_save_fl(void)
>  {
> +#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
> +       /*
> +        * Clang forced frame pointers via the builtins until Clang-14. Use
> +        * this as a fall-back until the minimum Clang version is >= 14.0.
> +        */
>         unsigned long flags;
>
>         /*
> @@ -33,6 +38,11 @@ extern __always_inline unsigned long native_save_fl(void)
>                      : "memory");
>
>         return flags;
> +#elif defined(CONFIG_X86_64)
> +       return __builtin_ia32_readeflags_u64();
> +#else
> +       return __builtin_ia32_readeflags_u32();
> +#endif
>  }
>
>  static __always_inline void native_irq_disable(void)
> --
> 2.35.1.574.g5d30c73bfb-goog
>