x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

[PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Uros Bizjak 11 months ago

Use asm_inline() to instruct the compiler that the size of asm()
is the minimum size of one instruction, ignoring how many instructions
the compiler thinks it is. ALTERNATIVE macro that expands to several
pseudo directives causes instruction length estimate to count
more than 20 instructions.

bloat-o-meter reports no code size changes.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/processor.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 5d2f7e5aff26..06e499ba4fe8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -707,7 +707,7 @@ static inline u32 per_cpu_l2c_id(unsigned int cpu)
  */
 static __always_inline void amd_clear_divider(void)
 {
-	asm volatile(ALTERNATIVE("", "div %2\n\t", X86_BUG_DIV0)
+	asm_inline volatile(ALTERNATIVE("", "div %2", X86_BUG_DIV0)
 		     :: "a" (0), "d" (0), "r" (1));
 }
 
-- 
2.42.0

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Borislav Petkov 11 months ago

On March 13, 2025 8:18:09 PM GMT+01:00, Uros Bizjak <ubizjak@gmail.com> wrote:
>Use asm_inline() to instruct the compiler that the size of asm()
>is the minimum size of one instruction, ignoring how many instructions
>the compiler thinks it is. ALTERNATIVE macro that expands to several
>pseudo directives causes instruction length estimate to count
>more than 20 instructions.
>
>bloat-o-meter reports no code size changes.
>
>Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
>Cc: Thomas Gleixner <tglx@linutronix.de>
>Cc: Ingo Molnar <mingo@kernel.org>
>Cc: Borislav Petkov <bp@alien8.de>
>Cc: Dave Hansen <dave.hansen@linux.intel.com>
>Cc: "H. Peter Anvin" <hpa@zytor.com>
>---
> arch/x86/include/asm/processor.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>index 5d2f7e5aff26..06e499ba4fe8 100644
>--- a/arch/x86/include/asm/processor.h
>+++ b/arch/x86/include/asm/processor.h
>@@ -707,7 +707,7 @@ static inline u32 per_cpu_l2c_id(unsigned int cpu)
>  */
> static __always_inline void amd_clear_divider(void)
> {
>-	asm volatile(ALTERNATIVE("", "div %2\n\t", X86_BUG_DIV0)
>+	asm_inline volatile(ALTERNATIVE("", "div %2", X86_BUG_DIV0)
> 		     :: "a" (0), "d" (0), "r" (1));
> }
> 

So there's no point for this one...
-- 
Sent from a small device: formatting sucks and brevity is inevitable.

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Uros Bizjak 11 months ago

On Thu, Mar 13, 2025 at 8:26 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On March 13, 2025 8:18:09 PM GMT+01:00, Uros Bizjak <ubizjak@gmail.com> wrote:
> >Use asm_inline() to instruct the compiler that the size of asm()
> >is the minimum size of one instruction, ignoring how many instructions
> >the compiler thinks it is. ALTERNATIVE macro that expands to several
> >pseudo directives causes instruction length estimate to count
> >more than 20 instructions.
> >
> >bloat-o-meter reports no code size changes.
> >
> >Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
> >Cc: Thomas Gleixner <tglx@linutronix.de>
> >Cc: Ingo Molnar <mingo@kernel.org>
> >Cc: Borislav Petkov <bp@alien8.de>
> >Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >Cc: "H. Peter Anvin" <hpa@zytor.com>
> >---
> > arch/x86/include/asm/processor.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> >index 5d2f7e5aff26..06e499ba4fe8 100644
> >--- a/arch/x86/include/asm/processor.h
> >+++ b/arch/x86/include/asm/processor.h
> >@@ -707,7 +707,7 @@ static inline u32 per_cpu_l2c_id(unsigned int cpu)
> >  */
> > static __always_inline void amd_clear_divider(void)
> > {
> >-      asm volatile(ALTERNATIVE("", "div %2\n\t", X86_BUG_DIV0)
> >+      asm_inline volatile(ALTERNATIVE("", "div %2", X86_BUG_DIV0)
> >                    :: "a" (0), "d" (0), "r" (1));
> > }
> >
>
> So there's no point for this one...

Not at its current usage sites, but considering that
amd_clear_divider() and two levels of small functions that include it
( amd_clear_divider(), arch_exit_to_user_mode() and
exit_to_user_mode() ) all need to be decorated with __always_inline
indicates that the issue is not that benign.

It also triggers my search scripts, so I thought I should convert this
one as well and put the issue at rest.

Uros.

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Borislav Petkov 11 months ago

On March 13, 2025 8:50:24 PM GMT+01:00, Uros Bizjak <ubizjak@gmail.com> wrote:
>On Thu, Mar 13, 2025 at 8:26 PM Borislav Petkov <bp@alien8.de> wrote:
>>
>> On March 13, 2025 8:18:09 PM GMT+01:00, Uros Bizjak <ubizjak@gmail.com> wrote:
>> >Use asm_inline() to instruct the compiler that the size of asm()
>> >is the minimum size of one instruction, ignoring how many instructions
>> >the compiler thinks it is. ALTERNATIVE macro that expands to several
>> >pseudo directives causes instruction length estimate to count
>> >more than 20 instructions.
>> >
>> >bloat-o-meter reports no code size changes.
>> >
>> >Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
>> >Cc: Thomas Gleixner <tglx@linutronix.de>
>> >Cc: Ingo Molnar <mingo@kernel.org>
>> >Cc: Borislav Petkov <bp@alien8.de>
>> >Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> >Cc: "H. Peter Anvin" <hpa@zytor.com>
>> >---
>> > arch/x86/include/asm/processor.h | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> >diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>> >index 5d2f7e5aff26..06e499ba4fe8 100644
>> >--- a/arch/x86/include/asm/processor.h
>> >+++ b/arch/x86/include/asm/processor.h
>> >@@ -707,7 +707,7 @@ static inline u32 per_cpu_l2c_id(unsigned int cpu)
>> >  */
>> > static __always_inline void amd_clear_divider(void)
>> > {
>> >-      asm volatile(ALTERNATIVE("", "div %2\n\t", X86_BUG_DIV0)
>> >+      asm_inline volatile(ALTERNATIVE("", "div %2", X86_BUG_DIV0)
>> >                    :: "a" (0), "d" (0), "r" (1));
>> > }
>> >
>>
>> So there's no point for this one...
>
>Not at its current usage sites, but considering that
>amd_clear_divider() and two levels of small functions that include it
>( amd_clear_divider(), arch_exit_to_user_mode() and
>exit_to_user_mode() ) all need to be decorated with __always_inline
>indicates that the issue is not that benign.
>
>It also triggers my search scripts, so I thought I should convert this
>one as well and put the issue at rest.
>
>Uros.

Sorry but this doesn't justify this churn. There's nothing quantifyingly palpable here to warrant this.
-- 
Sent from a small device: formatting sucks and brevity is inevitable.

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Ingo Molnar 11 months ago

* Borislav Petkov <bp@alien8.de> wrote:

> Sorry but this doesn't justify this churn. There's nothing 
> quantifyingly palpable here to warrant this.

I disagree, asm() is a known-bad inlining interface for fundamentally 
single-instruction inlines like this one, and there's various 
performance benefits to cleaning this up, as evidenced by the benchmark 
numbers and analysis in this pending commit:

  9628d19e91f1 ("x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions")

asm_inline() was implemented by the GCC folks *at our request* to fix 
such issues.

So these efforts are not "churn", at all - on the contrary.

Not merging such fixes/annotations would be similar to keeping build 
warnings about unclean code because they don't cause problems right 
now. While most build warnings are benign with no runtime effect, most 
of the time they point out an underlying problem.

We also asked Uros to submit careful, finegrained patches that might 
bloat the kernel, and this patch is the result of that request.

Thanks,

	Ingo

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Uros Bizjak 11 months ago

On Fri, Mar 14, 2025 at 11:15 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Borislav Petkov <bp@alien8.de> wrote:
>
> > Sorry but this doesn't justify this churn. There's nothing
> > quantifyingly palpable here to warrant this.
>
> I disagree, asm() is a known-bad inlining interface for fundamentally
> single-instruction inlines like this one, ...

Please note the new patch [1] that uses alternative_input(), where "no
code changes" actually supports the change. alternative_input() uses
asm_inline internally.

[1] https://lore.kernel.org/lkml/20250314081453.565859-1-ubizjak@gmail.com/

Uros.

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Ingo Molnar 11 months ago

* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Borislav Petkov <bp@alien8.de> wrote:
> 
> > Sorry but this doesn't justify this churn. There's nothing 
> > quantifyingly palpable here to warrant this.
> 
> I disagree, asm() is a known-bad inlining interface for fundamentally 
> single-instruction inlines like this one, and there's various 
> performance benefits to cleaning this up, as evidenced by the benchmark 
> numbers and analysis in this pending commit:
> 
>   9628d19e91f1 ("x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions")

Here's a link for those who'd like to view this via the web:

  https://lore.kernel.org/all/174188884263.14745.1542926632284353047.tip-bot2@tip-bot2/

Thanks,

	Ingo

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Borislav Petkov 11 months ago

On Fri, Mar 14, 2025 at 11:17:43AM +0100, Ingo Molnar wrote:
> Here's a link for those who'd like to view this via the web:
> 
>   https://lore.kernel.org/all/174188884263.14745.1542926632284353047.tip-bot2@tip-bot2/

This is a perf measuring method I got from you, actually, from a long time
ago:

:-)

./tools/perf/perf stat -a --repeat 5 --sync --pre ~/bin/pre-build-kernel.sh -- make -s -j33 bzImage

* tip/master fdebf9c0efe4 ("Merge branch into tip/master: 'x86/sev'")

 Performance counter stats for 'system wide' (5 runs):

      4,144,101.54 msec cpu-clock                        #   32.000 CPUs utilized               ( +-  0.10% )
           812,478      context-switches                 #  196.056 /sec                        ( +-  0.15% )
            67,201      cpu-migrations                   #   16.216 /sec                        ( +-  0.22% )
        48,228,560      page-faults                      #   11.638 K/sec                       ( +-  0.01% )
 9,473,229,339,058      instructions                     #    1.12  insn per cycle            
                                                  #    0.21  stalled cycles per insn     ( +-  0.00% )
 8,476,070,185,458      cycles                           #    2.045 GHz                         ( +-  0.12% )
 1,988,775,653,131      stalled-cycles-frontend          #   23.46% frontend cycles idle        ( +-  0.14% )
 2,128,585,400,027      branches                         #  513.642 M/sec                       ( +-  0.00% )
    66,681,861,375      branch-misses                    #    3.13% of all branches             ( +-  0.03% )

           129.504 +- 0.127 seconds time elapsed  ( +-  0.10% )

* tip/master with 9628d19e91f1 reverted

 Performance counter stats for 'system wide' (5 runs):

      4,141,057.45 msec cpu-clock                        #   32.000 CPUs utilized               ( +-  0.15% )
           811,299      context-switches                 #  195.916 /sec                        ( +-  0.08% )
            67,644      cpu-migrations                   #   16.335 /sec                        ( +-  0.24% )
        48,209,829      page-faults                      #   11.642 K/sec                       ( +-  0.00% )
 9,465,299,000,193      instructions                     #    1.12  insn per cycle            
                                                  #    0.21  stalled cycles per insn     ( +-  0.00% )
 8,487,239,564,102      cycles                           #    2.050 GHz                         ( +-  0.21% )
 1,992,414,836,889      stalled-cycles-frontend          #   23.48% frontend cycles idle        ( +-  0.08% )
 2,127,019,426,911      branches                         #  513.642 M/sec                       ( +-  0.00% )
    66,698,031,504      branch-misses                    #    3.14% of all branches             ( +-  0.02% )

           129.408 +- 0.195 seconds time elapsed  ( +-  0.15% )

This is all within the noise.

Or maybe building the kernel even with those "optimized" inlining decisions
due the asm being of length 1 for atomic locking insns simply doesn't matter.

Or maybe I need a different benchmark.

At least it ain't breaking anything...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Ingo Molnar 10 months, 4 weeks ago

* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Mar 14, 2025 at 11:17:43AM +0100, Ingo Molnar wrote:
> > Here's a link for those who'd like to view this via the web:
> > 
> >   https://lore.kernel.org/all/174188884263.14745.1542926632284353047.tip-bot2@tip-bot2/
> 
> This is a perf measuring method I got from you, actually, from a long time
> ago:
> 
> :-)
> 
> ./tools/perf/perf stat -a --repeat 5 --sync --pre ~/bin/pre-build-kernel.sh -- make -s -j33 bzImage

Yeah, so that's a suboptimal test for these particular changes really: 
why would a simple CPU-saturated kernel build with low levels of kernel 
use of the affected areas show measurable changes with this commit?

> This is all within the noise.

Should we expect anything else from this test?

Also, see the other figures & analysis within the commit, in particular 
the reduction in the number of function calls, when we have high 
per-function mitigation overhead that is often the top entry in kernel 
profiles:

 Overhead  Shared Object               Symbol
   4.57%  [kernel]                    [k] retbleed_return_thunk
   4.40%  [kernel]                    [k] unmap_page_range
   4.31%  [kernel]                    [k] _copy_to_iter
   2.46%  [kernel]                    [k] memset_orig
   2.31%  libc.so.6                   [.] __cxa_finalize

Each eliminated function call from when GCC's inliner was formerly 
confused by Linux's asm() statements is a win.

I did a test too, with a pipe-scheduling workload of 'perf bench sched 
pipe' locked down to a single CPU, with CPU frequencies fixed and 
nested perf stat instances:

  kepler:~> taskset -c 2 perf stat --null --repeat 5 perf stat --null --repeat 5 perf bench sched pipe

  [ -vanilla ]           19.5514 +- 0.0235 seconds time elapsed  ( +-  0.12% )
  [ +Uros's commit: ]    19.3972 +- 0.0207 seconds time elapsed  ( +-  0.11% )

Notes:

  - Best of 3 runs

  - "+Uros's commit" is the aforementioned one from -tip that you measured too:

      9628d19e91f1 ("x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions") applied: ]
      # https://lore.kernel.org/all/174188884263.14745.1542926632284353047.tip-bot2@tip-bot2/

  - The nested perf stat instances allowed further reduction in 
    measurement stddev, while keeping the internal steps easily
    observable, verifiable while the total runtime is still reasonable.

So on my system there appears to be a measurable improvement in 
performance on this particular benchmark on the order of magnitude of 
around ~0.8%, which is outside the measurement noise of around ~0.2%.

Thanks,

	Ingo

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Borislav Petkov 10 months, 4 weeks ago

On Sun, Mar 16, 2025 at 01:37:23AM +0100, Ingo Molnar wrote:
> Yeah, so that's a suboptimal test for these particular changes really: 
> why would a simple CPU-saturated kernel build with low levels of kernel 
> use of the affected areas show measurable changes with this commit?

Thus "Or maybe I need a different benchmark." :)

OTOH, it still tells me that if there are no negative changes in *that*
benchmark, then I should not worry.

> So on my system there appears to be a measurable improvement in 
> performance on this particular benchmark on the order of magnitude of 
> around ~0.8%, which is outside the measurement noise of around ~0.2%.

That's fine. It won't make me fall off my horse but sure, there are some small
improvements. Just don't let code readability suffer along the way with those
exercises.

:-)

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in amd_clear_divider()

Posted by Uros Bizjak 11 months ago

On Thu, Mar 13, 2025 at 9:07 PM Borislav Petkov <bp@alien8.de> wrote:

> >> > static __always_inline void amd_clear_divider(void)
> >> > {
> >> >-      asm volatile(ALTERNATIVE("", "div %2\n\t", X86_BUG_DIV0)
> >> >+      asm_inline volatile(ALTERNATIVE("", "div %2", X86_BUG_DIV0)
> >> >                    :: "a" (0), "d" (0), "r" (1));
> >> > }
> >> >
> >>
> >> So there's no point for this one...
> >
> >Not at its current usage sites, but considering that
> >amd_clear_divider() and two levels of small functions that include it
> >( amd_clear_divider(), arch_exit_to_user_mode() and
> >exit_to_user_mode() ) all need to be decorated with __always_inline
> >indicates that the issue is not that benign.
> >
> >It also triggers my search scripts, so I thought I should convert this
> >one as well and put the issue at rest.
>
> Sorry but this doesn't justify this churn. There's nothing quantifyingly palpable here to warrant this.

OK. This is not a hill I'm willing to die on.

Thanks,
Uros.