[v1] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()

[PATCH] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()

Posted by Uros Bizjak 11 months ago

Use asm_inline() to instruct the compiler that the size of asm()
is the minimum size of one instruction, ignoring how many instructions
the compiler thinks it is. ALTERNATIVE macro that expands to several
pseudo directives causes instruction length estimate to count
more than 20 instructions.

bloat-o-meter reports minimal code size increase
(x86_64 defconfig with CONFIG_ADDRESS_MASKING, gcc-14.2.1):

  add/remove: 2/2 grow/shrink: 5/1 up/down: 2365/-1995 (370)

	Function                          old     new   delta
	-----------------------------------------------------
	do_get_mempolicy                    -    1449   +1449
	copy_nodes_to_user                  -     226    +226
	__x64_sys_get_mempolicy            35     213    +178
	syscall_user_dispatch_set_config  157     332    +175
	__ia32_sys_get_mempolicy           31     206    +175
	set_syscall_user_dispatch          29     181    +152
	__do_sys_mremap                  2073    2083     +10
	sp_insert                         133     117     -16
	task_set_syscall_user_dispatch    172       -    -172
	kernel_get_mempolicy             1807       -   -1807

  Total: Before=21423151, After=21423521, chg +0.00%

The code size increase is due to the compiler inlining
more functions that inline untagged_addr().

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/uaccess_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c52f0133425b..3c1bec3a0405 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -26,8 +26,8 @@ extern unsigned long USER_PTR_MAX;
  */
 static inline unsigned long __untagged_addr(unsigned long addr)
 {
-	asm (ALTERNATIVE("",
-			 "and " __percpu_arg([mask]) ", %[addr]", X86_FEATURE_LAM)
+	asm_inline (ALTERNATIVE("", "and " __percpu_arg([mask]) ", %[addr]",
+				X86_FEATURE_LAM)
 	     : [addr] "+r" (addr)
 	     : [mask] "m" (__my_cpu_var(tlbstate_untag_mask)));
 
-- 
2.48.1

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()

Posted by Borislav Petkov 11 months ago

On Fri, Mar 14, 2025 at 10:30:55AM +0100, Uros Bizjak wrote:
> Use asm_inline() to instruct the compiler that the size of asm()
> is the minimum size of one instruction, ignoring how many instructions
> the compiler thinks it is. ALTERNATIVE macro that expands to several
> pseudo directives causes instruction length estimate to count
> more than 20 instructions.
> 
> bloat-o-meter reports minimal code size increase

If you see an increase and *no* *other* *palpable* improvement, you don't send
it. It is that simple.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()

Posted by Ingo Molnar 10 months, 3 weeks ago

* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Mar 14, 2025 at 10:30:55AM +0100, Uros Bizjak wrote:
> > Use asm_inline() to instruct the compiler that the size of asm()
> > is the minimum size of one instruction, ignoring how many instructions
> > the compiler thinks it is. ALTERNATIVE macro that expands to several
> > pseudo directives causes instruction length estimate to count
> > more than 20 instructions.
> > 
> > bloat-o-meter reports minimal code size increase
> 
> If you see an increase and *no* *other* *palpable* improvement, you 
> don't send it. It is that simple.

Sorry, but you wouldn't be saying that eliminating function calls is 
not a 'palpable improvement', had you ever profiled a recent kernel on 
a real system, on modern CPUs ... :-/

The sad reality is that the top profile is dominated by function call + 
return overhead due to CPU bug mitigation workarounds that create per 
function call overhead:

 Overhead  Shared Object               Symbol
   4.57%  [kernel]                    [k] retbleed_return_thunk <============= !!!!!!!!
   4.40%  [kernel]                    [k] unmap_page_range
   4.31%  [kernel]                    [k] _copy_to_iter
   2.46%  [kernel]                    [k] memset_orig
   2.31%  libc.so.6                   [.] __cxa_finalize

That retbleed_return_thunk overhead gets avoided every time we inline a 
simple enough function.

But GCC cannot always do proper inlining decisions due to our 
complicated ALTERNATIVE macro constructs confusing the GCC inliner:

  > > ALTERNATIVE macro that expands to several pseudo directives causes 
  > > instruction length estimate to count more than 20 instructions.
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Note how the asm_inline() compiler feature was added by GCC at the 
kernel community's request to address such issues. (!)

So for those reasons, in my book, eliminating a function call for 
really simple single instruction inlines is an unconditional 
improvement that doesn't require futile performance measurements - it 
'only' requires assembly level code generation analysis in the 
changelog.

The reason is that requiring measurable effects for really small 
inlining changes is pretty much impossible in practice. I know, because 
I tried, and I'm good at measuring such things and I have the hardware 
to do it. Yet the per function call overhead demonstrated above in the 
profile is very much real and should not be handwaved away.

Note that this policy doesn't apply to other inlining decisions, only 
to single-instruction inline functions.

Also, having said all that, for this particular patch I'd still like to 
see a bit more GCC code generation analysis in this particular 
changelog: could you please cite a single relevant, representative 
example before/after assembly code section that demonstrates the 
effects of the inlined asm versus function call version, including the 
function that gets called?

I'm asking for that because sometimes single instructions can still 
have a halo of half a dozen of instructions that set them up or 
transform their results, so sometimes having a function call is the 
better option. Not all single-instruction asm() statements are 'simple' 
in praxis - but looking at the code generation will very much tell us 
whether it is.

Thanks,

	Ingo

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()

Posted by H. Peter Anvin 10 months, 3 weeks ago

On March 17, 2025 2:01:12 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Borislav Petkov <bp@alien8.de> wrote:
>
>> On Fri, Mar 14, 2025 at 10:30:55AM +0100, Uros Bizjak wrote:
>> > Use asm_inline() to instruct the compiler that the size of asm()
>> > is the minimum size of one instruction, ignoring how many instructions
>> > the compiler thinks it is. ALTERNATIVE macro that expands to several
>> > pseudo directives causes instruction length estimate to count
>> > more than 20 instructions.
>> > 
>> > bloat-o-meter reports minimal code size increase
>> 
>> If you see an increase and *no* *other* *palpable* improvement, you 
>> don't send it. It is that simple.
>
>Sorry, but you wouldn't be saying that eliminating function calls is 
>not a 'palpable improvement', had you ever profiled a recent kernel on 
>a real system, on modern CPUs ... :-/
>
>The sad reality is that the top profile is dominated by function call + 
>return overhead due to CPU bug mitigation workarounds that create per 
>function call overhead:
>
> Overhead  Shared Object               Symbol
>   4.57%  [kernel]                    [k] retbleed_return_thunk <============= !!!!!!!!
>   4.40%  [kernel]                    [k] unmap_page_range
>   4.31%  [kernel]                    [k] _copy_to_iter
>   2.46%  [kernel]                    [k] memset_orig
>   2.31%  libc.so.6                   [.] __cxa_finalize
>
>That retbleed_return_thunk overhead gets avoided every time we inline a 
>simple enough function.
>
>But GCC cannot always do proper inlining decisions due to our 
>complicated ALTERNATIVE macro constructs confusing the GCC inliner:
>
>  > > ALTERNATIVE macro that expands to several pseudo directives causes 
>  > > instruction length estimate to count more than 20 instructions.
>                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>Note how the asm_inline() compiler feature was added by GCC at the 
>kernel community's request to address such issues. (!)
>
>So for those reasons, in my book, eliminating a function call for 
>really simple single instruction inlines is an unconditional 
>improvement that doesn't require futile performance measurements - it 
>'only' requires assembly level code generation analysis in the 
>changelog.
>
>The reason is that requiring measurable effects for really small 
>inlining changes is pretty much impossible in practice. I know, because 
>I tried, and I'm good at measuring such things and I have the hardware 
>to do it. Yet the per function call overhead demonstrated above in the 
>profile is very much real and should not be handwaved away.
>
>Note that this policy doesn't apply to other inlining decisions, only 
>to single-instruction inline functions.
>
>Also, having said all that, for this particular patch I'd still like to 
>see a bit more GCC code generation analysis in this particular 
>changelog: could you please cite a single relevant, representative 
>example before/after assembly code section that demonstrates the 
>effects of the inlined asm versus function call version, including the 
>function that gets called?
>
>I'm asking for that because sometimes single instructions can still 
>have a halo of half a dozen of instructions that set them up or 
>transform their results, so sometimes having a function call is the 
>better option. Not all single-instruction asm() statements are 'simple' 
>in praxis - but looking at the code generation will very much tell us 
>whether it is.
>
>Thanks,
>
>	Ingo

I would like to repeat that I would like to see us at least try to #define asm __asm__ __inline__ tree-wide (with a possible opt-out) and run a benchmark on it. Since this is a central knob, we could even make it a Kconfig option that architectures can opt in or out of, or be overridden for specific compilers should it ever be necessary.

It is simply much closer to how we actually use asm() in the Linux kernel, *and* what performance characteristics we tend to care about. More often than not if we have a large hunk of assembly source it is because of metadata and/or directives.

It doesn't hurt that inline duplicating kernel code can occasionally bring about huge improvements in terms of branch eliminations because often (but far from always, of course) the difference in call context allows the compiler to eliminate dead paths. 

    -hpa

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()

Posted by Ingo Molnar 10 months, 3 weeks ago

* Ingo Molnar <mingo@kernel.org> wrote:

> But GCC cannot always do proper inlining decisions due to our 
> complicated ALTERNATIVE macro constructs confusing the GCC inliner:
> 
>   > > ALTERNATIVE macro that expands to several pseudo directives causes 
>   > > instruction length estimate to count more than 20 instructions.
>                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Note how the asm_inline() compiler feature was added by GCC at the 
> kernel community's request to address such issues. (!)
> 
> So for those reasons, in my book, eliminating a function call for 
> really simple single instruction inlines is an unconditional 
> improvement that doesn't require futile performance measurements - it 
> 'only' requires assembly level code generation analysis in the 
> changelog.

Note that at least in part this is a weakness of GCC: the compiler 
isn't looking at the asm() closely enough and the 20 instructions count 
vastly overestimates the true footprint of these statements.

Yet GCC is also giving us a tool: "asm __inline", which tells the 
compiler that this piece of asm() statement is small. A tool that was 
created at the request of the kernel community's complaints about this 
issue. :-/

asm_inline() is functionally similar to __force_inline - which we 
regularly apply if it has code generation benefits.

So I really don't see the harm in these patches - they have benefits in 
terms of GCC code generation quality, documentation and performance:

 - It documents small asm() statements by annotating them asm_inline().

 - It sometimes avoids function call overhead, improving performance.

And because single-function inlining changes are next to impossible to 
measure in practice in most cases, I'd suggest we skip the performance 
measurement requirement if the code generation advantages on a recent 
GCC version are unambiguous.

Thanks,

	Ingo

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()

Posted by Uros Bizjak 11 months ago

On Fri, Mar 14, 2025 at 12:25 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Mar 14, 2025 at 10:30:55AM +0100, Uros Bizjak wrote:
> > Use asm_inline() to instruct the compiler that the size of asm()
> > is the minimum size of one instruction, ignoring how many instructions
> > the compiler thinks it is. ALTERNATIVE macro that expands to several
> > pseudo directives causes instruction length estimate to count
> > more than 20 instructions.
> >
> > bloat-o-meter reports minimal code size increase
>
> If you see an increase and *no* *other* *palpable* improvement, you don't send
> it. It is that simple.

Do you see the removed functions? These are now fully inlined and
optimized in their inlined places. The program does not have to set up
a function call, invoke call/ret insn and create a frame in the called
function. The well tuned compiler heuristics make trade offs between
performance and code growth and it chose it that way. The heuristics
are as good as the data the programmer provides, and choking the
compiler with incorrect data, as provided by asm() interface,
certainly does no good. The asm() code in the patch declares *one*
instruction, not 23. Please count it.

Code size is not the right metric with -O2. It is that simple.

BR,
Uros.

Re: [PATCH] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()

Posted by Borislav Petkov 11 months ago

On Fri, Mar 14, 2025 at 02:22:45PM +0100, Uros Bizjak wrote:
> Do you see the removed functions?

I'd actually wanna see real benchmarks which show any performance improvement.
Like this one here. But this one which shows only within-the-noise:

https://lore.kernel.org/all/20250314132306.GDZ9QtukcVVtDmW1V1@fat_crate.local

But hey, apparently it doesn't cause any slowdowns either and apparently Ingo
thinks all that churn makes sense so whatever...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette