include/linux/instrumentation.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
Use asm_inline() in the instrumentation begin/end macros to prevent the
compiler from making poor inlining decisions based on the length of the
objtool annotations.
Without the objtool annotations, each macro resolves to a single NOP.
Using inline_asm() seems obviously correct here as it accurately
communicates the actual code size to the compiler.
These macros are used by WARN() and lockdep, so this change can affect a
lot of functions.
For a defconfig kernel built with GCC 14.2.1, bloat-o-meter reports a
0.17% increase in text size:
add/remove: 74/352 grow/shrink: 914/353 up/down: 80747/-47120 (33627)
Total: Before=19460272, After=19493899, chg +0.17%
The text growth is presumably due to increased inlining. A net total of
278 functions were removed (+74 / -352). Each of the removed functions
is likely inlined at multiple sites which explains the somewhat
significant code growth.
One example from Uros:
$ grep "<encode_string>" objdump.old
00000000004506e0 <encode_string>:
45113c: e8 9f f5 ff ff call 4506e0 <encode_string>
452bcb: e9 10 db ff ff jmp 4506e0 <encode_string>
453d33: e8 a8 c9 ff ff call 4506e0 <encode_string>
453ef7: e8 e4 c7 ff ff call 4506e0 <encode_string>
45549f: e8 3c b2 ff ff call 4506e0 <encode_string>
455843: e8 98 ae ff ff call 4506e0 <encode_string>
455b37: e8 a4 ab ff ff call 4506e0 <encode_string>
455b47: e8 94 ab ff ff call 4506e0 <encode_string>
4564fa: e8 e1 a1 ff ff call 4506e0 <encode_string>
456669: e8 72 a0 ff ff call 4506e0 <encode_string>
456691: e8 4a a0 ff ff call 4506e0 <encode_string>
4566a0: e8 3b a0 ff ff call 4506e0 <encode_string>
4569aa: e8 31 9d ff ff call 4506e0 <encode_string>
456e79: e9 62 98 ff ff jmp 4506e0 <encode_string>
456efe: e9 dd 97 ff ff jmp 4506e0 <encode_string>
All these are calls now inline:
encode_string 58 - -58
... where for example encode_putfh() grows by:
encode_putfh 70 118 +48
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
v2:
- improve patch description
- extract from https://lore.kernel.org/cover.1744098446.git.jpoimboe@kernel.org
include/linux/instrumentation.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/instrumentation.h b/include/linux/instrumentation.h
index bf675a8aef8a..b1007407d272 100644
--- a/include/linux/instrumentation.h
+++ b/include/linux/instrumentation.h
@@ -9,7 +9,7 @@
/* Begin/end of an instrumentation safe region */
#define __instrumentation_begin(c) ({ \
- asm volatile(__stringify(c) ": nop\n\t" \
+ asm_inline volatile(__stringify(c) ": nop\n\t" \
ANNOTATE_INSTR_BEGIN(__ASM_BREF(c)) \
: : "i" (c)); \
})
@@ -47,7 +47,7 @@
* part of the condition block and does not escape.
*/
#define __instrumentation_end(c) ({ \
- asm volatile(__stringify(c) ": nop\n\t" \
+ asm_inline volatile(__stringify(c) ": nop\n\t" \
ANNOTATE_INSTR_END(__ASM_BREF(c)) \
: : "i" (c)); \
})
--
2.49.0
* Josh Poimboeuf <jpoimboe@kernel.org> wrote: > Use asm_inline() in the instrumentation begin/end macros to prevent the > compiler from making poor inlining decisions based on the length of the > objtool annotations. > > Without the objtool annotations, each macro resolves to a single NOP. > Using inline_asm() seems obviously correct here as it accurately > communicates the actual code size to the compiler. s/inline_asm /asm_inline > > These macros are used by WARN() and lockdep, so this change can affect a > lot of functions. > > For a defconfig kernel built with GCC 14.2.1, bloat-o-meter reports a > 0.17% increase in text size: > > add/remove: 74/352 grow/shrink: 914/353 up/down: 80747/-47120 (33627) > Total: Before=19460272, After=19493899, chg +0.17% > > The text growth is presumably due to increased inlining. A net total of > 278 functions were removed (+74 / -352). Each of the removed functions > is likely inlined at multiple sites which explains the somewhat > significant code growth. So: - 353 function shrunk by 47120 bytes, that's -133 bytes per function affected. - 914 functions grew by 80747 bytes, that's +88 bytes per function, but there's 3x of them. That's a lot of net text growth, isn't it? It's certainly not just a single instruction or two per inlining, as asm_inline() would suggest. > One example from Uros: > > $ grep "<encode_string>" objdump.old > > 00000000004506e0 <encode_string>: > 45113c: e8 9f f5 ff ff call 4506e0 <encode_string> > 452bcb: e9 10 db ff ff jmp 4506e0 <encode_string> > 453d33: e8 a8 c9 ff ff call 4506e0 <encode_string> > 453ef7: e8 e4 c7 ff ff call 4506e0 <encode_string> > 45549f: e8 3c b2 ff ff call 4506e0 <encode_string> > 455843: e8 98 ae ff ff call 4506e0 <encode_string> > 455b37: e8 a4 ab ff ff call 4506e0 <encode_string> > 455b47: e8 94 ab ff ff call 4506e0 <encode_string> > 4564fa: e8 e1 a1 ff ff call 4506e0 <encode_string> > 456669: e8 72 a0 ff ff call 4506e0 <encode_string> > 456691: e8 4a a0 ff ff call 4506e0 <encode_string> > 4566a0: e8 3b a0 ff ff call 4506e0 <encode_string> > 4569aa: e8 31 9d ff ff call 4506e0 <encode_string> > 456e79: e9 62 98 ff ff jmp 4506e0 <encode_string> > 456efe: e9 dd 97 ff ff jmp 4506e0 <encode_string> > > All these are calls now inline: > > encode_string 58 - -58 > > ... where for example encode_putfh() grows by: > > encode_putfh 70 118 +48 That still doesn't make it clear where the apparently ~10 instructions per inlining come from, right? Thanks, Ingo
On Tue, Apr 22, 2025 at 2:02 PM Ingo Molnar <mingo@kernel.org> wrote: > > > * Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > Use asm_inline() in the instrumentation begin/end macros to prevent the > > compiler from making poor inlining decisions based on the length of the > > objtool annotations. > > > > Without the objtool annotations, each macro resolves to a single NOP. > > Using inline_asm() seems obviously correct here as it accurately > > communicates the actual code size to the compiler. > > s/inline_asm > /asm_inline > > > > > These macros are used by WARN() and lockdep, so this change can affect a > > lot of functions. > > > > For a defconfig kernel built with GCC 14.2.1, bloat-o-meter reports a > > 0.17% increase in text size: > > > > add/remove: 74/352 grow/shrink: 914/353 up/down: 80747/-47120 (33627) > > Total: Before=19460272, After=19493899, chg +0.17% > > > > The text growth is presumably due to increased inlining. A net total of > > 278 functions were removed (+74 / -352). Each of the removed functions > > is likely inlined at multiple sites which explains the somewhat > > significant code growth. > > So: > > - 353 function shrunk by 47120 bytes, that's -133 bytes per function > affected. > > - 914 functions grew by 80747 bytes, that's +88 bytes per function, > but there's 3x of them. > > That's a lot of net text growth, isn't it? It's certainly not just a > single instruction or two per inlining, as asm_inline() would suggest. > > > One example from Uros: > > > > $ grep "<encode_string>" objdump.old > > > > 00000000004506e0 <encode_string>: > > 45113c: e8 9f f5 ff ff call 4506e0 <encode_string> > > 452bcb: e9 10 db ff ff jmp 4506e0 <encode_string> > > 453d33: e8 a8 c9 ff ff call 4506e0 <encode_string> > > 453ef7: e8 e4 c7 ff ff call 4506e0 <encode_string> > > 45549f: e8 3c b2 ff ff call 4506e0 <encode_string> > > 455843: e8 98 ae ff ff call 4506e0 <encode_string> > > 455b37: e8 a4 ab ff ff call 4506e0 <encode_string> > > 455b47: e8 94 ab ff ff call 4506e0 <encode_string> > > 4564fa: e8 e1 a1 ff ff call 4506e0 <encode_string> > > 456669: e8 72 a0 ff ff call 4506e0 <encode_string> > > 456691: e8 4a a0 ff ff call 4506e0 <encode_string> > > 4566a0: e8 3b a0 ff ff call 4506e0 <encode_string> > > 4569aa: e8 31 9d ff ff call 4506e0 <encode_string> > > 456e79: e9 62 98 ff ff jmp 4506e0 <encode_string> > > 456efe: e9 dd 97 ff ff jmp 4506e0 <encode_string> > > > > All these are calls now inline: > > > > encode_string 58 - -58 > > > > ... where for example encode_putfh() grows by: > > > > encode_putfh 70 118 +48 > > That still doesn't make it clear where the apparently ~10 instructions > per inlining come from, right? The growth is actually from different inlining decisions, that cover not only inlining of small functions, but other code blocks (hot vs. cold, tail duplication, etc) too. The compiler uses certain thresholds to estimate inlining gain (thresholds are different for -Os and -O2). Artificially bloated functions that don't use asm_inline() fall under this threshold (IOW, the inlining would increase size too much), so they are not inlined; code blocks that enclose unfixed asm clauses are treated differently than when they use asm_inline() instead of asm(). When asm_inline() is introduced, the size of the function (and consequently inlining gain) is estimated more accurately, the estimated size is lower, so there is more inlining happening. I'd again remark that the code size is not the right metric when compiling with -O2. Uros.
* Uros Bizjak <ubizjak@gmail.com> wrote: > > That still doesn't make it clear where the apparently ~10 > > instructions per inlining come from, right? > > The growth is actually from different inlining decisions, that cover > not only inlining of small functions, but other code blocks (hot vs. > cold, tail duplication, etc) too. The compiler uses certain > thresholds to estimate inlining gain (thresholds are different for > -Os and -O2). Artificially bloated functions that don't use > asm_inline() fall under this threshold (IOW, the inlining would > increase size too much), so they are not inlined; code blocks that > enclose unfixed asm clauses are treated differently than when they > use asm_inline() instead of asm(). When asm_inline() is introduced, > the size of the function (and consequently inlining gain) is > estimated more accurately, the estimated size is lower, so there is > more inlining happening. > > I'd again remark that the code size is not the right metric when > compiling with -O2. Understood, but still we somehow have to be able to measure whether the marking of these primitives with asm_inline() is beneficial in isolation - even if on a real build the noise of GCC's overall inlining decisions obscure the results - and may even reverse them. Is there a way to coax GCC into a mode of build where such changes can be measured in a better fashion? For example would setting -finline-limit=1000 or -finline-limit=10 (or some other well-chosen inlining threshold value, or tweaking any of the inliner parameters via --param values?), just for the sake of measurement, give us more representative .text size change values? Because, ideally, if we do these decisions correctly at the asm() level, compilers will, eventually, after a few decades, catch up and do the right thing as well. ;-) Thanks, Ingo
On Tue, Apr 22, 2025 at 10:05 PM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > > That still doesn't make it clear where the apparently ~10
> > > instructions per inlining come from, right?
> >
> > The growth is actually from different inlining decisions, that cover
> > not only inlining of small functions, but other code blocks (hot vs.
> > cold, tail duplication, etc) too. The compiler uses certain
> > thresholds to estimate inlining gain (thresholds are different for
> > -Os and -O2). Artificially bloated functions that don't use
> > asm_inline() fall under this threshold (IOW, the inlining would
> > increase size too much), so they are not inlined; code blocks that
> > enclose unfixed asm clauses are treated differently than when they
> > use asm_inline() instead of asm(). When asm_inline() is introduced,
> > the size of the function (and consequently inlining gain) is
> > estimated more accurately, the estimated size is lower, so there is
> > more inlining happening.
> >
> > I'd again remark that the code size is not the right metric when
> > compiling with -O2.
>
> Understood, but still we somehow have to be able to measure whether the
> marking of these primitives with asm_inline() is beneficial in
> isolation - even if on a real build the noise of GCC's overall inlining
> decisions obscure the results - and may even reverse them.
>
> Is there a way to coax GCC into a mode of build where such changes can
> be measured in a better fashion?
There are several debug options that report details of inliner
decisions. You can add -fdump-ipa-inline or -fdump-ipa-inline-details
[1] to generate a debug file for interprocedural inlining.
[1] https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html
> For example would setting -finline-limit=1000 or -finline-limit=10 (or
> some other well-chosen inlining threshold value, or tweaking any of the
> inliner parameters via --param values?), just for the sake of
> measurement, give us more representative .text size change values?
I don't think so, because inliner uses pseudo instructions [2] where:
_Note:_ pseudo instruction represents, in this particular context,
an abstract measurement of function's size. In no way does it
represent a count of assembly instructions and as such its exact
meaning might change from one release to an another.
[2] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-finline-limit
OTOH, there are plenty of --param choices to play with the inliner
besides -finline-limit= option. Please see [3]
[3] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-param
In the dump mentioned above, you will get e.g.:
IPA function summary for pfmemalloc_match/5736 inlinable
global time: 8.200000
self size: 11
global size: 11
min size: 7
self stack: 0
global stack: 0
size:4.000000, time:4.000000
size:3.000000, time:2.000000, executed if:(not inlined)
size:0.500000, time:0.500000, executed if:(op0 not sra candidate)
&& (not inlined)
size:0.500000, time:0.500000, executed if:(op0 not sra candidate)
...
The estimator estimates size and execution time and decides how to
(and if) inline the function.
> Because, ideally, if we do these decisions correctly at the asm()
> level, compilers will, eventually, after a few decades, catch up
> and do the right thing as well. ;-)
We (as in gcc developers) are eagerly waiting for better tuning
parameters that would satisfy everyone's needs. Rest assured that many
have tried to fine-tune the heuristics, with various levels of success
;)
Jokes aside, it is important to feed the estimator correct data, as
precise as possible. There are limitations with asm(), because the
compiler doesn't know what is inside the asm template. It estimates
one instruction for every instruction delimiter, where the size of
"one instruction" is estimated to 16 bytes. In case of __ASM_ANNOTATE,
the estimator estimates 5 instructions and 80 bytes total vs. one
instruction when asm_inline() is used. Based on this fact, I think
that changing asm() to asm_inline() for insns with __ASM_ANNOTATE is
beneficial.
Thanks,
Uros.
On Mon, Apr 14, 2025 at 05:27:11PM -0700, Josh Poimboeuf wrote: > Use asm_inline() in the instrumentation begin/end macros to prevent the > compiler from making poor inlining decisions based on the length of the > objtool annotations. > > Without the objtool annotations, each macro resolves to a single NOP. > Using inline_asm() seems obviously correct here as it accurately > communicates the actual code size to the compiler. > > These macros are used by WARN() and lockdep, so this change can affect a > lot of functions. > > For a defconfig kernel built with GCC 14.2.1, bloat-o-meter reports a > 0.17% increase in text size: > > add/remove: 74/352 grow/shrink: 914/353 up/down: 80747/-47120 (33627) > Total: Before=19460272, After=19493899, chg +0.17% Hmm, I was surprised that defconfig was affected at all. Why does defconfig have DEBUG_ENTRY on?
On Tue, Apr 15, 2025 at 09:48:49AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 14, 2025 at 05:27:11PM -0700, Josh Poimboeuf wrote:
> > Use asm_inline() in the instrumentation begin/end macros to prevent the
> > compiler from making poor inlining decisions based on the length of the
> > objtool annotations.
> >
> > Without the objtool annotations, each macro resolves to a single NOP.
> > Using inline_asm() seems obviously correct here as it accurately
> > communicates the actual code size to the compiler.
> >
> > These macros are used by WARN() and lockdep, so this change can affect a
> > lot of functions.
> >
> > For a defconfig kernel built with GCC 14.2.1, bloat-o-meter reports a
> > 0.17% increase in text size:
> >
> > add/remove: 74/352 grow/shrink: 914/353 up/down: 80747/-47120 (33627)
> > Total: Before=19460272, After=19493899, chg +0.17%
>
> Hmm, I was surprised that defconfig was affected at all. Why does
> defconfig have DEBUG_ENTRY on?
Hm, looks like Ingo enabled that with:
70c8dc910427 ("x86/defconfig: Enable CONFIG_DEBUG_ENTRY=y")
Though, is there any reason NOINSTR validation needs to depend on
DEBUG_ENTRY?
--
Josh
© 2016 - 2025 Red Hat, Inc.