Stop the compiler from inlining non-trivial memset() and memcpy() (for
memset() see e.g. map_vcpu_info() or kimage_load_segments() for
examples). This way we even keep the compiler from using REP STOSQ /
REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is
available).
With gcc10 this yields a modest .text size reduction (release build) of
around 2k.
Unfortunately these options aren't understood by the clang versions I
have readily available for testing with; I'm unaware of equivalents.
Note also that using cc-option-add is not an option here, or at least I
couldn't make things work with it (in case the option was not supported
by the compiler): The embedded comma in the option looks to be getting
in the way.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Re-base.
v2: New.
---
The boundary values are of course up for discussion - I wasn't really
certain whether to use 16 or 32; I'd be less certain about using yet
larger values.
Similarly whether to permit the compiler to emit REP STOSQ / REP MOVSQ
for known size, properly aligned blocks is up for discussion.
--- a/xen/arch/x86/arch.mk
+++ b/xen/arch/x86/arch.mk
@@ -58,6 +58,9 @@ endif
 $(call cc-option-add,CFLAGS_stack_boundary,CC,-mpreferred-stack-boundary=3)
 export CFLAGS_stack_boundary
 
+CFLAGS += $(call cc-option,$(CC),-mmemcpy-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign)
+CFLAGS += $(call cc-option,$(CC),-mmemset-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign)
+
 ifeq ($(CONFIG_UBSAN),y)
 # Don't enable alignment sanitisation.  x86 has efficient unaligned accesses,
 # and various things (ACPI tables, hypercall pages, stubs, etc) are wont-fix.Le 05/06/2025 à 12:28, Jan Beulich a écrit : > Stop the compiler from inlining non-trivial memset() and memcpy() (for > memset() see e.g. map_vcpu_info() or kimage_load_segments() for > examples). This way we even keep the compiler from using REP STOSQ / > REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is > available). > If the size is known and constant, and the compiler is able to generate a trivial rep movs/stos (usually with a mov $x, %ecx before). I don't see a reason to prevent it or forcing it to make a function call, as I suppose it is very likely that the plain inline rep movs/stos will perform better than a function call (even if it is not the prefered rep movsb/stosb), eventually also being smaller. I wonder if it is possible to only generate inline rep movs/stos for "trivial cases" (i.e preceded with a plain mov $x, %ecx), and rely on either inline movs or function call in other cases (non-trivial ones). > With gcc10 this yields a modest .text size reduction (release build) of > around 2k. > > Unfortunately these options aren't understood by the clang versions I > have readily available for testing with; I'm unaware of equivalents. > > Note also that using cc-option-add is not an option here, or at least I > couldn't make things work with it (in case the option was not supported > by the compiler): The embedded comma in the option looks to be getting > in the way. > > Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > v3: Re-base. > v2: New. > --- > The boundary values are of course up for discussion - I wasn't really > certain whether to use 16 or 32; I'd be less certain about using yet > larger values. > > Similarly whether to permit the compiler to emit REP STOSQ / REP MOVSQ > for known size, properly aligned blocks is up for discussion. > > --- a/xen/arch/x86/arch.mk > +++ b/xen/arch/x86/arch.mk > @@ -58,6 +58,9 @@ endif > $(call cc-option-add,CFLAGS_stack_boundary,CC,-mpreferred-stack-boundary=3) > export CFLAGS_stack_boundary > > +CFLAGS += $(call cc-option,$(CC),-mmemcpy-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign) > +CFLAGS += $(call cc-option,$(CC),-mmemset-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign) > + > ifeq ($(CONFIG_UBSAN),y) > # Don't enable alignment sanitisation. x86 has efficient unaligned accesses, > # and various things (ACPI tables, hypercall pages, stubs, etc) are wont-fix. > > Teddy Teddy Astie | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech
On 05.06.2025 19:34, Teddy Astie wrote: > Le 05/06/2025 à 12:28, Jan Beulich a écrit : >> Stop the compiler from inlining non-trivial memset() and memcpy() (for >> memset() see e.g. map_vcpu_info() or kimage_load_segments() for >> examples). This way we even keep the compiler from using REP STOSQ / >> REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is >> available). >> > > If the size is known and constant, and the compiler is able to generate > a trivial rep movs/stos (usually with a mov $x, %ecx before). I don't > see a reason to prevent it or forcing it to make a function call, as I > suppose it is very likely that the plain inline rep movs/stos will > perform better than a function call (even if it is not the prefered rep > movsb/stosb), eventually also being smaller. > > I wonder if it is possible to only generate inline rep movs/stos for > "trivial cases" (i.e preceded with a plain mov $x, %ecx), and rely on > either inline movs or function call in other cases (non-trivial ones). Note how the description starts with "Stop the compiler from inlining non-trivial ...", which indeed remain unaffected according to my observations (back at the time). Jan
Le 06/06/2025 à 11:21, Jan Beulich a écrit : > On 05.06.2025 19:34, Teddy Astie wrote: >> Le 05/06/2025 à 12:28, Jan Beulich a écrit : >>> Stop the compiler from inlining non-trivial memset() and memcpy() (for >>> memset() see e.g. map_vcpu_info() or kimage_load_segments() for >>> examples). This way we even keep the compiler from using REP STOSQ / >>> REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is >>> available). >>> >> >> If the size is known and constant, and the compiler is able to generate >> a trivial rep movs/stos (usually with a mov $x, %ecx before). I don't >> see a reason to prevent it or forcing it to make a function call, as I >> suppose it is very likely that the plain inline rep movs/stos will >> perform better than a function call (even if it is not the prefered rep >> movsb/stosb), eventually also being smaller. >> >> I wonder if it is possible to only generate inline rep movs/stos for >> "trivial cases" (i.e preceded with a plain mov $x, %ecx), and rely on >> either inline movs or function call in other cases (non-trivial ones). > > Note how the description starts with "Stop the compiler from inlining > non-trivial ...", which indeed remain unaffected according to my > observations (back at the time). Yes, at least it is what appears to happen when testing using GCC 14.2.0 where non-trivial memset/memcpy are replaced with explicit functions call, and some trivial ones still use rep movsb/l/q. Though, > unrolled_loop:16:noalign,libcall:-1:noalign to me sounds like : - use a inline unrolled loop for 0-16 memcpy/memset - call memset/memcpy for other cases (thus no rep-prefix shall be used) The align/noalign meaning being vaguely documented in GCC documentation, so it's unclear if it only affects "non-aligned" copies, or potentially all of them. > > Jan > Teddy Teddy Astie | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech
© 2016 - 2025 Red Hat, Inc.