[v5] x86: memcpy() / memset() (non-)ERMS flavors plus fallout

[PATCH v5 4/6] x86: control memset() and memcpy() inlining

Posted by Jan Beulich 4 months, 3 weeks ago

Stop the compiler from inlining non-trivial memset() and memcpy() (for
memset() see e.g. map_vcpu_info() or kimage_load_segments() for
examples). This way we even keep the compiler from using REP STOSQ /
REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is
available).

With gcc10 this yields a modest .text size reduction (release build) of
around 2k.

Unfortunately these options aren't understood by the clang versions I
have readily available for testing with; I'm unaware of equivalents.

Note also that using cc-option-add is not an option here, or at least I
couldn't make things work with it (in case the option was not supported
by the compiler): The embedded comma in the option looks to be getting
in the way.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Re-base.
v2: New.
---
The boundary values are of course up for discussion - I wasn't really
certain whether to use 16 or 32; I'd be less certain about using yet
larger values.

Similarly whether to permit the compiler to emit REP STOSQ / REP MOVSQ
for known size, properly aligned blocks is up for discussion.

--- a/xen/arch/x86/arch.mk
+++ b/xen/arch/x86/arch.mk
@@ -58,6 +58,9 @@ endif
 $(call cc-option-add,CFLAGS_stack_boundary,CC,-mpreferred-stack-boundary=3)
 export CFLAGS_stack_boundary
 
+CFLAGS += $(call cc-option,$(CC),-mmemcpy-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign)
+CFLAGS += $(call cc-option,$(CC),-mmemset-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign)
+
 ifeq ($(CONFIG_UBSAN),y)
 # Don't enable alignment sanitisation.  x86 has efficient unaligned accesses,
 # and various things (ACPI tables, hypercall pages, stubs, etc) are wont-fix.

Re: [PATCH v5 4/6] x86: control memset() and memcpy() inlining

Posted by Teddy Astie 4 months, 3 weeks ago

Le 05/06/2025 à 12:28, Jan Beulich a écrit :
> Stop the compiler from inlining non-trivial memset() and memcpy() (for
> memset() see e.g. map_vcpu_info() or kimage_load_segments() for
> examples). This way we even keep the compiler from using REP STOSQ /
> REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is
> available).
> 

If the size is known and constant, and the compiler is able to generate 
a trivial rep movs/stos (usually with a mov $x, %ecx before). I don't 
see a reason to prevent it or forcing it to make a function call, as I 
suppose it is very likely that the plain inline rep movs/stos will 
perform better than a function call (even if it is not the prefered rep 
movsb/stosb), eventually also being smaller.

I wonder if it is possible to only generate inline rep movs/stos for 
"trivial cases" (i.e preceded with a plain mov $x, %ecx), and rely on 
either inline movs or function call in other cases (non-trivial ones).

> With gcc10 this yields a modest .text size reduction (release build) of
> around 2k.
> 
> Unfortunately these options aren't understood by the clang versions I
> have readily available for testing with; I'm unaware of equivalents.
> 
> Note also that using cc-option-add is not an option here, or at least I
> couldn't make things work with it (in case the option was not supported
> by the compiler): The embedded comma in the option looks to be getting
> in the way.
> 
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: Re-base.
> v2: New.
> ---
> The boundary values are of course up for discussion - I wasn't really
> certain whether to use 16 or 32; I'd be less certain about using yet
> larger values.
> 
> Similarly whether to permit the compiler to emit REP STOSQ / REP MOVSQ
> for known size, properly aligned blocks is up for discussion.
> 
> --- a/xen/arch/x86/arch.mk
> +++ b/xen/arch/x86/arch.mk
> @@ -58,6 +58,9 @@ endif
>   $(call cc-option-add,CFLAGS_stack_boundary,CC,-mpreferred-stack-boundary=3)
>   export CFLAGS_stack_boundary
>   
> +CFLAGS += $(call cc-option,$(CC),-mmemcpy-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign)
> +CFLAGS += $(call cc-option,$(CC),-mmemset-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign)
> +
>   ifeq ($(CONFIG_UBSAN),y)
>   # Don't enable alignment sanitisation.  x86 has efficient unaligned accesses,
>   # and various things (ACPI tables, hypercall pages, stubs, etc) are wont-fix.
> 
> 

Teddy


Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH v5 4/6] x86: control memset() and memcpy() inlining

Posted by Jan Beulich 4 months, 3 weeks ago

On 05.06.2025 19:34, Teddy Astie wrote:
> Le 05/06/2025 à 12:28, Jan Beulich a écrit :
>> Stop the compiler from inlining non-trivial memset() and memcpy() (for
>> memset() see e.g. map_vcpu_info() or kimage_load_segments() for
>> examples). This way we even keep the compiler from using REP STOSQ /
>> REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is
>> available).
>>
> 
> If the size is known and constant, and the compiler is able to generate 
> a trivial rep movs/stos (usually with a mov $x, %ecx before). I don't 
> see a reason to prevent it or forcing it to make a function call, as I 
> suppose it is very likely that the plain inline rep movs/stos will 
> perform better than a function call (even if it is not the prefered rep 
> movsb/stosb), eventually also being smaller.
> 
> I wonder if it is possible to only generate inline rep movs/stos for 
> "trivial cases" (i.e preceded with a plain mov $x, %ecx), and rely on 
> either inline movs or function call in other cases (non-trivial ones).

Note how the description starts with "Stop the compiler from inlining
non-trivial ...", which indeed remain unaffected according to my
observations (back at the time).

Jan

Re: [PATCH v5 4/6] x86: control memset() and memcpy() inlining

Posted by Teddy Astie 4 months, 3 weeks ago

Le 06/06/2025 à 11:21, Jan Beulich a écrit :
> On 05.06.2025 19:34, Teddy Astie wrote:
>> Le 05/06/2025 à 12:28, Jan Beulich a écrit :
>>> Stop the compiler from inlining non-trivial memset() and memcpy() (for
>>> memset() see e.g. map_vcpu_info() or kimage_load_segments() for
>>> examples). This way we even keep the compiler from using REP STOSQ /
>>> REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is
>>> available).
>>>
>>
>> If the size is known and constant, and the compiler is able to generate
>> a trivial rep movs/stos (usually with a mov $x, %ecx before). I don't
>> see a reason to prevent it or forcing it to make a function call, as I
>> suppose it is very likely that the plain inline rep movs/stos will
>> perform better than a function call (even if it is not the prefered rep
>> movsb/stosb), eventually also being smaller.
>>
>> I wonder if it is possible to only generate inline rep movs/stos for
>> "trivial cases" (i.e preceded with a plain mov $x, %ecx), and rely on
>> either inline movs or function call in other cases (non-trivial ones).
> 
> Note how the description starts with "Stop the compiler from inlining
> non-trivial ...", which indeed remain unaffected according to my
> observations (back at the time).

Yes, at least it is what appears to happen when testing using GCC 14.2.0 
where non-trivial memset/memcpy are replaced with explicit functions 
call, and some trivial ones still use rep movsb/l/q.

Though,

> unrolled_loop:16:noalign,libcall:-1:noalign

to me sounds like :
- use a inline unrolled loop for 0-16 memcpy/memset
- call memset/memcpy for other cases
(thus no rep-prefix shall be used)

The align/noalign meaning being vaguely documented in GCC documentation, 
so it's unclear if it only affects "non-aligned" copies, or potentially 
all of them.

> 
> Jan
> 

Teddy


Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech