Move the function to its own assembly file. Having it in C just for the
entire body to be an asm() isn't really helpful. Then have two flavors:
A "basic" version using qword steps for the bulk of the operation, and an
ERMS version for modern hardware, to be substituted in via alternatives
patching.
Alternatives patching, however, requires an extra precaution: It uses
memcpy() itself, and hence the function may patch itself. Luckily the
patched-in code only replaces the prolog of the original function. Make
sure this remains this way.
Additionally alternatives patching, while supposedly safe via enforcing
a control flow change when modifying already prefetched code, may not
really be. Afaict a request is pending to drop the first of the two
options in the SDM's "Handling Self- and Cross-Modifying Code" section.
Insert a serializing instruction there.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
We may want to consider branching over the REP MOVSQ as well, if the
number of qwords turns out to be zero.
We may also want to consider using non-REP MOVS{L,W,B} for the tail.
TBD: We may further need a workaround similar to Linux'es 8ca97812c3c8
     ("x86/mce: Work around an erratum on fast string copy
     instructions").
TBD: Some older AMD CPUs have an issue with REP MOVS when source and
     destination are misaligned with one another (modulo 32?), which may
     require a separate memcpy() flavor.
---
v5: Re-base.
v4: Use CR2 write as serializing insn, and limit its use to boot time.
v3: Re-base.
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_RETURN_THUNK) += indirect-t
 obj-$(CONFIG_PV) += ioport_emulate.o
 obj-y += irq.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
+obj-y += memcpy.o
 obj-y += memset.o
 obj-y += mm.o x86_64/mm.o
 obj-$(CONFIG_VM_EVENT) += monitor.o
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -195,12 +195,16 @@ void *place_ret(void *ptr)
  * executing.
  *
  * "noinline" to cause control flow change and thus invalidate I$ and
- * cause refetch after modification.
+ * cause refetch after modification.  While the SDM continues to suggest this
+ * is sufficient, it may not be - issue a serializing insn afterwards as well,
+ * unless this is for live-patching.
  */
 static void init_or_livepatch noinline
 text_poke(void *addr, const void *opcode, size_t len)
 {
     memcpy(addr, opcode, len);
+    if ( system_state < SYS_STATE_active )
+        asm volatile ( "mov %%rax, %%cr2" ::: "memory" );
 }
 
 extern void *const __initdata_cf_clobber_start[];
--- /dev/null
+++ b/xen/arch/x86/memcpy.S
@@ -0,0 +1,20 @@
+#include <asm/asm_defns.h>
+
+FUNC(memcpy)
+        mov     %rdx, %rcx
+        mov     %rdi, %rax
+        /*
+         * We need to be careful here: memcpy() is involved in alternatives
+         * patching, so the code doing the actual copying (i.e. past setting
+         * up registers) may not be subject to patching (unless further
+         * precautions were taken).
+         */
+        ALTERNATIVE "and $7, %edx; shr $3, %rcx", \
+                    STR(rep movsb; RET), X86_FEATURE_ERMS
+        rep movsq
+        or      %edx, %ecx
+        jz      1f
+        rep movsb
+1:
+        RET
+END(memcpy)
--- a/xen/arch/x86/string.c
+++ b/xen/arch/x86/string.c
@@ -7,21 +7,6 @@
 
 #include <xen/lib.h>
 
-void *(memcpy)(void *dest, const void *src, size_t n)
-{
-    long d0, d1, d2;
-
-    asm volatile (
-        "   rep ; movs"__OS" ; "
-        "   mov %k4,%k3      ; "
-        "   rep ; movsb        "
-        : "=&c" (d0), "=&D" (d1), "=&S" (d2)
-        : "0" (n/BYTES_PER_LONG), "r" (n%BYTES_PER_LONG), "1" (dest), "2" (src)
-        : "memory" );
-
-    return dest;
-}
-
 void *(memmove)(void *dest, const void *src, size_t n)
 {
     long d0, d1, d2;On 05/06/2025 11:25 am, Jan Beulich wrote:
> --- a/xen/arch/x86/alternative.c
> +++ b/xen/arch/x86/alternative.c
> @@ -195,12 +195,16 @@ void *place_ret(void *ptr)
>   * executing.
>   *
>   * "noinline" to cause control flow change and thus invalidate I$ and
> - * cause refetch after modification.
> + * cause refetch after modification.  While the SDM continues to suggest this
> + * is sufficient, it may not be - issue a serializing insn afterwards as well,
> + * unless this is for live-patching.
>   */
>  static void init_or_livepatch noinline
>  text_poke(void *addr, const void *opcode, size_t len)
>  {
>      memcpy(addr, opcode, len);
> +    if ( system_state < SYS_STATE_active )
> +        asm volatile ( "mov %%rax, %%cr2" ::: "memory" );
>  }
This hunk wants pulling out separately.  It's not really related to
memcpy(), and probably ought to be backported.
Architecturally, both the APM and SDM say you're ok doing nothing on
64bit capable CPUs.
However, there are errata, and at least one recent AMD CPU needs
serialisation for self modifying code.  (Not sure if the rev guide has
been updated yet, and I can't remember offhand which CPU it is.)
You should also discuss the choice of serialising instruction in a
comment.  Mov to %cr2 is serialising on everything more modern than the
486, and least likely to be intercepted under virt (== most performant).
You also need to explain that we only do true SMC during boot.  After
boot for livepatch, it's prior to addr going live.
Finally, you're loading rubble into %cr2.  It's not even reliably 'addr'
because of the transformations the compiler is permitted to make on
memcpy().  I really think you should be reliably feeding in 0.  We're
doing self-modifying code here with serialisation; an extra xor won't be
measurable.
~Andrew
P.S. We should drop the noinline.  That's only applicable to the P5(?)
and earlier where you had to use branches to flush the prefetch queue. 
It's irrelevant on 64bit capable CPUs.
                
            Le 05/06/2025 à 12:27, Jan Beulich a écrit :
> Move the function to its own assembly file. Having it in C just for the
> entire body to be an asm() isn't really helpful. Then have two flavors:
> A "basic" version using qword steps for the bulk of the operation, and an
> ERMS version for modern hardware, to be substituted in via alternatives
> patching.
> 
> Alternatives patching, however, requires an extra precaution: It uses
> memcpy() itself, and hence the function may patch itself. Luckily the
> patched-in code only replaces the prolog of the original function. Make
> sure this remains this way.
> 
We can probably workaround that by using a separate memcpy for 
alternatives patching. So it wouldn't end up patching itself.
> Additionally alternatives patching, while supposedly safe via enforcing
> a control flow change when modifying already prefetched code, may not
> really be. Afaict a request is pending to drop the first of the two
> options in the SDM's "Handling Self- and Cross-Modifying Code" section.
> Insert a serializing instruction there.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> We may want to consider branching over the REP MOVSQ as well, if the
> number of qwords turns out to be zero.
> We may also want to consider using non-REP MOVS{L,W,B} for the tail.
> 
> TBD: We may further need a workaround similar to Linux'es 8ca97812c3c8
>       ("x86/mce: Work around an erratum on fast string copy
>       instructions").
> 
> TBD: Some older AMD CPUs have an issue with REP MOVS when source and
>       destination are misaligned with one another (modulo 32?), which may
>       require a separate memcpy() flavor.
> ---
> v5: Re-base.
> v4: Use CR2 write as serializing insn, and limit its use to boot time.
> v3: Re-base.
> 
> --- a/xen/arch/x86/Makefile
> +++ b/xen/arch/x86/Makefile
> @@ -47,6 +47,7 @@ obj-$(CONFIG_RETURN_THUNK) += indirect-t
>   obj-$(CONFIG_PV) += ioport_emulate.o
>   obj-y += irq.o
>   obj-$(CONFIG_KEXEC) += machine_kexec.o
> +obj-y += memcpy.o
>   obj-y += memset.o
>   obj-y += mm.o x86_64/mm.o
>   obj-$(CONFIG_VM_EVENT) += monitor.o
> --- a/xen/arch/x86/alternative.c
> +++ b/xen/arch/x86/alternative.c
> @@ -195,12 +195,16 @@ void *place_ret(void *ptr)
>    * executing.
>    *
>    * "noinline" to cause control flow change and thus invalidate I$ and
> - * cause refetch after modification.
> + * cause refetch after modification.  While the SDM continues to suggest this
> + * is sufficient, it may not be - issue a serializing insn afterwards as well,
> + * unless this is for live-patching.
>    */
>   static void init_or_livepatch noinline
>   text_poke(void *addr, const void *opcode, size_t len)
>   {
>       memcpy(addr, opcode, len);
> +    if ( system_state < SYS_STATE_active )
> +        asm volatile ( "mov %%rax, %%cr2" ::: "memory" );
>   }
>   
>   extern void *const __initdata_cf_clobber_start[];
> --- /dev/null
> +++ b/xen/arch/x86/memcpy.S
> @@ -0,0 +1,20 @@
> +#include <asm/asm_defns.h>
> +
> +FUNC(memcpy)
> +        mov     %rdx, %rcx
> +        mov     %rdi, %rax
> +        /*
> +         * We need to be careful here: memcpy() is involved in alternatives
> +         * patching, so the code doing the actual copying (i.e. past setting
> +         * up registers) may not be subject to patching (unless further
> +         * precautions were taken).
> +         */
> +        ALTERNATIVE "and $7, %edx; shr $3, %rcx", \
> +                    STR(rep movsb; RET), X86_FEATURE_ERMS
> +        rep movsq
> +        or      %edx, %ecx
> +        jz      1f
> +        rep movsb
> +1:
> +        RET
> +END(memcpy)
> --- a/xen/arch/x86/string.c
> +++ b/xen/arch/x86/string.c
> @@ -7,21 +7,6 @@
>   
>   #include <xen/lib.h>
>   
> -void *(memcpy)(void *dest, const void *src, size_t n)
> -{
> -    long d0, d1, d2;
> -
> -    asm volatile (
> -        "   rep ; movs"__OS" ; "
> -        "   mov %k4,%k3      ; "
> -        "   rep ; movsb        "
> -        : "=&c" (d0), "=&D" (d1), "=&S" (d2)
> -        : "0" (n/BYTES_PER_LONG), "r" (n%BYTES_PER_LONG), "1" (dest), "2" (src)
> -        : "memory" );
> -
> -    return dest;
> -}
> -
>   void *(memmove)(void *dest, const void *src, size_t n)
>   {
>       long d0, d1, d2;
> 
> 
Aside that:
Reviewed-by: Teddy Astie <teddy.astie@vates.tech>
Teddy
Teddy Astie | Vates XCP-ng Developer
XCP-ng & Xen Orchestra - Vates solutions
web: https://vates.tech
                
            On 05.06.2025 19:06, Teddy Astie wrote: > Le 05/06/2025 à 12:27, Jan Beulich a écrit : >> Move the function to its own assembly file. Having it in C just for the >> entire body to be an asm() isn't really helpful. Then have two flavors: >> A "basic" version using qword steps for the bulk of the operation, and an >> ERMS version for modern hardware, to be substituted in via alternatives >> patching. >> >> Alternatives patching, however, requires an extra precaution: It uses >> memcpy() itself, and hence the function may patch itself. Luckily the >> patched-in code only replaces the prolog of the original function. Make >> sure this remains this way. > > We can probably workaround that by using a separate memcpy for > alternatives patching. So it wouldn't end up patching itself. We could, yes, but imo we better wouldn't. > Aside that: > Reviewed-by: Teddy Astie <teddy.astie@vates.tech> Please clarify whether this applies without that suggestion of yours taken care of. Jan
Le 06/06/2025 à 11:13, Jan Beulich a écrit : > On 05.06.2025 19:06, Teddy Astie wrote: >> Le 05/06/2025 à 12:27, Jan Beulich a écrit : >>> Move the function to its own assembly file. Having it in C just for the >>> entire body to be an asm() isn't really helpful. Then have two flavors: >>> A "basic" version using qword steps for the bulk of the operation, and an >>> ERMS version for modern hardware, to be substituted in via alternatives >>> patching. >>> >>> Alternatives patching, however, requires an extra precaution: It uses >>> memcpy() itself, and hence the function may patch itself. Luckily the >>> patched-in code only replaces the prolog of the original function. Make >>> sure this remains this way. >> >> We can probably workaround that by using a separate memcpy for >> alternatives patching. So it wouldn't end up patching itself. > > We could, yes, but imo we better wouldn't. > As Andrew pointed out that it's not that simple to use a separate memcpy. So should probably keep the current approach. >> Aside that: >> Reviewed-by: Teddy Astie <teddy.astie@vates.tech> > > Please clarify whether this applies without that suggestion of yours taken > care of. > Yes. > Jan Teddy Teddy Astie | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech
On 05/06/2025 6:06 pm, Teddy Astie wrote:
> Le 05/06/2025 à 12:27, Jan Beulich a écrit :
>> Move the function to its own assembly file. Having it in C just for the
>> entire body to be an asm() isn't really helpful. Then have two flavors:
>> A "basic" version using qword steps for the bulk of the operation, and an
>> ERMS version for modern hardware, to be substituted in via alternatives
>> patching.
>>
>> Alternatives patching, however, requires an extra precaution: It uses
>> memcpy() itself, and hence the function may patch itself. Luckily the
>> patched-in code only replaces the prolog of the original function. Make
>> sure this remains this way.
>>
> We can probably workaround that by using a separate memcpy for 
> alternatives patching. So it wouldn't end up patching itself.
Not memcpy() you can't.
The compiler is permitted to invent calls to memset()/memcpy() out of
thin air, e.g. struct big_foo = {}; as a local on the stack.
This is the same reason why it's impossible to do speculation safety in
C; you cannot guarantee to get protections ahead of the first RET.
~Andrew
                
            © 2016 - 2025 Red Hat, Inc.