x86: handle the tail in rep_movs_alternative() with an overlapping store

[PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Mateusz Guzik 10 months, 3 weeks ago

Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
1 byte at a time loop to handle the tail.

This is trivially avoidable with overlapping stores, which is the
standard technique for these kind of routines (in fact memcpy() is
already using it in a more extensive form).

I traced calls in this range during a kernel build and found that 65% of
all of them had tail to take care of.

Distribution of size & 7 in that range is as follows:
@:
[0, 1)           3282178 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)            949627 |@@@@@@@@@@@@@@@                                     |
[2, 3)           1078842 |@@@@@@@@@@@@@@@@@                                   |
[3, 4)            998995 |@@@@@@@@@@@@@@@                                     |
[4, 5)            744515 |@@@@@@@@@@@                                         |
[5, 6)            925437 |@@@@@@@@@@@@@@                                      |
[6, 7)            663305 |@@@@@@@@@@                                          |
[7, ...)          786007 |@@@@@@@@@@@@                                        |

@stats[notail]: 3282178
@stats[tail]: 6146728

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

I added a call to a custom copy_user_probe() routine to
copy_user_generic so that I can attach to it with bpftrace. Available
upon request. The one-liner is:

bpftrace -e 'kprobe:copy_user_probe /arg2 >= 8 && arg2 <= 64/ \
{ @ = lhist(arg2 & 7, 0, 7, 1); @stats[arg2 & 7 ? "tail" : "notail"] = count(); }'

Anyhow, I don't have any means to benchmark this at the moment as the
only hw I have access to is in fact a little too modern (FSRM), but this
being the standard technique I think wont require much convincing. If
anything the question is why this differs from memcpy which *does* use
overlapping stores.

Tested by forcing the kernel to use the routine, did the kernel build
just fine with the change.

That said, absent own bench results I'm not going to strongly argue for
the change but I do think it is an ok tidy-up until someone(tm) puts
more effort here.

 arch/x86/lib/copy_user_64.S | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index aa8c341b2441..2d5f42c521b5 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -65,10 +65,15 @@ SYM_FUNC_START(rep_movs_alternative)
 	je .Lexit
 	cmp $8,%ecx
 	jae .Lword
-	jmp .Lcopy_user_tail
+4:	movq -8(%rsi,%rcx),%rax
+5:	movq %rax,-8(%rdi,%rcx)
+	xorl %ecx,%ecx
+	RET
 
 	_ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
 	_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 4b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 5b, .Lcopy_user_tail)
 
 .Llarge:
 0:	ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
-- 
2.43.0

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Linus Torvalds 10 months, 3 weeks ago

On Thu, 20 Mar 2025 at 12:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
> 1 byte at a time loop to handle the tail.

I definitely do not mind this patch, but I think it doesn't go far enough.

It gets rid of the byte-at-a-time loop at the end, but only for the
short-copy case of 8-63 bytes.

The .Llarge_movsq ends up still doing

        testl %ecx,%ecx
        jne .Lcopy_user_tail
        RET

and while that is only triggered by the non-ERMS case, that's what
most older AMD CPU's will trigger, afaik.

So I think that if we do this, we should do it properly.

               Linus

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Mateusz Guzik 10 months, 3 weeks ago

On Thu, Mar 20, 2025 at 8:23 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 20 Mar 2025 at 12:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
> > 1 byte at a time loop to handle the tail.
>
> I definitely do not mind this patch, but I think it doesn't go far enough.
>
> It gets rid of the byte-at-a-time loop at the end, but only for the
> short-copy case of 8-63 bytes.
>

This bit I can vouch for.

> The .Llarge_movsq ends up still doing
>
>         testl %ecx,%ecx
>         jne .Lcopy_user_tail
>         RET
>
> and while that is only triggered by the non-ERMS case, that's what
> most older AMD CPU's will trigger, afaik.
>

This bit I can't.

Per my other e-mail it has been several years since I was seriously
digging in the area (around 7 by now I think) and details are rather
fuzzy.

I have a recollection that handling the tail after rep movsq with an
overlapping store was suffering a penalty big enough to warrant a
"normal" copy instead, avoiding the just written to area. I see my old
routine $elsewhere makes sure to do it. I don't have sensible hw to
bench this on either at the moment.

That said, if you insist on it, I'll repost v2 with the change (I'm
going to *test* it of course, just not bench. :>)
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Linus Torvalds 10 months, 3 weeks ago

On Thu, 20 Mar 2025 at 12:33, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I have a recollection that handling the tail after rep movsq with an
> overlapping store was suffering a penalty big enough to warrant a
> "normal" copy instead, avoiding the just written to area.

Ahh. Good point. The rep movsq might indeed end up having odd effects
with subsequent aliasing memory operations.

Consider myself convinced.

           Linus

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Linus Torvalds 10 months, 3 weeks ago

On Thu, 20 Mar 2025 at 14:17, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 20 Mar 2025 at 12:33, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > I have a recollection that handling the tail after rep movsq with an
> > overlapping store was suffering a penalty big enough to warrant a
> > "normal" copy instead, avoiding the just written to area.
>
> Ahh. Good point. The rep movsq might indeed end up having odd effects
> with subsequent aliasing memory operations.
>
> Consider myself convinced.

Actually, I think there's a solution for this.

Do not do the last 0-7 bytes as a word that overlaps with the tail of
the 'rep movs'

Do the last 8-15 bytes *non-overlapping* (well, they overlap each
other, but not the 'rep movs')

Something UNTESTED like the appended, in other words. The large case
then ends up without any conditionals, looking something like this:

        mov    %rcx,%rax
        shr    $0x3,%rcx
        dec    %rcx
        and    $0x7,%eax
        rep movsq %ds:(%rsi),%es:(%rdi)
        mov    (%rsi),%rcx
        mov    %rcx,(%rdi)
        mov    (%rsi,%rax,1),%rcx
        mov    %rcx,(%rdi,%rax,1)
        xor    %ecx,%ecx
        ret

with some added complexity - but not a lot - in the exception fixup cases.

This is once again intentionally whitespace-damaged, because I don't
want people applying this mindlessly. Somebody needs to double-check
my logic, and verify that this also avoids the cost from the aliasing
with the rep movs.

                   Linus

---
  --- a/arch/x86/lib/copy_user_64.S
  +++ b/arch/x86/lib/copy_user_64.S
  @@ -76,16 +76,38 @@ SYM_FUNC_START(rep_movs_alternative)
   .Llarge_movsq:
        movq %rcx,%rax
        shrq $3,%rcx
  +     decq %rcx
        andl $7,%eax
  +
  +     /* 8*%rcx + 8 + %rax bytes: do the 8*%rcx */
   0:   rep movsq
  -     movl %eax,%ecx
  -     testl %ecx,%ecx
  -     jne .Lcopy_user_tail
  +
  +     /* We now have 8 + %rax bytes left */
  +1:   movq (%rsi),%rcx
  +2:   movq %rcx,(%rdi)
  +
  +     /* %rax bytes left - do it as one overlapping word */
  +3:   movq (%rsi,%rax),%rcx
  +4:   movq %rcx,(%rdi,%rax)
  +
  +     xorl %ecx,%ecx
        RET

  -1:   leaq (%rax,%rcx,8),%rcx
  +10:  leaq 8(%rax,%rcx,8),%rcx
        jmp .Lcopy_user_tail

  -     _ASM_EXTABLE_UA( 0b, 1b)
  +11:  leaq 8(%rax),%rcx
  +     jmp .Lcopy_user_tail
  +
  +12:  addq $8,%rsi
  +     addq $8,%rdi
  +     movl %eax,%ecx
  +     jmp .Lcopy_user_tail
  +
  +     _ASM_EXTABLE_UA( 0b, 10b)
  +     _ASM_EXTABLE_UA( 1b, 11b)
  +     _ASM_EXTABLE_UA( 2b, 11b)
  +     _ASM_EXTABLE_UA( 3b, 12b)
  +     _ASM_EXTABLE_UA( 4b, 12b)
   SYM_FUNC_END(rep_movs_alternative)
   EXPORT_SYMBOL(rep_movs_alternative)

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by David Laight 10 months, 3 weeks ago

On Thu, 20 Mar 2025 16:53:32 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 20 Mar 2025 at 14:17, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 20 Mar 2025 at 12:33, Mateusz Guzik <mjguzik@gmail.com> wrote:  
> > >
> > > I have a recollection that handling the tail after rep movsq with an
> > > overlapping store was suffering a penalty big enough to warrant a
> > > "normal" copy instead, avoiding the just written to area.  
> >
> > Ahh. Good point. The rep movsq might indeed end up having odd effects
> > with subsequent aliasing memory operations.
> >
> > Consider myself convinced.  
> 
> Actually, I think there's a solution for this.
> 
> Do not do the last 0-7 bytes as a word that overlaps with the tail of
> the 'rep movs'
> 
> Do the last 8-15 bytes *non-overlapping* (well, they overlap each
> other, but not the 'rep movs')
> 
> Something UNTESTED like the appended, in other words. The large case
> then ends up without any conditionals, looking something like this:
> 
>         mov    %rcx,%rax
>         shr    $0x3,%rcx
>         dec    %rcx
>         and    $0x7,%eax
>         rep movsq %ds:(%rsi),%es:(%rdi)
>         mov    (%rsi),%rcx
>         mov    %rcx,(%rdi)
>         mov    (%rsi,%rax,1),%rcx
>         mov    %rcx,(%rdi,%rax,1)
>         xor    %ecx,%ecx
>         ret

I think you can save the 'tail end' copying the same 8 bytes twice by doing:
	sub	$9,%rcx
	mov	%rcx,%rax
	shr	$3,%rcx
	and	$7,%rax
	inc	%rax
before the 'rep movsq'.

	David
	
> 
> with some added complexity - but not a lot - in the exception fixup cases.
> 
> This is once again intentionally whitespace-damaged, because I don't
> want people applying this mindlessly. Somebody needs to double-check
> my logic, and verify that this also avoids the cost from the aliasing
> with the rep movs.
> 
>                    Linus
...

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Herton Krzesinski 10 months, 2 weeks ago

On Fri, Mar 21, 2025 at 5:47 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Thu, 20 Mar 2025 16:53:32 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Thu, 20 Mar 2025 at 14:17, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Thu, 20 Mar 2025 at 12:33, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > >
> > > > I have a recollection that handling the tail after rep movsq with an
> > > > overlapping store was suffering a penalty big enough to warrant a
> > > > "normal" copy instead, avoiding the just written to area.
> > >
> > > Ahh. Good point. The rep movsq might indeed end up having odd effects
> > > with subsequent aliasing memory operations.
> > >
> > > Consider myself convinced.
> >
> > Actually, I think there's a solution for this.
> >
> > Do not do the last 0-7 bytes as a word that overlaps with the tail of
> > the 'rep movs'
> >
> > Do the last 8-15 bytes *non-overlapping* (well, they overlap each
> > other, but not the 'rep movs')
> >
> > Something UNTESTED like the appended, in other words. The large case
> > then ends up without any conditionals, looking something like this:
> >
> >         mov    %rcx,%rax
> >         shr    $0x3,%rcx
> >         dec    %rcx
> >         and    $0x7,%eax
> >         rep movsq %ds:(%rsi),%es:(%rdi)
> >         mov    (%rsi),%rcx
> >         mov    %rcx,(%rdi)
> >         mov    (%rsi,%rax,1),%rcx
> >         mov    %rcx,(%rdi,%rax,1)
> >         xor    %ecx,%ecx
> >         ret
>
> I think you can save the 'tail end' copying the same 8 bytes twice by doing:
>         sub     $9,%rcx
>         mov     %rcx,%rax
>         shr     $3,%rcx
>         and     $7,%rax
>         inc     %rax
> before the 'rep movsq'.

Not sure how above will work handling the remaining in %rax?

Anyway, another version may be like this to avoid
the rep movs penalty? Not sure if doing it before would be ok?

index fc9fb5d06174..a0f9655e364c 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -62,10 +62,15 @@ SYM_FUNC_START(rep_movs_alternative)
        je .Lexit
        cmp $8,%ecx
        jae .Lword
-       jmp .Lcopy_user_tail
+4:     movq -8(%rsi,%rcx),%rax
+5:     movq %rax,-8(%rdi,%rcx)
+       xorl %ecx,%ecx
+       RET

        _ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
        _ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
+       _ASM_EXTABLE_UA( 4b, .Lcopy_user_tail)
+       _ASM_EXTABLE_UA( 5b, .Lcopy_user_tail)

 .Llarge:
 0:     ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
@@ -74,18 +79,20 @@ SYM_FUNC_START(rep_movs_alternative)
        _ASM_EXTABLE_UA( 0b, 1b)

 .Llarge_movsq:
+       /* copy tail byte first, to avoid overlapping
+          penalty with rep movsq */
+0:     movq -8(%rsi,%rcx),%rax
+1:     movq %rax,-8(%rdi,%rcx)
        movq %rcx,%rax
        shrq $3,%rcx
-       andl $7,%eax
-0:     rep movsq
-       movl %eax,%ecx
-       testl %ecx,%ecx
-       jne .Lcopy_user_tail
+2:     rep movsq
+       xorl %ecx,%ecx
        RET
-
-1:     leaq (%rax,%rcx,8),%rcx
+3:     movq %rax,%rcx
        jmp .Lcopy_user_tail

-       _ASM_EXTABLE_UA( 0b, 1b)
+       _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
+       _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
+       _ASM_EXTABLE_UA( 2b, 3b)
 SYM_FUNC_END(rep_movs_alternative)
 EXPORT_SYMBOL(rep_movs_alternative)



I have been trying to also measure the impact of changes like above, however,
it seems I don't get improvement or it's limited due impact of
profiling, I tried
to uninline/move copy_user_generic() like this:

diff --git a/arch/x86/include/asm/uaccess_64.h
b/arch/x86/include/asm/uaccess_64.h
index c52f0133425b..2ae442c8a4b5 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -115,25 +115,8 @@ static inline bool __access_ok(const void __user
*ptr, unsigned long size)
 __must_check unsigned long
 rep_movs_alternative(void *to, const void *from, unsigned len);

-static __always_inline __must_check unsigned long
-copy_user_generic(void *to, const void *from, unsigned long len)
-{
-       stac();
-       /*
-        * If CPU has FSRM feature, use 'rep movs'.
-        * Otherwise, use rep_movs_alternative.
-        */
-       asm volatile(
-               "1:\n\t"
-               ALTERNATIVE("rep movsb",
-                           "call rep_movs_alternative",
ALT_NOT(X86_FEATURE_FSRM))
-               "2:\n"
-               _ASM_EXTABLE_UA(1b, 2b)
-               :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
-               : : "memory", "rax");
-       clac();
-       return len;
-}
+__must_check unsigned long
+copy_user_generic(void *to, const void *from, unsigned long len);

 static __always_inline __must_check unsigned long
 raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index e9251b89a9e9..4585349f8f33 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -142,3 +142,24 @@ void __memcpy_flushcache(void *_dst, const void
*_src, size_t size)
 }
 EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 #endif
+
+__must_check unsigned long
+copy_user_generic(void *to, const void *from, unsigned long len)
+{
+       stac();
+       /*
+        * If CPU has FSRM feature, use 'rep movs'.
+        * Otherwise, use rep_movs_alternative.
+        */
+       asm volatile(
+               "1:\n\t"
+               ALTERNATIVE("rep movsb",
+                           "call rep_movs_alternative",
ALT_NOT(X86_FEATURE_FSRM))
+               "2:\n"
+               _ASM_EXTABLE_UA(1b, 2b)
+               :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
+               : : "memory", "rax");
+       clac();
+       return len;
+}
+EXPORT_SYMBOL_GPL(copy_user_generic);


And then, using bpftrace with this script to try to measure execution times:

#############################
fentry:copy_user_generic
/strcontains(comm,"iperf3")/
{
        /*printf("start %ul %p\n", args.len, kptr(args.to));*/
        @start[kptr(args.to),args.len] = nsecs;
}

fexit:copy_user_generic
/strcontains(comm,"iperf3") && @start[kptr(args.to)-args.len,args.len]/
{
        /*printf("end %ul %p\n", args.len, kptr(args.to)-args.len);*/

        $len = args.len;
        $len >>= 1;
        $log_len = 0;
        while ($len) {
                $len >>= 1;
                $log_len++;
        }
        $log1 = 1;
        $log1 <<= $log_len;
        $log2 = $log1;
        $log2 <<= 1;
        $dalign = (uint64)(kptr(args.to) - args.len);
        $dalign &= 0x7;

        @us[$dalign,$log1,$log2] = hist((nsecs -
@start[kptr(args.to)-args.len,args.len]));
        delete(@start, (kptr(args.to)-args.len,args.len))
}

END
{
        clear(@start);
}
#############################

But the result is mixed at least in case of this change, I can't prove
an improvement
with it.

>
>         David
>
> >
> > with some added complexity - but not a lot - in the exception fixup cases.
> >
> > This is once again intentionally whitespace-damaged, because I don't
> > want people applying this mindlessly. Somebody needs to double-check
> > my logic, and verify that this also avoids the cost from the aliasing
> > with the rep movs.
> >
> >                    Linus
> ...
>

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Mateusz Guzik 10 months, 2 weeks ago

On Tue, Mar 25, 2025 at 11:42 PM Herton Krzesinski <hkrzesin@redhat.com> wrote:
>
> On Fri, Mar 21, 2025 at 5:47 PM David Laight
> <david.laight.linux@gmail.com> wrote:
> >
> > On Thu, 20 Mar 2025 16:53:32 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > > On Thu, 20 Mar 2025 at 14:17, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > On Thu, 20 Mar 2025 at 12:33, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > > >
> > > > > I have a recollection that handling the tail after rep movsq with an
> > > > > overlapping store was suffering a penalty big enough to warrant a
> > > > > "normal" copy instead, avoiding the just written to area.
> > > >
> > > > Ahh. Good point. The rep movsq might indeed end up having odd effects
> > > > with subsequent aliasing memory operations.
> > > >
> > > > Consider myself convinced.
> > >
> > > Actually, I think there's a solution for this.
> > >
> > > Do not do the last 0-7 bytes as a word that overlaps with the tail of
> > > the 'rep movs'
> > >
> > > Do the last 8-15 bytes *non-overlapping* (well, they overlap each
> > > other, but not the 'rep movs')
> > >
> > > Something UNTESTED like the appended, in other words. The large case
> > > then ends up without any conditionals, looking something like this:
> > >
> > >         mov    %rcx,%rax
> > >         shr    $0x3,%rcx
> > >         dec    %rcx
> > >         and    $0x7,%eax
> > >         rep movsq %ds:(%rsi),%es:(%rdi)
> > >         mov    (%rsi),%rcx
> > >         mov    %rcx,(%rdi)
> > >         mov    (%rsi,%rax,1),%rcx
> > >         mov    %rcx,(%rdi,%rax,1)
> > >         xor    %ecx,%ecx
> > >         ret
> >
> > I think you can save the 'tail end' copying the same 8 bytes twice by doing:
> >         sub     $9,%rcx
> >         mov     %rcx,%rax
> >         shr     $3,%rcx
> >         and     $7,%rax
> >         inc     %rax
> > before the 'rep movsq'.
>
> Not sure how above will work handling the remaining in %rax?
>
> Anyway, another version may be like this to avoid
> the rep movs penalty? Not sure if doing it before would be ok?
>
> index fc9fb5d06174..a0f9655e364c 100644
> --- a/arch/x86/lib/copy_user_64.S
> +++ b/arch/x86/lib/copy_user_64.S
> @@ -62,10 +62,15 @@ SYM_FUNC_START(rep_movs_alternative)
>         je .Lexit
>         cmp $8,%ecx
>         jae .Lword
> -       jmp .Lcopy_user_tail
> +4:     movq -8(%rsi,%rcx),%rax
> +5:     movq %rax,-8(%rdi,%rcx)
> +       xorl %ecx,%ecx
> +       RET
>
>         _ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
>         _ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
> +       _ASM_EXTABLE_UA( 4b, .Lcopy_user_tail)
> +       _ASM_EXTABLE_UA( 5b, .Lcopy_user_tail)
>
>  .Llarge:
>  0:     ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
> @@ -74,18 +79,20 @@ SYM_FUNC_START(rep_movs_alternative)
>         _ASM_EXTABLE_UA( 0b, 1b)
>
>  .Llarge_movsq:
> +       /* copy tail byte first, to avoid overlapping
> +          penalty with rep movsq */
> +0:     movq -8(%rsi,%rcx),%rax
> +1:     movq %rax,-8(%rdi,%rcx)
>         movq %rcx,%rax
>         shrq $3,%rcx
> -       andl $7,%eax
> -0:     rep movsq
> -       movl %eax,%ecx
> -       testl %ecx,%ecx
> -       jne .Lcopy_user_tail
> +2:     rep movsq
> +       xorl %ecx,%ecx
>         RET
> -
> -1:     leaq (%rax,%rcx,8),%rcx
> +3:     movq %rax,%rcx
>         jmp .Lcopy_user_tail
>
> -       _ASM_EXTABLE_UA( 0b, 1b)
> +       _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
> +       _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
> +       _ASM_EXTABLE_UA( 2b, 3b)
>  SYM_FUNC_END(rep_movs_alternative)
>  EXPORT_SYMBOL(rep_movs_alternative)
>
>
>
> I have been trying to also measure the impact of changes like above, however,
> it seems I don't get improvement or it's limited due impact of
> profiling, I tried
> to uninline/move copy_user_generic() like this:
>
> diff --git a/arch/x86/include/asm/uaccess_64.h
> b/arch/x86/include/asm/uaccess_64.h
> index c52f0133425b..2ae442c8a4b5 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -115,25 +115,8 @@ static inline bool __access_ok(const void __user
> *ptr, unsigned long size)
>  __must_check unsigned long
>  rep_movs_alternative(void *to, const void *from, unsigned len);
>
> -static __always_inline __must_check unsigned long
> -copy_user_generic(void *to, const void *from, unsigned long len)
> -{
> -       stac();
> -       /*
> -        * If CPU has FSRM feature, use 'rep movs'.
> -        * Otherwise, use rep_movs_alternative.
> -        */
> -       asm volatile(
> -               "1:\n\t"
> -               ALTERNATIVE("rep movsb",
> -                           "call rep_movs_alternative",
> ALT_NOT(X86_FEATURE_FSRM))
> -               "2:\n"
> -               _ASM_EXTABLE_UA(1b, 2b)
> -               :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> -               : : "memory", "rax");
> -       clac();
> -       return len;
> -}
> +__must_check unsigned long
> +copy_user_generic(void *to, const void *from, unsigned long len);
>
>  static __always_inline __must_check unsigned long
>  raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index e9251b89a9e9..4585349f8f33 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -142,3 +142,24 @@ void __memcpy_flushcache(void *_dst, const void
> *_src, size_t size)
>  }
>  EXPORT_SYMBOL_GPL(__memcpy_flushcache);
>  #endif
> +
> +__must_check unsigned long
> +copy_user_generic(void *to, const void *from, unsigned long len)
> +{
> +       stac();
> +       /*
> +        * If CPU has FSRM feature, use 'rep movs'.
> +        * Otherwise, use rep_movs_alternative.
> +        */
> +       asm volatile(
> +               "1:\n\t"
> +               ALTERNATIVE("rep movsb",
> +                           "call rep_movs_alternative",
> ALT_NOT(X86_FEATURE_FSRM))
> +               "2:\n"
> +               _ASM_EXTABLE_UA(1b, 2b)
> +               :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> +               : : "memory", "rax");
> +       clac();
> +       return len;
> +}
> +EXPORT_SYMBOL_GPL(copy_user_generic);
>
>
> And then, using bpftrace with this script to try to measure execution times:
>
> #############################
> fentry:copy_user_generic
> /strcontains(comm,"iperf3")/
> {
>         /*printf("start %ul %p\n", args.len, kptr(args.to));*/
>         @start[kptr(args.to),args.len] = nsecs;
> }
>
> fexit:copy_user_generic
> /strcontains(comm,"iperf3") && @start[kptr(args.to)-args.len,args.len]/
> {
>         /*printf("end %ul %p\n", args.len, kptr(args.to)-args.len);*/
>
>         $len = args.len;
>         $len >>= 1;
>         $log_len = 0;
>         while ($len) {
>                 $len >>= 1;
>                 $log_len++;
>         }
>         $log1 = 1;
>         $log1 <<= $log_len;
>         $log2 = $log1;
>         $log2 <<= 1;
>         $dalign = (uint64)(kptr(args.to) - args.len);
>         $dalign &= 0x7;
>
>         @us[$dalign,$log1,$log2] = hist((nsecs -
> @start[kptr(args.to)-args.len,args.len]));
>         delete(@start, (kptr(args.to)-args.len,args.len))
> }
>
> END
> {
>         clear(@start);
> }
> #############################
>
> But the result is mixed at least in case of this change, I can't prove
> an improvement
> with it.
>

I suspect going to ebpf here has enough probe effect to overshadow any impact.

You may get a better shot issuing rdtscp before and after and then
calling a dummy probe with the difference -- that way all ebpf
overhead is incurred outside of the measured area. But this does not
account for migrating between cpus.

However, all of this convinced me to dig up an (unfinished, but close)
test jig I had for these routines.

While it neglects cache effects (as in lets things remain cache-hot),
it issues ops based on a random seed, making sure to screw with the
branch predictor. This is not perfect by any means, but should be good
enough to justify some  of the changes (namely sorting out memset and
this guy not using overlapping stores). I can't promise any specific
timeline for the sucker though, apart from this being a matter of
weeks. (best case this weekend)

-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by David Laight 10 months, 2 weeks ago

On Tue, 25 Mar 2025 19:42:09 -0300
Herton Krzesinski <hkrzesin@redhat.com> wrote:

...
> I have been trying to also measure the impact of changes like above, however,
> it seems I don't get improvement or it's limited due impact of
> profiling, I tried
> to uninline/move copy_user_generic() like this:

If you use the PERF_COUNT_HW_CPU_CYCLES counter bracketed by 'mfence'
you can get reasonably consistent cycle counts for short sequences.

The problem here is that you need the specific cpu that is causing issues.
Probably zen2 or zen3.

Benchmarking 'rep movsb' on a zen5 can be summarised:
Test overhead: 195 clocks ('rep movb' asm with a 'nop') subtracted from the
other values.
  length    clocks
       0       7
   1..3f       5
      40       4
  41..7f       5
  80..1ff     39 (except 16c with is 4 clocks faster!)
      200     38
 201..23f     40
      240     38
 241..27f     41
      280     39
The pattern then continues much the same, increasing by 1 clock every 64 bytes
with the multiple of 64 being a bit cheaper.

With a 'sailing wind' a copy loop should do 8 bytes/clock.
(Faster if the cpu supports more than one write/clock.)
So might be faster for lengths between 128 and ~256.

Misaligning the addresses doesn't usually make any difference.
(There is a small penalty for destinations in the last cache line of a page.)

But there is strange oddity.
If (dest - src) % 4096 is between 1 and 63 then short copies are 55 clocks
jumping to 75 at 128 bytes and then increasing slowly.
(I think that matches what I've seen.)

	David

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Linus Torvalds 10 months, 2 weeks ago

On Tue, 25 Mar 2025 at 15:42, Herton Krzesinski <hkrzesin@redhat.com> wrote:
>
> I have been trying to also measure the impact of changes like above, however,
> it seems I don't get improvement or it's limited due impact of
> profiling,

In my experience, profiling overhead - assuming you have half-way
modern hardware that has support for it - is simply not an issue.

So I think that if we don't have performance numbers on some hardware
where it can be shown to matter, let's just leave things be.

            Linus

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Mateusz Guzik 10 months, 3 weeks ago

On Fri, Mar 21, 2025 at 12:53 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 20 Mar 2025 at 14:17, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 20 Mar 2025 at 12:33, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > I have a recollection that handling the tail after rep movsq with an
> > > overlapping store was suffering a penalty big enough to warrant a
> > > "normal" copy instead, avoiding the just written to area.
> >
> > Ahh. Good point. The rep movsq might indeed end up having odd effects
> > with subsequent aliasing memory operations.
> >
> > Consider myself convinced.
>
> Actually, I think there's a solution for this.
>
> Do not do the last 0-7 bytes as a word that overlaps with the tail of
> the 'rep movs'
>
> Do the last 8-15 bytes *non-overlapping* (well, they overlap each
> other, but not the 'rep movs')
>
> Something UNTESTED like the appended, in other words. The large case
> then ends up without any conditionals, looking something like this:
>
>         mov    %rcx,%rax
>         shr    $0x3,%rcx
>         dec    %rcx
>         and    $0x7,%eax
>         rep movsq %ds:(%rsi),%es:(%rdi)
>         mov    (%rsi),%rcx
>         mov    %rcx,(%rdi)
>         mov    (%rsi,%rax,1),%rcx
>         mov    %rcx,(%rdi,%rax,1)
>         xor    %ecx,%ecx
>         ret
>
> with some added complexity - but not a lot - in the exception fixup cases.
>
> This is once again intentionally whitespace-damaged, because I don't
> want people applying this mindlessly. Somebody needs to double-check
> my logic, and verify that this also avoids the cost from the aliasing
> with the rep movs.
>

I think the code works, but I find the idea to be questionable.

I'm worried what happens when the consumer has a nicely aligned target
buffer with a nice size (say, a multiple of a cache line) which this
is now unconditionally messing with -- are some uarchs going to
suffer?

That aside, this executes for any size >= 64 bytes. The start latency
is pretty high and the 64 byte limit is probably too low even without
the patch. For the 64 size in particular there will be only 48 bytes
handled by rep movsq followed by 16 bytes of regular stores, largely
defeating the point of avoiding the regular mov loop. If going this
way, I would bump the threshold to punt to rep movsq by at least 16.

I would bench it myself, but don't have appropriate hw to do this sucker.

All in all, for the time being I stand by not messing with this bit
(i.e., sticking to just my patch). Serious benchmarking effort should
instead accompany rewriting the routine to begin with (along with
sorting out memset and memcpy to both use overlapping stores and rep
as needed).

However, if you are looking to commit this thing anyway, you can feel
free to roll up my change into it without credit -- it is kind of
stock standard for this kind of work.
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by Mateusz Guzik 10 months, 3 weeks ago

On Thu, Mar 20, 2025 at 8:33 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Thu, Mar 20, 2025 at 8:23 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 20 Mar 2025 at 12:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
> > > 1 byte at a time loop to handle the tail.
> >
> > I definitely do not mind this patch, but I think it doesn't go far enough.
> >
> > It gets rid of the byte-at-a-time loop at the end, but only for the
> > short-copy case of 8-63 bytes.
> >
>
> This bit I can vouch for.
>
> > The .Llarge_movsq ends up still doing
> >
> >         testl %ecx,%ecx
> >         jne .Lcopy_user_tail
> >         RET
> >
> > and while that is only triggered by the non-ERMS case, that's what
> > most older AMD CPU's will trigger, afaik.
> >
>
> This bit I can't.
>
> Per my other e-mail it has been several years since I was seriously
> digging in the area (around 7 by now I think) and details are rather
> fuzzy.
>
> I have a recollection that handling the tail after rep movsq with an
> overlapping store was suffering a penalty big enough to warrant a
> "normal" copy instead, avoiding the just written to area. I see my old
> routine $elsewhere makes sure to do it. I don't have sensible hw to
> bench this on either at the moment.
>

So I did some testing on Sapphire Rapids vs movsq (it is an
FSRM-enabled sucker so I had to force it to use it) and the penalty I
remembered is still there.

I patched read1 from will-it-scale to do 128 and 129 byte reads.

On the stock kernel I get about 5% throughput drop rate when adding
just the one byte.

On a kernel patched to overlap these with prior movsq stores I get a 10% drop.

While a CPU without ERMS/FSRM could have a different penalty, this
very much lines up with my prior experiments back in the day, so I'm
gonna have to assume this still is still a problem for others.

So I stand by only patching the short range for the time being.

Note that should someone(tm) rework this to use overlapping stores
with bigger ranges (like memcpy), this will become less of an issue as
there will be no per-byte copying.
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Posted by David Laight 10 months, 3 weeks ago

On Thu, 20 Mar 2025 21:24:38 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:

> On Thu, Mar 20, 2025 at 8:33 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 8:23 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:  
> > >
> > > On Thu, 20 Mar 2025 at 12:06, Mateusz Guzik <mjguzik@gmail.com> wrote:  
> > > >
> > > > Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
> > > > 1 byte at a time loop to handle the tail.  
> > >
> > > I definitely do not mind this patch, but I think it doesn't go far enough.
> > >
> > > It gets rid of the byte-at-a-time loop at the end, but only for the
> > > short-copy case of 8-63 bytes.
> > >  
> >
> > This bit I can vouch for.
> >  
> > > The .Llarge_movsq ends up still doing
> > >
> > >         testl %ecx,%ecx
> > >         jne .Lcopy_user_tail
> > >         RET
> > >
> > > and while that is only triggered by the non-ERMS case, that's what
> > > most older AMD CPU's will trigger, afaik.
> > >  
> >
> > This bit I can't.
> >
> > Per my other e-mail it has been several years since I was seriously
> > digging in the area (around 7 by now I think) and details are rather
> > fuzzy.
> >
> > I have a recollection that handling the tail after rep movsq with an
> > overlapping store was suffering a penalty big enough to warrant a
> > "normal" copy instead, avoiding the just written to area. I see my old
> > routine $elsewhere makes sure to do it. I don't have sensible hw to
> > bench this on either at the moment.
> >  
> 
> So I did some testing on Sapphire Rapids vs movsq (it is an
> FSRM-enabled sucker so I had to force it to use it) and the penalty I
> remembered is still there.
> 
> I patched read1 from will-it-scale to do 128 and 129 byte reads.
> 
> On the stock kernel I get about 5% throughput drop rate when adding
> just the one byte.
> 
> On a kernel patched to overlap these with prior movsq stores I get a 10% drop.
> 
> While a CPU without ERMS/FSRM could have a different penalty, this
> very much lines up with my prior experiments back in the day, so I'm
> gonna have to assume this still is still a problem for others.

A different penalty for ERMS/FSRM wouldn't surprise me.
If you tested on an Intel cpu it probably supported ERMS/FSRM
(you have to go back to 'core-2' to not have the support).

So I suspect you'd need to have tested an AMD cpu for it to matter.
And the report of slow misaligned writes was for a fairly recent
AMD cpu - I think one that didn't report ERMS/FSRM support.
I've only got a zen-5, although I might 'borrow' a piledriver from work.
Definitely no access to anything amd in between.

I do wonder if the 'medium length' copy loop is needed at all.
The non ERMS/FSRM case needs 'short copy' and aligned 'rep movsq'
copies, but is the setup cost for 'rep movsq' significant enough
for the 'copy loop' to be worth while?

I also suspect the copy loop is 'over unrolled'.
The max throughput is 8 bytes/clock and Intel cpu will execute
2 clock loops, so if you can reduce the loop control (etc) instructions
to ones that will run in two clocks you can do 16 bytes per iteration.
That should be achievable using negative offsets from the end of the
buffer - the loop control is 'add $16,%rcx; js 10b'.
(I've mentioned that before ...)

I've just done some more searching and found a comment that FRMS is
very slow on zen3 for misaligned destinations.
(Possibly to the level of dropping back to byte copies.)
That means that the 'rep movsq' also needs to be aligned.
Which is what started all this.

I'm going to have to see if my Sandy bridge system still works.

	David

> 
> So I stand by only patching the short range for the time being.
> 
> Note that should someone(tm) rework this to use overlapping stores
> with bigger ranges (like memcpy), this will become less of an issue as
> there will be no per-byte copying.