[v2] x86: bring back rep movsq for user access on CPUs without ERMS

[PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

Posted by Mateusz Guzik 2 years, 5 months ago

I hacked up this crapper:

#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int main(void)
{
        char *buf;
        int fd, n;

        buf = mmap((void *)0xAA0000, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
        if (buf == MAP_FAILED) {
                perror("mmap");
                return 1;
        }

        fd = open("/tmp/out", O_RDWR | O_CREAT, 0644);
        if (fd == -1) {
                perror("open");
                return 1;
        }
        n = write(fd, &buf[4096 - 66], 130);
        printf("%d\n", n);
}

Then I modified the unrolled loop to have the following fixup:
.Lfallback:
	ud2

	_ASM_EXTABLE_UA(10b, .Lfallback)
	_ASM_EXTABLE_UA(11b, .Lfallback)
[and so on]

Similarly, the movsq implementation:
1:     leaq (%rax,%rcx,8),%rcx
       ud2
       jmp .Lcopy_user_tail

Then I compared regdumps from both results and they match up:
stock:  RCX: 0000000000000042 RSI: 0000000000aa0ffe
movsq:  RCS: 0000000000000042 RSI: 0000000000aa0ffe

[note MAP_FIXED with 0xAA0000]

v1 gives a bogus result.

Finally write returns 66 on both stock and patched kernel.

So I think we are fine here.

================ cut here ================

Intel CPUs ship with ERMS for over a decade, but this is not true for
AMD.  In particular one reasonably recent uarch (EPYC 7R13) does not
have it (or at least the bit is inactive when running on the Amazon
EC2 cloud -- I found rather conflicting information about AMD CPUs vs the
extension).

Hand-rolled mov loops executing in this case are quite pessimal compared
to rep movsq for bigger sizes. While the upper limit depends on uarch,
everyone is well south of 1KB AFAICS and sizes bigger than that are
common.

While technically ancient CPUs may be suffering from rep usage, gcc has
been emitting it for years all over kernel code, so I don't think this
is a legitimate concern.

Sample result from read1_processes from will-it-scale (4KB reads/s):
before:	1507021
after:	1721828 (+14%)

Note that the cutoff point for rep usage is set to 64 bytes, which is
way too conservative but I'm sticking to what was done in 47ee3f1dd93b
("x86: re-introduce support for ERMS copies for user space accesses").
That is to say *some* copies will now go slower, which is fixable but
beyond the scope of this patch.

v2:
- correct fixup handling
- use 0/1 labels, stop messing with ones already put there for erms
[the _ASM_EXTABLE_UA line is still modified because it was indented with
spaces]
- removu now unneded clobbers on r8-r11
- add a note about removal of the unrolled loop

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 arch/x86/include/asm/uaccess_64.h |  2 +-
 arch/x86/lib/copy_user_64.S       | 57 +++++++------------------------
 2 files changed, 14 insertions(+), 45 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 81b826d3b753..f2c02e4469cc 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -116,7 +116,7 @@ copy_user_generic(void *to, const void *from, unsigned long len)
 		"2:\n"
 		_ASM_EXTABLE_UA(1b, 2b)
 		:"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
-		: : "memory", "rax", "r8", "r9", "r10", "r11");
+		: : "memory", "rax");
 	clac();
 	return len;
 }
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 01c5de4c279b..0a81aafed7f8 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -27,7 +27,7 @@
  * NOTE! The calling convention is very intentionally the same as
  * for 'rep movs', so that we can rewrite the function call with
  * just a plain 'rep movs' on machines that have FSRM.  But to make
- * it simpler for us, we can clobber rsi/rdi and rax/r8-r11 freely.
+ * it simpler for us, we can clobber rsi/rdi and rax freely.
  */
 SYM_FUNC_START(rep_movs_alternative)
 	cmpq $64,%rcx
@@ -68,55 +68,24 @@ SYM_FUNC_START(rep_movs_alternative)
 	_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
 
 .Llarge:
-0:	ALTERNATIVE "jmp .Lunrolled", "rep movsb", X86_FEATURE_ERMS
+0:	ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
 1:	RET
 
-        _ASM_EXTABLE_UA( 0b, 1b)
+	_ASM_EXTABLE_UA( 0b, 1b)
 
-	.p2align 4
-.Lunrolled:
-10:	movq (%rsi),%r8
-11:	movq 8(%rsi),%r9
-12:	movq 16(%rsi),%r10
-13:	movq 24(%rsi),%r11
-14:	movq %r8,(%rdi)
-15:	movq %r9,8(%rdi)
-16:	movq %r10,16(%rdi)
-17:	movq %r11,24(%rdi)
-20:	movq 32(%rsi),%r8
-21:	movq 40(%rsi),%r9
-22:	movq 48(%rsi),%r10
-23:	movq 56(%rsi),%r11
-24:	movq %r8,32(%rdi)
-25:	movq %r9,40(%rdi)
-26:	movq %r10,48(%rdi)
-27:	movq %r11,56(%rdi)
-	addq $64,%rsi
-	addq $64,%rdi
-	subq $64,%rcx
-	cmpq $64,%rcx
-	jae .Lunrolled
-	cmpl $8,%ecx
-	jae .Lword
+.Llarge_movsq:
+	movq %rcx,%rax
+	shrq $3,%rcx
+	andl $7,%eax
+0:	rep movsq
+	movl %eax,%ecx
 	testl %ecx,%ecx
 	jne .Lcopy_user_tail
 	RET
 
-	_ASM_EXTABLE_UA(10b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(11b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(12b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(13b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(14b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(15b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(16b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(17b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(20b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(21b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(22b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(23b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(24b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(25b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(26b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(27b, .Lcopy_user_tail)
+1:	leaq (%rax,%rcx,8),%rcx
+	jmp .Lcopy_user_tail
+
+	_ASM_EXTABLE_UA( 0b, 1b)
 SYM_FUNC_END(rep_movs_alternative)
 EXPORT_SYMBOL(rep_movs_alternative)
-- 
2.39.2

RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

Posted by David Laight 2 years, 5 months ago

From: Mateusz Guzik
> Sent: 30 August 2023 15:03
...
> Hand-rolled mov loops executing in this case are quite pessimal compared
> to rep movsq for bigger sizes. While the upper limit depends on uarch,
> everyone is well south of 1KB AFAICS and sizes bigger than that are
> common.

That unrolled loop is pretty pessimal and very much 1980s.

It should be pretty easy to write a code loop that runs
at one copy (8 bytes) per clock on modern desktop x86.
I think that matches 'rep movsq'.
(It will run slower on Atom based cpu.)

A very simple copy loop needs (using negative offsets
from the end of the buffer):
	A memory read
	A memory write
	An increment
	A jnz
Doing all of those every clock is well with the cpu's capabilities.
However I've never managed a 1 clock loop.
So you need to unroll once (and only once) to copy 8 bytes/clock.

So for copies that are multiples of 16 bytes something like:
	# dst in %rdi, src in %rsi, len in %rdx
	add	%rdi, %rdx
	add	%rsi, %rdx
	neg	%rdx
1:
	mov	%rcx,0(%rsi, %rdx)
	mov	0(%rdi, %rdx), %rcx
	add	#16, %rdx
	mov	%rcx, -8(%rsi, %rdx)
	mov	-8(%rdi, %rdx), %rcx
	jnz	1b

Is likely to execute an iteration every two clocks.
The memory read/write all get queued up and will happen at
some point - so memory latency doesn't matter at all.

For copies (over 16 bytes) that aren't multiples of
16 it is probably just worth copying the first 16 bytes
and then doing 16 bytes copies that align with the end
of the buffer - copying some bytes twice.
(Or even copy the last 16 bytes first and copy aligned
with the start.)

I'm also not at all sure how much it is worth optimising
mis-aligned transfers.
An IP-Checksum function (which only does reads) is only
just measurable slower for mis-aligned buffers.
Less than 1 clock per cache line.

I think you can get an idea of what happens from looking
at the PCIe TLP generated for misaligned transfers and
assuming that memory requests get much the same treatment.

Last time I checked (on an i7-7700) misaligned transfers
were done in 8-byte chunks (SSE/AVX) and accesses that
crossed cache-line boundaries split into two.
Since the cpu can issue two reads/clock not all of the
split reads (to cache) will take an extra clock.
(which sort of matches what we see.)
OTOH misaligned writes that cross a cache-line boundary
probably always take a 1 clock penalty.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)