[PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

Mateusz Guzik posted 1 patch 2 years, 3 months ago
arch/x86/include/asm/uaccess_64.h |  2 +-
arch/x86/lib/copy_user_64.S       | 57 +++++++------------------------
2 files changed, 14 insertions(+), 45 deletions(-)
[PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
I hacked up this crapper:

#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int main(void)
{
        char *buf;
        int fd, n;

        buf = mmap((void *)0xAA0000, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
        if (buf == MAP_FAILED) {
                perror("mmap");
                return 1;
        }

        fd = open("/tmp/out", O_RDWR | O_CREAT, 0644);
        if (fd == -1) {
                perror("open");
                return 1;
        }
        n = write(fd, &buf[4096 - 66], 130);
        printf("%d\n", n);
}

Then I modified the unrolled loop to have the following fixup:
.Lfallback:
	ud2

	_ASM_EXTABLE_UA(10b, .Lfallback)
	_ASM_EXTABLE_UA(11b, .Lfallback)
[and so on]

Similarly, the movsq implementation:
1:     leaq (%rax,%rcx,8),%rcx
       ud2
       jmp .Lcopy_user_tail

Then I compared regdumps from both results and they match up:
stock:  RCX: 0000000000000042 RSI: 0000000000aa0ffe
movsq:  RCS: 0000000000000042 RSI: 0000000000aa0ffe

[note MAP_FIXED with 0xAA0000]

v1 gives a bogus result.

Finally write returns 66 on both stock and patched kernel.

So I think we are fine here.

================ cut here ================

Intel CPUs ship with ERMS for over a decade, but this is not true for
AMD.  In particular one reasonably recent uarch (EPYC 7R13) does not
have it (or at least the bit is inactive when running on the Amazon
EC2 cloud -- I found rather conflicting information about AMD CPUs vs the
extension).

Hand-rolled mov loops executing in this case are quite pessimal compared
to rep movsq for bigger sizes. While the upper limit depends on uarch,
everyone is well south of 1KB AFAICS and sizes bigger than that are
common.

While technically ancient CPUs may be suffering from rep usage, gcc has
been emitting it for years all over kernel code, so I don't think this
is a legitimate concern.

Sample result from read1_processes from will-it-scale (4KB reads/s):
before:	1507021
after:	1721828 (+14%)

Note that the cutoff point for rep usage is set to 64 bytes, which is
way too conservative but I'm sticking to what was done in 47ee3f1dd93b
("x86: re-introduce support for ERMS copies for user space accesses").
That is to say *some* copies will now go slower, which is fixable but
beyond the scope of this patch.

v2:
- correct fixup handling
- use 0/1 labels, stop messing with ones already put there for erms
[the _ASM_EXTABLE_UA line is still modified because it was indented with
spaces]
- removu now unneded clobbers on r8-r11
- add a note about removal of the unrolled loop

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 arch/x86/include/asm/uaccess_64.h |  2 +-
 arch/x86/lib/copy_user_64.S       | 57 +++++++------------------------
 2 files changed, 14 insertions(+), 45 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 81b826d3b753..f2c02e4469cc 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -116,7 +116,7 @@ copy_user_generic(void *to, const void *from, unsigned long len)
 		"2:\n"
 		_ASM_EXTABLE_UA(1b, 2b)
 		:"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
-		: : "memory", "rax", "r8", "r9", "r10", "r11");
+		: : "memory", "rax");
 	clac();
 	return len;
 }
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 01c5de4c279b..0a81aafed7f8 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -27,7 +27,7 @@
  * NOTE! The calling convention is very intentionally the same as
  * for 'rep movs', so that we can rewrite the function call with
  * just a plain 'rep movs' on machines that have FSRM.  But to make
- * it simpler for us, we can clobber rsi/rdi and rax/r8-r11 freely.
+ * it simpler for us, we can clobber rsi/rdi and rax freely.
  */
 SYM_FUNC_START(rep_movs_alternative)
 	cmpq $64,%rcx
@@ -68,55 +68,24 @@ SYM_FUNC_START(rep_movs_alternative)
 	_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
 
 .Llarge:
-0:	ALTERNATIVE "jmp .Lunrolled", "rep movsb", X86_FEATURE_ERMS
+0:	ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
 1:	RET
 
-        _ASM_EXTABLE_UA( 0b, 1b)
+	_ASM_EXTABLE_UA( 0b, 1b)
 
-	.p2align 4
-.Lunrolled:
-10:	movq (%rsi),%r8
-11:	movq 8(%rsi),%r9
-12:	movq 16(%rsi),%r10
-13:	movq 24(%rsi),%r11
-14:	movq %r8,(%rdi)
-15:	movq %r9,8(%rdi)
-16:	movq %r10,16(%rdi)
-17:	movq %r11,24(%rdi)
-20:	movq 32(%rsi),%r8
-21:	movq 40(%rsi),%r9
-22:	movq 48(%rsi),%r10
-23:	movq 56(%rsi),%r11
-24:	movq %r8,32(%rdi)
-25:	movq %r9,40(%rdi)
-26:	movq %r10,48(%rdi)
-27:	movq %r11,56(%rdi)
-	addq $64,%rsi
-	addq $64,%rdi
-	subq $64,%rcx
-	cmpq $64,%rcx
-	jae .Lunrolled
-	cmpl $8,%ecx
-	jae .Lword
+.Llarge_movsq:
+	movq %rcx,%rax
+	shrq $3,%rcx
+	andl $7,%eax
+0:	rep movsq
+	movl %eax,%ecx
 	testl %ecx,%ecx
 	jne .Lcopy_user_tail
 	RET
 
-	_ASM_EXTABLE_UA(10b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(11b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(12b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(13b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(14b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(15b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(16b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(17b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(20b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(21b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(22b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(23b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(24b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(25b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(26b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(27b, .Lcopy_user_tail)
+1:	leaq (%rax,%rcx,8),%rcx
+	jmp .Lcopy_user_tail
+
+	_ASM_EXTABLE_UA( 0b, 1b)
 SYM_FUNC_END(rep_movs_alternative)
 EXPORT_SYMBOL(rep_movs_alternative)
-- 
2.39.2
RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by David Laight 2 years, 3 months ago
From: Mateusz Guzik
> Sent: 30 August 2023 15:03
...
> Hand-rolled mov loops executing in this case are quite pessimal compared
> to rep movsq for bigger sizes. While the upper limit depends on uarch,
> everyone is well south of 1KB AFAICS and sizes bigger than that are
> common.

That unrolled loop is pretty pessimal and very much 1980s.

It should be pretty easy to write a code loop that runs
at one copy (8 bytes) per clock on modern desktop x86.
I think that matches 'rep movsq'.
(It will run slower on Atom based cpu.)

A very simple copy loop needs (using negative offsets
from the end of the buffer):
	A memory read
	A memory write
	An increment
	A jnz
Doing all of those every clock is well with the cpu's capabilities.
However I've never managed a 1 clock loop.
So you need to unroll once (and only once) to copy 8 bytes/clock.

So for copies that are multiples of 16 bytes something like:
	# dst in %rdi, src in %rsi, len in %rdx
	add	%rdi, %rdx
	add	%rsi, %rdx
	neg	%rdx
1:
	mov	%rcx,0(%rsi, %rdx)
	mov	0(%rdi, %rdx), %rcx
	add	#16, %rdx
	mov	%rcx, -8(%rsi, %rdx)
	mov	-8(%rdi, %rdx), %rcx
	jnz	1b

Is likely to execute an iteration every two clocks.
The memory read/write all get queued up and will happen at
some point - so memory latency doesn't matter at all.

For copies (over 16 bytes) that aren't multiples of
16 it is probably just worth copying the first 16 bytes
and then doing 16 bytes copies that align with the end
of the buffer - copying some bytes twice.
(Or even copy the last 16 bytes first and copy aligned
with the start.)

I'm also not at all sure how much it is worth optimising
mis-aligned transfers.
An IP-Checksum function (which only does reads) is only
just measurable slower for mis-aligned buffers.
Less than 1 clock per cache line.

I think you can get an idea of what happens from looking
at the PCIe TLP generated for misaligned transfers and
assuming that memory requests get much the same treatment.

Last time I checked (on an i7-7700) misaligned transfers
were done in 8-byte chunks (SSE/AVX) and accesses that
crossed cache-line boundaries split into two.
Since the cpu can issue two reads/clock not all of the
split reads (to cache) will take an extra clock.
(which sort of matches what we see.)
OTOH misaligned writes that cross a cache-line boundary
probably always take a 1 clock penalty.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 9/1/23, David Laight <David.Laight@aculab.com> wrote:
> From: Mateusz Guzik
>> Sent: 30 August 2023 15:03
> ...
>> Hand-rolled mov loops executing in this case are quite pessimal compared
>> to rep movsq for bigger sizes. While the upper limit depends on uarch,
>> everyone is well south of 1KB AFAICS and sizes bigger than that are
>> common.
>
> That unrolled loop is pretty pessimal and very much 1980s.
>
> It should be pretty easy to write a code loop that runs
> at one copy (8 bytes) per clock on modern desktop x86.
> I think that matches 'rep movsq'.
> (It will run slower on Atom based cpu.)
>
> A very simple copy loop needs (using negative offsets
> from the end of the buffer):
> 	A memory read
> 	A memory write
> 	An increment
> 	A jnz
> Doing all of those every clock is well with the cpu's capabilities.
> However I've never managed a 1 clock loop.
> So you need to unroll once (and only once) to copy 8 bytes/clock.
>

When I was playing with this stuff about 5 years ago I found 32-byte
loops to be optimal for uarchs of the priod (Skylake, Broadwell,
Haswell and so on), but only up to a point where rep wins.

> So for copies that are multiples of 16 bytes something like:
> 	# dst in %rdi, src in %rsi, len in %rdx
> 	add	%rdi, %rdx
> 	add	%rsi, %rdx
> 	neg	%rdx
> 1:
> 	mov	%rcx,0(%rsi, %rdx)
> 	mov	0(%rdi, %rdx), %rcx
> 	add	#16, %rdx
> 	mov	%rcx, -8(%rsi, %rdx)
> 	mov	-8(%rdi, %rdx), %rcx
> 	jnz	1b
>
> Is likely to execute an iteration every two clocks.
> The memory read/write all get queued up and will happen at
> some point - so memory latency doesn't matter at all.
>
> For copies (over 16 bytes) that aren't multiples of
> 16 it is probably just worth copying the first 16 bytes
> and then doing 16 bytes copies that align with the end
> of the buffer - copying some bytes twice.
> (Or even copy the last 16 bytes first and copy aligned
> with the start.)
>

It would definitely be beneficial to align the target buffer in this
routine (as in, non-FSRM), but it is unclear to me if you are
suggesting that for mov loops or rep.

I never tested regs for really big sizes and misaligned targets, for
the sizes where hand-rolled movs used to win against rep spending time
aligning was more expensive than suffering the misaligned (and
possibly overlapped) stores.

If anything I find it rather surprising how inconsistent the string
ops are -- why is memcpy using overlapping stores, while memset does
not? Someone(tm) should transplant it, along with slapping rep past a
certain size on both.

-- 
Mateusz Guzik <mjguzik gmail.com>
RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by David Laight 2 years, 3 months ago
...
> When I was playing with this stuff about 5 years ago I found 32-byte
> loops to be optimal for uarchs of the priod (Skylake, Broadwell,
> Haswell and so on), but only up to a point where rep wins.

Does the 'rep movsq' ever actually win?
(Unless you find one of the EMRS (or similar) versions.)
IIRC it only ever does one iteration per clock - and you
should be able to match that with a carefully constructed loop.

Many years ago I got my Athlon-700 to execute a copy loop
as fast as 'rep movs' - but the setup times were longer.

The killer for 'rep movs' setup was always P4-netburst - over 40 clocks.
But I think some of the more recent cpu are still in double figures
(apart from some optimised copies).
So I'm not actually sure you should ever need to switch
to a 'rep movsq' loop - but I've not tried to write it.

I did have to unroll the ip-cksum loop 4 times (as):
+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"   // %[len] is %rcx
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
In order to get one adc every clock.
But only because of the strange loop required to 'loop carry' the
carry flag (the 'loop' instruction is OK on AMD cpu, but not on Intel.)
A similar loop using adox and adcx will beat one read/clock
provided it is unrolled again.
(IIRC I got to about 12 bytes/clock.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 9/3/23, David Laight <David.Laight@aculab.com> wrote:
> ...
>> When I was playing with this stuff about 5 years ago I found 32-byte
>> loops to be optimal for uarchs of the priod (Skylake, Broadwell,
>> Haswell and so on), but only up to a point where rep wins.
>
> Does the 'rep movsq' ever actually win?
> (Unless you find one of the EMRS (or similar) versions.)
> IIRC it only ever does one iteration per clock - and you
> should be able to match that with a carefully constructed loop.
>

Sorry for late reply, I missed your e-mail due to all the unrelated
traffic in the thread and using gmail client. ;)

I am somewhat confused by the question though. In this very patch I'm
showing numbers from an ERMS-less uarch getting a win from switching
from hand-rolled mov loop to rep movsq, while doing 4KB copies.

Now, one can definitely try to make a case the loop is implemented in
a suboptimal manner and a better one would outperform rep. I did note
myself that such loops *do* beat rep up to a point, last I played with
this it was south of 1KB. It may be higher than that today.

Normally I don't have access to this particular hw, but I can get it again.

I can't stress again one *can* beat rep movsq as plopped in right
here, but only up to a certain size.

Since you are questioning whether movsq wins at /any/ size vs an
optimal loop, I suggest you hack it up, show the win on 4KB on your
uarch of choice and then I'll be happy to get access to the hw I
tested my patch on to bench your variant.

That said, as CPUs get better they execute this loops faster,
interestingly rep is not improving at an equivalent rate which is
rather funny. Even then, there is a limit past which rep wins AFAICS.

-- 
Mateusz Guzik <mjguzik gmail.com>
RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by David Laight 2 years, 3 months ago
From: Mateusz Guzik
> Sent: 10 September 2023 11:54
> 
> On 9/3/23, David Laight <David.Laight@aculab.com> wrote:
> > ...
> >> When I was playing with this stuff about 5 years ago I found 32-byte
> >> loops to be optimal for uarchs of the priod (Skylake, Broadwell,
> >> Haswell and so on), but only up to a point where rep wins.
> >
> > Does the 'rep movsq' ever actually win?
> > (Unless you find one of the EMRS (or similar) versions.)
> > IIRC it only ever does one iteration per clock - and you
> > should be able to match that with a carefully constructed loop.
> >
> 
> Sorry for late reply, I missed your e-mail due to all the unrelated
> traffic in the thread and using gmail client. ;)
> 
> I am somewhat confused by the question though. In this very patch I'm
> showing numbers from an ERMS-less uarch getting a win from switching
> from hand-rolled mov loop to rep movsq, while doing 4KB copies.

I've just dome some measurements on an i7-7700.
That does have ERMS (fast 'rep movsb') but shows some interesting info.

The overhead of 'rep movbs' is about 36 clocks, 'rep movsq' only 16.
(except it has just changed its mind...)
'rep movsb' will copy (about) 32 bytes/clock provided the
destination buffer is 32byte aligned, but only 16 bytes/clock
otherwise. The source buffer alignment doesn't seem to matter.

On this system 'rep movsq' seems to behave the same way.

So that is faster than an copy loop - limited to one register
write per clock.

Test program attached.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>

#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>

static int init_pmc(void)
{
	static struct perf_event_attr perf_attr = {
		.type = PERF_TYPE_HARDWARE,
		.config = PERF_COUNT_HW_CPU_CYCLES,
		.pinned = 1,
	};
	struct perf_event_mmap_page *pc;

	int perf_fd;
	perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
	if (perf_fd < 0) {
		fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
		exit(1);
	}
	pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
	if (pc == MAP_FAILED) {
		fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
		exit(1);
	}
	return pc->index - 1;
}

static inline unsigned int rdpmc(unsigned int counter)
{
        unsigned int low, high;

	// asm volatile("rdtsc" : "=a" (low), "=d" (high));
	asm volatile("lfence");
	asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));

	// return low bits, counter might to 32 or 40 bits wide.
	return low;
}

#ifndef MODE
#define MODE 0
#endif

__attribute__((noinline))
void memcpy_perf(unsigned char *d_buff, const unsigned char *s_buff, unsigned long len)
{

#if MODE == -1
// 'No copy' loop for baseline overhead
	asm volatile("	nop\n"
		: "+&D" (d_buff),  "+&S" (s_buff),  "+&c" (len)
		: : "memory");
#endif

#if MODE == 0
// Simple 'rep movs' loop
	asm volatile("	rep movsb\n"
		: "+&D" (d_buff),  "+&S" (s_buff),  "+&c" (len)
		: : "memory");
#endif

#if MODE == 1
// Simple 'rep movq' loop
	len /= 8;
	asm volatile("	rep movsq\n"
		: "+&D" (d_buff),  "+&S" (s_buff),  "+&c" (len)
		: : "memory");
#endif

}

unsigned char s_buff[8192] __attribute__((aligned(4096)));
unsigned char d_buff[8192 + 1] __attribute__((aligned(4096)));

#ifndef PASSES
#define PASSES 5
#endif

#ifndef OFFSET
#define OFFSET 0
#endif

int main(int argc, char **argv)
{
	unsigned int tick;
	unsigned int ticks[PASSES];
	unsigned int len, s_off = 0, d_off = 0;
	unsigned int i;
	unsigned int id = init_pmc();
	unsigned int offset = OFFSET;

	len = sizeof s_buff;
	for (;;) {
		switch (getopt(argc, argv, "l:s:d:o:")) {
		case -1:
			break;
		default:
			exit(1);
		case 'l': len = atoi(optarg); continue;
		case 's': s_off = atoi(optarg); continue;
		case 'd': d_off = atoi(optarg); continue;
		case 'o': offset = atoi(optarg); continue;
		}
		break;
	}

	if (s_off + len > sizeof s_buff || d_off + len > sizeof d_buff - 1) {
		fprintf(stderr, "too long\n");
		exit(1);
	}

	for (i = 0; i < len; i++)
		s_buff[i] = rand();

	for (i = 0; i < PASSES; i++) {
		tick = rdpmc(id);
		memcpy_perf(d_buff + d_off, s_buff + s_off, len);
		ticks[i] = rdpmc(id) - tick;
	}

	printf("   ticks    rate mode %d\n", MODE);
	for (i = 0; i < PASSES; i++)
		printf(" %7u %7u\n", ticks[i], 100 * len / (ticks[i] - offset));

	if (memcmp(d_buff + d_off, s_buff + s_off, len) || d_buff[d_off + len]) {
		fprintf(stderr, "copy mismatch\n");
		exit(1);
	}
	return 0;
}

Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Mon, 11 Sept 2023 at 03:38, David Laight <David.Laight@aculab.com> wrote:
>
> The overhead of 'rep movbs' is about 36 clocks, 'rep movsq' only 16.

Note that the hard case for 'rep movsq' is when the stores cross a
cacheline (or worse yet, a page) boundary.

That is what makes 'rep movsb' fundamentally simpler in theory. The
natural reaction is "but movsq does things 8 bytes at a time", but
once you start doing any kind of optimizations that are actually based
on bigger areas, the byte counts are actually simpler. You can always
do them as masked writes up to whatever boundary you like, and just
restart. There are never any "what about the straddling bytes" issues.

That's one of the dangers with benchmarking. Do you benchmark the
unaligned cases? How much do they matter in real life? Do they even
happen?

And that's entirely ignoring any "cold vs hot caches" etc issues, or
the "what is the cost of access _after_ the memcpy/memsert".

Or, in the case of the kernel, our issues with "function calls can now
be surprisingly expensive, and if we can inline things it can win back
20 cycles from a forced mispredict".

(And yes, I mean _any_ function calls. The indirect function calls are
even worse and more widely horrific, but sadly, with the return
prediction issues, even a perfectly regular function call is no longer
"a cycle or two")

So beware microbenchmarks. That's true in general, but it's
_particularly_ true of memset/memcpy.

                  Linus
RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by David Laight 2 years, 3 months ago
From: Linus Torvalds
> Sent: 12 September 2023 19:49
> 
> On Mon, 11 Sept 2023 at 03:38, David Laight <David.Laight@aculab.com> wrote:
> >
> > The overhead of 'rep movbs' is about 36 clocks, 'rep movsq' only 16.

Interestingly exactly the same test changed its mind for no reason!
While I got repeatable clock counts (+/-1 in the low hundreds) when
looping the test (ie the 'hot cache' cases), I couldn't decide on
the exact overhead to get an accurate bytes/clock.
OTOH it was between 30 and 35 - so pretty much likely to be 32.

> Note that the hard case for 'rep movsq' is when the stores cross a
> cacheline (or worse yet, a page) boundary.

Page faults on misaligned transfers are what makes them hard to implement.
OTOH that is a 'hardware problem' not specifically worth optimising
for - since unlikely.

> That is what makes 'rep movsb' fundamentally simpler in theory. The
> natural reaction is "but movsq does things 8 bytes at a time", but
> once you start doing any kind of optimizations that are actually based
> on bigger areas, the byte counts are actually simpler. You can always
> do them as masked writes up to whatever boundary you like, and just
> restart. There are never any "what about the straddling bytes" issues.

What I found seemed to imply that 'rep movsq' used the same internal
logic as 'rep movsb' (pretty easy to do in hardware) even though I
don't think the EMRS docs say anything about that.
it may well be cpu specific - but I'd expect later ones to be the same.
(AMD cpu will be different, and I don't have access to anything new.)

> That's one of the dangers with benchmarking. Do you benchmark the
> unaligned cases? How much do they matter in real life? Do they even
> happen?

A microbenchmark can tell you how bad they are.
Real life will almost certainly be different.

I did some buffer offset measurements (the buffers should have been
8k aligned).
For reasonable length copies (1024 bytes) the source alignment made
almost no difference.
What did matter was the destination alignment, anything other than
a multiple or 32 halved the transfer speed (regardless of the
source alignment).
So 32n+1, 32n+8 and 32n+16 were all equally bad.
Some values reduced it further, possibly writes affecting the
read prefetches - who knows.

Anyway it seemed like there is a pipelined barrel shifter on
the read side (so it could read 32 bytes/clock) but the write
side was having to do two writes of each misaligned block.

> And that's entirely ignoring any "cold vs hot caches" etc issues, or
> the "what is the cost of access _after_ the memcpy/memset".

I count the clocks for 5 iterations.
The first 'cold cache' is massively slower.
The rest are pretty much identical - so 5 is plenty.
For microbenchmarks you really want to assume 'hot cache'.
The cache loads/invalidates will (usually) be needed sometime
so no point destroying a benchmark by including them.
Especially when looking a short code loops.

I'm definitely against running 10000 iterations and measuring wall time.
It really doesn't tell you anything useful.
I can't even use rdtsc - I can't lock the cpu clock frequency.

> Or, in the case of the kernel, our issues with "function calls can now
> be surprisingly expensive, and if we can inline things it can win back
> 20 cycles from a forced mispredict".

And then glibc probably adds a pile of wrappers to convert open()
into openat(O_EMPTY_PATH,....) for no good reason.
Undoing all the good work.

And then there is the code I've written for an fpga soft-core
which has a limited number of clock to do its work.
Required careful analysis of static branch prediction (having
found the hidden menu to disable the dynamic one) to ensure
that the worst case paths were fast enough - rather than making
the common paths fastest. 

...
> So beware microbenchmarks. That's true in general, but it's
> _particularly_ true of memset/memcpy.

I see the problem benchmarks the ones that are massively unrolled
and run fast(ish) in a benchmark but just destroy the i-cache.

If I'm mircobenchmarking I'm trying to see if i can get the
cpu to execute as many instructions in parallel as it should
so that the code loop is limited by (eg) the number of reads.
If you can get a short loop to run 'as fast as possible' there
is no point doing a 1980's unroll.

If anyone had done a microbenchmark of the 'adc' list in the ip
checksum code (eg on Intel core-2) they'd have panicked about
each one taking two clocks.
Even with more recent cpu you need to adc to alternate registers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Tue, 12 Sept 2023 at 12:41, David Laight <David.Laight@aculab.com> wrote:
>
> What I found seemed to imply that 'rep movsq' used the same internal
> logic as 'rep movsb' (pretty easy to do in hardware)

Christ.

I told you. It's pretty easy in hardware  AS LONG AS IT'S ALIGNED.

And if it's unaligned, "rep movsq" is FUNDAMENTALLY HARDER.

Really.

                    Linus
RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by David Laight 2 years, 3 months ago
From: Linus Torvalds
> Sent: 12 September 2023 21:48
> 
> On Tue, 12 Sept 2023 at 12:41, David Laight <David.Laight@aculab.com> wrote:
> >
> > What I found seemed to imply that 'rep movsq' used the same internal
> > logic as 'rep movsb' (pretty easy to do in hardware)
> 
> Christ.
> 
> I told you. It's pretty easy in hardware  AS LONG AS IT'S ALIGNED.
> 
> And if it's unaligned, "rep movsq" is FUNDAMENTALLY HARDER.

For cached memory it only has to appear to have used 8 byte
accesses.
So in the same way that 'rep movsb' could be optimised to do
cache line sized reads and writes even if the address are
completely misaligned 'rep movsq' could use exactly the same
hardware logic with a byte count that is 8 times larger.

The only subtlety is that the read length would need masking
to a multiple of 8 if there is a page fault on a misaligned
read side (so that a multiple of 8 bytes would be written).
That wouldn't really be hard.

I definitely saw exactly the same number of bytes/clock
for 'rep movsb' and 'rep movsq' when the destination was
misaligned.
The alignment made no difference except that a multiple
of 32 ran (about) twice as fast.
I even double-checked the disassembly to make sure I was
running the right code.

So it looks like the Intel hardware engineers have solved
the 'FUNDAMENTALLY HARDER' problem.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
Side note, if you want to play around with the user copy routines (or
maybe Borislav wants to), I have a patch that handles a couple of
common cases statically.

It requires that we inline copy_to/from_user() in order to get
constant size information, but almost all other architectures do that
anyway, and it's not as horrid as it used to be with the current
access_ok() that doesn't need to do that nasty dynamic task size
check.

In particular, it should help with copying structures - notably the
'stat' structure in cp_new_stat().

The attached patch is entirely untested, except for me checking code
generation for some superficial sanity in a couple of places.

I'm not convinced that

    len >= 64 && !(len & 7)

is necessarily the "correct" option, but I resurrected an older patch
for this, and decided to use that as the "this is what
rep_movs_alternative would do anyway" test.

And obviously I expect that FSRM also does ok with "rep movsq", even
if technically "movsb" is the simpler case (because it doesn't have
the alignment issues that "rep movsq" has).

                 Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 8/30/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Side note, if you want to play around with the user copy routines (or
> maybe Borislav wants to), I have a patch that handles a couple of
> common cases statically.
>

No thanks mate, I'm good ;) (but see below)

> In particular, it should help with copying structures - notably the
> 'stat' structure in cp_new_stat().
>

So my first side note is that I agree this is going to help a bunch of
cases, but I disagree about this one.

cp_new_stat and the counterpart for statx can dodge this rep mov by
filling user memory directly.

I know for a fact statx already had that implemented, but it was
before (or around the time) SMAP was introduced -- afterwards all the
__put_user were shafting performance big time. As the kernel did not
have support for delineating several accesses in a row, code got
reworked to the usual massage locally and copyout. I can't be arsed to
check but I suspect it was the same story with cp_new_stat. The
necessary machinery is implemented for quite some time now but nobody
revisited the area.

I'm going to patch this, but first I want to address the bigger
problem of glibc implementing fstat as newfstatat, demolishing perf of
that op. In their defense currently they have no choice as this is the
only exporter of the "new" struct stat. I'll be sending a long email
to fsdevel soon(tm) with a proposed fix.

> The attached patch is entirely untested, except for me checking code
> generation for some superficial sanity in a couple of places.
>
> I'm not convinced that
>
>     len >= 64 && !(len & 7)
>
> is necessarily the "correct" option, but I resurrected an older patch
> for this, and decided to use that as the "this is what
> rep_movs_alternative would do anyway" test.
>
> And obviously I expect that FSRM also does ok with "rep movsq", even
> if technically "movsb" is the simpler case (because it doesn't have
> the alignment issues that "rep movsq" has).
>

So I was wondering if rep movsq is any worse than ERMS'ed rep movsb
when there is no tail to handle and the buffer is aligned to a page,
or more to the point if clear_page gets any benefit for going with
movsb.

The routine got patched to use it in e365c9df2f2f ("x86, mem:
clear_page_64.S: Support clear_page() with enhanced REP MOVSB/STOSB"),
which was a part of a larger patchset sprinkling ERMS over string ops.
The commit does not come with any numbers for it though.

I did a quick test with page_fault1 from will-it-scale with both
variants on Sapphire Rapids and got the same result, but then again
that's just one arch -- maybe something really old suffers here?

Not having to mess with alternatives would be nice here. That said I'm
just throwing it out there, I have no intention of pursuing it at the
moment.

As a heads up what I *do* intend to do later is add back rep to memcpy
and memset for sizes >= 64.

Gathering stats over kernel build: bpftrace -e
'kprobe:memcpy,kprobe:memset  { @ = lhist(arg2, 64, 16384, 512); }'

@[kprobe:memcpy]:
(..., 64)        3477477 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 576)        1245601 |@@@@@@@@@@@@@@@@@@                                  |
[576, 1088)        44061 |                                                    |
[1088, 1600)          11 |                                                    |
[1600, 2112)           9 |                                                    |
[2624, 3136)           1 |                                                    |
[3136, 3648)           2 |                                                    |
[3648, 4160)         148 |                                                    |
[4160, 4672)           0 |                                                    |

@[kprobe:memset]:
(..., 64)       20094711 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 576)       11141718 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[576, 1088)        28948 |                                                    |
[1088, 1600)       59981 |                                                    |
[1600, 2112)       11362 |                                                    |
[2112, 2624)       13218 |                                                    |
[2624, 3136)       14781 |                                                    |
[3136, 3648)       15839 |                                                    |
[3648, 4160)       48670 |                                                    |
[7744, 8256)           1 |                                                    |
[16K, ...)         32491 |                                                    |


[I removed rows with no hits; also I had a crap hack so that bpftrace
can attach to these]

This collects sizes bigger than 64, with a 512 byte step. While vast
majority of calls are for sizes which are very small, multikilobyte
uses keep showing up making it worthwhile.

I'm going to try to work out alternative_jump which avoids the current
extra nops and which works on both gcc and clang (or in the worst case
I'll massage what you had for gcc and let clang eat the nops, with the
macro hiding the ugliness).

I make no commitments when I get to it, but probably early September.

btw these string ops would probably benefit from overlapping stores
instead of 1 or 8 byte loops, but I'm not submitting any patches on
that front for now (and in fact would appreciate if someone else
sorted it out) :)

Cheers,
-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Fri, 1 Sept 2023 at 08:20, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I'm going to patch this, but first I want to address the bigger
> problem of glibc implementing fstat as newfstatat, demolishing perf of
> that op. In their defense currently they have no choice as this is the
> only exporter of the "new" struct stat. I'll be sending a long email
> to fsdevel soon(tm) with a proposed fix.

This ended up nagging me, and I looked it up.

You are correct that glibc seems to implement 'fstat()' as
'newfstatat(..AT_EMPTY_PATH)', which is a damn shame. We could
short-circuit that, but we really shouldn't need to.

Because you are incorrect when you claim that it's their only choice.
We have a native 'newfstat()' that glibc *should* be using.

We've had 'newfstat()' *forever*. It predates not just the git tree,
but the BK tree before it.

In fact, our 'newfstat()' system call goes back to Linux-0.97 (August
1, 1992), and yes, back then it really took a 'struct new_stat' as the
argument.

So I have no idea why you claim that "currently they have no choice".
glibc is simply being incredibly stupid, and using newfstatat() for no
good reason.

How very very annoying.

I guess we can - and now probably should - short-circuit the "empty
path with AT_EMPTY_PATH" case, but dammit, we shouldn't _need_ to.

Only because of excessive stupidity in user space does it look like we
should do it.

Why _is_ glibc doing that thing?

              Linus

PS. On 32-bit x86, we also have 'stat64', but that has native
'fstat64()' support too, so that's no excuse. And it actually uses our
'unsafe_put_user()' infrastructure to avoid having to double copy,
mainly because it doesn't need to worry about the padding fields like
the generic 'cp_new_stat' does.
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 3 Sept 2023 at 11:49, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I have no idea why you claim that "currently they have no choice".
> glibc is simply being incredibly stupid, and using newfstatat() for no
> good reason.

Do you have any good benchmark that shows the effects of this?

And if you do, does the attached patch (ENTIRELY UNTESTED!) fix the
silly glibc mis-feature?

               Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On Sun, Sep 03, 2023 at 01:08:03PM -0700, Linus Torvalds wrote:
> On Sun, 3 Sept 2023 at 11:49, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So I have no idea why you claim that "currently they have no choice".
> > glibc is simply being incredibly stupid, and using newfstatat() for no
> > good reason.
> 
> Do you have any good benchmark that shows the effects of this?
> 
> And if you do, does the attached patch (ENTIRELY UNTESTED!) fix the
> silly glibc mis-feature?
> 

"real fstat" is syscall(5, fd, &sb).

Sapphire Rapids, will-it-scale, ops/s

stock fstat	5088199
patched fstat	7625244	(+49%)
real fstat	8540383	(+67% / +12%)

It dodges lockref et al, but it does not dodge SMAP which accounts for
the difference.
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 3 Sept 2023 at 13:49, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> "real fstat" is syscall(5, fd, &sb).
>
> Sapphire Rapids, will-it-scale, ops/s
>
> stock fstat     5088199
> patched fstat   7625244 (+49%)
> real fstat      8540383 (+67% / +12%)
>
> It dodges lockref et al, but it does not dodge SMAP which accounts for
> the difference.

Side note, since I was looking at this, I hacked up a quick way for
architectures to do their own optimized cp_new_stat() that avoids the
double-buffering.

Sadly it *is* architecture-specific due to padding and
architecture-specific field sizes (and thus EOVERFLOW rules), but it
is what it is.

I don't know how much it matters, but it might make a difference. And
'stat()' is most certainly worth optimizing for, even if glibc has
made our life more difficult.

Want to try out another entirely untested patch? Attached.

                Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Ingo Molnar 2 years, 3 months ago
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 3 Sept 2023 at 13:49, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > "real fstat" is syscall(5, fd, &sb).
> >
> > Sapphire Rapids, will-it-scale, ops/s
> >
> > stock fstat     5088199
> > patched fstat   7625244 (+49%)
> > real fstat      8540383 (+67% / +12%)
> >
> > It dodges lockref et al, but it does not dodge SMAP which accounts for
> > the difference.
> 
> Side note, since I was looking at this, I hacked up a quick way for
> architectures to do their own optimized cp_new_stat() that avoids the
> double-buffering.
> 
> Sadly it *is* architecture-specific due to padding and
> architecture-specific field sizes (and thus EOVERFLOW rules), but it
> is what it is.
> 
> I don't know how much it matters, but it might make a difference. And
> 'stat()' is most certainly worth optimizing for, even if glibc has
> made our life more difficult.
> 
> Want to try out another entirely untested patch? Attached.
> 
>                 Linus

>  arch/x86/kernel/sys_x86_64.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/stat.c                    |  2 +-
>  include/linux/stat.h         |  2 ++
>  3 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index c783aeb37dce..fca647f61bc1 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -22,6 +22,50 @@
>  #include <asm/elf.h>
>  #include <asm/ia32.h>
>  
> +int cp_new_stat(struct kstat *stat, struct stat __user *ubuf)
> +{
> +	typeof(ubuf->st_uid) uid;
> +	typeof(ubuf->st_gid) gid;
> +	typeof(ubuf->st_dev) dev = new_encode_dev(stat->dev);
> +	typeof(ubuf->st_rdev) rdev = new_encode_dev(stat->rdev);
> +
> +	SET_UID(uid, from_kuid_munged(current_user_ns(), stat->uid));
> +	SET_GID(gid, from_kgid_munged(current_user_ns(), stat->gid));
> +
> +	if (!user_write_access_begin(ubuf, sizeof(struct stat)))
> +		return -EFAULT;
> +
> +	unsafe_put_user(dev,			&ubuf->st_dev,		Efault);
> +	unsafe_put_user(stat->ino,		&ubuf->st_ino,		Efault);
> +	unsafe_put_user(stat->nlink,		&ubuf->st_nlink,	Efault);
> +
> +	unsafe_put_user(stat->mode,		&ubuf->st_mode,		Efault);
> +	unsafe_put_user(uid,			&ubuf->st_uid,		Efault);
> +	unsafe_put_user(gid,			&ubuf->st_gid,		Efault);
> +	unsafe_put_user(0,			&ubuf->__pad0,		Efault);
> +	unsafe_put_user(rdev,			&ubuf->st_rdev,		Efault);
> +	unsafe_put_user(stat->size,		&ubuf->st_size,		Efault);
> +	unsafe_put_user(stat->blksize,		&ubuf->st_blksize,	Efault);
> +	unsafe_put_user(stat->blocks,		&ubuf->st_blocks,	Efault);
> +
> +	unsafe_put_user(stat->atime.tv_sec,	&ubuf->st_atime,	Efault);
> +	unsafe_put_user(stat->atime.tv_nsec,	&ubuf->st_atime_nsec,	Efault);
> +	unsafe_put_user(stat->mtime.tv_sec,	&ubuf->st_mtime,	Efault);
> +	unsafe_put_user(stat->mtime.tv_nsec,	&ubuf->st_mtime_nsec,	Efault);
> +	unsafe_put_user(stat->ctime.tv_sec,	&ubuf->st_ctime,	Efault);
> +	unsafe_put_user(stat->ctime.tv_nsec,	&ubuf->st_ctime_nsec,	Efault);
> +	unsafe_put_user(0,			&ubuf->__unused[0],	Efault);
> +	unsafe_put_user(0,			&ubuf->__unused[1],	Efault);
> +	unsafe_put_user(0,			&ubuf->__unused[2],	Efault);

/me performs happy dance at seeing proper use of vertical alignment in 
bulk-assignments.

If measurements support it then this looks like a nice optimization.

Thanks,

	Ingo
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 3 Sept 2023 at 14:48, Ingo Molnar <mingo@kernel.org> wrote:
>
> If measurements support it then this looks like a nice optimization.

Well, it seems to work, but when I profile it to see if the end result
looks reasonable, the profile data is swamped by the return
mispredicts from CPU errata workarounds, and to a smaller degree by
the clac/stac overhead of SMAP.

So it does seem to work - at least it boots here and everything looks
normal - and it does seem to generate good code, but the profiles look
kind of sad.

I also note that we do a lot of stupid pointless 'statx' work that is
then entirely thrown away for a regular stat() system call.

Part of it is actual extra work to set the statx fields.

But a lot of it is that even if we didn't do that, the 'statx' code
has made 'struct kstat' much bigger, and made our code footprints much
worse.

Of course, even without the useless statx overhead, 'struct kstat'
itself ends up having a lot of padding because of how 'struct
timespec64' looks. It might actually be good to split it explicitly
into seconds and nanoseconds just for padding.

Because that all blows 'struct kstat' up to 160 bytes here.

And to make it all worse, the statx code has caused all the
filesystems to have their own 'getattr()' code just to fill in that
worthless garbage, when it used to be that you could rely on
'generic_fillattr()'.

I'm looking at ext4_getattr(), for example, and I think *all* of it is
due to statx - that to a close approximation nobody cares about, and
is a specialty system call for a couple of users

And again - the indirect branches have gone from being "a cycle or
two" to being pipeline stalls and mispredicts. So not using just a
plain 'generic_fillattr()' is *expensive*.

Sad. Because the *normal* stat() family of system calls are some of
the most important ones out there. Very much unlike statx().

              Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 9/4/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 3 Sept 2023 at 14:48, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> If measurements support it then this looks like a nice optimization.
>
> Well, it seems to work, but when I profile it to see if the end result
> looks reasonable, the profile data is swamped by the return
> mispredicts from CPU errata workarounds, and to a smaller degree by
> the clac/stac overhead of SMAP.
>
> So it does seem to work - at least it boots here and everything looks
> normal - and it does seem to generate good code, but the profiles look
> kind of sad.
>
> I also note that we do a lot of stupid pointless 'statx' work that is
> then entirely thrown away for a regular stat() system call.
>
> Part of it is actual extra work to set the statx fields.
>
> But a lot of it is that even if we didn't do that, the 'statx' code
> has made 'struct kstat' much bigger, and made our code footprints much
> worse.
>
> Of course, even without the useless statx overhead, 'struct kstat'
> itself ends up having a lot of padding because of how 'struct
> timespec64' looks. It might actually be good to split it explicitly
> into seconds and nanoseconds just for padding.
>
> Because that all blows 'struct kstat' up to 160 bytes here.
>
> And to make it all worse, the statx code has caused all the
> filesystems to have their own 'getattr()' code just to fill in that
> worthless garbage, when it used to be that you could rely on
> 'generic_fillattr()'.
>
> I'm looking at ext4_getattr(), for example, and I think *all* of it is
> due to statx - that to a close approximation nobody cares about, and
> is a specialty system call for a couple of users
>
> And again - the indirect branches have gone from being "a cycle or
> two" to being pipeline stalls and mispredicts. So not using just a
> plain 'generic_fillattr()' is *expensive*.
>
> Sad. Because the *normal* stat() family of system calls are some of
> the most important ones out there. Very much unlike statx().
>

First a note that the patch has a side-effect of dodging hardened
usercopy, but I'm testing on a kernel without it.

bad news: virtually no difference for stat(2) (aka newfstatat)
before:  3898139
after: 3922714 (+0%)

ok news: a modest win for the actual fstat system call (aka newfstat,
*not* the libc wrapper calling newfstatat)
before: 8511283
after: 8746829 (+2%)

That is rather disappointing, but I would check it in for the time
when glibc fixes fstat, on some conditions.

The patch as proposed is kind of crap -- moving all that handling out
of fs/stat.c is quite weird and trivially avoidable, with a way to do
it already present in the file.

Following the current example of INIT_STRUCT_STAT_PADDING you could
add these to an x86-64 header:
#define INIT_STRUCT_STAT_PAD0 unsafe_put_user(0,
&ubuf->__pad0,          Efault);
#define INIT_STRUCT_STAT_UNUSED { unsafe_put_user(0, ....);
unsafe_put_user(0, ....) }

the new func placed in fs/stat.c would get the same code gen as it
does with your patch, but other archs could easily join and there
would be no stat logic escaping the file (or getting duplicated again
if another arch wants to dodge the temp buffer)

Then I think it is a perfectly tolerable thing to do, but again given
the modest win it is perhaps too early.

Someone(tm) should probably sort out stat performance as is. ;)

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 3 Sept 2023 at 16:15, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> bad news: virtually no difference for stat(2) (aka newfstatat)
> before:  3898139
> after: 3922714 (+0%)

Yeah, the path lookup is way more expensive than a slight win in a L1
copy, and the pathname lookup will also have gone much deeper on the
stack.

> ok news: a modest win for the actual fstat system call (aka newfstat,
> *not* the libc wrapper calling newfstatat)
> before: 8511283
> after: 8746829 (+2%)

We might end up with slightly less deep stack, so potentially a
smaller cache footprint, and there is much less other stuff going on..

> The patch as proposed is kind of crap -- moving all that handling out
> of fs/stat.c is quite weird and trivially avoidable, with a way to do
> it already present in the file.

So you say. I claim that you're full of crap.

Try it and you'll see it is not even *remotely* as easy as you claim.
Not when you have to deal with random sizes and padding of 20+
different architectures.

             Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 3 Sept 2023 at 20:07, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Try it and you'll see it is not even *remotely* as easy as you claim.
> Not when you have to deal with random sizes and padding of 20+
> different architectures.

Perhaps more importantly, nobody actually seems to have the energy to care.

As shown by the fact that even the current really simple "just define
INIT_STRUCT_STAT_PADDING to avoid a pointless memset in a hot path"
has been taken up by exactly zero other architectures than x86.

I wasn't going to fight that kind of history of apathy.

              Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 9/4/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 3 Sept 2023 at 20:07, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Try it and you'll see it is not even *remotely* as easy as you claim.
>> Not when you have to deal with random sizes and padding of 20+
>> different architectures.
>
> Perhaps more importantly, nobody actually seems to have the energy to care.
>

Perhaps in this context I should have explicitly mentioned that only
64-bit archs would be considered, as in placing the func you wrote in
fs/stat.c and it is explicitly 64-bit. This should whack all archs
which would never bother with their implementation anyway.

Worst case if the 64 bit structs differ one can settle for
user-accessing INIT_STRUCT_STAT_PADDING.

My point though was that code filling the struct escaping to
arch-dependent file is rather regrettable, even if x86-64 would be the
only case. And it definitely can be avoided, even if would require
tolerating that some of the stores are reordered. They are already
reordered for dodging memset.

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 3 Sept 2023 at 23:03, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Worst case if the 64 bit structs differ one can settle for
> user-accessing INIT_STRUCT_STAT_PADDING.

As I said, try it. I think you'll find that you are wrong.  It's
_hard_ to get the padding right. The "use a temporary" model of the
current code makes the fallback easy - just clear it before copying
the fields. Without that, you have to get every architecture padding
right manually.

You almost inevitably end up with "one function for the fallback case,
a completely different function for the unsafe_put_user() case, and
fairly painful macro for architectures that get converted".

And even then, you'll get non-optimal code, because you won't get the
order of the stores right to get nice contiguous stores. That
admittedly only matters for architectures with bad store coalescing,
which is hopefully not any of the ones we care about (and those kinds
of microarchitectures usually also want the loads done first, so
ld-ld-ld-ld-st-st-st-st patterns).

But just giving up on that, and using a weak fallback function, and
then an optimal one for the (single) architecture that anybody will do
this for, makes it all much simpler.

Feel free to send me a patch to prove your point. Because without a
patch, I claim you are just blowing hot air.

               Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 9/4/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 3 Sept 2023 at 23:03, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> Worst case if the 64 bit structs differ one can settle for
>> user-accessing INIT_STRUCT_STAT_PADDING.
>
> As I said, try it. I think you'll find that you are wrong.  It's
> _hard_ to get the padding right. The "use a temporary" model of the
> current code makes the fallback easy - just clear it before copying
> the fields. Without that, you have to get every architecture padding
> right manually.
>

See below for an unrelated patch. ;)

In the worst case user-accessing INIT_STRUCT_STAT_PADDING would simply
have these padding assignments spelled out, like you had to do anyway
in your patch (among other things) and which already happens with
existing code for x86-64. The code would not be fully optimized due to
"bad" ordering, but the implementation would avoid escaping fs/stat.c
which imo justifies it.

I said my $0,03 on the matter twice, I don't think this convo is
getting anywhere or that it is particularly important.

With that out of the way:

Earlier you attached a patch to check for an empty path early, that
was a nice win but can result in duplicated smap trips for nasty
software.

Then you mentioned doing the check after the name is fully copied in,
which I incorrectly dismissed from the get go -- mentally I had a
state from $elsewhere where it would require patching namei. But in
Linux this is not the case:

        name = getname_flags(filename,
getname_statx_lookup_flags(statx_flags), NULL);
        ret = vfs_statx(dfd, name, statx_flags, stat, STATX_BASIC_STATS);
        putname(name);

So something like this:

@@ -312,7 +314,15 @@ int vfs_fstatat(int dfd, const char __user *filename,
        struct filename *name;

        name = getname_flags(filename,
getname_statx_lookup_flags(statx_flags), NULL);
-       ret = vfs_statx(dfd, name, statx_flags, stat, STATX_BASIC_STATS);
+       /*
+        * Hack: ugliness below damage controls glibc which reimplemented fstat
+        * on top of newfstatat(fd, "", buf, AT_EMPTY_PATH). We still pay for
+        * kmalloc and user access, but elide lockref.
+        */
+       if (name->name[0] == '\0' && flags == AT_EMPTY_PATH && dfd >= 0)
+               ret = vfs_fstat(dfd, stat);
+       else
+               ret = vfs_statx(dfd, name, statx_flags, stat,
STATX_BASIC_STATS);
        putname(name);

        return ret;

Previous numbers:
> stock fstat     5088199
> patched fstat   7625244 (+49%)
> real fstat      8540383 (+67% / +12%)

while moving the check lower:
patchedv2 fstat 6253694 (+22%)

So quite a bit slower than the hackiest variant, but avoids the double
smap problem for cases passing AT_EMPTY_PATH and a non-"" name.

I think this is a justifiable patch, can submit properly formatted
(this is copy pasted into gmail, sorry). Alternatively since it is
trivial and it was your idea feel free to take it or reimpement or
whatever in similar vein.

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Tue, 5 Sept 2023 at 13:41, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> @@ -312,7 +314,15 @@ int vfs_fstatat(int dfd, const char __user *filename,
>         struct filename *name;
>
>         name = getname_flags(filename,
> getname_statx_lookup_flags(statx_flags), NULL);
> -       ret = vfs_statx(dfd, name, statx_flags, stat, STATX_BASIC_STATS);
> +       /*
> +        * Hack: ugliness below damage controls glibc which reimplemented fstat
> +        * on top of newfstatat(fd, "", buf, AT_EMPTY_PATH). We still pay for
> +        * kmalloc and user access, but elide lockref.
> +        */
> +       if (name->name[0] == '\0' && flags == AT_EMPTY_PATH && dfd >= 0)
> +               ret = vfs_fstat(dfd, stat);
> +       else
> +               ret = vfs_statx(dfd, name, statx_flags, stat,
> STATX_BASIC_STATS);
>         putname(name);

I actually think I might prefer the earlier hacky thing, because it
avoids the whole nasty pathname allocation thing (ie the __getname()
dance in getname_flags(), and the addition of the pathname to the
audit records etc).

I suspect your "there are no real loads that combine AT_EMPTY_PATH
with a path" comment is true.

So if we have this short-circuit of the code, let's go all hog on it,
and avoid not just the RCU lookup (with lockref etc), but the pathname
allocation too.

                  Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 9/6/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 5 Sept 2023 at 13:41, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> @@ -312,7 +314,15 @@ int vfs_fstatat(int dfd, const char __user
>> *filename,
>>         struct filename *name;
>>
>>         name = getname_flags(filename,
>> getname_statx_lookup_flags(statx_flags), NULL);
>> -       ret = vfs_statx(dfd, name, statx_flags, stat, STATX_BASIC_STATS);
>> +       /*
>> +        * Hack: ugliness below damage controls glibc which reimplemented
>> fstat
>> +        * on top of newfstatat(fd, "", buf, AT_EMPTY_PATH). We still pay
>> for
>> +        * kmalloc and user access, but elide lockref.
>> +        */
>> +       if (name->name[0] == '\0' && flags == AT_EMPTY_PATH && dfd >= 0)
>> +               ret = vfs_fstat(dfd, stat);
>> +       else
>> +               ret = vfs_statx(dfd, name, statx_flags, stat,
>> STATX_BASIC_STATS);
>>         putname(name);
>
> I actually think I might prefer the earlier hacky thing, because it
> avoids the whole nasty pathname allocation thing (ie the __getname()
> dance in getname_flags(), and the addition of the pathname to the
> audit records etc).
>
> I suspect your "there are no real loads that combine AT_EMPTY_PATH
> with a path" comment is true.
>
> So if we have this short-circuit of the code, let's go all hog on it,
> and avoid not just the RCU lookup (with lockref etc), but the pathname
> allocation too.
>

Ok, I'm buggering off the subject.

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 3 Sept 2023 at 13:49, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> It dodges lockref et al, but it does not dodge SMAP which accounts for
> the difference.

Yeah, just doing that "check if the name is empty" is not free. The
CLAC/STAC overhead is real.

I see no way of avoiding that cost, though - the only true fix is to
fix the glibc braindamage.

In fact, I suspect that my stupid patch is unacceptable exactly
because it actually makes a *real* fstatat() with a real path much
worse due to the "check if the path is empty".

We could possibly move that check into the getname() path, and at
least avoid the extra overhead.

           Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 9/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 3 Sept 2023 at 13:49, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> It dodges lockref et al, but it does not dodge SMAP which accounts for
>> the difference.
>
> Yeah, just doing that "check if the name is empty" is not free. The
> CLAC/STAC overhead is real.
>
> I see no way of avoiding that cost, though - the only true fix is to
> fix the glibc braindamage.
>
> In fact, I suspect that my stupid patch is unacceptable exactly
> because it actually makes a *real* fstatat() with a real path much
> worse due to the "check if the path is empty".
>

I don't think it is *that* bad. I did a quick sanity check on that
front by rolling with bpftrace on cases which pass AT_EMPTY_PATH *and*
provide a path. Apart from my test prog which deliberately did that
nothing popped up in casual testing. But there is definitely funny
code out there which passes both not for testing purposes, so there is
that.

> We could possibly move that check into the getname() path, and at
> least avoid the extra overhead.
>

Complexity aside that would eat some of the win as kmalloc/kfree are
not particularly cheap, and I'm even ignoring the INIT_ON_ALLOC
problem.

So I would not go for it, but that's my $0,03.

That said, I'm going to engage glibc people to sort this out on their
end. Arguably this is something everyone will be able to backport to
their distro no matter how LTSy (not that I expect it to happen ;)).

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 3 Sept 2023 at 14:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I don't think it is *that* bad. I did a quick sanity check on that
> front by rolling with bpftrace on cases which pass AT_EMPTY_PATH *and*
> provide a path.

I guess you are right - nobody sane would use AT_EMPTY_PATH except if
they don't have a path.

Of course, the only reason we're discussing this in the first place is
because people are doing insane things, which makes _that_ particular
argument very weak indeed...

                     Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 9/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 3 Sept 2023 at 14:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> I don't think it is *that* bad. I did a quick sanity check on that
>> front by rolling with bpftrace on cases which pass AT_EMPTY_PATH *and*
>> provide a path.
>
> I guess you are right - nobody sane would use AT_EMPTY_PATH except if
> they don't have a path.
>
> Of course, the only reason we're discussing this in the first place is
> because people are doing insane things, which makes _that_ particular
> argument very weak indeed...
>

I put blame on whoever allowed non-NULL path and AT_EMPTY_PATH as a
valid combination, forcing the user buf to be accessed no matter what.
But I'm not going to go digging for names. ;)

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Al Viro 2 years, 3 months ago
On Sun, Sep 03, 2023 at 11:18:44PM +0200, Mateusz Guzik wrote:
> On 9/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Sun, 3 Sept 2023 at 14:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >>
> >> I don't think it is *that* bad. I did a quick sanity check on that
> >> front by rolling with bpftrace on cases which pass AT_EMPTY_PATH *and*
> >> provide a path.
> >
> > I guess you are right - nobody sane would use AT_EMPTY_PATH except if
> > they don't have a path.
> >
> > Of course, the only reason we're discussing this in the first place is
> > because people are doing insane things, which makes _that_ particular
> > argument very weak indeed...
> >
> 
> I put blame on whoever allowed non-NULL path and AT_EMPTY_PATH as a
> valid combination, forcing the user buf to be accessed no matter what.
> But I'm not going to go digging for names. ;)

ITYM s/allowed/mandated/ - AT_EMPTY_PATH with NULL is -EFAULT.
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Mateusz Guzik 2 years, 3 months ago
On 9/3/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Sun, Sep 03, 2023 at 01:08:03PM -0700, Linus Torvalds wrote:
>> On Sun, 3 Sept 2023 at 11:49, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> >
>> > So I have no idea why you claim that "currently they have no choice".
>> > glibc is simply being incredibly stupid, and using newfstatat() for no
>> > good reason.
>>
>> Do you have any good benchmark that shows the effects of this?
>>
>> And if you do, does the attached patch (ENTIRELY UNTESTED!) fix the
>> silly glibc mis-feature?
>>
>
> "real fstat" is syscall(5, fd, &sb).
>
> Sapphire Rapids, will-it-scale, ops/s
>
> stock fstat	5088199
> patched fstat	7625244	(+49%)
> real fstat	8540383	(+67% / +12%)
>
> It dodges lockref et al, but it does not dodge SMAP which accounts for
> the difference.
>

I'm going to ask glibc people what's up with the change later.

I'll note statx does not have a dedicated entry point (I checked again
;)) which elides path lookup. It *did* have a way, but it got removed
in 1e2f82d1e9d1 ("statx: Kill fd-with-NULL-path support in favour of
AT_EMPTY_PATH").

One frequent consumer I know of is Rust.

Would be nice to have a non-hacky way to sort it out.

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 3 Sept 2023 at 11:49, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Why _is_ glibc doing that thing?

Oh, I think I may see where the confusion may come from.

The glibc people found a "__NR_newfstatat", and thought it was a newer
version of 'stat', and didn't find any new versions of the basic
stat/fstat/lstat functions at all. So they thought that everything
else was implemented using that 'newfstatat()' entrypoint.

But on x86-64 (and most half-way newer architectures), the regular
__NR_stat *is* that "new" stat.

After all, it was only "new" in 1992, long before x86-64 even existed.

So maybe somebody noticed that "__NR_newfstatat" is the only system
call number that has that "new" in the stat name, and didn't realize
that that was just an odd naming thing for strange historical reasons.

The relevant system call numbers are

  #define __NR_stat 4
  #define __NR_fstat 5
  #define __NR_lstat 6
  #define __NR_newfstatat 262
  #define __NR_statx 332

and the numbering very much is about when they were added.

And yes, that "new" in "newfstatat" is misleading, it's the same
'struct stat' as those stat/fstat/lstat system calls (but not the
'statx' one, that has 'struct statx', of course).

On x86-32, which has that extra decade of history, you end up with

  #define __NR_oldstat 18
  #define __NR_oldfstat 28
  #define __NR_oldlstat 84
  #define __NR_stat 106
  #define __NR_lstat 107
  #define __NR_fstat 108
  #define __NR_stat64 195
  #define __NR_lstat64 196
  #define __NR_fstat64 197
  #define __NR_fstatat64 300
  #define __NR_statx 383

where 106-108 are those "new" stat calls.  And then the stat64 ones
are the "new new" ones and the only version that was relevant by the
time we got the 'at()' version.

              Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Fri, 1 Sept 2023 at 08:20, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> cp_new_stat and the counterpart for statx can dodge this rep mov by
> filling user memory directly.

Yeah, they could be made to use the "unsafe_put_user()" machinery
these days, and we could go back to the good old days of avoiding the
extra temp buffer.

> I'm going to patch this, but first I want to address the bigger
> problem of glibc implementing fstat as newfstatat, demolishing perf of
> that op. In their defense currently they have no choice as this is the
> only exporter of the "new" struct stat. I'll be sending a long email
> to fsdevel soon(tm) with a proposed fix.

I wouldn't mind re-instating the "copy directly to user space rather
than go through a temporary buffer", for the stat family of functions,
so please do..

> So I was wondering if rep movsq is any worse than ERMS'ed rep movsb
> when there is no tail to handle and the buffer is aligned to a page,
> or more to the point if clear_page gets any benefit for going with
> movsb.

Hard to tell. 'movsq' is *historically* better, and likely on all
current microarchitectures.

But 'movsb' is actually in many ways easier for the CPU to optimize,
because there's no question of the sub-chunking if anything is not
aligned just rught.

             Linus
Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS
Posted by Linus Torvalds 2 years, 3 months ago
On Wed, 30 Aug 2023 at 07:03, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Hand-rolled mov loops executing in this case are quite pessimal compared
> to rep movsq for bigger sizes. While the upper limit depends on uarch,
> everyone is well south of 1KB AFAICS and sizes bigger than that are
> common.
>
> While technically ancient CPUs may be suffering from rep usage, gcc has
> been emitting it for years all over kernel code, so I don't think this
> is a legitimate concern.
>
> Sample result from read1_processes from will-it-scale (4KB reads/s):
> before: 1507021
> after:  1721828 (+14%)

Ok, patch looks fine to me now.

So I applied this directly to my tree, since I was the one doing the
x86 memcpy cleanups that removed the REP_GOOD hackery anyway.

              Linus