[PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)

Herton R. Krzesinski posted 1 patch 9 months ago
arch/x86/lib/copy_user_64.S | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
[PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)
Posted by Herton R. Krzesinski 9 months ago
Since the upstream series with user copy updates were merged upstream
with commit a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy
clarifications"), copy_user_generic() on x86_64 stopped doing alignment
of the writes to the destination to a 8 byte boundary for the non FSRM
case. Previously, this was done through the ALIGN_DESTINATION macro that
was used in the now removed copy_user_generic_unrolled function.

Turns out that may cause some loss of performance/throughput on some use
cases and specific CPU/platforms without FSRM and ERMS. Lately I got two
reports of performance/throughput issues after a RHEL 9 kernel pulled
the same upstream series with updates to user copy functions. Both
reports consisted of running specific networking/TCP related testing
using iperf3. The first report was related to a Linux Bridge testing
using VMs on an specific machine with an AMD CPU (EPYC 7402), and after
a brief investigation it turned out that the later change through
commit ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs
without ERMS") helped/fixed the performance issue.

However, after the later commit/fix was applied, then I got another
regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also
running on an AMD based platform (AMD EPYC 7302 CPU), again that was using
iperf3 to run the test. That regression was after applying the later
fix/commit, but only this didn't help in telling the whole history.

So I narrowed down the second regression use case, but running it
without traffic through a nic, on localhost, in trying to narrow down
CPU usage and not being limited by other factor like network bandwidth.
I used another system also with an AMD CPU (AMD EPYC 7742). Basically,
I run iperf3 in server and client mode in the same system, for example:

- Start the server binding it to CPU core/thread 19:
$ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000

- Start the client always binding/running on CPU core/thread 17, using
perf to get statistics:
$ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \
    -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \
    > stat-19.txt

For the client, always running/pinned to CPU 17. But for the iperf3 in
server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any
specific CPU. So it basically consisted with four runs of the same
commands, just changing the CPU which the server is pinned, or without
pinning by removing the taskset call before the server command. The CPUs
were chosen based on NUMA node they were on, this is the relevant output
of lscpu on the system:

$ lscpu
...
  Model name:             AMD EPYC 7742 64-Core Processor
...
Caches (sum of all):
  L1d:                    2 MiB (64 instances)
  L1i:                    2 MiB (64 instances)
  L2:                     32 MiB (64 instances)
  L3:                     256 MiB (16 instances)
NUMA:
  NUMA node(s):           4
  NUMA node0 CPU(s):      0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
  NUMA node1 CPU(s):      2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123
  NUMA node2 CPU(s):      4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125
  NUMA node3 CPU(s):      6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127
...

So for the server run, when picking a CPU, I chose CPUs to be not on the same
node. The reason is with that I was able to get/measure relevant
performance differences when changing the alignment of the writes to the
destination in copy_user_generic. I made tables below, an example of a set
results I got, summarizing the results:

* No alignment case:
             CPU      RATE          SYS          TIME     sender-receiver
Server bind   19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8%
Server bind   21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9%
Server bind   23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5%
Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5%

* With this patch (aligning write in non ERMS/FSRM case):
             CPU      RATE          SYS          TIME     sender-receiver
Server bind   19: 20.8Gbits/sec 14.897284000 20.811101382 75.7%-89.0%
Server bind   21: 20.4Gbits/sec 15.205055000 21.263165909 75.4%-89.7%
Server bind   23: 20.2Gbits/sec 15.433801000 21.456175000 75.5%-89.8%
Server bind none: 26.1Gbits/sec 12.534022000 16.632447315 79.8%-89.6%

So I consistently got better results when aligning the write. The
results above were run on 6.14.0-rc6/rc7 based kernels. The sys is sys
time and then the total time to run/transfer 50G of data. The last
field is the CPU usage of sender/receiver iperf3 process. It's also
worth to note that each pair of iperf3 runs may get slightly different
results on each run, but I always got consistent higher results with
the write alignment for this specific test of running the processes
on CPUs in different NUMA nodes.

Linus Torvalds helped/provided this version of the patch. Initially I
proposed a version which aligned writes for all cases in
rep_movs_alternative, however it used two extra registers and thus
Linus provided an enhanced version that only aligns the write on the
large_movsq case, which is sufficient since the problem happens only
on those AMD CPUs like ones mentioned above without ERMS/FSRM, and
also doesn't require using extra registers. Also, I validated that
aligning only on large_movsq case is really enough for getting the
performance back. I tested this patch also on an old Intel based system
without ERMS/FRMS (with Xeon E5-2667 - Sandy Bridge based) and didn't
get any problems (no performance enhancement but also no regression
too, using the same iperf3 based benchmark). Also newer Intel processors
after Sandy Bridge usually have ERMS and should not be affected by this
change.

Fixes: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")
Fixes: 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function")
Reported-by: Ondrej Lichtner <olichtne@redhat.com>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
---
 arch/x86/lib/copy_user_64.S | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index fc9fb5d06174..b8f74d80f35c 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -74,6 +74,24 @@ SYM_FUNC_START(rep_movs_alternative)
 	_ASM_EXTABLE_UA( 0b, 1b)
 
 .Llarge_movsq:
+	/* Do the first possibly unaligned word */
+0:	movq (%rsi),%rax
+1:	movq %rax,(%rdi)
+
+	_ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
+
+	/* What would be the offset to the aligned destination? */
+	leaq 8(%rdi),%rax
+	andq $-8,%rax
+	subq %rdi,%rax
+
+	/* .. and update pointers and count to match */
+	addq %rax,%rdi
+	addq %rax,%rsi
+	subq %rax,%rcx
+
+	/* make %rcx contain the number of words, %rax the remainder */
 	movq %rcx,%rax
 	shrq $3,%rcx
 	andl $7,%eax
-- 
2.47.1
Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)
Posted by Nicholas Sielicki 7 months, 2 weeks ago
On Thu, Mar 20, 2025 at 3:22\ufffd\ufffd\ufffdPM Herton R. Krzesinski <herton@redhat.com> wrote:
> History of the performance regression:
> ======================================
> 
> Since the following series of user copy updates were merged upstream
> ~2 years ago via:
> 
>   a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications")
> 
> .. copy_user_generic() on x86_64 stopped doing alignment of the
> writes to the destination to a 8 byte boundary for the non FSRM case.
> 
> Previously, this was done through the ALIGN_DESTINATION macro that
> was used in the now removed copy_user_generic_unrolled function.
> 
> Turns out this change causes some loss of performance/throughput on
> some use cases and specific CPU/platforms without FSRM and ERMS.
> 
> Lately I got two reports of performance/throughput issues after a
> RHEL 9 kernel pulled the same upstream series with updates to user
> copy functions. Both reports consisted of running specific
> networking/TCP related testing using iperf3.
> 
> Partial upstream fix
> ====================
> 
> The first report was related to a Linux Bridge testing using VMs on a
> specific machine with an AMD CPU (EPYC 7402), and after a brief
> investigation it turned out that the later change via:
> 
>   ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")
> 
> ... helped/fixed the performance issue.
> 
> However, after the later commit/fix was applied, then I got another
> regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also
> running on an AMD based platform (AMD EPYC 7302 CPU), again that was using
> iperf3 to run the test. That regression was after applying the later
> fix/commit, but only this didn't help in telling the whole history.

I went down the rabbit-hole of this issue in late 2022 at $work, which
we found from running the same iperf3 single-flow workload being
described above on a Milan system. It took me some time (much longer
than I'd like to admit), but eventually I started asking questions about
FSRM and ERMS, and stumbled across the lkml threads surrounding the
+FSRM/-ERMS alternative.

Before arriving at that root cause, I had noticed that tx-nocache-copy /
NETIF_F_NOCACHE_COPY was able to considerably improve perf compared to
baseline. Is that interesting to anyone?

I did a bit of research on the history of tx-nocache-copy, and how it
might relate to znver3, and walked away believing the following story to
be mostly true:

1. tx-nocache-copy was introduced in 2011 and was initially enabled by
default on all non-loopback interfaces. It was tested on an AMD
bulldozer-like system, it showed a significant improvement in tail
latency and a 5%-10% improvement in transactions per second.

2. A year later, for products released in year 2012, intel introduced
something called DDIO. My entire understanding of DDIO comes from a
single PDF [1], so take this with a massive grain of salt, but from what
I understand it was intended to largely solve the same problem that
tx-nocache-copy was optimizing for, and I think it did so in a way that
broke many of the underlying assumptions that tx-nocache-copy was
relying on.

In other words, it didn't just make tx-nocache-copy unnecessary, but
made its usage actively harmful. For two reasons:

+ DDIO operates atop dedicated cache ways. So if that reserved cache
  space is not used, it doesn't result in any less cache contention for
  other uses, it just means wasting that cache space. Remote reads from
  the NIC of data held in cache stopped resulting in a write-back to main
  memory from the cache, which you might otherwise expect to occur under
  most coherency protocols; their state machine grew a carve-out for this
  specific flow.

  So if your motivation for issuing a non-temporal write is any of:

   a) to avoid your write from evicting other useful data from the cache
   b) to avoid coherency traffic triggered by the remote read.
   c) you anticipate a significant amount of time and/or cache churn to
      occur between now and when the remote read takes place, and you feel
      it's reasonable to suspect that the data will have been evicted from
      cache into main memory by then, anyway.

  all of them make less sense on a system with ddio.

+ Because reads by the NIC are expected to usually be directly serviced
  from cache, Intel also stopped issuing speculative reads against main
  memory as early as they otherwise could, under the assumption that it
  would be especially rare for it to be used.

  This means that on systems with DDIO, if you elect to use the non-temporal
  hints, those remote reads become extra slow, because they are
  serialized with the read attempt against the cache.

Putting these two together, tx-nocache-copy stopped making sense, and
the best thing the kernel could do, was to play dumb.

3. By 2017, after many complaints, and in a world where almost everyone
in the high-performance networking/server space was using an intel
platform with DDIO, tx-nocache-copy was moved to be disabled by default.

I didn't see DDIO mentioned in any of the on-list discussions of
tx-nocache-copy that I could find, and it lead me to wonder if this is
real story behind tx-nocache-copy: a reasonable feature, introduced with
unlucky timing, which ultimately fell victim to a hardware monoculture
without anyone fully realizing it.

If that story is true, then it might suggest that tx-nocache-copy still
has merit under the current zen systems (which have nothing similar to
DDIO, as far as I'm aware). At least in late 2022, under the original
unpatched kernels that were available at the time, I can report that it
did. I no longer work at $work, so I have no ability to retest myself,
but I'd be curious to hear the results from anyone that finds this
interesting enough to look into it. 

[1]: https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf
The relevant section for tx-nocache-copy is "2.2".
Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)
Posted by Mateusz Guzik 9 months ago
On Thu, Mar 20, 2025 at 3:22 PM Herton R. Krzesinski <herton@redhat.com> wrote:
> diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> index fc9fb5d06174..b8f74d80f35c 100644
> --- a/arch/x86/lib/copy_user_64.S
> +++ b/arch/x86/lib/copy_user_64.S
> @@ -74,6 +74,24 @@ SYM_FUNC_START(rep_movs_alternative)
>         _ASM_EXTABLE_UA( 0b, 1b)
>
>  .Llarge_movsq:
> +       /* Do the first possibly unaligned word */
> +0:     movq (%rsi),%rax
> +1:     movq %rax,(%rdi)
> +
> +       _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
> +       _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
> +
> +       /* What would be the offset to the aligned destination? */
> +       leaq 8(%rdi),%rax
> +       andq $-8,%rax
> +       subq %rdi,%rax
> +
> +       /* .. and update pointers and count to match */
> +       addq %rax,%rdi
> +       addq %rax,%rsi
> +       subq %rax,%rcx
> +
> +       /* make %rcx contain the number of words, %rax the remainder */
>         movq %rcx,%rax
>         shrq $3,%rcx
>         andl $7,%eax

The patch looks fine to me, but there is more to do if you are up for it.

It was quite some time since I last seriously played with the area and
I don't remember all the details, on top of that realities of uarchs
probably improved.

That said, have you experimented with aligning the target to 16 bytes
or more bytes?

Moreover, I have some recollection that there were uarchs with ERMS
which also liked the target to be aligned -- as in perhaps this should
be done regardless of FSRM?

And most importantly memset, memcpy and clear_user would all use a
revamp and they are missing rep handling for bigger sizes (I verified
they *do* show up). Not only that, but memcpy uses overlapping stores
while memset just loops over stuff.

I intended to sort it out long time ago and maybe will find some time
now that I got reminded of it, but I would be deligthed if it got
picked up.

Hacking this up is just some screwing around, the real time consuming
part is the benchmarking so I completely understand if you are not
interested.

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)
Posted by Herton Krzesinski 9 months ago
On Thu, Mar 20, 2025 at 11:36 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Thu, Mar 20, 2025 at 3:22 PM Herton R. Krzesinski <herton@redhat.com> wrote:
> > diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> > index fc9fb5d06174..b8f74d80f35c 100644
> > --- a/arch/x86/lib/copy_user_64.S
> > +++ b/arch/x86/lib/copy_user_64.S
> > @@ -74,6 +74,24 @@ SYM_FUNC_START(rep_movs_alternative)
> >         _ASM_EXTABLE_UA( 0b, 1b)
> >
> >  .Llarge_movsq:
> > +       /* Do the first possibly unaligned word */
> > +0:     movq (%rsi),%rax
> > +1:     movq %rax,(%rdi)
> > +
> > +       _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
> > +       _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
> > +
> > +       /* What would be the offset to the aligned destination? */
> > +       leaq 8(%rdi),%rax
> > +       andq $-8,%rax
> > +       subq %rdi,%rax
> > +
> > +       /* .. and update pointers and count to match */
> > +       addq %rax,%rdi
> > +       addq %rax,%rsi
> > +       subq %rax,%rcx
> > +
> > +       /* make %rcx contain the number of words, %rax the remainder */
> >         movq %rcx,%rax
> >         shrq $3,%rcx
> >         andl $7,%eax
>
> The patch looks fine to me, but there is more to do if you are up for it.
>
> It was quite some time since I last seriously played with the area and
> I don't remember all the details, on top of that realities of uarchs
> probably improved.
>
> That said, have you experimented with aligning the target to 16 bytes
> or more bytes?

Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge based)
and got no improvement at least in the specific benchmark I'm doing here.
Also after your question here I tried 16-byte/32-byte on the AMD cpu as
well and got no difference from the 8-byte alignment, same bench as well.
I tried to do 8-byte alignment for the ERMS case on Intel and got no
difference on the systems I tested. I'm not saying it may not improve in
some other case, just that in my specific testing I couldn't tell/measure
any improvement.

>
> Moreover, I have some recollection that there were uarchs with ERMS
> which also liked the target to be aligned -- as in perhaps this should
> be done regardless of FSRM?

Where I tested I didn't see improvements but may be there is some case,
but I didn't have any.

>
> And most importantly memset, memcpy and clear_user would all use a
> revamp and they are missing rep handling for bigger sizes (I verified
> they *do* show up). Not only that, but memcpy uses overlapping stores
> while memset just loops over stuff.
>
> I intended to sort it out long time ago and maybe will find some time
> now that I got reminded of it, but I would be deligthed if it got
> picked up.
>
> Hacking this up is just some screwing around, the real time consuming
> part is the benchmarking so I completely understand if you are not
> interested.

Yes, the most time you spend is on benchmarking. May be later I could
try to take a look but will not put any promises on it.

>
> --
> Mateusz Guzik <mjguzik gmail.com>
>
Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)
Posted by Mateusz Guzik 9 months ago
On Thu, Mar 20, 2025 at 6:51 PM Herton Krzesinski <hkrzesin@redhat.com> wrote:
>
> On Thu, Mar 20, 2025 at 11:36 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 3:22 PM Herton R. Krzesinski <herton@redhat.com> wrote:
> > > diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> > > index fc9fb5d06174..b8f74d80f35c 100644
> > > --- a/arch/x86/lib/copy_user_64.S
> > > +++ b/arch/x86/lib/copy_user_64.S
> > > @@ -74,6 +74,24 @@ SYM_FUNC_START(rep_movs_alternative)
> > >         _ASM_EXTABLE_UA( 0b, 1b)
> > >
> > >  .Llarge_movsq:
> > > +       /* Do the first possibly unaligned word */
> > > +0:     movq (%rsi),%rax
> > > +1:     movq %rax,(%rdi)
> > > +
> > > +       _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
> > > +       _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
> > > +
> > > +       /* What would be the offset to the aligned destination? */
> > > +       leaq 8(%rdi),%rax
> > > +       andq $-8,%rax
> > > +       subq %rdi,%rax
> > > +
> > > +       /* .. and update pointers and count to match */
> > > +       addq %rax,%rdi
> > > +       addq %rax,%rsi
> > > +       subq %rax,%rcx
> > > +
> > > +       /* make %rcx contain the number of words, %rax the remainder */
> > >         movq %rcx,%rax
> > >         shrq $3,%rcx
> > >         andl $7,%eax
> >
> > The patch looks fine to me, but there is more to do if you are up for it.
> >
> > It was quite some time since I last seriously played with the area and
> > I don't remember all the details, on top of that realities of uarchs
> > probably improved.
> >
> > That said, have you experimented with aligning the target to 16 bytes
> > or more bytes?
>
> Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge based)
> and got no improvement at least in the specific benchmark I'm doing here.
> Also after your question here I tried 16-byte/32-byte on the AMD cpu as
> well and got no difference from the 8-byte alignment, same bench as well.
> I tried to do 8-byte alignment for the ERMS case on Intel and got no
> difference on the systems I tested. I'm not saying it may not improve in
> some other case, just that in my specific testing I couldn't tell/measure
> any improvement.
>

oof, I would not got as far back as Sandy Bridge. ;)

I think Skylake is the oldest yeller to worry about, if one insists on it.

That said, if memory serves right these bufs like to be misaligned to
weird extents, it very well may be in your tests aligning to 8 had a
side effect of aligning it to 16 even.

> >
> > Moreover, I have some recollection that there were uarchs with ERMS
> > which also liked the target to be aligned -- as in perhaps this should
> > be done regardless of FSRM?
>
> Where I tested I didn't see improvements but may be there is some case,
> but I didn't have any.
>
> >
> > And most importantly memset, memcpy and clear_user would all use a
> > revamp and they are missing rep handling for bigger sizes (I verified
> > they *do* show up). Not only that, but memcpy uses overlapping stores
> > while memset just loops over stuff.
> >
> > I intended to sort it out long time ago and maybe will find some time
> > now that I got reminded of it, but I would be deligthed if it got
> > picked up.
> >
> > Hacking this up is just some screwing around, the real time consuming
> > part is the benchmarking so I completely understand if you are not
> > interested.
>
> Yes, the most time you spend is on benchmarking. May be later I could
> try to take a look but will not put any promises on it.
>

Now I'm curious enough what's up here. If I don't run out of steam,
I'm gonna cover memset and memcpy myself.

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)
Posted by David Laight 9 months ago
On Thu, 20 Mar 2025 19:02:21 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:

> On Thu, Mar 20, 2025 at 6:51 PM Herton Krzesinski <hkrzesin@redhat.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 11:36 AM Mateusz Guzik <mjguzik@gmail.com> wrote:  
...
> > > That said, have you experimented with aligning the target to 16 bytes
> > > or more bytes?  
> >
> > Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge based)
> > and got no improvement at least in the specific benchmark I'm doing here.
> > Also after your question here I tried 16-byte/32-byte on the AMD cpu as
> > well and got no difference from the 8-byte alignment, same bench as well.
> > I tried to do 8-byte alignment for the ERMS case on Intel and got no
> > difference on the systems I tested. I'm not saying it may not improve in
> > some other case, just that in my specific testing I couldn't tell/measure
> > any improvement.
> >  
> 
> oof, I would not got as far back as Sandy Bridge. ;)

It is a boundary point.
Agner's tables (fairly reliable have):

Sandy Bridge
Page 222
MOVS 5 4
REP MOVS 2n 1.5 n  worst case
REP MOVS 3/16B 1/16B best case

which is the same as Ivy bridge - which you'd sort of expect since
Ivy bridge is a minor update, Agner's tables have the same values for it.
Haswell jumps to 1/32B.

I didn't test Sandy bridge (I've got one, powered off), but did test Ivy Bridge.
Neither the source nor destination alignment made any difference at all.

As I said earlier the only alignment that made any difference was 32byte
aligning the destination on Haswell (and later).
That is needed to get 32 bytes/clock rather than 16 bytes/clock.

> 
> I think Skylake is the oldest yeller to worry about, if one insists on it.
> 
> That said, if memory serves right these bufs like to be misaligned to
> weird extents, it very well may be in your tests aligning to 8 had a
> side effect of aligning it to 16 even.
> 
> > >
> > > Moreover, I have some recollection that there were uarchs with ERMS
> > > which also liked the target to be aligned -- as in perhaps this should
> > > be done regardless of FSRM?  

Dunno, the only report is some AMD cpu being slow with misaligned writes.
But that is the copy loop, not 'rep movsq'.
I don't have one to test.

> >
> > Where I tested I didn't see improvements but may be there is some case,
> > but I didn't have any.
> >  
> > >
> > > And most importantly memset, memcpy and clear_user would all use a
> > > revamp and they are missing rep handling for bigger sizes (I verified
> > > they *do* show up). Not only that, but memcpy uses overlapping stores
> > > while memset just loops over stuff.
> > >
> > > I intended to sort it out long time ago and maybe will find some time
> > > now that I got reminded of it, but I would be deligthed if it got
> > > picked up.
> > >
> > > Hacking this up is just some screwing around, the real time consuming
> > > part is the benchmarking so I completely understand if you are not
> > > interested.  
> >
> > Yes, the most time you spend is on benchmarking. May be later I could
> > try to take a look but will not put any promises on it.

I found I needed to use the performance counter to get a proper cycle count.
But then directly read the register to avoid all the 'library' overhead.
Then add lfence/mfence both sides of the cycle count read.
After subtracting the overhead of a 'null function' I could measure the
number of clocks each operation took.
So could tell when I was actually getting 32 bytes copied per clock.

(Or testing the ip checksum code the number of bytes/clock - can get to 12).

	David

> >  
> 
> Now I'm curious enough what's up here. If I don't run out of steam,
> I'm gonna cover memset and memcpy myself.
> 
Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)
Posted by Herton Krzesinski 9 months ago
On Thu, Mar 20, 2025 at 3:02 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Thu, Mar 20, 2025 at 6:51 PM Herton Krzesinski <hkrzesin@redhat.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 11:36 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > On Thu, Mar 20, 2025 at 3:22 PM Herton R. Krzesinski <herton@redhat.com> wrote:
> > > > diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> > > > index fc9fb5d06174..b8f74d80f35c 100644
> > > > --- a/arch/x86/lib/copy_user_64.S
> > > > +++ b/arch/x86/lib/copy_user_64.S
> > > > @@ -74,6 +74,24 @@ SYM_FUNC_START(rep_movs_alternative)
> > > >         _ASM_EXTABLE_UA( 0b, 1b)
> > > >
> > > >  .Llarge_movsq:
> > > > +       /* Do the first possibly unaligned word */
> > > > +0:     movq (%rsi),%rax
> > > > +1:     movq %rax,(%rdi)
> > > > +
> > > > +       _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
> > > > +       _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
> > > > +
> > > > +       /* What would be the offset to the aligned destination? */
> > > > +       leaq 8(%rdi),%rax
> > > > +       andq $-8,%rax
> > > > +       subq %rdi,%rax
> > > > +
> > > > +       /* .. and update pointers and count to match */
> > > > +       addq %rax,%rdi
> > > > +       addq %rax,%rsi
> > > > +       subq %rax,%rcx
> > > > +
> > > > +       /* make %rcx contain the number of words, %rax the remainder */
> > > >         movq %rcx,%rax
> > > >         shrq $3,%rcx
> > > >         andl $7,%eax
> > >
> > > The patch looks fine to me, but there is more to do if you are up for it.
> > >
> > > It was quite some time since I last seriously played with the area and
> > > I don't remember all the details, on top of that realities of uarchs
> > > probably improved.
> > >
> > > That said, have you experimented with aligning the target to 16 bytes
> > > or more bytes?
> >
> > Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge based)
> > and got no improvement at least in the specific benchmark I'm doing here.
> > Also after your question here I tried 16-byte/32-byte on the AMD cpu as
> > well and got no difference from the 8-byte alignment, same bench as well.
> > I tried to do 8-byte alignment for the ERMS case on Intel and got no
> > difference on the systems I tested. I'm not saying it may not improve in
> > some other case, just that in my specific testing I couldn't tell/measure
> > any improvement.
> >
>
> oof, I would not got as far back as Sandy Bridge. ;)
>
> I think Skylake is the oldest yeller to worry about, if one insists on it.
>
> That said, if memory serves right these bufs like to be misaligned to
> weird extents, it very well may be in your tests aligning to 8 had a
> side effect of aligning it to 16 even.

I went back to Sandy Bridge because it was the first one I found
from newer to oldest on the Intel line that did not have erms (and
of course no fsrm). I did change the patch for a 32-byte alignment
before rep movsb in the erms case and tested with a xeon 8280
(cascade lake which is enhanced skylake, so hope is valid here),
but I did not get improvements in my testing. You may be right that
in aligning to 8-byte it might be 16-byte aligned as well, but in practice
it made no difference in my testing.

>
> > >
> > > Moreover, I have some recollection that there were uarchs with ERMS
> > > which also liked the target to be aligned -- as in perhaps this should
> > > be done regardless of FSRM?
> >
> > Where I tested I didn't see improvements but may be there is some case,
> > but I didn't have any.
> >
> > >
> > > And most importantly memset, memcpy and clear_user would all use a
> > > revamp and they are missing rep handling for bigger sizes (I verified
> > > they *do* show up). Not only that, but memcpy uses overlapping stores
> > > while memset just loops over stuff.
> > >
> > > I intended to sort it out long time ago and maybe will find some time
> > > now that I got reminded of it, but I would be deligthed if it got
> > > picked up.
> > >
> > > Hacking this up is just some screwing around, the real time consuming
> > > part is the benchmarking so I completely understand if you are not
> > > interested.
> >
> > Yes, the most time you spend is on benchmarking. May be later I could
> > try to take a look but will not put any promises on it.
> >
>
> Now I'm curious enough what's up here. If I don't run out of steam,
> I'm gonna cover memset and memcpy myself.
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>
[tip: x86/urgent] x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs
Posted by tip-bot2 for Herton R. Krzesinski 8 months, 3 weeks ago
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     b5322b6ec06a6c58650f52abcd2492000396363b
Gitweb:        https://git.kernel.org/tip/b5322b6ec06a6c58650f52abcd2492000396363b
Author:        Herton R. Krzesinski <herton@redhat.com>
AuthorDate:    Thu, 20 Mar 2025 11:22:13 -03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 28 Mar 2025 22:57:44 +01:00

x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs

History of the performance regression:
======================================

Since the following series of user copy updates were merged upstream
~2 years ago via:

  a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications")

.. copy_user_generic() on x86_64 stopped doing alignment of the
writes to the destination to a 8 byte boundary for the non FSRM case.

Previously, this was done through the ALIGN_DESTINATION macro that
was used in the now removed copy_user_generic_unrolled function.

Turns out this change causes some loss of performance/throughput on
some use cases and specific CPU/platforms without FSRM and ERMS.

Lately I got two reports of performance/throughput issues after a
RHEL 9 kernel pulled the same upstream series with updates to user
copy functions. Both reports consisted of running specific
networking/TCP related testing using iperf3.

Partial upstream fix
====================

The first report was related to a Linux Bridge testing using VMs on a
specific machine with an AMD CPU (EPYC 7402), and after a brief
investigation it turned out that the later change via:

  ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")

... helped/fixed the performance issue.

However, after the later commit/fix was applied, then I got another
regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also
running on an AMD based platform (AMD EPYC 7302 CPU), again that was using
iperf3 to run the test. That regression was after applying the later
fix/commit, but only this didn't help in telling the whole history.

Testing performed to pinpoint residual regression
=================================================

So I narrowed down the second regression use case, but running it
without traffic through a NIC, on localhost, in trying to narrow down
CPU usage and not being limited by other factor like network bandwidth.
I used another system also with an AMD CPU (AMD EPYC 7742). Basically,
I run iperf3 in server and client mode in the same system, for example:

 - Start the server binding it to CPU core/thread 19:
   $ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000

 - Start the client always binding/running on CPU core/thread 17, using
   perf to get statistics:
   $ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \
       -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \
       > stat-19.txt

For the client, always running/pinned to CPU 17. But for the iperf3 in
server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any
specific CPU. So it basically consisted with four runs of the same
commands, just changing the CPU which the server is pinned, or without
pinning by removing the taskset call before the server command. The CPUs
were chosen based on NUMA node they were on, this is the relevant output
of lscpu on the system:

  $ lscpu
  ...
    Model name:             AMD EPYC 7742 64-Core Processor
  ...
  Caches (sum of all):
    L1d:                    2 MiB (64 instances)
    L1i:                    2 MiB (64 instances)
    L2:                     32 MiB (64 instances)
    L3:                     256 MiB (16 instances)
  NUMA:
    NUMA node(s):           4
    NUMA node0 CPU(s):      0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
    NUMA node1 CPU(s):      2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123
    NUMA node2 CPU(s):      4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125
    NUMA node3 CPU(s):      6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127
  ...

So for the server run, when picking a CPU, I chose CPUs to be not on the same
node. The reason is with that I was able to get/measure relevant
performance differences when changing the alignment of the writes to the
destination in copy_user_generic.

Testing shows up to +81% performance improvement under iperf3
=============================================================

Here's a summary of the iperf3 runs:

  # Vanilla upstream alignment:

		     CPU      RATE          SYS          TIME     sender-receiver
	Server bind   19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8%
	Server bind   21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9%
	Server bind   23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5%
	Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5%

  # With the attached patch (aligning writes in non ERMS/FSRM case):

		     CPU      RATE          SYS          TIME     sender-receiver
	Server bind   19: 20.8Gbits/sec 14.897284000 20.811101382 75.7%-89.0%
	Server bind   21: 20.4Gbits/sec 15.205055000 21.263165909 75.4%-89.7%
	Server bind   23: 20.2Gbits/sec 15.433801000 21.456175000 75.5%-89.8%
	Server bind none: 26.1Gbits/sec 12.534022000 16.632447315 79.8%-89.6%

So I consistently got better results when aligning the write. The
results above were run on 6.14.0-rc6/rc7 based kernels. The sys is sys
time and then the total time to run/transfer 50G of data. The last
field is the CPU usage of sender/receiver iperf3 process. It's also
worth to note that each pair of iperf3 runs may get slightly different
results on each run, but I always got consistent higher results with
the write alignment for this specific test of running the processes
on CPUs in different NUMA nodes.

Linus Torvalds helped/provided this version of the patch. Initially I
proposed a version which aligned writes for all cases in
rep_movs_alternative, however it used two extra registers and thus
Linus provided an enhanced version that only aligns the write on the
large_movsq case, which is sufficient since the problem happens only
on those AMD CPUs like ones mentioned above without ERMS/FSRM, and
also doesn't require using extra registers. Also, I validated that
aligning only on large_movsq case is really enough for getting the
performance back.

I also tested this patch on an old Intel based non-ERMS/FRMS system
(with Xeon E5-2667 - Sandy Bridge based) and didn't get any problems:
no performance enhancement but also no regression either, using the
same iperf3 based benchmark. Also newer Intel processors after
Sandy Bridge usually have ERMS and should not be affected by this change.

[ mingo: Updated the changelog. ]

Fixes: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")
Fixes: 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function")
Reported-by: Ondrej Lichtner <olichtne@redhat.com>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250320142213.2623518-1-herton@redhat.com
---
 arch/x86/lib/copy_user_64.S | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index aa8c341..06296eb 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -77,6 +77,24 @@ SYM_FUNC_START(rep_movs_alternative)
 	_ASM_EXTABLE_UA( 0b, 1b)
 
 .Llarge_movsq:
+	/* Do the first possibly unaligned word */
+0:	movq (%rsi),%rax
+1:	movq %rax,(%rdi)
+
+	_ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
+
+	/* What would be the offset to the aligned destination? */
+	leaq 8(%rdi),%rax
+	andq $-8,%rax
+	subq %rdi,%rax
+
+	/* .. and update pointers and count to match */
+	addq %rax,%rdi
+	addq %rax,%rsi
+	subq %rax,%rcx
+
+	/* make %rcx contain the number of words, %rax the remainder */
 	movq %rcx,%rax
 	shrq $3,%rcx
 	andl $7,%eax
Re: [tip: x86/urgent] x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs
Posted by Ingo Molnar 8 months, 3 weeks ago
Linus,

* tip-bot2 for Herton R. Krzesinski <tip-bot2@linutronix.de> wrote:

> The following commit has been merged into the x86/urgent branch of tip:
> 
> Commit-ID:     b5322b6ec06a6c58650f52abcd2492000396363b
> Gitweb:        https://git.kernel.org/tip/b5322b6ec06a6c58650f52abcd2492000396363b
> Author:        Herton R. Krzesinski <herton@redhat.com>
> AuthorDate:    Thu, 20 Mar 2025 11:22:13 -03:00
> Committer:     Ingo Molnar <mingo@kernel.org>
> CommitterDate: Fri, 28 Mar 2025 22:57:44 +01:00
> 
> x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs
...

> [ mingo: Updated the changelog. ]
> 
> Fixes: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")
> Fixes: 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function")
> Reported-by: Ondrej Lichtner <olichtne@redhat.com>
> Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Link: https://lore.kernel.org/r/20250320142213.2623518-1-herton@redhat.com

I have added in Linus's Signed-off-by tag, to make this SOB chain 
valid. Let me know if that's not OK.

Thanks,

	Ingo
[tip: x86/urgent] x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs
Posted by tip-bot2 for Herton R. Krzesinski 8 months, 3 weeks ago
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     3f7176e19a0f723c5b2abcc633ad0616c67d1f6d
Gitweb:        https://git.kernel.org/tip/3f7176e19a0f723c5b2abcc633ad0616c67d1f6d
Author:        Herton R. Krzesinski <herton@redhat.com>
AuthorDate:    Thu, 20 Mar 2025 11:22:13 -03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 28 Mar 2025 22:45:25 +01:00

x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs

History of the performance regression:
======================================

Since the following series of user copy updates were merged upstream
~2 years ago via:

  a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications")

.. copy_user_generic() on x86_64 stopped doing alignment of the
writes to the destination to a 8 byte boundary for the non FSRM case.

Previously, this was done through the ALIGN_DESTINATION macro that
was used in the now removed copy_user_generic_unrolled function.

Turns out this change causes some loss of performance/throughput on
some use cases and specific CPU/platforms without FSRM and ERMS.

Lately I got two reports of performance/throughput issues after a
RHEL 9 kernel pulled the same upstream series with updates to user
copy functions. Both reports consisted of running specific
networking/TCP related testing using iperf3.

Partial upstream fix
====================

The first report was related to a Linux Bridge testing using VMs on a
specific machine with an AMD CPU (EPYC 7402), and after a brief
investigation it turned out that the later change via:

  ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")

... helped/fixed the performance issue.

However, after the later commit/fix was applied, then I got another
regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also
running on an AMD based platform (AMD EPYC 7302 CPU), again that was using
iperf3 to run the test. That regression was after applying the later
fix/commit, but only this didn't help in telling the whole history.

Testing performed to pinpoint residual regression
=================================================

So I narrowed down the second regression use case, but running it
without traffic through a NIC, on localhost, in trying to narrow down
CPU usage and not being limited by other factor like network bandwidth.
I used another system also with an AMD CPU (AMD EPYC 7742). Basically,
I run iperf3 in server and client mode in the same system, for example:

 - Start the server binding it to CPU core/thread 19:
   $ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000

 - Start the client always binding/running on CPU core/thread 17, using
   perf to get statistics:
   $ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \
       -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \
       > stat-19.txt

For the client, always running/pinned to CPU 17. But for the iperf3 in
server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any
specific CPU. So it basically consisted with four runs of the same
commands, just changing the CPU which the server is pinned, or without
pinning by removing the taskset call before the server command. The CPUs
were chosen based on NUMA node they were on, this is the relevant output
of lscpu on the system:

  $ lscpu
  ...
    Model name:             AMD EPYC 7742 64-Core Processor
  ...
  Caches (sum of all):
    L1d:                    2 MiB (64 instances)
    L1i:                    2 MiB (64 instances)
    L2:                     32 MiB (64 instances)
    L3:                     256 MiB (16 instances)
  NUMA:
    NUMA node(s):           4
    NUMA node0 CPU(s):      0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
    NUMA node1 CPU(s):      2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123
    NUMA node2 CPU(s):      4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125
    NUMA node3 CPU(s):      6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127
  ...

So for the server run, when picking a CPU, I chose CPUs to be not on the same
node. The reason is with that I was able to get/measure relevant
performance differences when changing the alignment of the writes to the
destination in copy_user_generic.

Testing shows up to +81% performance improvement under iperf3
=============================================================

Here's a summary of the iperf3 runs:

  # Vanilla upstream alignment:

		     CPU      RATE          SYS          TIME     sender-receiver
	Server bind   19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8%
	Server bind   21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9%
	Server bind   23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5%
	Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5%

  # With the attached patch (aligning writes in non ERMS/FSRM case):

		     CPU      RATE          SYS          TIME     sender-receiver
	Server bind   19: 20.8Gbits/sec 14.897284000 20.811101382 75.7%-89.0%
	Server bind   21: 20.4Gbits/sec 15.205055000 21.263165909 75.4%-89.7%
	Server bind   23: 20.2Gbits/sec 15.433801000 21.456175000 75.5%-89.8%
	Server bind none: 26.1Gbits/sec 12.534022000 16.632447315 79.8%-89.6%

So I consistently got better results when aligning the write. The
results above were run on 6.14.0-rc6/rc7 based kernels. The sys is sys
time and then the total time to run/transfer 50G of data. The last
field is the CPU usage of sender/receiver iperf3 process. It's also
worth to note that each pair of iperf3 runs may get slightly different
results on each run, but I always got consistent higher results with
the write alignment for this specific test of running the processes
on CPUs in different NUMA nodes.

Linus Torvalds helped/provided this version of the patch. Initially I
proposed a version which aligned writes for all cases in
rep_movs_alternative, however it used two extra registers and thus
Linus provided an enhanced version that only aligns the write on the
large_movsq case, which is sufficient since the problem happens only
on those AMD CPUs like ones mentioned above without ERMS/FSRM, and
also doesn't require using extra registers. Also, I validated that
aligning only on large_movsq case is really enough for getting the
performance back.

I also tested this patch on an old Intel based non-ERMS/FRMS system
(with Xeon E5-2667 - Sandy Bridge based) and didn't get any problems:
no performance enhancement but also no regression either, using the
same iperf3 based benchmark. Also newer Intel processors after
Sandy Bridge usually have ERMS and should not be affected by this change.

[ mingo: Updated the changelog. ]

Fixes: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")
Fixes: 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function")
Reported-by: Ondrej Lichtner <olichtne@redhat.com>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250320142213.2623518-1-herton@redhat.com
---
 arch/x86/lib/copy_user_64.S | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index aa8c341..06296eb 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -77,6 +77,24 @@ SYM_FUNC_START(rep_movs_alternative)
 	_ASM_EXTABLE_UA( 0b, 1b)
 
 .Llarge_movsq:
+	/* Do the first possibly unaligned word */
+0:	movq (%rsi),%rax
+1:	movq %rax,(%rdi)
+
+	_ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
+
+	/* What would be the offset to the aligned destination? */
+	leaq 8(%rdi),%rax
+	andq $-8,%rax
+	subq %rdi,%rax
+
+	/* .. and update pointers and count to match */
+	addq %rax,%rdi
+	addq %rax,%rsi
+	subq %rax,%rcx
+
+	/* make %rcx contain the number of words, %rax the remainder */
 	movq %rcx,%rax
 	shrq $3,%rcx
 	andl $7,%eax
[tip: x86/mm] x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs
Posted by tip-bot2 for Herton R. Krzesinski 8 months, 3 weeks ago
The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     411d2763b3ba0e3a99cd27ce813738d530b2320d
Gitweb:        https://git.kernel.org/tip/411d2763b3ba0e3a99cd27ce813738d530b2320d
Author:        Herton R. Krzesinski <herton@redhat.com>
AuthorDate:    Thu, 20 Mar 2025 11:22:13 -03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 28 Mar 2025 21:56:49 +01:00

x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs

History of the performance regression:
======================================

Since the a series of user copy updates were merged upstream
~2 years ago via:

  a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications")

.. copy_user_generic() on x86_64 stopped doing alignment of the
writes to the destination to a 8 byte boundary for the non FSRM case.

Previously, this was done through the ALIGN_DESTINATION macro that
was used in the now removed copy_user_generic_unrolled function.

Turns out this change causes some loss of performance/throughput on
some use cases and specific CPU/platforms without FSRM and ERMS.

Lately I got two reports of performance/throughput issues after a
RHEL 9 kernel pulled the same upstream series with updates to user
copy functions. Both reports consisted of running specific
networking/TCP related testing using iperf3.

Partial upstream fix
====================

The first report was related to a Linux Bridge testing using VMs on a
specific machine with an AMD CPU (EPYC 7402), and after a brief
investigation it turned out that the later change via:

  ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")

... helped/fixed the performance issue.

However, after the later commit/fix was applied, then I got another
regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also
running on an AMD based platform (AMD EPYC 7302 CPU), again that was using
iperf3 to run the test. That regression was after applying the later
fix/commit, but only this didn't help in telling the whole history.

Testing performed to pinpoint residual regression
=================================================

So I narrowed down the second regression use case, but running it
without traffic through a NIC, on localhost, in trying to narrow down
CPU usage and not being limited by other factor like network bandwidth.
I used another system also with an AMD CPU (AMD EPYC 7742). Basically,
I run iperf3 in server and client mode in the same system, for example:

 - Start the server binding it to CPU core/thread 19:
   $ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000

 - Start the client always binding/running on CPU core/thread 17, using
   perf to get statistics:
   $ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \
       -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \
       > stat-19.txt

For the client, always running/pinned to CPU 17. But for the iperf3 in
server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any
specific CPU. So it basically consisted with four runs of the same
commands, just changing the CPU which the server is pinned, or without
pinning by removing the taskset call before the server command. The CPUs
were chosen based on NUMA node they were on, this is the relevant output
of lscpu on the system:

  $ lscpu
  ...
    Model name:             AMD EPYC 7742 64-Core Processor
  ...
  Caches (sum of all):
    L1d:                    2 MiB (64 instances)
    L1i:                    2 MiB (64 instances)
    L2:                     32 MiB (64 instances)
    L3:                     256 MiB (16 instances)
  NUMA:
    NUMA node(s):           4
    NUMA node0 CPU(s):      0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
    NUMA node1 CPU(s):      2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123
    NUMA node2 CPU(s):      4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125
    NUMA node3 CPU(s):      6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127
  ...

So for the server run, when picking a CPU, I chose CPUs to be not on the same
node. The reason is with that I was able to get/measure relevant
performance differences when changing the alignment of the writes to the
destination in copy_user_generic.

Testing shows up to +81% performance improvement under iperf3
=============================================================

Here's a summary of the iperf3 runs:

  # Vanilla upstream alignment:

		     CPU      RATE          SYS          TIME     sender-receiver
	Server bind   19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8%
	Server bind   21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9%
	Server bind   23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5%
	Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5%

  # With the attached patch (aligning writes in non ERMS/FSRM case):

		     CPU      RATE          SYS          TIME     sender-receiver
	Server bind   19: 20.8Gbits/sec 14.897284000 20.811101382 75.7%-89.0%
	Server bind   21: 20.4Gbits/sec 15.205055000 21.263165909 75.4%-89.7%
	Server bind   23: 20.2Gbits/sec 15.433801000 21.456175000 75.5%-89.8%
	Server bind none: 26.1Gbits/sec 12.534022000 16.632447315 79.8%-89.6%

So I consistently got better results when aligning the write. The
results above were run on 6.14.0-rc6/rc7 based kernels. The sys is sys
time and then the total time to run/transfer 50G of data. The last
field is the CPU usage of sender/receiver iperf3 process. It's also
worth to note that each pair of iperf3 runs may get slightly different
results on each run, but I always got consistent higher results with
the write alignment for this specific test of running the processes
on CPUs in different NUMA nodes.

Linus Torvalds helped/provided this version of the patch. Initially I
proposed a version which aligned writes for all cases in
rep_movs_alternative, however it used two extra registers and thus
Linus provided an enhanced version that only aligns the write on the
large_movsq case, which is sufficient since the problem happens only
on those AMD CPUs like ones mentioned above without ERMS/FSRM, and
also doesn't require using extra registers. Also, I validated that
aligning only on large_movsq case is really enough for getting the
performance back.

I also tested this patch on an old Intel based non-ERMS/FRMS system
(with Xeon E5-2667 - Sandy Bridge based) and didn't get any problems:
no performance enhancement but also no regression either, using the
same iperf3 based benchmark. Also newer Intel processors after
Sandy Bridge usually have ERMS and should not be affected by this change.

[ mingo: Updated the changelog. ]

Fixes: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")
Fixes: 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function")
Reported-by: Ondrej Lichtner <olichtne@redhat.com>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250320142213.2623518-1-herton@redhat.com
---
 arch/x86/lib/copy_user_64.S | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index aa8c341..06296eb 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -77,6 +77,24 @@ SYM_FUNC_START(rep_movs_alternative)
 	_ASM_EXTABLE_UA( 0b, 1b)
 
 .Llarge_movsq:
+	/* Do the first possibly unaligned word */
+0:	movq (%rsi),%rax
+1:	movq %rax,(%rdi)
+
+	_ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
+
+	/* What would be the offset to the aligned destination? */
+	leaq 8(%rdi),%rax
+	andq $-8,%rax
+	subq %rdi,%rax
+
+	/* .. and update pointers and count to match */
+	addq %rax,%rdi
+	addq %rax,%rsi
+	subq %rax,%rcx
+
+	/* make %rcx contain the number of words, %rax the remainder */
 	movq %rcx,%rax
 	shrq $3,%rcx
 	andl $7,%eax
Re: [tip: x86/mm] x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs
Posted by Ingo Molnar 8 months, 3 weeks ago
* tip-bot2 for Herton R. Krzesinski <tip-bot2@linutronix.de> wrote:

> The following commit has been merged into the x86/mm branch of tip:
> 
> Commit-ID:     411d2763b3ba0e3a99cd27ce813738d530b2320d
> Gitweb:        https://git.kernel.org/tip/411d2763b3ba0e3a99cd27ce813738d530b2320d
> Author:        Herton R. Krzesinski <herton@redhat.com>
> AuthorDate:    Thu, 20 Mar 2025 11:22:13 -03:00
> Committer:     Ingo Molnar <mingo@kernel.org>
> CommitterDate: Fri, 28 Mar 2025 21:56:49 +01:00
> 
> x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs
> 
> History of the performance regression:
> ======================================
> 
> Since the a series of user copy updates were merged upstream
        ^^^^^
> ~2 years ago via:

That was my mistake in a last-minute edit to the changelog.
I force-pushed a fix:

  |  Since the following series of user copy updates were merged upstream
  |  ~2 years ago via:

Thanks,

	Ingo