[PATCH v2] epoll: Use user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().

Kuniyuki Iwashima posted 1 patch 3 months, 1 week ago
include/linux/eventpoll.h | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
[PATCH v2] epoll: Use user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
Posted by Kuniyuki Iwashima 3 months, 1 week ago
epoll_put_uevent() calls __put_user() twice, which are inlined
to two calls of out-of-line functions, __put_user_nocheck_4()
and __put_user_nocheck_8().

Both functions wrap mov with a stac/clac pair, which is expensive
on the AMD EPYC 7B12 (Zen 2) 64-Core Processor platform.

  __put_user_nocheck_4  /proc/kcore [Percent: local period]
  Percent │
    89.91 │      stac
     0.19 │      mov  %eax,(%rcx)
     0.15 │      xor  %ecx,%ecx
     9.69 │      clac
     0.06 │    ← retq

This was remarkable while testing neper/udp_rr with 1000 flows per
thread.

  Overhead  Shared O  Symbol
    10.08%  [kernel]  [k] _copy_to_iter
     7.12%  [kernel]  [k] ip6_output
     6.40%  [kernel]  [k] sock_poll
     5.71%  [kernel]  [k] move_addr_to_user
     4.39%  [kernel]  [k] __put_user_nocheck_4
     ...
     1.06%  [kernel]  [k] ep_try_send_events
     ...                  ^- epoll_put_uevent() was inlined
     0.78%  [kernel]  [k] __put_user_nocheck_8

Let's use user_write_access_begin() and unsafe_put_user() in
epoll_put_uevent().

We saw 2% more pps with udp_rr by saving a stac/clac pair.

Before:

  # nstat > /dev/null; sleep 10; nstat | grep -i udp
  Udp6InDatagrams                 2184011            0.0

  @ep_try_send_events_ns:
  [256, 512)       2796601 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)         627863 |@@@@@@@@@@@                                         |
  [1K, 2K)          166403 |@@@                                                 |
  [2K, 4K)           10437 |                                                    |
  [4K, 8K)            1396 |                                                    |
  [8K, 16K)            116 |                                                    |

After:

  # nstat > /dev/null; sleep 10; nstat | grep -i udp
  Udp6InDatagrams                 2232730            0.0

  @ep_try_send_events_ns:
  [256, 512)       2900655 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)         622045 |@@@@@@@@@@@                                         |
  [1K, 2K)          172831 |@@@                                                 |
  [2K, 4K)           17687 |                                                    |
  [4K, 8K)            1103 |                                                    |
  [8K, 16K)            174 |                                                    |

Another option would be to use can_do_masked_user_access()
and masked_user_access_begin(), but we saw 3% regression. (See Link)

Link: https://lore.kernel.org/lkml/20251028053330.2391078-1-kuniyu@google.com/
Suggested-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2:
  * Drop patch 1
  * Use user_write_access_begin() instead of a bare stac (Dave Hansen)

v1: https://lore.kernel.org/lkml/20251023000535.2897002-1-kuniyu@google.com/
---
 include/linux/eventpoll.h | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index ccb478eb174b..31a1b11e4ddf 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -82,11 +82,15 @@ static inline struct epoll_event __user *
 epoll_put_uevent(__poll_t revents, __u64 data,
 		 struct epoll_event __user *uevent)
 {
-	if (__put_user(revents, &uevent->events) ||
-	    __put_user(data, &uevent->data))
+	if (!user_write_access_begin(uevent, sizeof(*uevent)))
 		return NULL;
-
-	return uevent+1;
+	unsafe_put_user(revents, &uevent->events, efault);
+	unsafe_put_user(data, &uevent->data, efault);
+	user_access_end();
+	return uevent + 1;
+efault:
+	user_access_end();
+	return NULL;
 }
 #endif
 
-- 
2.51.1.851.g4ebd6896fd-goog
Re: [PATCH v2] epoll: Use user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
Posted by Linus Torvalds 3 months, 1 week ago
On Tue, 28 Oct 2025 at 10:56, Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> Let's use user_write_access_begin() and unsafe_put_user() in
> epoll_put_uevent().
>
> We saw 2% more pps with udp_rr by saving a stac/clac pair.

This patch looks fine to me. Simple and targeted.

> Another option would be to use can_do_masked_user_access()
> and masked_user_access_begin(), but we saw 3% regression. (See Link)

So I find this intriguing, because generally,
masked_user_access_begin() should _never_ be slower than
user_write_access_begin().

user_write_access_begin() ends up doing a __uaccess_begin_nospec() on
x86, which is not just the STAC instruction, but also a barrier.

In contrast, masked_user_access_begin() obviously also has the STAC,
but it avoids the barrier and only uses a simple conditional mask.

So I get the feeling that you did something wrong. In particular,
following your link, I see you describe that case (2) as

  2) masked_user_access_begin() + masked_user_access_begin()
  97% pps compared to 1).
  96% calls of ep_try_send_events().

and you mention masked_user_access_begin() *twice*.

Which would certainly explain why it's slower.

Can you show what the patch you used is?

Because I think the proper patch should look something like the
attached.. For me, that generates


        movabs $0x123456789abcdef,%rcx
        cmp    %rcx,%r15
        cmova  %rcx,%r15
        stac
        mov    %r12d,(%r15)
        mov    %rax,0x4(%r15)
        clac

which honestly should be pretty much optimal.

(That 0x123456789abcdef is just a placeholder for the USER_PTR_MAX
value - it gets rewritten at boot to the right value).

NOTE! The attached patch has absolutely not been tested. I only used
it to verify the code generation visually, so you should definitely
check it yourself.

               Linus
Re: [PATCH v2] epoll: Use user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
Posted by Kuniyuki Iwashima 3 months, 1 week ago
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 28 Oct 2025 12:02:33 -0700
> On Tue, 28 Oct 2025 at 10:56, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >
> > Let's use user_write_access_begin() and unsafe_put_user() in
> > epoll_put_uevent().
> >
> > We saw 2% more pps with udp_rr by saving a stac/clac pair.
> 
> This patch looks fine to me. Simple and targeted.
> 
> > Another option would be to use can_do_masked_user_access()
> > and masked_user_access_begin(), but we saw 3% regression. (See Link)
> 
> So I find this intriguing, because generally,
> masked_user_access_begin() should _never_ be slower than
> user_write_access_begin().
> 
> user_write_access_begin() ends up doing a __uaccess_begin_nospec() on
> x86, which is not just the STAC instruction, but also a barrier.
> 
> In contrast, masked_user_access_begin() obviously also has the STAC,
> but it avoids the barrier and only uses a simple conditional mask.
> 
> So I get the feeling that you did something wrong. In particular,
> following your link, I see you describe that case (2) as
> 
>   2) masked_user_access_begin() + masked_user_access_begin()
>   97% pps compared to 1).
>   96% calls of ep_try_send_events().
> 
> and you mention masked_user_access_begin() *twice*.

Oh sorry, this was my copy-and-paste mistake, and it should have
been can_do_masked_user_access() + masked_user_access_begin().


> 
> Which would certainly explain why it's slower.
> 
> Can you show what the patch you used is?

I used the diff below as the masked version.

I noticed user_access_begin() is used in "else if ()", but I
think it should not matter as long as tested on x86_64 platform
where can_do_masked_user_access() is true.

So, I think the diff is basically the same with the attached one.

---8<---
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index ccb478eb174b..c530d83f89d0 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -82,11 +82,23 @@ static inline struct epoll_event __user *
 epoll_put_uevent(__poll_t revents, __u64 data,
                 struct epoll_event __user *uevent)
 {
-       if (__put_user(revents, &uevent->events) ||
-           __put_user(data, &uevent->data))
+       if (can_do_masked_user_access()) {
+               uevent = masked_user_access_begin(uevent);
+       } else if (!user_access_begin(uevent, sizeof(*uevent))) {
+               /*
+                * dead branch for __must_check.
+                * ep_check_params() already checked access_ok().
+                */
                return NULL;
-
-       return uevent+1;
+       }
+
+       unsafe_put_user(revents, &uevent->events, efault);
+       unsafe_put_user(data, &uevent->data, efault);
+       user_access_end();
+       return uevent + 1;
+efault:
+       user_access_end();
+       return NULL;
 }
 #endif
 
---8<---


> 
> Because I think the proper patch should look something like the
> attached.. For me, that generates
> 
> 
>         movabs $0x123456789abcdef,%rcx
>         cmp    %rcx,%r15
>         cmova  %rcx,%r15
>         stac
>         mov    %r12d,(%r15)
>         mov    %rax,0x4(%r15)
>         clac
> 
> which honestly should be pretty much optimal.

I had this with the masked version, taken from perf + 'a'.

---8<---
        │       movabs $0x7ffffffff000,%rcx
   0.05 │       cmp    %rcx,%r15
   0.18 │       cmova  %rcx,%r15
  72.69 │       stac
   0.08 │       lea    0x28(%rsp),%r12
   0.09 │       mov    %r14d,(%r15)
   0.06 │       mov    %rax,0x4(%r15)
   6.42 │       clac
---8<---

One possibility that Eric mentioned 2 weeks ago is that cmov
might be expensive on the Zen 2 platform.


> 
> (That 0x123456789abcdef is just a placeholder for the USER_PTR_MAX
> value - it gets rewritten at boot to the right value).
> 
> NOTE! The attached patch has absolutely not been tested. I only used
> it to verify the code generation visually, so you should definitely
> check it yourself.
> 
Re: [PATCH v2] epoll: Use user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
Posted by Linus Torvalds 3 months, 1 week ago
On Tue, 28 Oct 2025 at 12:27, Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> Oh sorry, this was my copy-and-paste mistake, and it should have
> been can_do_masked_user_access() + masked_user_access_begin().

Ok.

> I used the diff below as the masked version.

That looks fine.

> I noticed user_access_begin() is used in "else if ()", but I
> think it should not matter as long as tested on x86_64 platform
> where can_do_masked_user_access() is true.

Correct. It's designed to be just optimized away statically based on
whether the architecture supports address masking or not.

> So, I think the diff is basically the same with the attached one.

Yes, looks fine.

> I had this with the masked version, taken from perf + 'a'.
>
> ---8<---
>         │       movabs $0x7ffffffff000,%rcx
>    0.05 │       cmp    %rcx,%r15
>    0.18 │       cmova  %rcx,%r15
>   72.69 │       stac
>    0.08 │       lea    0x28(%rsp),%r12
>    0.09 │       mov    %r14d,(%r15)
>    0.06 │       mov    %rax,0x4(%r15)
>    6.42 │       clac
> ---8<---
>
> One possibility that Eric mentioned 2 weeks ago is that cmov
> might be expensive on the Zen 2 platform.

Hmm. I have a Zen 2 myself, and I haven't seen that being particularly
problematic, but admittedly I did most of my profiling with the
original address masking (back when it just used a plain 'or'
instruction to set all bits in the address).

Without the user address masking, you should be seeing a 'lfence'
(from the user_write_access_begin) instead, which has typically been
*much* more expensive.

I wonder if it's some random instruction scheduling thing.

               Linus
Re: [PATCH v2] epoll: Use user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
Posted by Linus Torvalds 3 months, 1 week ago
 On Tue, 28 Oct 2025 at 12:02, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Because I think the proper patch should look something like the
> attached.. For me, that generates
>
>         movabs $0x123456789abcdef,%rcx
>         cmp    %rcx,%r15
>         cmova  %rcx,%r15
>         stac
>         mov    %r12d,(%r15)
>         mov    %rax,0x4(%r15)
>         clac
>
> which honestly should be pretty much optimal.

Side note: when I say "pretty much optimal", the truly optimal
situation would be to do the address masking entirely outside the loop
in ep_send_events(), but we don't expose that interface, and we have
absolutely horrible experiences with doing user address checking
separately from the actual code that does the access, so I'm really
loathe to add that kind of logic.

So the above is *not* optimal in the sense that the

        movabs $0x123456789abcdef,%rcx
        cmp    %rcx,%r15
        cmova  %rcx,%r15

part certainly *could* be done outside the list_for_each_entry_safe()
loop in ep_send_events(), but I do not think that three simple
ALU-only instructions should make a difference here.

              Linus