[v1] rseq: Optimize exit to user space

[patch 00/11] rseq: Optimize exit to user space

Posted by Thomas Gleixner 1 month, 3 weeks ago

With the more wide spread usage of rseq in glibc, rseq is not longer a
niche use case for special applications.

While working on a sane implementation of a rseq based time slice extension
mechanism, I noticed several shortcomings of the current rseq code:

  1) task::rseq_event_mask is a pointless bitfield despite the fact that
     the ABI flags it was meant to support have been deprecated and
     functionally disabled three years ago.

  2) task::rseq_event_mask is accumulating bits unless there is a critical
     section discovered in the user space rseq memory. This results in
     pointless invocations of the rseq user space exit handler even if
     there had nothing changed. As a matter of correctness these bits have
     to be clear when exiting to user space and therefore pristine when
     coming back into the kernel. Aside of correctness, this also avoids
     pointless evaluation of the user space memory, which is a performance
     benefit.

  3) The evaluation of critical sections does not differentiate between
     syscall and interrupt/exception exits. The current implementation
     silently fixes up critical sections which invoked a syscall unless
     CONFIG_DEBUG_RSEQ is enabled.

     That's just wrong. If user space does that on a production kernel it
     can keep the pieces. The kernel is not there to proliferate mindless
     user space programming and letting everyone pay the performance
     penalty.

This series addresses these issues and on top converts parts of the user
space access over to the new masked access model, which lowers the overhead
of Spectre-V1 mitigations significantly on architectures which support it
(x86 as of today). This is especially noticable in the access to the
rseq_cs field in struct rseq, which is the first quick check to figure out
whether a critical section is installed or not.

It survives the kernels rseq selftests, but I did not any performance tests
vs. rseq because I have no idea how to use the gazillion of undocumented
command line parameters of the benchmark. I leave that to people who are so
familiar with them, that they assume everyone else is too :)

The performance gain on regular workloads is clearly measurable and the
consistent event flag state allows now to build the time slice extension
mechanism on top. The first POC I implemented:

   https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/

suffered badly from the stale eventmask bits and the cleaned up version
brought a whopping 25+% performance gain.

The series depends on the seperately posted rseq bugfix:

   https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/

and the uaccess generic helper series:

   https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/

and a related futex fix in

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent

For your conveniance it is also available as a conglomorate from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

Thanks,

	tglx
---
 drivers/hv/mshv_common.c         |    2 
 include/linux/entry-common.h     |   12 +--
 include/linux/irq-entry-common.h |   31 ++++++--
 include/linux/resume_user_mode.h |   40 +++++++---
 include/linux/rseq.h             |  115 +++++++++----------------------
 include/linux/sched.h            |   10 +-
 include/uapi/linux/rseq.h        |   21 +----
 io_uring/io_uring.h              |    2 
 kernel/entry/common.c            |    9 +-
 kernel/entry/kvm.c               |    2 
 kernel/rseq.c                    |  142 +++++++++++++++++++++++++--------------
 kernel/sched/core.c              |    8 +-
 kernel/sched/membarrier.c        |    8 +-
 13 files changed, 213 insertions(+), 189 deletions(-)

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Mathieu Desnoyers 1 month, 2 weeks ago

On 2025-08-13 12:29, Thomas Gleixner wrote:
> With the more wide spread usage of rseq in glibc, rseq is not longer a
> niche use case for special applications.
> 
> While working on a sane implementation of a rseq based time slice extension
> mechanism, I noticed several shortcomings of the current rseq code:
> 
>    1) task::rseq_event_mask is a pointless bitfield despite the fact that
>       the ABI flags it was meant to support have been deprecated and
>       functionally disabled three years ago.

Correct. We could replace this mask by a single flag, set when the
events happen, test-and-cleared on return to userspace. This would
lessen the synchronization requirements because bitops are not needed
anymore.

> 
>    2) task::rseq_event_mask is accumulating bits unless there is a critical
>       section discovered in the user space rseq memory. This results in
>       pointless invocations of the rseq user space exit handler even if
>       there had nothing changed. As a matter of correctness these bits have
>       to be clear when exiting to user space and therefore pristine when
>       coming back into the kernel. Aside of correctness, this also avoids
>       pointless evaluation of the user space memory, which is a performance
>       benefit.

Ouch. Yes. The intent here is indeed to clear those bits in all cases,
whether there is an rseq critical section ongoing or not. Good catch!

> 
>    3) The evaluation of critical sections does not differentiate between
>       syscall and interrupt/exception exits. The current implementation
>       silently fixes up critical sections which invoked a syscall unless
>       CONFIG_DEBUG_RSEQ is enabled.
> 
>       That's just wrong. If user space does that on a production kernel it
>       can keep the pieces. The kernel is not there to proliferate mindless
>       user space programming and letting everyone pay the performance
>       penalty.

Agreed. There is no need to fixup the rseq critical section, however we
still need to update other rseq fields (e.g. current CPU number) when
returning from a system call.

> 
> This series addresses these issues and on top converts parts of the user
> space access over to the new masked access model, which lowers the overhead
> of Spectre-V1 mitigations significantly on architectures which support it
> (x86 as of today). This is especially noticable in the access to the
> rseq_cs field in struct rseq, which is the first quick check to figure out
> whether a critical section is installed or not.

Nice!

I'll try to have a closer look at the series soon.

Thanks,

Mathieu

> 
> It survives the kernels rseq selftests, but I did not any performance tests
> vs. rseq because I have no idea how to use the gazillion of undocumented
> command line parameters of the benchmark. I leave that to people who are so
> familiar with them, that they assume everyone else is too :)
> 
> The performance gain on regular workloads is clearly measurable and the
> consistent event flag state allows now to build the time slice extension
> mechanism on top. The first POC I implemented:
> 
>     https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/
> 
> suffered badly from the stale eventmask bits and the cleaned up version
> brought a whopping 25+% performance gain.
> 
> The series depends on the seperately posted rseq bugfix:
> 
>     https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/
> 
> and the uaccess generic helper series:
> 
>     https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/
> 
> and a related futex fix in
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent
> 
> For your conveniance it is also available as a conglomorate from git:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
> 
> Thanks,
> 
> 	tglx
> ---
>   drivers/hv/mshv_common.c         |    2
>   include/linux/entry-common.h     |   12 +--
>   include/linux/irq-entry-common.h |   31 ++++++--
>   include/linux/resume_user_mode.h |   40 +++++++---
>   include/linux/rseq.h             |  115 +++++++++----------------------
>   include/linux/sched.h            |   10 +-
>   include/uapi/linux/rseq.h        |   21 +----
>   io_uring/io_uring.h              |    2
>   kernel/entry/common.c            |    9 +-
>   kernel/entry/kvm.c               |    2
>   kernel/rseq.c                    |  142 +++++++++++++++++++++++++--------------
>   kernel/sched/core.c              |    8 +-
>   kernel/sched/membarrier.c        |    8 +-
>   13 files changed, 213 insertions(+), 189 deletions(-)
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Jens Axboe 1 month, 3 weeks ago

On 8/13/25 10:29 AM, Thomas Gleixner wrote:
> With the more wide spread usage of rseq in glibc, rseq is not longer a
> niche use case for special applications.
> 
> While working on a sane implementation of a rseq based time slice extension
> mechanism, I noticed several shortcomings of the current rseq code:
> 
>   1) task::rseq_event_mask is a pointless bitfield despite the fact that
>      the ABI flags it was meant to support have been deprecated and
>      functionally disabled three years ago.
> 
>   2) task::rseq_event_mask is accumulating bits unless there is a critical
>      section discovered in the user space rseq memory. This results in
>      pointless invocations of the rseq user space exit handler even if
>      there had nothing changed. As a matter of correctness these bits have
>      to be clear when exiting to user space and therefore pristine when
>      coming back into the kernel. Aside of correctness, this also avoids
>      pointless evaluation of the user space memory, which is a performance
>      benefit.
> 
>   3) The evaluation of critical sections does not differentiate between
>      syscall and interrupt/exception exits. The current implementation
>      silently fixes up critical sections which invoked a syscall unless
>      CONFIG_DEBUG_RSEQ is enabled.
> 
>      That's just wrong. If user space does that on a production kernel it
>      can keep the pieces. The kernel is not there to proliferate mindless
>      user space programming and letting everyone pay the performance
>      penalty.
> 
> This series addresses these issues and on top converts parts of the user
> space access over to the new masked access model, which lowers the overhead
> of Spectre-V1 mitigations significantly on architectures which support it
> (x86 as of today). This is especially noticable in the access to the
> rseq_cs field in struct rseq, which is the first quick check to figure out
> whether a critical section is installed or not.
> 
> It survives the kernels rseq selftests, but I did not any performance tests
> vs. rseq because I have no idea how to use the gazillion of undocumented
> command line parameters of the benchmark. I leave that to people who are so
> familiar with them, that they assume everyone else is too :)
> 
> The performance gain on regular workloads is clearly measurable and the
> consistent event flag state allows now to build the time slice extension
> mechanism on top. The first POC I implemented:
> 
>    https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/
> 
> suffered badly from the stale eventmask bits and the cleaned up version
> brought a whopping 25+% performance gain.

Thanks for doing this work, it's been on my list to take a look at rseq
as it's quite the pig currently and enabled by default (with what I
assume is from a newer libc).

-- 
Jens Axboe

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Thomas Gleixner 1 month, 3 weeks ago

On Wed, Aug 13 2025 at 11:45, Jens Axboe wrote:
> On 8/13/25 10:29 AM, Thomas Gleixner wrote:
>> 
>> suffered badly from the stale eventmask bits and the cleaned up version
>> brought a whopping 25+% performance gain.
>
> Thanks for doing this work, it's been on my list to take a look at rseq
> as it's quite the pig currently and enabled by default (with what I
> assume is from a newer libc).

Yes, recent glibc uses it.

Could you give it a test ride to see whether this makes a difference in
your environment?

Thanks

        tglx

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Jens Axboe 1 month, 3 weeks ago

On 8/13/25 3:32 PM, Thomas Gleixner wrote:
> On Wed, Aug 13 2025 at 11:45, Jens Axboe wrote:
>> On 8/13/25 10:29 AM, Thomas Gleixner wrote:
>>>
>>> suffered badly from the stale eventmask bits and the cleaned up version
>>> brought a whopping 25+% performance gain.
>>
>> Thanks for doing this work, it's been on my list to take a look at rseq
>> as it's quite the pig currently and enabled by default (with what I
>> assume is from a newer libc).
> 
> Yes, recent glibc uses it.
> 
> Could you give it a test ride to see whether this makes a difference in
> your environment?

Yep, I'll give a spin. Won't be before Monday as I'm out of town thu/fri
this week, just as a heads-up.

If I recall correct, I'd see rseq at 2-3% of overall CPU time last I
tested. I'll try and do some pre/post numbers for some of the stuff I
usually run.

-- 
Jens Axboe

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Thomas Gleixner 1 month, 3 weeks ago

On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote:
> On 8/13/25 3:32 PM, Thomas Gleixner wrote:
>> Could you give it a test ride to see whether this makes a difference in
>> your environment?
>
> Yep, I'll give a spin.

Appreciated.

> If I recall correct, I'd see rseq at 2-3% of overall CPU time last I

I'm not surprised.

> tested. I'll try and do some pre/post numbers for some of the stuff I
> usually run.

Cool. Thanks

      tglx

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Thomas Gleixner 1 month, 2 weeks ago

On Thu, Aug 14 2025 at 00:08, Thomas Gleixner wrote:
> On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote:
>> On 8/13/25 3:32 PM, Thomas Gleixner wrote:
>>> Could you give it a test ride to see whether this makes a difference in
>>> your environment?
>>
>> Yep, I'll give a spin.
>
> Appreciated.

Please do not use the git branch I had in the cover letter. I did some
more analysis of this and it's even worse than I thought. Use

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/wip

instead.

I've rewritten the whole pile by now and made it a real fast path
without the TIF_NOTIFY horror show, unless the fast path, which runs
_after_ the TIF work loop faults. So far that happens once for each
fork() as that has to fault in the copy of the user space rseq region.

There are lightweight per CPU stats for the various events, which can be
accessed via

      /sys/kernel/debug/rseq/stat

which exposes a racy sum of them. Here is the output after a reboot and
a full kernel recompile

exit:           85703905    // Total invocations
signal:            34635    // Invocations from signal delivery
slowp:               134    // Slow path via TIF_NOTIFY_RESUME
ids:               70052    // Updates of CPU and MM CID
cs:                    0    // Critical section analysis
clear:                 0    // Clearing of critical section
fixup:                 0    // Fixup of critical section (abort)

Before the rewrite this took more than a million of ID updates and
critical section evaluations even when completely pointless.

So on any syscall or interrupt heavy workload this should be clearly
visible as a difference in the profile.

I'm still not happy about the exit to user fast path decision as it's
two conditionals instead of one, but all attempts to do that lightweight
somewhere else turned out to make stuff worse as I just burdened other
fast path operations, i.e. the scheduler with pointless conditionals.

I'll think about that more, but nevertheless this is way better than the
current horror show.

I also have no real good plan yet how to gradually convert this over,
but I'm way too tired to think about that now.

It survives the self test suite after I wasted a day to figure out why
the selftests reliably segfault on a machine which has debian trixie
installed. The fix is in the branch.

Michael, can you please run your librseq tests against that too? They
have the same segfault problem as the kernel and they lack a run script,
so I couldn't be bothered to test against them. See commit 2bff3a0e5998
in that branch. I'll send out a patch with a proper change log later.

Thanks,

        tglx

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Mathieu Desnoyers 1 month, 2 weeks ago

On 2025-08-17 17:23, Thomas Gleixner wrote:
> On Thu, Aug 14 2025 at 00:08, Thomas Gleixner wrote:
>> On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote:
>>> On 8/13/25 3:32 PM, Thomas Gleixner wrote:
>>>> Could you give it a test ride to see whether this makes a difference in
>>>> your environment?
>>>
>>> Yep, I'll give a spin.
>>
>> Appreciated.
> 
> Please do not use the git branch I had in the cover letter. I did some
> more analysis of this and it's even worse than I thought. Use
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/wip
> 
> instead.
> 
> I've rewritten the whole pile by now and made it a real fast path
> without the TIF_NOTIFY horror show, unless the fast path, which runs
> _after_ the TIF work loop faults. So far that happens once for each
> fork() as that has to fault in the copy of the user space rseq region.
> 

OK. I'll stop reviewing this version of the series and wait for an
updated version.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Michael Jeanson 1 month, 2 weeks ago

On 2025-08-17 17:23, Thomas Gleixner wrote:
> Michael, can you please run your librseq tests against that too? They
> have the same segfault problem as the kernel and they lack a run script,
> so I couldn't be bothered to test against them. See commit 2bff3a0e5998
> in that branch. I'll send out a patch with a proper change log later.

I ran the librseq test suite on the new branch on a Debian Trixie amd64 
system and it succeeds, here are the rseq stats before and after.

Before:

exit:             746809
signal:                3
slowp:                99
ids:                1053
cs:                    0
clear:                 0
fixup:                 0

After:

exit:          229294046
signal:               11
slowp:              4570
ids:              615950
cs:              2493682
clear:            194637
fixup:           2299044


And I also ran the same test suite in a 32bit chroot on the same system 
which also succeeds with the following rseq stats.

Before:

exit:             717945
signal:                1
slowp:               102
ids:                1039
cs:                    0
clear:                 0
fixup:                 0

After:

exit:          201051038
signal:                9
slowp:              4909
ids:              793551
cs:                28887
clear:             12871
fixup:             16016


If you want to run the librseq tests on your system, just do the regular 
autotools dance and then run 'make check'.

Regards,

Michael

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Thomas Gleixner 1 month, 2 weeks ago

On Mon, Aug 18 2025 at 13:38, Michael Jeanson wrote:

> On 2025-08-17 17:23, Thomas Gleixner wrote:
>> Michael, can you please run your librseq tests against that too? They
>> have the same segfault problem as the kernel and they lack a run script,
>> so I couldn't be bothered to test against them. See commit 2bff3a0e5998
>> in that branch. I'll send out a patch with a proper change log later.
>
> I ran the librseq test suite on the new branch on a Debian Trixie amd64 
> system and it succeeds, here are the rseq stats before and after.

Thanks!

> Before:
>
> exit:             746809
> signal:                3
> slowp:                99
> ids:                1053
> cs:                    0
> clear:                 0
> fixup:                 0
>
> After:
>
> exit:          229294046
> signal:               11
> slowp:              4570
> ids:              615950
> cs:              2493682
> clear:            194637
> fixup:           2299044

That looks about right. Can you reset the branch to

     commit 85b61b265635 ("rseq: Expose stats")

which is just adding primitive stats on top of the current mainline
code, and provide numbers for that too?

That gives you 'notify: , cpuid:, fixup:' numbers, which are not 1:1
mappable to the final ones, but that should give some interesting
insight.

> If you want to run the librseq tests on your system, just do the regular 
> autotools dance and then run 'make check'.

Might be useful to put such instructions into README.md, no?

Thanks,

        tglx

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Michael Jeanson 1 month, 2 weeks ago

On 2025-08-18 16:21, Thomas Gleixner wrote:
> That looks about right. Can you reset the branch to
> 
>       commit 85b61b265635 ("rseq: Expose stats")
> 
> which is just adding primitive stats on top of the current mainline
> code, and provide numbers for that too?
> 
> That gives you 'notify: , cpuid:, fixup:' numbers, which are not 1:1
> mappable to the final ones, but that should give some interesting
> insight.

For amd64 kernel and userspace.

Before:

notify:             12467
fixup:              12467
cpuid:              12467


After:

notify:         123669528
fixup:          123669528
cpuid:          123669528


For amd64 kernel, i386 userspace.

Before:

notify:             12857
fixup:              12857
cpuid:              12857


After:

notify:         120621210
fixup:          120621210
cpuid:          120621210


> Might be useful to put such instructions into README.md, no?

Will do.

Regards,

Michael

Re: [patch 00/11] rseq: Optimize exit to user space

Posted by Thomas Gleixner 1 month, 2 weeks ago

On Mon, Aug 18 2025 at 17:29, Michael Jeanson wrote:

> On 2025-08-18 16:21, Thomas Gleixner wrote:
>> That looks about right. Can you reset the branch to
>> 
>>       commit 85b61b265635 ("rseq: Expose stats")
>> 
>> which is just adding primitive stats on top of the current mainline
>> code, and provide numbers for that too?
>> 
>> That gives you 'notify: , cpuid:, fixup:' numbers, which are not 1:1
>> mappable to the final ones, but that should give some interesting
>> insight.
>
> For amd64 kernel and userspace.
>
> Before:
>
> notify:             12467
> fixup:              12467
> cpuid:              12467
>
>
> After:
>
> notify:         123669528
> fixup:          123669528
> cpuid:          123669528

That's insane compared to this:

> exit:          229294046
> signal:               11
> slowp:              4570
> ids:              615950
> cs:              2493682
> clear:            194637
> fixup:           2299044

You can assume that the number of exits (to user) is roughly the same,
i.e. ~23M, so 12M (> 50%) take the TIF_NOTIFY dance on the way out and
most of them for no good reason.

While with the rework only 4.5K go into the NOTIFY slow path and 2.5M
(10 %) do the critical section evaluation and 600k (~ 3%) update CPU/MM
CID.

No suprise that Jens is seeing this in his profiles...

Thanks,

        tglx

BUG: rseq selftests and librseq vs. glibc fail

Posted by Thomas Gleixner 1 month, 2 weeks ago

On Sun, Aug 17 2025 at 23:23, Thomas Gleixner wrote:
> It survives the self test suite after I wasted a day to figure out why
> the selftests reliably segfault on a machine which has debian trixie
> installed. The fix is in the branch.

That's glibc 2.41 FWIW. glibc 2.36 from Debian 12 does not have this
problem.

The fix unfortunately only works with a dynamically linked libc,
statically linked libc fails. The fix is basically a revert of

   3bcbc20942db ("selftests/rseq: Play nice with binaries statically linked
                  against glibc 2.35+")

which introduced these weak libc symbols to make static libc linking work.

I have no idea why this creates havoc, but in GDB I saw that libc
manages to overwrite the TLS of the pthread at some place, but I gave up
decoding it further. If no pthread is created it just works. Removing
this weak muck makes it work too.

It's trivial to reproduce. All it needs is to have in the source:

__weak ptrdiff_t __rseq_offset;

w/o even being referenced and creating a pthread. Reproducer below.

TBH, this interface is a horrible hack. libc should expose a proper
function for querying whether it has registered rseq and return the
parameters in a struct. But well...

Build reproducer with:

# gcc -O2 -Wall -o t test.c
# ./t
Segmentation fault

# gcc -O2 -Wall -o t test.c -static
# ./t
#

Removing the weak __rseq_offset makes the dynamic case work too.

Yours sufficiently grumpy

      tglx

----
#define _GNU_SOURCE
#include <pthread.h>
#include <stddef.h>

#define __weak  __attribute__((__weak__))
__weak ptrdiff_t __rseq_offset;

static void *foo(void *arg)
{
	return NULL;
}

int main(int argc, char **argv)
{
	pthread_t t;

	pthread_create(&t, NULL, foo, NULL);
	pthread_join(t, NULL);
	return 0;
}