drivers/hv/mshv_common.c | 2 include/linux/entry-common.h | 12 +-- include/linux/irq-entry-common.h | 31 ++++++-- include/linux/resume_user_mode.h | 40 +++++++--- include/linux/rseq.h | 115 +++++++++---------------------- include/linux/sched.h | 10 +- include/uapi/linux/rseq.h | 21 +---- io_uring/io_uring.h | 2 kernel/entry/common.c | 9 +- kernel/entry/kvm.c | 2 kernel/rseq.c | 142 +++++++++++++++++++++++++-------------- kernel/sched/core.c | 8 +- kernel/sched/membarrier.c | 8 +- 13 files changed, 213 insertions(+), 189 deletions(-)
With the more wide spread usage of rseq in glibc, rseq is not longer a niche use case for special applications. While working on a sane implementation of a rseq based time slice extension mechanism, I noticed several shortcomings of the current rseq code: 1) task::rseq_event_mask is a pointless bitfield despite the fact that the ABI flags it was meant to support have been deprecated and functionally disabled three years ago. 2) task::rseq_event_mask is accumulating bits unless there is a critical section discovered in the user space rseq memory. This results in pointless invocations of the rseq user space exit handler even if there had nothing changed. As a matter of correctness these bits have to be clear when exiting to user space and therefore pristine when coming back into the kernel. Aside of correctness, this also avoids pointless evaluation of the user space memory, which is a performance benefit. 3) The evaluation of critical sections does not differentiate between syscall and interrupt/exception exits. The current implementation silently fixes up critical sections which invoked a syscall unless CONFIG_DEBUG_RSEQ is enabled. That's just wrong. If user space does that on a production kernel it can keep the pieces. The kernel is not there to proliferate mindless user space programming and letting everyone pay the performance penalty. This series addresses these issues and on top converts parts of the user space access over to the new masked access model, which lowers the overhead of Spectre-V1 mitigations significantly on architectures which support it (x86 as of today). This is especially noticable in the access to the rseq_cs field in struct rseq, which is the first quick check to figure out whether a critical section is installed or not. It survives the kernels rseq selftests, but I did not any performance tests vs. rseq because I have no idea how to use the gazillion of undocumented command line parameters of the benchmark. I leave that to people who are so familiar with them, that they assume everyone else is too :) The performance gain on regular workloads is clearly measurable and the consistent event flag state allows now to build the time slice extension mechanism on top. The first POC I implemented: https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/ suffered badly from the stale eventmask bits and the cleaned up version brought a whopping 25+% performance gain. The series depends on the seperately posted rseq bugfix: https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/ and the uaccess generic helper series: https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/ and a related futex fix in git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent For your conveniance it is also available as a conglomorate from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf Thanks, tglx --- drivers/hv/mshv_common.c | 2 include/linux/entry-common.h | 12 +-- include/linux/irq-entry-common.h | 31 ++++++-- include/linux/resume_user_mode.h | 40 +++++++--- include/linux/rseq.h | 115 +++++++++---------------------- include/linux/sched.h | 10 +- include/uapi/linux/rseq.h | 21 +---- io_uring/io_uring.h | 2 kernel/entry/common.c | 9 +- kernel/entry/kvm.c | 2 kernel/rseq.c | 142 +++++++++++++++++++++++++-------------- kernel/sched/core.c | 8 +- kernel/sched/membarrier.c | 8 +- 13 files changed, 213 insertions(+), 189 deletions(-)
On 2025-08-13 12:29, Thomas Gleixner wrote: > With the more wide spread usage of rseq in glibc, rseq is not longer a > niche use case for special applications. > > While working on a sane implementation of a rseq based time slice extension > mechanism, I noticed several shortcomings of the current rseq code: > > 1) task::rseq_event_mask is a pointless bitfield despite the fact that > the ABI flags it was meant to support have been deprecated and > functionally disabled three years ago. Correct. We could replace this mask by a single flag, set when the events happen, test-and-cleared on return to userspace. This would lessen the synchronization requirements because bitops are not needed anymore. > > 2) task::rseq_event_mask is accumulating bits unless there is a critical > section discovered in the user space rseq memory. This results in > pointless invocations of the rseq user space exit handler even if > there had nothing changed. As a matter of correctness these bits have > to be clear when exiting to user space and therefore pristine when > coming back into the kernel. Aside of correctness, this also avoids > pointless evaluation of the user space memory, which is a performance > benefit. Ouch. Yes. The intent here is indeed to clear those bits in all cases, whether there is an rseq critical section ongoing or not. Good catch! > > 3) The evaluation of critical sections does not differentiate between > syscall and interrupt/exception exits. The current implementation > silently fixes up critical sections which invoked a syscall unless > CONFIG_DEBUG_RSEQ is enabled. > > That's just wrong. If user space does that on a production kernel it > can keep the pieces. The kernel is not there to proliferate mindless > user space programming and letting everyone pay the performance > penalty. Agreed. There is no need to fixup the rseq critical section, however we still need to update other rseq fields (e.g. current CPU number) when returning from a system call. > > This series addresses these issues and on top converts parts of the user > space access over to the new masked access model, which lowers the overhead > of Spectre-V1 mitigations significantly on architectures which support it > (x86 as of today). This is especially noticable in the access to the > rseq_cs field in struct rseq, which is the first quick check to figure out > whether a critical section is installed or not. Nice! I'll try to have a closer look at the series soon. Thanks, Mathieu > > It survives the kernels rseq selftests, but I did not any performance tests > vs. rseq because I have no idea how to use the gazillion of undocumented > command line parameters of the benchmark. I leave that to people who are so > familiar with them, that they assume everyone else is too :) > > The performance gain on regular workloads is clearly measurable and the > consistent event flag state allows now to build the time slice extension > mechanism on top. The first POC I implemented: > > https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/ > > suffered badly from the stale eventmask bits and the cleaned up version > brought a whopping 25+% performance gain. > > The series depends on the seperately posted rseq bugfix: > > https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/ > > and the uaccess generic helper series: > > https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/ > > and a related futex fix in > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent > > For your conveniance it is also available as a conglomorate from git: > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf > > Thanks, > > tglx > --- > drivers/hv/mshv_common.c | 2 > include/linux/entry-common.h | 12 +-- > include/linux/irq-entry-common.h | 31 ++++++-- > include/linux/resume_user_mode.h | 40 +++++++--- > include/linux/rseq.h | 115 +++++++++---------------------- > include/linux/sched.h | 10 +- > include/uapi/linux/rseq.h | 21 +---- > io_uring/io_uring.h | 2 > kernel/entry/common.c | 9 +- > kernel/entry/kvm.c | 2 > kernel/rseq.c | 142 +++++++++++++++++++++++++-------------- > kernel/sched/core.c | 8 +- > kernel/sched/membarrier.c | 8 +- > 13 files changed, 213 insertions(+), 189 deletions(-) > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 8/13/25 10:29 AM, Thomas Gleixner wrote: > With the more wide spread usage of rseq in glibc, rseq is not longer a > niche use case for special applications. > > While working on a sane implementation of a rseq based time slice extension > mechanism, I noticed several shortcomings of the current rseq code: > > 1) task::rseq_event_mask is a pointless bitfield despite the fact that > the ABI flags it was meant to support have been deprecated and > functionally disabled three years ago. > > 2) task::rseq_event_mask is accumulating bits unless there is a critical > section discovered in the user space rseq memory. This results in > pointless invocations of the rseq user space exit handler even if > there had nothing changed. As a matter of correctness these bits have > to be clear when exiting to user space and therefore pristine when > coming back into the kernel. Aside of correctness, this also avoids > pointless evaluation of the user space memory, which is a performance > benefit. > > 3) The evaluation of critical sections does not differentiate between > syscall and interrupt/exception exits. The current implementation > silently fixes up critical sections which invoked a syscall unless > CONFIG_DEBUG_RSEQ is enabled. > > That's just wrong. If user space does that on a production kernel it > can keep the pieces. The kernel is not there to proliferate mindless > user space programming and letting everyone pay the performance > penalty. > > This series addresses these issues and on top converts parts of the user > space access over to the new masked access model, which lowers the overhead > of Spectre-V1 mitigations significantly on architectures which support it > (x86 as of today). This is especially noticable in the access to the > rseq_cs field in struct rseq, which is the first quick check to figure out > whether a critical section is installed or not. > > It survives the kernels rseq selftests, but I did not any performance tests > vs. rseq because I have no idea how to use the gazillion of undocumented > command line parameters of the benchmark. I leave that to people who are so > familiar with them, that they assume everyone else is too :) > > The performance gain on regular workloads is clearly measurable and the > consistent event flag state allows now to build the time slice extension > mechanism on top. The first POC I implemented: > > https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/ > > suffered badly from the stale eventmask bits and the cleaned up version > brought a whopping 25+% performance gain. Thanks for doing this work, it's been on my list to take a look at rseq as it's quite the pig currently and enabled by default (with what I assume is from a newer libc). -- Jens Axboe
On Wed, Aug 13 2025 at 11:45, Jens Axboe wrote: > On 8/13/25 10:29 AM, Thomas Gleixner wrote: >> >> suffered badly from the stale eventmask bits and the cleaned up version >> brought a whopping 25+% performance gain. > > Thanks for doing this work, it's been on my list to take a look at rseq > as it's quite the pig currently and enabled by default (with what I > assume is from a newer libc). Yes, recent glibc uses it. Could you give it a test ride to see whether this makes a difference in your environment? Thanks tglx
On 8/13/25 3:32 PM, Thomas Gleixner wrote: > On Wed, Aug 13 2025 at 11:45, Jens Axboe wrote: >> On 8/13/25 10:29 AM, Thomas Gleixner wrote: >>> >>> suffered badly from the stale eventmask bits and the cleaned up version >>> brought a whopping 25+% performance gain. >> >> Thanks for doing this work, it's been on my list to take a look at rseq >> as it's quite the pig currently and enabled by default (with what I >> assume is from a newer libc). > > Yes, recent glibc uses it. > > Could you give it a test ride to see whether this makes a difference in > your environment? Yep, I'll give a spin. Won't be before Monday as I'm out of town thu/fri this week, just as a heads-up. If I recall correct, I'd see rseq at 2-3% of overall CPU time last I tested. I'll try and do some pre/post numbers for some of the stuff I usually run. -- Jens Axboe
On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote: > On 8/13/25 3:32 PM, Thomas Gleixner wrote: >> Could you give it a test ride to see whether this makes a difference in >> your environment? > > Yep, I'll give a spin. Appreciated. > If I recall correct, I'd see rseq at 2-3% of overall CPU time last I I'm not surprised. > tested. I'll try and do some pre/post numbers for some of the stuff I > usually run. Cool. Thanks tglx
On Thu, Aug 14 2025 at 00:08, Thomas Gleixner wrote: > On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote: >> On 8/13/25 3:32 PM, Thomas Gleixner wrote: >>> Could you give it a test ride to see whether this makes a difference in >>> your environment? >> >> Yep, I'll give a spin. > > Appreciated. Please do not use the git branch I had in the cover letter. I did some more analysis of this and it's even worse than I thought. Use git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/wip instead. I've rewritten the whole pile by now and made it a real fast path without the TIF_NOTIFY horror show, unless the fast path, which runs _after_ the TIF work loop faults. So far that happens once for each fork() as that has to fault in the copy of the user space rseq region. There are lightweight per CPU stats for the various events, which can be accessed via /sys/kernel/debug/rseq/stat which exposes a racy sum of them. Here is the output after a reboot and a full kernel recompile exit: 85703905 // Total invocations signal: 34635 // Invocations from signal delivery slowp: 134 // Slow path via TIF_NOTIFY_RESUME ids: 70052 // Updates of CPU and MM CID cs: 0 // Critical section analysis clear: 0 // Clearing of critical section fixup: 0 // Fixup of critical section (abort) Before the rewrite this took more than a million of ID updates and critical section evaluations even when completely pointless. So on any syscall or interrupt heavy workload this should be clearly visible as a difference in the profile. I'm still not happy about the exit to user fast path decision as it's two conditionals instead of one, but all attempts to do that lightweight somewhere else turned out to make stuff worse as I just burdened other fast path operations, i.e. the scheduler with pointless conditionals. I'll think about that more, but nevertheless this is way better than the current horror show. I also have no real good plan yet how to gradually convert this over, but I'm way too tired to think about that now. It survives the self test suite after I wasted a day to figure out why the selftests reliably segfault on a machine which has debian trixie installed. The fix is in the branch. Michael, can you please run your librseq tests against that too? They have the same segfault problem as the kernel and they lack a run script, so I couldn't be bothered to test against them. See commit 2bff3a0e5998 in that branch. I'll send out a patch with a proper change log later. Thanks, tglx
On 2025-08-17 17:23, Thomas Gleixner wrote: > On Thu, Aug 14 2025 at 00:08, Thomas Gleixner wrote: >> On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote: >>> On 8/13/25 3:32 PM, Thomas Gleixner wrote: >>>> Could you give it a test ride to see whether this makes a difference in >>>> your environment? >>> >>> Yep, I'll give a spin. >> >> Appreciated. > > Please do not use the git branch I had in the cover letter. I did some > more analysis of this and it's even worse than I thought. Use > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/wip > > instead. > > I've rewritten the whole pile by now and made it a real fast path > without the TIF_NOTIFY horror show, unless the fast path, which runs > _after_ the TIF work loop faults. So far that happens once for each > fork() as that has to fault in the copy of the user space rseq region. > OK. I'll stop reviewing this version of the series and wait for an updated version. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2025-08-17 17:23, Thomas Gleixner wrote: > Michael, can you please run your librseq tests against that too? They > have the same segfault problem as the kernel and they lack a run script, > so I couldn't be bothered to test against them. See commit 2bff3a0e5998 > in that branch. I'll send out a patch with a proper change log later. I ran the librseq test suite on the new branch on a Debian Trixie amd64 system and it succeeds, here are the rseq stats before and after. Before: exit: 746809 signal: 3 slowp: 99 ids: 1053 cs: 0 clear: 0 fixup: 0 After: exit: 229294046 signal: 11 slowp: 4570 ids: 615950 cs: 2493682 clear: 194637 fixup: 2299044 And I also ran the same test suite in a 32bit chroot on the same system which also succeeds with the following rseq stats. Before: exit: 717945 signal: 1 slowp: 102 ids: 1039 cs: 0 clear: 0 fixup: 0 After: exit: 201051038 signal: 9 slowp: 4909 ids: 793551 cs: 28887 clear: 12871 fixup: 16016 If you want to run the librseq tests on your system, just do the regular autotools dance and then run 'make check'. Regards, Michael
On Mon, Aug 18 2025 at 13:38, Michael Jeanson wrote: > On 2025-08-17 17:23, Thomas Gleixner wrote: >> Michael, can you please run your librseq tests against that too? They >> have the same segfault problem as the kernel and they lack a run script, >> so I couldn't be bothered to test against them. See commit 2bff3a0e5998 >> in that branch. I'll send out a patch with a proper change log later. > > I ran the librseq test suite on the new branch on a Debian Trixie amd64 > system and it succeeds, here are the rseq stats before and after. Thanks! > Before: > > exit: 746809 > signal: 3 > slowp: 99 > ids: 1053 > cs: 0 > clear: 0 > fixup: 0 > > After: > > exit: 229294046 > signal: 11 > slowp: 4570 > ids: 615950 > cs: 2493682 > clear: 194637 > fixup: 2299044 That looks about right. Can you reset the branch to commit 85b61b265635 ("rseq: Expose stats") which is just adding primitive stats on top of the current mainline code, and provide numbers for that too? That gives you 'notify: , cpuid:, fixup:' numbers, which are not 1:1 mappable to the final ones, but that should give some interesting insight. > If you want to run the librseq tests on your system, just do the regular > autotools dance and then run 'make check'. Might be useful to put such instructions into README.md, no? Thanks, tglx
On 2025-08-18 16:21, Thomas Gleixner wrote: > That looks about right. Can you reset the branch to > > commit 85b61b265635 ("rseq: Expose stats") > > which is just adding primitive stats on top of the current mainline > code, and provide numbers for that too? > > That gives you 'notify: , cpuid:, fixup:' numbers, which are not 1:1 > mappable to the final ones, but that should give some interesting > insight. For amd64 kernel and userspace. Before: notify: 12467 fixup: 12467 cpuid: 12467 After: notify: 123669528 fixup: 123669528 cpuid: 123669528 For amd64 kernel, i386 userspace. Before: notify: 12857 fixup: 12857 cpuid: 12857 After: notify: 120621210 fixup: 120621210 cpuid: 120621210 > Might be useful to put such instructions into README.md, no? Will do. Regards, Michael
On Mon, Aug 18 2025 at 17:29, Michael Jeanson wrote: > On 2025-08-18 16:21, Thomas Gleixner wrote: >> That looks about right. Can you reset the branch to >> >> commit 85b61b265635 ("rseq: Expose stats") >> >> which is just adding primitive stats on top of the current mainline >> code, and provide numbers for that too? >> >> That gives you 'notify: , cpuid:, fixup:' numbers, which are not 1:1 >> mappable to the final ones, but that should give some interesting >> insight. > > For amd64 kernel and userspace. > > Before: > > notify: 12467 > fixup: 12467 > cpuid: 12467 > > > After: > > notify: 123669528 > fixup: 123669528 > cpuid: 123669528 That's insane compared to this: > exit: 229294046 > signal: 11 > slowp: 4570 > ids: 615950 > cs: 2493682 > clear: 194637 > fixup: 2299044 You can assume that the number of exits (to user) is roughly the same, i.e. ~23M, so 12M (> 50%) take the TIF_NOTIFY dance on the way out and most of them for no good reason. While with the rework only 4.5K go into the NOTIFY slow path and 2.5M (10 %) do the critical section evaluation and 600k (~ 3%) update CPU/MM CID. No suprise that Jens is seeing this in his profiles... Thanks, tglx
On Sun, Aug 17 2025 at 23:23, Thomas Gleixner wrote: > It survives the self test suite after I wasted a day to figure out why > the selftests reliably segfault on a machine which has debian trixie > installed. The fix is in the branch. That's glibc 2.41 FWIW. glibc 2.36 from Debian 12 does not have this problem. The fix unfortunately only works with a dynamically linked libc, statically linked libc fails. The fix is basically a revert of 3bcbc20942db ("selftests/rseq: Play nice with binaries statically linked against glibc 2.35+") which introduced these weak libc symbols to make static libc linking work. I have no idea why this creates havoc, but in GDB I saw that libc manages to overwrite the TLS of the pthread at some place, but I gave up decoding it further. If no pthread is created it just works. Removing this weak muck makes it work too. It's trivial to reproduce. All it needs is to have in the source: __weak ptrdiff_t __rseq_offset; w/o even being referenced and creating a pthread. Reproducer below. TBH, this interface is a horrible hack. libc should expose a proper function for querying whether it has registered rseq and return the parameters in a struct. But well... Build reproducer with: # gcc -O2 -Wall -o t test.c # ./t Segmentation fault # gcc -O2 -Wall -o t test.c -static # ./t # Removing the weak __rseq_offset makes the dynamic case work too. Yours sufficiently grumpy tglx ---- #define _GNU_SOURCE #include <pthread.h> #include <stddef.h> #define __weak __attribute__((__weak__)) __weak ptrdiff_t __rseq_offset; static void *foo(void *arg) { return NULL; } int main(int argc, char **argv) { pthread_t t; pthread_create(&t, NULL, foo, NULL); pthread_join(t, NULL); return 0; }
* Thomas Gleixner: > On Sun, Aug 17 2025 at 23:23, Thomas Gleixner wrote: >> It survives the self test suite after I wasted a day to figure out why >> the selftests reliably segfault on a machine which has debian trixie >> installed. The fix is in the branch. > > That's glibc 2.41 FWIW. glibc 2.36 from Debian 12 does not have this > problem. > > The fix unfortunately only works with a dynamically linked libc, > statically linked libc fails. The fix is basically a revert of > > 3bcbc20942db ("selftests/rseq: Play nice with binaries statically linked > against glibc 2.35+") > > which introduced these weak libc symbols to make static libc linking work. > > I have no idea why this creates havoc, but in GDB I saw that libc > manages to overwrite the TLS of the pthread at some place, but I gave up > decoding it further. If no pthread is created it just works. Removing > this weak muck makes it work too. > > It's trivial to reproduce. All it needs is to have in the source: > > __weak ptrdiff_t __rseq_offset; > > w/o even being referenced and creating a pthread. Reproducer below. Well, that's sort of expected. You can't define glibc symbols that are not intended for interposition and expect things to work. It's kind of like writing: int _rtld_global; That's going to fail rather spectaculary, too. We make an exception for symbols that are not reserved (you can build in ISO C mode and define open, close, etc., at least as long as you link to glibc only). But __rseq_offset is a reserved name, so that is not applicable here. The real change here is GCC changing from -fcommon (which made a lot of these things work in the past) to -fno-common. Thanks, Florian
On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote: > * Thomas Gleixner: >> It's trivial to reproduce. All it needs is to have in the source: >> >> __weak ptrdiff_t __rseq_offset; >> >> w/o even being referenced and creating a pthread. Reproducer below. > > Well, that's sort of expected. You can't define glibc symbols that are > not intended for interposition and expect things to work. It's kind of > like writing: > > int _rtld_global; > > That's going to fail rather spectaculary, too. We make an exception for > symbols that are not reserved (you can build in ISO C mode and define > open, close, etc., at least as long as you link to glibc only). But > __rseq_offset is a reserved name, so that is not applicable here. > > The real change here is GCC changing from -fcommon (which made a lot of > these things work in the past) to -fno-common. Thanks for the explanation! So the only way to make this actually work is to revert that commit and the folks who want to link that statically need to come up with: #ifdef _BUILD_STATICALLY extern .... #else ptr = dlsym(...); #endif or something daft like that. A proper function interface would avoid all that nonsense, but we can't have nice things or can we? Thanks, tglx
> On Aug 18, 2025, at 10:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote: >> * Thomas Gleixner: >>> It's trivial to reproduce. All it needs is to have in the source: >>> >>> __weak ptrdiff_t __rseq_offset; >>> >>> w/o even being referenced and creating a pthread. Reproducer below. >> >> Well, that's sort of expected. You can't define glibc symbols that are >> not intended for interposition and expect things to work. It's kind of >> like writing: >> >> int _rtld_global; >> >> That's going to fail rather spectaculary, too. We make an exception for >> symbols that are not reserved (you can build in ISO C mode and define >> open, close, etc., at least as long as you link to glibc only). But >> __rseq_offset is a reserved name, so that is not applicable here. >> >> The real change here is GCC changing from -fcommon (which made a lot of >> these things work in the past) to -fno-common. > > Thanks for the explanation! > > So the only way to make this actually work is to revert that commit and > the folks who want to link that statically need to come up with: > > #ifdef _BUILD_STATICALLY > extern .... > > #else > ptr = dlsym(...); > #endif > > or something daft like that. A proper function interface would avoid all > that nonsense, but we can't have nice things or can we? Could the rseq(2) syscall itself return the already registered rseq structure address? Perhaps a new flag argument to the rseq(2) syscall to query the registered rseq address or return the address of the already registered rseq structure when it fails to register a new one. Application can call it when the call to register a rseq structure fails. -Prakash > > Thanks, > > tglx >
On 2025-08-29 14:44, Prakash Sangappa wrote: > > >> On Aug 18, 2025, at 10:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote: >> >> On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote: >>> * Thomas Gleixner: >>>> It's trivial to reproduce. All it needs is to have in the source: >>>> >>>> __weak ptrdiff_t __rseq_offset; >>>> >>>> w/o even being referenced and creating a pthread. Reproducer below. >>> >>> Well, that's sort of expected. You can't define glibc symbols that are >>> not intended for interposition and expect things to work. It's kind of >>> like writing: >>> >>> int _rtld_global; >>> >>> That's going to fail rather spectaculary, too. We make an exception for >>> symbols that are not reserved (you can build in ISO C mode and define >>> open, close, etc., at least as long as you link to glibc only). But >>> __rseq_offset is a reserved name, so that is not applicable here. >>> >>> The real change here is GCC changing from -fcommon (which made a lot of >>> these things work in the past) to -fno-common. >> >> Thanks for the explanation! >> >> So the only way to make this actually work is to revert that commit and >> the folks who want to link that statically need to come up with: >> >> #ifdef _BUILD_STATICALLY >> extern .... >> >> #else >> ptr = dlsym(...); >> #endif >> >> or something daft like that. A proper function interface would avoid all >> that nonsense, but we can't have nice things or can we? > > > Could the rseq(2) syscall itself return the already registered rseq structure address? > Perhaps a new flag argument to the rseq(2) syscall to query the registered rseq address > or return the address of the already registered rseq structure when it fails to register a new one. > > Application can call it when the call to register a rseq structure fails. There is a ptrace(2) PTRACE_GET_RSEQ_CONFIGURATION to achieve something similar. I don't know if a dependency on ptrace would be acceptable for that use-case though. Thanks, Mathieu > > -Prakash > >> >> Thanks, >> >> tglx >> > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
> On Aug 29, 2025, at 11:50 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2025-08-29 14:44, Prakash Sangappa wrote: >>> On Aug 18, 2025, at 10:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote: >>> >>> On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote: >>>> * Thomas Gleixner: >>>>> It's trivial to reproduce. All it needs is to have in the source: >>>>> >>>>> __weak ptrdiff_t __rseq_offset; >>>>> >>>>> w/o even being referenced and creating a pthread. Reproducer below. >>>> >>>> Well, that's sort of expected. You can't define glibc symbols that are >>>> not intended for interposition and expect things to work. It's kind of >>>> like writing: >>>> >>>> int _rtld_global; >>>> >>>> That's going to fail rather spectaculary, too. We make an exception for >>>> symbols that are not reserved (you can build in ISO C mode and define >>>> open, close, etc., at least as long as you link to glibc only). But >>>> __rseq_offset is a reserved name, so that is not applicable here. >>>> >>>> The real change here is GCC changing from -fcommon (which made a lot of >>>> these things work in the past) to -fno-common. >>> >>> Thanks for the explanation! >>> >>> So the only way to make this actually work is to revert that commit and >>> the folks who want to link that statically need to come up with: >>> >>> #ifdef _BUILD_STATICALLY >>> extern .... >>> >>> #else >>> ptr = dlsym(...); >>> #endif >>> >>> or something daft like that. A proper function interface would avoid all >>> that nonsense, but we can't have nice things or can we? >> Could the rseq(2) syscall itself return the already registered rseq structure address? >> Perhaps a new flag argument to the rseq(2) syscall to query the registered rseq address >> or return the address of the already registered rseq structure when it fails to register a new one. >> Application can call it when the call to register a rseq structure fails. > > There is a ptrace(2) PTRACE_GET_RSEQ_CONFIGURATION to achieve > something similar. I don't know if a dependency on ptrace would > be acceptable for that use-case though. Can a thread call ptrace(PTRACE_GET_RSEQ_CONFIGURATION,..) on itself? May be something similar can be added to rseq(2) . Thanks, -Prakash. > > Thanks, > > Mathieu > > > >> -Prakash >>> >>> Thanks, >>> >>> tglx >>> > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com
* Thomas Gleixner: > On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote: >> * Thomas Gleixner: >>> It's trivial to reproduce. All it needs is to have in the source: >>> >>> __weak ptrdiff_t __rseq_offset; >>> >>> w/o even being referenced and creating a pthread. Reproducer below. >> >> Well, that's sort of expected. You can't define glibc symbols that are >> not intended for interposition and expect things to work. It's kind of >> like writing: >> >> int _rtld_global; >> >> That's going to fail rather spectaculary, too. We make an exception for >> symbols that are not reserved (you can build in ISO C mode and define >> open, close, etc., at least as long as you link to glibc only). But >> __rseq_offset is a reserved name, so that is not applicable here. >> >> The real change here is GCC changing from -fcommon (which made a lot of >> these things work in the past) to -fno-common. > > Thanks for the explanation! > > So the only way to make this actually work is to revert that commit and > the folks who want to link that statically need to come up with: > > #ifdef _BUILD_STATICALLY > extern .... > > #else > ptr = dlsym(...); > #endif > > or something daft like that. A proper function interface would avoid all > that nonsense, but we can't have nice things or can we? I don't understand why a function would be different. Well, a function *declaration* would be implicitly extern, in a way a variable declaration is not (without -fcommon). Maybe it's just about the missing extern keyword? You could add the extern keyword and check &__rseq_offset for NULL if you want to probe for the availability of the signal? Or use: #if __has_include(<sys/rseq.h>) #include <sys/rseq.h> /* Code that depends on glibc's rseq support goes here. */ #endif Thanks, Florian
On Mon, Aug 18, 2025, Florian Weimer wrote: > * Thomas Gleixner: > > > On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote: > >> * Thomas Gleixner: > >>> It's trivial to reproduce. All it needs is to have in the source: > >>> > >>> __weak ptrdiff_t __rseq_offset; > >>> > >>> w/o even being referenced and creating a pthread. Reproducer below. > >> > >> Well, that's sort of expected. You can't define glibc symbols that are > >> not intended for interposition and expect things to work. It's kind of > >> like writing: > >> > >> int _rtld_global; > >> > >> That's going to fail rather spectaculary, too. We make an exception for > >> symbols that are not reserved (you can build in ISO C mode and define > >> open, close, etc., at least as long as you link to glibc only). But > >> __rseq_offset is a reserved name, so that is not applicable here. > >> > >> The real change here is GCC changing from -fcommon (which made a lot of > >> these things work in the past) to -fno-common. > > > > Thanks for the explanation! > > > > So the only way to make this actually work is to revert that commit and > > the folks who want to link that statically need to come up with: > > > > #ifdef _BUILD_STATICALLY > > extern .... > > > > #else > > ptr = dlsym(...); > > #endif > > > > or something daft like that. A proper function interface would avoid all > > that nonsense, but we can't have nice things or can we? > > I don't understand why a function would be different. Well, a function > *declaration* would be implicitly extern, in a way a variable > declaration is not (without -fcommon). Maybe it's just about the > missing extern keyword? > > You could add the extern keyword and check &__rseq_offset for NULL if > you want to probe for the availability of the signal? That will fail to link if the glibc version doesn't support rseq in any capacity, which is why I added the __weak crud. >Or use: > > #if __has_include(<sys/rseq.h>) > #include <sys/rseq.h> > /* Code that depends on glibc's rseq support goes here. */ FWIW, the code in question doesn't depend on rseq per se, rather the problem is that attempting to register a restartable sequence fails if glibc has already "claimed" rseq. What about something horrific like this? Or if __has_include(<sys/rseq.h>) is preferrable to checking the glibc version, go with that. The idea with checking LIBC_RSEQ_STATIC_LINK is to give folks a way to force static linking if their libc registers rseq, but doesn't satisfy the glibc version checks for whatever reason. diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c index 663a9cef1952..1a88352fcff3 100644 --- a/tools/testing/selftests/rseq/rseq.c +++ b/tools/testing/selftests/rseq/rseq.c @@ -36,17 +36,18 @@ #include "../kselftest.h" #include "rseq.h" -/* - * Define weak versions to play nice with binaries that are statically linked - * against a libc that doesn't support registering its own rseq. - */ -__weak ptrdiff_t __rseq_offset; -__weak unsigned int __rseq_size; -__weak unsigned int __rseq_flags; +#if defined(LIBC_RSEQ_STATIC_LINK) || __GLIBC__ > 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ >= 35) +extern ptrdiff_t __rseq_offset; +extern unsigned int __rseq_size; +extern unsigned int __rseq_flags; +#define GLIBC_RSEQ_PTR(x) &x +#else +#define GLIBC_RSEQ_PTR(x) NULL +#endif -static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset; -static const unsigned int *libc_rseq_size_p = &__rseq_size; -static const unsigned int *libc_rseq_flags_p = &__rseq_flags; +static const ptrdiff_t *libc_rseq_offset_p = GLIBC_RSEQ_PTR(__rseq_offset); +static const unsigned int *libc_rseq_size_p = GLIBC_RSEQ_PTR(__rseq_size); +static const unsigned int *libc_rseq_flags_p = GLIBC_RSEQ_PTR(__rseq_flags); /* Offset from the thread pointer to the rseq area. */ ptrdiff_t rseq_offset; @@ -209,7 +210,7 @@ void rseq_init(void) * libc not having registered a restartable sequence. Try to find the * symbols if that's the case. */ - if (!*libc_rseq_size_p) { + if (!libc_rseq_size_p || !*libc_rseq_size_p) { libc_rseq_offset_p = dlsym(RTLD_NEXT, "__rseq_offset"); libc_rseq_size_p = dlsym(RTLD_NEXT, "__rseq_size"); libc_rseq_flags_p = dlsym(RTLD_NEXT, "__rseq_flags");
* Sean Christopherson: > On Mon, Aug 18, 2025, Florian Weimer wrote: >> You could add the extern keyword and check &__rseq_offset for NULL if >> you want to probe for the availability of the signal? > > That will fail to link if the glibc version doesn't support rseq in > any capacity, which is why I added the __weak crud. You need both (extern and weak) to get a weak symbol reference instead of a weak symbol definition. You still need to check &__rseq_offset, of course. >>Or use: >> >> #if __has_include(<sys/rseq.h>) >> #include <sys/rseq.h> >> /* Code that depends on glibc's rseq support goes here. */ > > FWIW, the code in question doesn't depend on rseq per se, rather the problem is > that attempting to register a restartable sequence fails if glibc has already > "claimed" rseq. You can set GLIBC_TUNABLES=glibc.pthread.rseq=0 as an environment variable to prevent glibc from registering it. For a test that's probably okay? It won't help with other libcs that might use rseq eventually. > What about something horrific like this? Or if __has_include(<sys/rseq.h>) is > preferrable to checking the glibc version, go with that. The idea with checking > LIBC_RSEQ_STATIC_LINK is to give folks a way to force static linking if their > libc registers rseq, but doesn't satisfy the glibc version checks for whatever > reason. > > diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c > index 663a9cef1952..1a88352fcff3 100644 > --- a/tools/testing/selftests/rseq/rseq.c > +++ b/tools/testing/selftests/rseq/rseq.c > @@ -36,17 +36,18 @@ > #include "../kselftest.h" > #include "rseq.h" > > -/* > - * Define weak versions to play nice with binaries that are statically linked > - * against a libc that doesn't support registering its own rseq. > - */ > -__weak ptrdiff_t __rseq_offset; > -__weak unsigned int __rseq_size; > -__weak unsigned int __rseq_flags; > +#if defined(LIBC_RSEQ_STATIC_LINK) || __GLIBC__ > 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ >= 35) > +extern ptrdiff_t __rseq_offset; > +extern unsigned int __rseq_size; > +extern unsigned int __rseq_flags; > +#define GLIBC_RSEQ_PTR(x) &x > +#else > +#define GLIBC_RSEQ_PTR(x) NULL > +#endif That doesn't work for a couple of distributions nowadays that are nominally based on glibc 2.34. The AArch64 performance improvement for sched_getcpu was just too sweet to give up, so these distributions have __rseq_offset@@GLIBC_2.35, too. We added these symbols specifically so that applications that need to interact with rseq can do so safely, even though glibc claimed it for the sched_getcpu acceleration. Thanks, Florian
On Mon, Aug 18, 2025, Florian Weimer wrote: > * Sean Christopherson: > > > On Mon, Aug 18, 2025, Florian Weimer wrote: > >> You could add the extern keyword and check &__rseq_offset for NULL if > >> you want to probe for the availability of the signal? > > > > That will fail to link if the glibc version doesn't support rseq in > > any capacity, which is why I added the __weak crud. > > You need both (extern and weak) to get a weak symbol reference instead > of a weak symbol definition. You still need to check &__rseq_offset, of > course. Ooh, you're saying add "extern" to the existing __weak symbol, not replace it. Huh, TIL weak symbol references are a thing. This works with static and dynamic linking, with and without an rseq-aware glibc. Thomas, does this fix the problem you were seeing? diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c index 663a9cef1952..d17ded120d48 100644 --- a/tools/testing/selftests/rseq/rseq.c +++ b/tools/testing/selftests/rseq/rseq.c @@ -40,9 +40,9 @@ * Define weak versions to play nice with binaries that are statically linked * against a libc that doesn't support registering its own rseq. */ -__weak ptrdiff_t __rseq_offset; -__weak unsigned int __rseq_size; -__weak unsigned int __rseq_flags; +extern __weak ptrdiff_t __rseq_offset; +extern __weak unsigned int __rseq_size; +extern __weak unsigned int __rseq_flags; static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset; static const unsigned int *libc_rseq_size_p = &__rseq_size; @@ -209,7 +209,7 @@ void rseq_init(void) * libc not having registered a restartable sequence. Try to find the * symbols if that's the case. */ - if (!*libc_rseq_size_p) { + if (!libc_rseq_offset_p || !*libc_rseq_size_p) { libc_rseq_offset_p = dlsym(RTLD_NEXT, "__rseq_offset"); libc_rseq_size_p = dlsym(RTLD_NEXT, "__rseq_size"); libc_rseq_flags_p = dlsym(RTLD_NEXT, "__rseq_flags"); > >>Or use: > >> > >> #if __has_include(<sys/rseq.h>) > >> #include <sys/rseq.h> > >> /* Code that depends on glibc's rseq support goes here. */ > > > > FWIW, the code in question doesn't depend on rseq per se, rather the problem is > > that attempting to register a restartable sequence fails if glibc has already > > "claimed" rseq. > > You can set GLIBC_TUNABLES=glibc.pthread.rseq=0 as an environment > variable to prevent glibc from registering it. For a test that's > probably okay? It won't help with other libcs that might use rseq > eventually. I vaguely recall exploring that option when trying to get static linking working. I think I shied away from it because I wanted to have something that Just Worked for all users, and didn't like the idea of hardcoding that into the KVM selftests makefile.
On Mon, Aug 18 2025 at 13:27, Sean Christopherson wrote: > On Mon, Aug 18, 2025, Florian Weimer wrote: >> You need both (extern and weak) to get a weak symbol reference instead >> of a weak symbol definition. You still need to check &__rseq_offset, of >> course. > > Ooh, you're saying add "extern" to the existing __weak symbol, not replace it. > Huh, TIL weak symbol references are a thing. > > This works with static and dynamic linking, with and without an rseq-aware glibc. > > Thomas, does this fix the problem you were seeing? > > diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c > index 663a9cef1952..d17ded120d48 100644 > --- a/tools/testing/selftests/rseq/rseq.c > +++ b/tools/testing/selftests/rseq/rseq.c > @@ -40,9 +40,9 @@ > * Define weak versions to play nice with binaries that are statically linked > * against a libc that doesn't support registering its own rseq. > */ > -__weak ptrdiff_t __rseq_offset; > -__weak unsigned int __rseq_size; > -__weak unsigned int __rseq_flags; > +extern __weak ptrdiff_t __rseq_offset; > +extern __weak unsigned int __rseq_size; > +extern __weak unsigned int __rseq_flags; > > static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset; > static const unsigned int *libc_rseq_size_p = &__rseq_size; > @@ -209,7 +209,7 @@ void rseq_init(void) > * libc not having registered a restartable sequence. Try to find the > * symbols if that's the case. > */ > - if (!*libc_rseq_size_p) { > + if (!libc_rseq_offset_p || !*libc_rseq_size_p) { If I make that: + if (!*libc_rseq_offset_p || !*libc_rseq_size_p) { then it makes sense and actually works. The pointer can hardly be NULL, even when statically linked, no? Thanks, tglx
On Tue, Aug 19, 2025, Thomas Gleixner wrote: > On Mon, Aug 18 2025 at 13:27, Sean Christopherson wrote: > > On Mon, Aug 18, 2025, Florian Weimer wrote: > >> You need both (extern and weak) to get a weak symbol reference instead > >> of a weak symbol definition. You still need to check &__rseq_offset, of > >> course. > > > > Ooh, you're saying add "extern" to the existing __weak symbol, not replace it. > > Huh, TIL weak symbol references are a thing. > > > > This works with static and dynamic linking, with and without an rseq-aware glibc. > > > > Thomas, does this fix the problem you were seeing? > > > > diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c > > index 663a9cef1952..d17ded120d48 100644 > > --- a/tools/testing/selftests/rseq/rseq.c > > +++ b/tools/testing/selftests/rseq/rseq.c > > @@ -40,9 +40,9 @@ > > * Define weak versions to play nice with binaries that are statically linked > > * against a libc that doesn't support registering its own rseq. > > */ > > -__weak ptrdiff_t __rseq_offset; > > -__weak unsigned int __rseq_size; > > -__weak unsigned int __rseq_flags; > > +extern __weak ptrdiff_t __rseq_offset; > > +extern __weak unsigned int __rseq_size; > > +extern __weak unsigned int __rseq_flags; > > > > static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset; > > static const unsigned int *libc_rseq_size_p = &__rseq_size; > > @@ -209,7 +209,7 @@ void rseq_init(void) > > * libc not having registered a restartable sequence. Try to find the > > * symbols if that's the case. > > */ > > - if (!*libc_rseq_size_p) { > > + if (!libc_rseq_offset_p || !*libc_rseq_size_p) { Doh, I meant to check libc_rseq_size_p for NULL, i.e. if (!libc_rseq_size_p || !*libc_rseq_size_p) { > > If I make that: > > + if (!*libc_rseq_offset_p || !*libc_rseq_size_p) { > > then it makes sense and actually works. The pointer can hardly be NULL, > even when statically linked, no? IIUC, it is indeed the pointers that are set to NULL/0, because for unresolved symbols, the symbol itself, not its value, is set to '0'. Which makes sense, because if there is no symbol, then it can't have a value. I.e. the address of the symbol is '0', and its value is undefined. E.g. statically linking this against glibc without rseq support: diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c index 663a9cef1952..959bdcb32e96 100644 --- a/tools/testing/selftests/rseq/rseq.c +++ b/tools/testing/selftests/rseq/rseq.c @@ -40,9 +40,9 @@ * Define weak versions to play nice with binaries that are statically linked * against a libc that doesn't support registering its own rseq. */ -__weak ptrdiff_t __rseq_offset; -__weak unsigned int __rseq_size; -__weak unsigned int __rseq_flags; +extern __weak ptrdiff_t __rseq_offset; +extern __weak unsigned int __rseq_size; +extern __weak unsigned int __rseq_flags; static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset; static const unsigned int *libc_rseq_size_p = &__rseq_size; @@ -209,7 +209,12 @@ void rseq_init(void) * libc not having registered a restartable sequence. Try to find the * symbols if that's the case. */ - if (!*libc_rseq_size_p) { + printf("libc_rseq_offset_p = %lx (%lx), libc_rseq_size_p = %lx (%lx)\n", + (unsigned long)libc_rseq_offset_p, (unsigned long)libc_rseq_size_p, + (unsigned long)&__rseq_offset, (unsigned long)&__rseq_size); + printf("__rseq_size = %u\n", __rseq_size); + + if (!libc_rseq_size_p || !*libc_rseq_size_p) { libc_rseq_offset_p = dlsym(RTLD_NEXT, "__rseq_offset"); libc_rseq_size_p = dlsym(RTLD_NEXT, "__rseq_size"); libc_rseq_flags_p = dlsym(RTLD_NEXT, "__rseq_flags"); Generates this output: $ ./rseq_test libc_rseq_offset_p = 0 (0), libc_rseq_size_p = 0 (0) Segmentation fault Because trying to dereference __rseq_size hits NULL/0.
* Sean Christopherson: >> If I make that: >> >> + if (!*libc_rseq_offset_p || !*libc_rseq_size_p) { >> >> then it makes sense and actually works. The pointer can hardly be NULL, >> even when statically linked, no? > > IIUC, it is indeed the pointers that are set to NULL/0, because for unresolved > symbols, the symbol itself, not its value, is set to '0'. Which makes sense, > because if there is no symbol, then it can't have a value. Right, that's how weak symbol references work. Thanks, Florian
© 2016 - 2025 Red Hat, Inc.