Documentation/admin-guide/kernel-parameters.txt | 5 Documentation/admin-guide/sysctl/kernel.rst | 6 Documentation/userspace-api/index.rst | 1 Documentation/userspace-api/rseq.rst | 118 +++++++++ arch/alpha/kernel/syscalls/syscall.tbl | 1 arch/arm/tools/syscall.tbl | 1 arch/arm64/tools/syscall_32.tbl | 1 arch/m68k/kernel/syscalls/syscall.tbl | 1 arch/microblaze/kernel/syscalls/syscall.tbl | 1 arch/mips/kernel/syscalls/syscall_n32.tbl | 1 arch/mips/kernel/syscalls/syscall_n64.tbl | 1 arch/mips/kernel/syscalls/syscall_o32.tbl | 1 arch/parisc/kernel/syscalls/syscall.tbl | 1 arch/powerpc/kernel/syscalls/syscall.tbl | 1 arch/s390/kernel/syscalls/syscall.tbl | 1 arch/s390/mm/pfault.c | 3 arch/sh/kernel/syscalls/syscall.tbl | 1 arch/sparc/kernel/syscalls/syscall.tbl | 1 arch/x86/entry/syscalls/syscall_32.tbl | 1 arch/x86/entry/syscalls/syscall_64.tbl | 1 arch/xtensa/kernel/syscalls/syscall.tbl | 1 include/linux/entry-common.h | 2 include/linux/rseq.h | 11 include/linux/rseq_entry.h | 191 ++++++++++++++- include/linux/rseq_types.h | 30 ++ include/linux/sched.h | 7 include/linux/syscalls.h | 1 include/linux/thread_info.h | 16 - include/uapi/asm-generic/unistd.h | 5 include/uapi/linux/prctl.h | 10 include/uapi/linux/rseq.h | 38 +++ init/Kconfig | 12 kernel/entry/common.c | 14 - kernel/entry/syscall-common.c | 11 kernel/rcu/tiny.c | 8 kernel/rcu/tree.c | 14 - kernel/rcu/tree_exp.h | 3 kernel/rcu/tree_plugin.h | 9 kernel/rcu/tree_stall.h | 3 kernel/rseq.c | 299 ++++++++++++++++++++++++ kernel/sys.c | 6 kernel/sys_ni.c | 1 scripts/syscall.tbl | 1 tools/testing/selftests/rseq/.gitignore | 1 tools/testing/selftests/rseq/Makefile | 5 tools/testing/selftests/rseq/rseq-abi.h | 27 ++ tools/testing/selftests/rseq/slice_test.c | 198 +++++++++++++++ 47 files changed, 1019 insertions(+), 53 deletions(-)
This is a follow up on the V2 version:
https://lore.kernel.org/20251022110646.839870156@linutronix.de
V1 contains a detailed explanation:
https://lore.kernel.org/20250908225709.144709889@linutronix.de
TLDR: Time slice extensions are an attempt to provide opportunistic
priority ceiling without the overhead of an actual priority ceiling
protocol, but also without the guarantees such a protocol provides.
The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.
This series uses the existing RSEQ user memory to implement it.
Changes vs. V2:
- Rebase on the newest RSEQ and uaccess changes
- Document the command line parameter - Sebastian
- Use ENOTSUPP in the stub inline to be consistent - Sebastian
- Add sysctl documentation - Sebastian
- Simplify timer cancelation - Sebastian
- Restore the dropped 'From: Peter...' line in patch 1 - Sebastian
- More documentation/comment fixes - Randy
The uaccess and RSEQ modifications on which this series is based can be
found here:
https://lore.kernel.org/20251029123717.886619142@linutronix.de
and in git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
For your convenience all of it is also available as a conglomerate from
git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
Thanks,
tglx
---
Peter Zijlstra (1):
sched: Provide and use set_need_resched_current()
Thomas Gleixner (11):
rseq: Add fields and constants for time slice extension
rseq: Provide static branch for time slice extensions
rseq: Add statistics for time slice extensions
rseq: Add prctl() to enable time slice extensions
rseq: Implement sys_rseq_slice_yield()
rseq: Implement syscall entry work for time slice extensions
rseq: Implement time slice extension enforcement timer
rseq: Reset slice extension when scheduled
rseq: Implement rseq_grant_slice_extension()
entry: Hook up rseq time slice extension
selftests/rseq: Implement time slice extension test
Documentation/admin-guide/kernel-parameters.txt | 5
Documentation/admin-guide/sysctl/kernel.rst | 6
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 118 +++++++++
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/tools/syscall_32.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/s390/mm/pfault.c | 3
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
arch/xtensa/kernel/syscalls/syscall.tbl | 1
include/linux/entry-common.h | 2
include/linux/rseq.h | 11
include/linux/rseq_entry.h | 191 ++++++++++++++-
include/linux/rseq_types.h | 30 ++
include/linux/sched.h | 7
include/linux/syscalls.h | 1
include/linux/thread_info.h | 16 -
include/uapi/asm-generic/unistd.h | 5
include/uapi/linux/prctl.h | 10
include/uapi/linux/rseq.h | 38 +++
init/Kconfig | 12
kernel/entry/common.c | 14 -
kernel/entry/syscall-common.c | 11
kernel/rcu/tiny.c | 8
kernel/rcu/tree.c | 14 -
kernel/rcu/tree_exp.h | 3
kernel/rcu/tree_plugin.h | 9
kernel/rcu/tree_stall.h | 3
kernel/rseq.c | 299 ++++++++++++++++++++++++
kernel/sys.c | 6
kernel/sys_ni.c | 1
scripts/syscall.tbl | 1
tools/testing/selftests/rseq/.gitignore | 1
tools/testing/selftests/rseq/Makefile | 5
tools/testing/selftests/rseq/rseq-abi.h | 27 ++
tools/testing/selftests/rseq/slice_test.c | 198 +++++++++++++++
47 files changed, 1019 insertions(+), 53 deletions(-)
On 2025-10-29 14:22:11 [+0100], Thomas Gleixner wrote: > Changes vs. V2: > > - Rebase on the newest RSEQ and uaccess changes … > and in git: > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid > > For your convenience all of it is also available as a conglomerate from > git: > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice rseq/slice is older than rseq/cid. rseq/slice has the __put_kernel_nofault typo. rseq/cid looks correct. > Thanks, > > tglx Sebastian
On Wed, 29 Oct 2025 16:10:55 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > and in git: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid > > > > For your convenience all of it is also available as a conglomerate from > > git: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice > > rseq/slice is older than rseq/cid. rseq/slice has the > __put_kernel_nofault typo. rseq/cid looks correct. Yeah, I started looking at both too, and checking out req/slice and trying to do a rebase on top of rseq/cid causes a bunch of conflicts. I'm continuing the rebase and just skipping the changed commits. -- Steve
On Wed, Oct 29 2025 at 11:40, Steven Rostedt wrote:
> On Wed, 29 Oct 2025 16:10:55 +0100
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
>
>> > and in git:
>> >
>> > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
>> >
>> > For your convenience all of it is also available as a conglomerate from
>> > git:
>> >
>> > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>>
>> rseq/slice is older than rseq/cid. rseq/slice has the
>> __put_kernel_nofault typo. rseq/cid looks correct.
>
> Yeah, I started looking at both too, and checking out req/slice and trying
> to do a rebase on top of rseq/cid causes a bunch of conflicts.
>
> I'm continuing the rebase and just skipping the changed commits.
Forgot to push the updated branch out....
Fixed now.
Thanks,
tglx
> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > This is a follow up on the V2 version: > > https://lore.kernel.org/20251022110646.839870156@linutronix.de > > V1 contains a detailed explanation: > > https://lore.kernel.org/20250908225709.144709889@linutronix.de > > TLDR: Time slice extensions are an attempt to provide opportunistic > priority ceiling without the overhead of an actual priority ceiling > protocol, but also without the guarantees such a protocol provides. […] > > > The uaccess and RSEQ modifications on which this series is based can be > found here: > > https://lore.kernel.org/20251029123717.886619142@linutronix.de > > and in git: > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid > > For your convenience all of it is also available as a conglomerate from > git: > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice > Hit this watchdog panic. Using following tree. Assume this Is the latest. https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/slice Appears to be spinning in mm_get_cid(). Must be the mm cid changes. https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/ -Prakash ------------------------------------------------------- watchdog: CPU152: Watchdog detected hard LOCKUP on cpu 152 .. 93.093858] RIP: 0010:mm_get_cid+0x7e/0xd0 [ 93.093866] Code: 4c eb 63 f3 90 8b 05 f1 6a 66 02 8b 35 d7 bc 8e 01 83 c0 3f 48 89 f5 c1 e8 03 25 f8 ff ff 1f 48 8d 3c 43 e8 24 ce 62 00 89 c1 <39> e8 73 d5 8b 35 c8 6a 66 02 89 c0 8d 56 3f c1 ea 03 81 e2 f8 ff [ 93.093867] RSP: 0018:ff734c4591c6bc38 EFLAGS: 00000046 [ 93.093869] RAX: 0000000000000180 RBX: ff3c42cea15ec2c0 RCX: 0000000000000180 [ 93.093871] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 93.093872] RBP: 0000000000000180 R08: 0000000000000000 R09: 0000000000000000 [ 93.093873] R10: 0000000000000000 R11: 00000000fffffff4 R12: ff3c42cea15ebd30 [ 93.093874] R13: ffa54c453ba41640 R14: ff3c42cea15ebd28 R15: ff3c42cea15ebd27 [ 93.093875] FS: 00007f92b1482740(0000) GS:ff3c43e8d55ef000(0000) knlGS:0000000000000000 [ 93.093876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 93.093877] CR2: 00007f8ebe7fbfb8 CR3: 00000126c9f61004 CR4: 0000000000f71ef0 [ 93.093878] PKRU: 55555554 [ 93.093879] Call Trace: [ 93.093882] <TASK> [ 93.093887] sched_mm_cid_fork+0x22d/0x300 [ 93.093895] copy_process+0x92a/0x1670 [ 93.093902] kernel_clone+0xbc/0x490 [ 93.093903] ? srso_alias_return_thunk+0x5/0xfbef5 [ 93.093907] ? __lruvec_stat_mod_folio+0x83/0xd0 [ 93.093911] __do_sys_clone+0x65/0xa0 [ 93.093916] do_syscall_64+0x7f/0x8a0 > Thanks, > > tglx >
On 2025-11-06 12:28, Prakash Sangappa wrote:
[...]
> Hit this watchdog panic.
>
> Using following tree. Assume this Is the latest.
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/slice
>
> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
When this happened during the development of the "complex" mm_cid
scheme, this was typically caused by a stale "mm_cid" being kept around
by a task even though it was not actually scheduled, thus causing
over-reservation of concurrency IDs beyond the max_cids threshold. This
ends up looping in:
static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
while (cid == MM_CID_UNSET) {
cpu_relax();
cid = __mm_get_cid(mm, num_possible_cpus());
}
return cid;
}
Based on the stacktrace you provided, it seems to happen within
sched_mm_cid_fork() within copy_process, so perhaps it's simply an
initialization issue in fork, or an issue when cloning a new thread ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On 2025-11-10 09:23, Mathieu Desnoyers wrote:
> On 2025-11-06 12:28, Prakash Sangappa wrote:
> [...]
>> Hit this watchdog panic.
>>
>> Using following tree. Assume this Is the latest.
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/
>> slice
>>
>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
>
> When this happened during the development of the "complex" mm_cid
> scheme, this was typically caused by a stale "mm_cid" being kept around
> by a task even though it was not actually scheduled, thus causing
> over-reservation of concurrency IDs beyond the max_cids threshold. This
> ends up looping in:
>
> static inline unsigned int mm_get_cid(struct mm_struct *mm)
> {
> unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm-
> >mm_cid.max_cids));
>
> while (cid == MM_CID_UNSET) {
> cpu_relax();
> cid = __mm_get_cid(mm, num_possible_cpus());
> }
> return cid;
> }
>
> Based on the stacktrace you provided, it seems to happen within
> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
> initialization issue in fork, or an issue when cloning a new thread ?
I've spent some time digging through Thomas' implementation of
mm_cid management. I've spotted something which may explain
the watchdog panic. Here is the scenario:
1) A process is constrained to a subset of the possible CPUs,
and has enough threads to swap from per-thread to per-cpu mm_cid
mode. It runs happily in that per-cpu mode.
2) The number of allowed CPUs is increased for a process, thus invoking
mm_update_cpus_allowed. This switches the mode back to per-thread,
but delays invocation of mm_cid_work_fn to some point in the future,
in thread context, through irq_work + schedule_work.
At that point, because only __mm_update_max_cids was called by
mm_update_cpus_allowed, the max_cids is updated, but mc->transit
is still zero.
Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
scheduled work or near the end of sched_mm_cid_fork, or by
sched_mm_cid_exit, we are in a state where mm_cids are still
owned by CPUs, but we are now in per-thread mm_cid mode, which
means that the mc->max_cids value depends on the number of threads.
3) At that point, a new thread is cloned, thus invoking
sched_mm_cid_fork. Calling sched_mm_cid_add_user increases the user
count and invokes mm_update_max_cids, which updates the mc->max_cids
limit, but does not set the mc->transit flag because this call does not
swap from per-cpu to per-task mode (the mode is already per-task).
Immediately after the call to sched_mm_cid_add_user, sched_mm_cid_fork()
attempts to call mm_get_cid while the mm_cid mutex and mm_cid lock
are held, and loops forever because the mm_cid mask has all
the max_cids IDs reserved because of the stale per-cpu CIDs.
I see two possible issues here:
A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
mode without setting the mc->transit flag.
B) sched_mm_cid_fork calls mm_get_cpu() before invoking
mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
mm_cids and make them available for mm_get_cpu().
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Tue, Nov 11 2025 at 11:42, Mathieu Desnoyers wrote:
> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
> I've spent some time digging through Thomas' implementation of
> mm_cid management. I've spotted something which may explain
> the watchdog panic. Here is the scenario:
>
> 1) A process is constrained to a subset of the possible CPUs,
> and has enough threads to swap from per-thread to per-cpu mm_cid
> mode. It runs happily in that per-cpu mode.
>
> 2) The number of allowed CPUs is increased for a process, thus invoking
> mm_update_cpus_allowed. This switches the mode back to per-thread,
> but delays invocation of mm_cid_work_fn to some point in the future,
> in thread context, through irq_work + schedule_work.
>
> At that point, because only __mm_update_max_cids was called by
> mm_update_cpus_allowed, the max_cids is updated, but mc->transit
> is still zero.
>
> Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
> scheduled work or near the end of sched_mm_cid_fork, or by
> sched_mm_cid_exit, we are in a state where mm_cids are still
> owned by CPUs, but we are now in per-thread mm_cid mode, which
> means that the mc->max_cids value depends on the number of threads.
No. It stays in per CPU mode. The mode switch itself happens either in
the worker or on fork/exit whatever comes first.
> 3) At that point, a new thread is cloned, thus invoking
> sched_mm_cid_fork. Calling sched_mm_cid_add_user increases the user
> count and invokes mm_update_max_cids, which updates the mc->max_cids
> limit, but does not set the mc->transit flag because this call does not
> swap from per-cpu to per-task mode (the mode is already per-task).
No. mm::mm_cid::percpu is still set. So mm::mm_cid::transit is irrelevant.
> Immediately after the call to sched_mm_cid_add_user, sched_mm_cid_fork()
> attempts to call mm_get_cid while the mm_cid mutex and mm_cid lock
> are held, and loops forever because the mm_cid mask has all
> the max_cids IDs reserved because of the stale per-cpu CIDs.
Definitely not. sched_mm_cid_add_user() invokes mm_update_max_cids()
which does the mode switch in mm_cid, sets transit and returns true,
which means that fork() goes and does the transition game and allocates
the CID for the new task after that completed.
There was an issue in V3 with the not-initialized transit member and a
off by one in one of the transition functions. It's fixed in the git
tree, but I haven't posted it yet because I was AFK for a week.
I did not notice the V3 issue because tests passed on a small machine,
but after I did a rebase to the tip rseq and uaccess bits, I noticed the
failure because I tested on a larger box.
Thanks,
tglx
On 2025-11-12 15:31, Thomas Gleixner wrote: > On Tue, Nov 11 2025 at 11:42, Mathieu Desnoyers wrote: >> On 2025-11-10 09:23, Mathieu Desnoyers wrote: >> I've spent some time digging through Thomas' implementation of >> mm_cid management. I've spotted something which may explain >> the watchdog panic. Here is the scenario: >> >> 1) A process is constrained to a subset of the possible CPUs, >> and has enough threads to swap from per-thread to per-cpu mm_cid >> mode. It runs happily in that per-cpu mode. >> >> 2) The number of allowed CPUs is increased for a process, thus invoking >> mm_update_cpus_allowed. This switches the mode back to per-thread, >> but delays invocation of mm_cid_work_fn to some point in the future, >> in thread context, through irq_work + schedule_work. >> >> At that point, because only __mm_update_max_cids was called by >> mm_update_cpus_allowed, the max_cids is updated, but mc->transit >> is still zero. >> >> Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the >> scheduled work or near the end of sched_mm_cid_fork, or by >> sched_mm_cid_exit, we are in a state where mm_cids are still >> owned by CPUs, but we are now in per-thread mm_cid mode, which >> means that the mc->max_cids value depends on the number of threads. > > No. It stays in per CPU mode. The mode switch itself happens either in > the worker or on fork/exit whatever comes first. Ah, that's what I missed. All good then. [...] > > There was an issue in V3 with the not-initialized transit member and a > off by one in one of the transition functions. It's fixed in the git > tree, but I haven't posted it yet because I was AFK for a week. > > I did not notice the V3 issue because tests passed on a small machine, > but after I did a rebase to the tip rseq and uaccess bits, I noticed the > failure because I tested on a larger box. Good ! We'll see if this fixes the issue observed by Prakash. If not, I'm curious to validate that num_possible_cpus() is always set to its final value before _any_ mm is created. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Wed, Nov 12 2025 at 15:46, Mathieu Desnoyers wrote: > On 2025-11-12 15:31, Thomas Gleixner wrote: >> I did not notice the V3 issue because tests passed on a small machine, >> but after I did a rebase to the tip rseq and uaccess bits, I noticed the >> failure because I tested on a larger box. > > Good ! We'll see if this fixes the issue observed by Prakash. If not, > I'm curious to validate that num_possible_cpus() is always set to its > final value before _any_ mm is created. It _is_ set to it's final value in start_kernel() before setup_per_cpu_areas() is invoked. Otherwise the kernel would not work at all.
> On Nov 11, 2025, at 8:42 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
>> On 2025-11-06 12:28, Prakash Sangappa wrote:
>> [...]
>>> Hit this watchdog panic.
>>>
>>> Using following tree. Assume this Is the latest.
>>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/ slice
>>>
>>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
>> When this happened during the development of the "complex" mm_cid
>> scheme, this was typically caused by a stale "mm_cid" being kept around
>> by a task even though it was not actually scheduled, thus causing
>> over-reservation of concurrency IDs beyond the max_cids threshold. This
>> ends up looping in:
>> static inline unsigned int mm_get_cid(struct mm_struct *mm)
>> {
>> unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm- >mm_cid.max_cids));
>> while (cid == MM_CID_UNSET) {
>> cpu_relax();
>> cid = __mm_get_cid(mm, num_possible_cpus());
>> }
>> return cid;
>> }
>> Based on the stacktrace you provided, it seems to happen within
>> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
>> initialization issue in fork, or an issue when cloning a new thread ?
>
> I've spent some time digging through Thomas' implementation of
> mm_cid management. I've spotted something which may explain
> the watchdog panic. Here is the scenario:
[..]
> I see two possible issues here:
>
> A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
> mode without setting the mc->transit flag.
>
> B) sched_mm_cid_fork calls mm_get_cpu() before invoking
> mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
> mm_cids and make them available for mm_get_cpu().
>
> Thoughts ?
The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
Managed to grep the ‘mksquashfs’ command that was executing, which triggers the panic.
#ps -ef |grep mksquash.
root 16614 10829 0 05:55 ? 00:00:00 mksquashfs /dev/null /var/tmp/dracut.iLs0z0/.squash-test.img -no-progress -comp xz
I added following printk’s to mm_get_cid()
static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
+ int max_cids = READ_ONCE(mm->mm_cid.max_cids);
+ long *addr = mm_cidmask(mm);
+
+ if (cid == MM_CID_UNSET) {
+ printk(KERN_INFO "pid %d, exec %s, maxcids %d percpu %d pcputhr %d, users %d nrcpus_allwd %d\n",
+ mm->owner->pid, mm->owner->comm,
+ max_cids,
+ mm->mm_cid.percpu,
+ mm->mm_cid.pcpu_thrs,
+ mm->mm_cid.users,
+ mm->mm_cid.nr_cpus_allowed);
+ printk(KERN_INFO "cid bitmask %lx %lx %lx %lx %lx %lx\n",
+ addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]);
+ }
while (cid == MM_CID_UNSET) {
cpu_relax();
Got following trace(trimmed).
[ 65.139543] pid 16614, exec mksquashfs, maxcids 82 percpu 0 pcputhr 0, users 66 nrcpus_allwd 384
[ 65.139544] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f43455357 44455a494c414954
[ 65.139597] pid 16614, exec mksquashfs, maxcids 83 percpu 0 pcputhr 0, users 67 nrcpus_allwd 384
[ 65.139599] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f4345535f 44455a494c414954
..
[ 65.142665] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a5fffffffff
[ 65.142750] pid 16614, exec mksquashfs, maxcids 155 percpu 0 pcputhr 0, users 124 nrcpus_allwd 384
[ 65.142752] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a7fffffffff
..
[ 65.143712] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
[ 65.143767] pid 16614, exec mksquashfs, maxcids 175 percpu 0 pcputhr 0, users 140 nrcpus_allwd 384
[ 65.143769] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
Followed by the panic.
[ 99.979256] watchdog: CPU114: Watchdog detected hard LOCKUP on cpu 114
..
99.979340] RIP: 0010:mm_get_cid+0xf5/0x150
[ 99.979346] Code: 4d 8b 44 24 18 48 c7 c7 e0 07 86 b6 49 8b 4c 24 10 49 8b 54 24 08 41 ff 74 24 28 49 8b 34 24 e8 c1 b7 04 00 48 83 c4 18 f3 90 <8b> 05 65 ae ec 01 8b 35 eb e0 68 01 83 c0 3f 48 89 f5 c1 e8 03 25
[ 99.979348] RSP: 0018:ff75650cf9717d20 EFLAGS: 00000046
[ 99.979349] RAX: 0000000000000180 RBX: ff424236e5d55c40 RCX: 0000000000000180
[ 99.979351] RDX: 0000000000000000 RSI: 0000000000000180 RDI: ff424236e5d55cd0
[ 99.979352] RBP: 0000000000000180 R08: 0000000000000180 R09: c0000000fffdffff
[ 99.979352] R10: 0000000000000001 R11: ff75650cf9717a80 R12: ff424236e5d55ca0
[ 99.979353] R13: ff424236e5d55668 R14: ffa7650cba2841c0 R15: ff42423881a5aa80
[ 99.979355] FS: 00007f469ed6b740(0000) GS:ff424351c24d6000(0000) knlGS:0000000000000000
[ 99.979356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 99.979357] CR2: 00007f443b7fdfb8 CR3: 0000012724555006 CR4: 0000000000771ef0
[ 99.979358] PKRU: 55555554
[ 99.979359] Call Trace:
[ 99.979361] <TASK>
[ 99.979364] sched_mm_cid_fork+0x3fb/0x590
[ 99.979369] copy_process+0xd1a/0x2130
[ 99.979375] kernel_clone+0x9d/0x3b0
[ 99.979379] __do_sys_clone+0x65/0x90
[ 99.979384] do_syscall_64+0x64/0x670
[ 99.979388] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 99.979391] RIP: 0033:0x7f469d77d8c5
As you can see, at least when it cannot find available cid’s it is in per-task mm cid mode.
Perhaps it is taking longer to drop used cid’s? I have not delved into the mm cid management.
Hopeful you can make out something from the above trace.
Let me know if you want me to add more tracing.
-Prakash
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
On Wed, Nov 12 2025 at 06:30, Prakash Sangappa wrote:
> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
> It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
Can you verify that the top most commit of the rseq/slice branch is:
d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
Thanks,
tglx
> On Nov 12, 2025, at 1:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Nov 12 2025 at 06:30, Prakash Sangappa wrote:
>> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
>> It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
>
> Can you verify that the top most commit of the rseq/slice branch is:
>
> d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
No it is
c46f12a1166058764da8e84a215a6b66cae2fe0a
selftests/rseq: Implement time slice extension test
I can refresh and try.
-Prakash
>
> Thanks,
>
> tglx
> On Nov 12, 2025, at 3:17 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>
>
>
>> On Nov 12, 2025, at 1:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Wed, Nov 12 2025 at 06:30, Prakash Sangappa wrote:
>>> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
>>> It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
>>
>> Can you verify that the top most commit of the rseq/slice branch is:
>>
>> d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
>
> No it is
>
> c46f12a1166058764da8e84a215a6b66cae2fe0a
> selftests/rseq: Implement time slice extension test
>
Tested the latest from rseq/slice with the top most commit you mentioned above and the
watchdog panic does not reproduce anymore.
Thanks,
-Prakash
> I can refresh and try.
> -Prakash
>
>>
>> Thanks,
>>
>> tglx
>
On Thu, Nov 13 2025 at 02:34, Prakash Sangappa wrote:
>> On Nov 12, 2025, at 3:17 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>>> On Nov 12, 2025, at 1:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Can you verify that the top most commit of the rseq/slice branch is:
>>>
>>> d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
>>
>> No it is
>>
>> c46f12a1166058764da8e84a215a6b66cae2fe0a
>> selftests/rseq: Implement time slice extension test
>>
>
> Tested the latest from rseq/slice with the top most commit you mentioned above and the
> watchdog panic does not reproduce anymore.
Thanks for checking. I'll post a V4 soonish.
Thanks,
tglx
On 2025-11-12 01:30, Prakash Sangappa wrote: [...] > > The problem reproduces on a 2 socket AMD(384 cpus) bare metal system. > It occurs soon after system boot up. Does not reproduce on a 64cpu VM. > > Managed to grep the ‘mksquashfs’ command that was executing, which triggers the panic. > > #ps -ef |grep mksquash. > root 16614 10829 0 05:55 ? 00:00:00 mksquashfs /dev/null /var/tmp/dracut.iLs0z0/.squash-test.img -no-progress -comp xz > > [...] > .. > [ 65.143712] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff > [ 65.143767] pid 16614, exec mksquashfs, maxcids 175 percpu 0 pcputhr 0, users 140 nrcpus_allwd 384 > [ 65.143769] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff > It's weird that the cid bitmask is all f values (all 1). Aren't those zeroed on mm init ? > Followed by the panic. > [ 99.979256] watchdog: CPU114: Watchdog detected hard LOCKUP on cpu 114 > .. [...] > > As you can see, at least when it cannot find available cid’s it is in per-task mm cid mode. > Perhaps it is taking longer to drop used cid’s? I have not delved into the mm cid management. > Hopeful you can make out something from the above trace. > > Let me know if you want me to add more tracing. How soon is that after boot up ? I'm starting to wonder if the num_possible_cpus() value used in mm_cid_size() and mm_init_cid used respectively for mm allocation and initialization may be read before it is initialized by the boot up sequence ? That's far fetched, but it would be good if we can double-check that those are never called before the last call to init_cpu_possible and set_cpu_possible(). Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2025-11-10 09:23, Mathieu Desnoyers wrote:
> On 2025-11-06 12:28, Prakash Sangappa wrote:
> [...]
>> Hit this watchdog panic.
>>
>> Using following tree. Assume this Is the latest.
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/
>> slice
>>
>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
>
> When this happened during the development of the "complex" mm_cid
> scheme, this was typically caused by a stale "mm_cid" being kept around
> by a task even though it was not actually scheduled, thus causing
> over-reservation of concurrency IDs beyond the max_cids threshold. This
> ends up looping in:
>
> static inline unsigned int mm_get_cid(struct mm_struct *mm)
> {
> unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm-
> >mm_cid.max_cids));
>
> while (cid == MM_CID_UNSET) {
> cpu_relax();
> cid = __mm_get_cid(mm, num_possible_cpus());
> }
> return cid;
> }
>
> Based on the stacktrace you provided, it seems to happen within
> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
> initialization issue in fork, or an issue when cloning a new thread ?
One possible issue here: I note that kernel/sched/core.c:mm_init_cid()
misses the following initialization:
mm->mm_cid.transit = 0;
Thanks,
Mathieu
>
> Thanks,
>
> Mathieu
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
© 2016 - 2026 Red Hat, Inc.