[v3] rseq: Implement time slice extension mechanism

[patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Thomas Gleixner 3 months, 1 week ago

This is a follow up on the V2 version:

     https://lore.kernel.org/20251022110646.839870156@linutronix.de

V1 contains a detailed explanation:

     https://lore.kernel.org/20250908225709.144709889@linutronix.de

TLDR: Time slice extensions are an attempt to provide opportunistic
priority ceiling without the overhead of an actual priority ceiling
protocol, but also without the guarantees such a protocol provides.

The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.

This series uses the existing RSEQ user memory to implement it.

Changes vs. V2:

   - Rebase on the newest RSEQ and uaccess changes

   - Document the command line parameter - Sebastian

   - Use ENOTSUPP in the stub inline to be consistent - Sebastian

   - Add sysctl documentation - Sebastian

   - Simplify timer cancelation - Sebastian

   - Restore the dropped 'From: Peter...' line in patch 1 - Sebastian

   - More documentation/comment fixes - Randy


The uaccess and RSEQ modifications on which this series is based can be
found here:

    https://lore.kernel.org/20251029123717.886619142@linutronix.de

and in git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid

For your convenience all of it is also available as a conglomerate from
git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

Thanks,

	tglx

---
Peter Zijlstra (1):
      sched: Provide and use set_need_resched_current()

Thomas Gleixner (11):
      rseq: Add fields and constants for time slice extension
      rseq: Provide static branch for time slice extensions
      rseq: Add statistics for time slice extensions
      rseq: Add prctl() to enable time slice extensions
      rseq: Implement sys_rseq_slice_yield()
      rseq: Implement syscall entry work for time slice extensions
      rseq: Implement time slice extension enforcement timer
      rseq: Reset slice extension when scheduled
      rseq: Implement rseq_grant_slice_extension()
      entry: Hook up rseq time slice extension
      selftests/rseq: Implement time slice extension test

 Documentation/admin-guide/kernel-parameters.txt |    5 
 Documentation/admin-guide/sysctl/kernel.rst     |    6 
 Documentation/userspace-api/index.rst           |    1 
 Documentation/userspace-api/rseq.rst            |  118 +++++++++
 arch/alpha/kernel/syscalls/syscall.tbl          |    1 
 arch/arm/tools/syscall.tbl                      |    1 
 arch/arm64/tools/syscall_32.tbl                 |    1 
 arch/m68k/kernel/syscalls/syscall.tbl           |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl     |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl       |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl       |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl       |    1 
 arch/parisc/kernel/syscalls/syscall.tbl         |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl        |    1 
 arch/s390/kernel/syscalls/syscall.tbl           |    1 
 arch/s390/mm/pfault.c                           |    3 
 arch/sh/kernel/syscalls/syscall.tbl             |    1 
 arch/sparc/kernel/syscalls/syscall.tbl          |    1 
 arch/x86/entry/syscalls/syscall_32.tbl          |    1 
 arch/x86/entry/syscalls/syscall_64.tbl          |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl         |    1 
 include/linux/entry-common.h                    |    2 
 include/linux/rseq.h                            |   11 
 include/linux/rseq_entry.h                      |  191 ++++++++++++++-
 include/linux/rseq_types.h                      |   30 ++
 include/linux/sched.h                           |    7 
 include/linux/syscalls.h                        |    1 
 include/linux/thread_info.h                     |   16 -
 include/uapi/asm-generic/unistd.h               |    5 
 include/uapi/linux/prctl.h                      |   10 
 include/uapi/linux/rseq.h                       |   38 +++
 init/Kconfig                                    |   12 
 kernel/entry/common.c                           |   14 -
 kernel/entry/syscall-common.c                   |   11 
 kernel/rcu/tiny.c                               |    8 
 kernel/rcu/tree.c                               |   14 -
 kernel/rcu/tree_exp.h                           |    3 
 kernel/rcu/tree_plugin.h                        |    9 
 kernel/rcu/tree_stall.h                         |    3 
 kernel/rseq.c                                   |  299 ++++++++++++++++++++++++
 kernel/sys.c                                    |    6 
 kernel/sys_ni.c                                 |    1 
 scripts/syscall.tbl                             |    1 
 tools/testing/selftests/rseq/.gitignore         |    1 
 tools/testing/selftests/rseq/Makefile           |    5 
 tools/testing/selftests/rseq/rseq-abi.h         |   27 ++
 tools/testing/selftests/rseq/slice_test.c       |  198 +++++++++++++++
 47 files changed, 1019 insertions(+), 53 deletions(-)

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Sebastian Andrzej Siewior 3 months, 1 week ago

On 2025-10-29 14:22:11 [+0100], Thomas Gleixner wrote:
> Changes vs. V2:
> 
>    - Rebase on the newest RSEQ and uaccess changes
…
> and in git:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
> 
> For your convenience all of it is also available as a conglomerate from
> git:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

rseq/slice is older than rseq/cid. rseq/slice has the
__put_kernel_nofault typo. rseq/cid looks correct.

> Thanks,
> 
> 	tglx

Sebastian

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Steven Rostedt 3 months, 1 week ago

On Wed, 29 Oct 2025 16:10:55 +0100
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> > and in git:
> > 
> >     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
> > 
> > For your convenience all of it is also available as a conglomerate from
> > git:
> > 
> >     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice  
> 
> rseq/slice is older than rseq/cid. rseq/slice has the
> __put_kernel_nofault typo. rseq/cid looks correct.

Yeah, I started looking at both too, and checking out req/slice and trying
to do a rebase on top of rseq/cid causes a bunch of conflicts.

I'm continuing the rebase and just skipping the changed commits.

-- Steve

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Thomas Gleixner 3 months, 1 week ago

On Wed, Oct 29 2025 at 11:40, Steven Rostedt wrote:
> On Wed, 29 Oct 2025 16:10:55 +0100
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
>
>> > and in git:
>> > 
>> >     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
>> > 
>> > For your convenience all of it is also available as a conglomerate from
>> > git:
>> > 
>> >     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice  
>> 
>> rseq/slice is older than rseq/cid. rseq/slice has the
>> __put_kernel_nofault typo. rseq/cid looks correct.
>
> Yeah, I started looking at both too, and checking out req/slice and trying
> to do a rebase on top of rseq/cid causes a bunch of conflicts.
>
> I'm continuing the rebase and just skipping the changed commits.

Forgot to push the updated branch out....

Fixed now.

Thanks,

        tglx

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Prakash Sangappa 3 months ago


> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> This is a follow up on the V2 version:
> 
>     https://lore.kernel.org/20251022110646.839870156@linutronix.de
> 
> V1 contains a detailed explanation:
> 
>     https://lore.kernel.org/20250908225709.144709889@linutronix.de
> 
> TLDR: Time slice extensions are an attempt to provide opportunistic
> priority ceiling without the overhead of an actual priority ceiling
> protocol, but also without the guarantees such a protocol provides.

[…]
> 
> 
> The uaccess and RSEQ modifications on which this series is based can be
> found here:
> 
>    https://lore.kernel.org/20251029123717.886619142@linutronix.de
> 
> and in git:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
> 
> For your convenience all of it is also available as a conglomerate from
> git:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
> 

Hit this watchdog panic.

Using following tree. Assume this Is the latest.
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/slice

Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/

-Prakash
-------------------------------------------------------
watchdog: CPU152: Watchdog detected hard LOCKUP on cpu 152
..  

93.093858] RIP: 0010:mm_get_cid+0x7e/0xd0
[   93.093866] Code: 4c eb 63 f3 90 8b 05 f1 6a 66 02 8b 35 d7 bc 8e 01 83 c0 3f 48 89 f5 c1 e8 03 25 f8 ff ff 1f 48 8d 3c 43 e8 24 ce 62 00 89 c1 <39> e8 73 d5 8b 35 c8 6a 66 02 89 c0 8d 56 3f c1 ea 03 81 e2 f8 ff
[   93.093867] RSP: 0018:ff734c4591c6bc38 EFLAGS: 00000046
[   93.093869] RAX: 0000000000000180 RBX: ff3c42cea15ec2c0 RCX: 0000000000000180
[   93.093871] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   93.093872] RBP: 0000000000000180 R08: 0000000000000000 R09: 0000000000000000
[   93.093873] R10: 0000000000000000 R11: 00000000fffffff4 R12: ff3c42cea15ebd30
[   93.093874] R13: ffa54c453ba41640 R14: ff3c42cea15ebd28 R15: ff3c42cea15ebd27
[   93.093875] FS:  00007f92b1482740(0000) GS:ff3c43e8d55ef000(0000) knlGS:0000000000000000
[   93.093876] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   93.093877] CR2: 00007f8ebe7fbfb8 CR3: 00000126c9f61004 CR4: 0000000000f71ef0
[   93.093878] PKRU: 55555554
[   93.093879] Call Trace:
[   93.093882]  <TASK>
[   93.093887]  sched_mm_cid_fork+0x22d/0x300
[   93.093895]  copy_process+0x92a/0x1670
[   93.093902]  kernel_clone+0xbc/0x490
[   93.093903]  ? srso_alias_return_thunk+0x5/0xfbef5
[   93.093907]  ? __lruvec_stat_mod_folio+0x83/0xd0
[   93.093911]  __do_sys_clone+0x65/0xa0
[   93.093916]  do_syscall_64+0x7f/0x8a0

> Thanks,
> 
> tglx
>

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Mathieu Desnoyers 2 months, 4 weeks ago

On 2025-11-06 12:28, Prakash Sangappa wrote:
[...]
> Hit this watchdog panic.
> 
> Using following tree. Assume this Is the latest.
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/slice
> 
> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/

When this happened during the development of the "complex" mm_cid
scheme, this was typically caused by a stale "mm_cid" being kept around
by a task even though it was not actually scheduled, thus causing
over-reservation of concurrency IDs beyond the max_cids threshold. This
ends up looping in:

static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
         unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));

         while (cid == MM_CID_UNSET) {
                 cpu_relax();
                 cid = __mm_get_cid(mm, num_possible_cpus());
         }
         return cid;
}

Based on the stacktrace you provided, it seems to happen within
sched_mm_cid_fork() within copy_process, so perhaps it's simply an
initialization issue in fork, or an issue when cloning a new thread ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Mathieu Desnoyers 2 months, 4 weeks ago

On 2025-11-10 09:23, Mathieu Desnoyers wrote:
> On 2025-11-06 12:28, Prakash Sangappa wrote:
> [...]
>> Hit this watchdog panic.
>>
>> Using following tree. Assume this Is the latest.
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/ 
>> slice
>>
>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
> 
> When this happened during the development of the "complex" mm_cid
> scheme, this was typically caused by a stale "mm_cid" being kept around
> by a task even though it was not actually scheduled, thus causing
> over-reservation of concurrency IDs beyond the max_cids threshold. This
> ends up looping in:
> 
> static inline unsigned int mm_get_cid(struct mm_struct *mm)
> {
>          unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm- 
>  >mm_cid.max_cids));
> 
>          while (cid == MM_CID_UNSET) {
>                  cpu_relax();
>                  cid = __mm_get_cid(mm, num_possible_cpus());
>          }
>          return cid;
> }
> 
> Based on the stacktrace you provided, it seems to happen within
> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
> initialization issue in fork, or an issue when cloning a new thread ?

I've spent some time digging through Thomas' implementation of
mm_cid management. I've spotted something which may explain
the watchdog panic. Here is the scenario:

1) A process is constrained to a subset of the possible CPUs,
    and has enough threads to swap from per-thread to per-cpu mm_cid
    mode. It runs happily in that per-cpu mode.

2) The number of allowed CPUs is increased for a process, thus invoking
    mm_update_cpus_allowed. This switches the mode back to per-thread,
    but delays invocation of mm_cid_work_fn to some point in the future,
    in thread context, through irq_work + schedule_work.

    At that point, because only __mm_update_max_cids was called by
    mm_update_cpus_allowed, the max_cids is updated, but mc->transit
    is still zero.

    Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
    scheduled work or near the end of sched_mm_cid_fork, or by
    sched_mm_cid_exit, we are in a state where mm_cids are still
    owned by CPUs, but we are now in per-thread mm_cid mode, which
    means that the mc->max_cids value depends on the number of threads.

3) At that point, a new thread is cloned, thus invoking
    sched_mm_cid_fork. Calling sched_mm_cid_add_user increases the user
    count and invokes mm_update_max_cids, which updates the mc->max_cids
    limit, but does not set the mc->transit flag because this call does not
    swap from per-cpu to per-task mode (the mode is already per-task).

    Immediately after the call to sched_mm_cid_add_user, sched_mm_cid_fork()
    attempts to call mm_get_cid while the mm_cid mutex and mm_cid lock
    are held, and loops forever because the mm_cid mask has all
    the max_cids IDs reserved because of the stale per-cpu CIDs.

I see two possible issues here:

A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
    mode without setting the mc->transit flag.

B) sched_mm_cid_fork calls mm_get_cpu() before invoking
    mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
    mm_cids and make them available for mm_get_cpu().

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Thomas Gleixner 2 months, 3 weeks ago

On Tue, Nov 11 2025 at 11:42, Mathieu Desnoyers wrote:
> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
> I've spent some time digging through Thomas' implementation of
> mm_cid management. I've spotted something which may explain
> the watchdog panic. Here is the scenario:
>
> 1) A process is constrained to a subset of the possible CPUs,
>     and has enough threads to swap from per-thread to per-cpu mm_cid
>     mode. It runs happily in that per-cpu mode.
>
> 2) The number of allowed CPUs is increased for a process, thus invoking
>     mm_update_cpus_allowed. This switches the mode back to per-thread,
>     but delays invocation of mm_cid_work_fn to some point in the future,
>     in thread context, through irq_work + schedule_work.
>
>     At that point, because only __mm_update_max_cids was called by
>     mm_update_cpus_allowed, the max_cids is updated, but mc->transit
>     is still zero.
>
>     Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
>     scheduled work or near the end of sched_mm_cid_fork, or by
>     sched_mm_cid_exit, we are in a state where mm_cids are still
>     owned by CPUs, but we are now in per-thread mm_cid mode, which
>     means that the mc->max_cids value depends on the number of threads.

No. It stays in per CPU mode. The mode switch itself happens either in
the worker or on fork/exit whatever comes first.

> 3) At that point, a new thread is cloned, thus invoking
>     sched_mm_cid_fork. Calling sched_mm_cid_add_user increases the user
>     count and invokes mm_update_max_cids, which updates the mc->max_cids
>     limit, but does not set the mc->transit flag because this call does not
>     swap from per-cpu to per-task mode (the mode is already per-task).

No. mm::mm_cid::percpu is still set. So mm::mm_cid::transit is irrelevant.

>     Immediately after the call to sched_mm_cid_add_user, sched_mm_cid_fork()
>     attempts to call mm_get_cid while the mm_cid mutex and mm_cid lock
>     are held, and loops forever because the mm_cid mask has all
>     the max_cids IDs reserved because of the stale per-cpu CIDs.

Definitely not. sched_mm_cid_add_user() invokes mm_update_max_cids()
which does the mode switch in mm_cid, sets transit and returns true,
which means that fork() goes and does the transition game and allocates
the CID for the new task after that completed.

There was an issue in V3 with the not-initialized transit member and a
off by one in one of the transition functions. It's fixed in the git
tree, but I haven't posted it yet because I was AFK for a week.

I did not notice the V3 issue because tests passed on a small machine,
but after I did a rebase to the tip rseq and uaccess bits, I noticed the
failure because I tested on a larger box.

Thanks,

        tglx

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Mathieu Desnoyers 2 months, 3 weeks ago

On 2025-11-12 15:31, Thomas Gleixner wrote:
> On Tue, Nov 11 2025 at 11:42, Mathieu Desnoyers wrote:
>> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
>> I've spent some time digging through Thomas' implementation of
>> mm_cid management. I've spotted something which may explain
>> the watchdog panic. Here is the scenario:
>>
>> 1) A process is constrained to a subset of the possible CPUs,
>>      and has enough threads to swap from per-thread to per-cpu mm_cid
>>      mode. It runs happily in that per-cpu mode.
>>
>> 2) The number of allowed CPUs is increased for a process, thus invoking
>>      mm_update_cpus_allowed. This switches the mode back to per-thread,
>>      but delays invocation of mm_cid_work_fn to some point in the future,
>>      in thread context, through irq_work + schedule_work.
>>
>>      At that point, because only __mm_update_max_cids was called by
>>      mm_update_cpus_allowed, the max_cids is updated, but mc->transit
>>      is still zero.
>>
>>      Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
>>      scheduled work or near the end of sched_mm_cid_fork, or by
>>      sched_mm_cid_exit, we are in a state where mm_cids are still
>>      owned by CPUs, but we are now in per-thread mm_cid mode, which
>>      means that the mc->max_cids value depends on the number of threads.
> 
> No. It stays in per CPU mode. The mode switch itself happens either in
> the worker or on fork/exit whatever comes first.

Ah, that's what I missed. All good then.

[...]

> 
> There was an issue in V3 with the not-initialized transit member and a
> off by one in one of the transition functions. It's fixed in the git
> tree, but I haven't posted it yet because I was AFK for a week.
> 
> I did not notice the V3 issue because tests passed on a small machine,
> but after I did a rebase to the tip rseq and uaccess bits, I noticed the
> failure because I tested on a larger box.

Good ! We'll see if this fixes the issue observed by Prakash. If not,
I'm curious to validate that num_possible_cpus() is always set to its
final value before _any_ mm is created.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Thomas Gleixner 2 months, 3 weeks ago

On Wed, Nov 12 2025 at 15:46, Mathieu Desnoyers wrote:
> On 2025-11-12 15:31, Thomas Gleixner wrote:
>> I did not notice the V3 issue because tests passed on a small machine,
>> but after I did a rebase to the tip rseq and uaccess bits, I noticed the
>> failure because I tested on a larger box.
>
> Good ! We'll see if this fixes the issue observed by Prakash. If not,
> I'm curious to validate that num_possible_cpus() is always set to its
> final value before _any_ mm is created.

It _is_ set to it's final value in start_kernel() before
setup_per_cpu_areas() is invoked. Otherwise the kernel would not work at
all.

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Prakash Sangappa 2 months, 3 weeks ago


> On Nov 11, 2025, at 8:42 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
>> On 2025-11-06 12:28, Prakash Sangappa wrote:
>> [...]
>>> Hit this watchdog panic.
>>> 
>>> Using following tree. Assume this Is the latest.
>>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/ slice
>>> 
>>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
>> When this happened during the development of the "complex" mm_cid
>> scheme, this was typically caused by a stale "mm_cid" being kept around
>> by a task even though it was not actually scheduled, thus causing
>> over-reservation of concurrency IDs beyond the max_cids threshold. This
>> ends up looping in:
>> static inline unsigned int mm_get_cid(struct mm_struct *mm)
>> {
>>         unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm-  >mm_cid.max_cids));
>>         while (cid == MM_CID_UNSET) {
>>                 cpu_relax();
>>                 cid = __mm_get_cid(mm, num_possible_cpus());
>>         }
>>         return cid;
>> }
>> Based on the stacktrace you provided, it seems to happen within
>> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
>> initialization issue in fork, or an issue when cloning a new thread ?
> 
> I've spent some time digging through Thomas' implementation of
> mm_cid management. I've spotted something which may explain
> the watchdog panic. Here is the scenario:

[..]
> I see two possible issues here:
> 
> A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
>   mode without setting the mc->transit flag.
> 
> B) sched_mm_cid_fork calls mm_get_cpu() before invoking
>   mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
>   mm_cids and make them available for mm_get_cpu().
> 
> Thoughts ?

The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
It occurs soon after system boot up.  Does not reproduce on a 64cpu VM.

Managed to grep the ‘mksquashfs’ command that was executing, which  triggers the panic. 

#ps -ef |grep mksquash.
root       16614   10829  0 05:55 ?        00:00:00 mksquashfs /dev/null /var/tmp/dracut.iLs0z0/.squash-test.img -no-progress -comp xz


I added following printk’s to mm_get_cid()

static inline unsigned int mm_get_cid(struct mm_struct *mm)
 {
        unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
+       int max_cids = READ_ONCE(mm->mm_cid.max_cids);
+       long *addr = mm_cidmask(mm);
+
+       if (cid == MM_CID_UNSET) {
+               printk(KERN_INFO "pid %d, exec %s, maxcids %d percpu %d pcputhr %d, users %d nrcpus_allwd %d\n",
+                               mm->owner->pid, mm->owner->comm,
+                               max_cids,
+                               mm->mm_cid.percpu,
+                               mm->mm_cid.pcpu_thrs,
+                               mm->mm_cid.users,
+                               mm->mm_cid.nr_cpus_allowed);
+               printk(KERN_INFO "cid bitmask %lx %lx %lx %lx %lx %lx\n",
+                       addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]);
 
+       }
        while (cid == MM_CID_UNSET) {
                cpu_relax();

Got following trace(trimmed). 

[   65.139543] pid 16614, exec mksquashfs, maxcids 82 percpu 0 pcputhr 0, users 66 nrcpus_allwd 384
[   65.139544] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f43455357 44455a494c414954
[   65.139597] pid 16614, exec mksquashfs, maxcids 83 percpu 0 pcputhr 0, users 67 nrcpus_allwd 384
[   65.139599] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f4345535f 44455a494c414954
..
[   65.142665] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a5fffffffff
[   65.142750] pid 16614, exec mksquashfs, maxcids 155 percpu 0 pcputhr 0, users 124 nrcpus_allwd 384
[   65.142752] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a7fffffffff
..
[   65.143712] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
[   65.143767] pid 16614, exec mksquashfs, maxcids 175 percpu 0 pcputhr 0, users 140 nrcpus_allwd 384
[   65.143769] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff

Followed by the panic.
[   99.979256] watchdog: CPU114: Watchdog detected hard LOCKUP on cpu 114
.. 
 99.979340] RIP: 0010:mm_get_cid+0xf5/0x150
[   99.979346] Code: 4d 8b 44 24 18 48 c7 c7 e0 07 86 b6 49 8b 4c 24 10 49 8b 54 24 08 41 ff 74 24 28 49 8b 34 24 e8 c1 b7 04 00 48 83 c4 18 f3 90 <8b> 05 65 ae ec 01 8b 35 eb e0 68 01 83 c0 3f 48 89 f5 c1 e8 03 25
[   99.979348] RSP: 0018:ff75650cf9717d20 EFLAGS: 00000046
[   99.979349] RAX: 0000000000000180 RBX: ff424236e5d55c40 RCX: 0000000000000180
[   99.979351] RDX: 0000000000000000 RSI: 0000000000000180 RDI: ff424236e5d55cd0
[   99.979352] RBP: 0000000000000180 R08: 0000000000000180 R09: c0000000fffdffff
[   99.979352] R10: 0000000000000001 R11: ff75650cf9717a80 R12: ff424236e5d55ca0
[   99.979353] R13: ff424236e5d55668 R14: ffa7650cba2841c0 R15: ff42423881a5aa80
[   99.979355] FS:  00007f469ed6b740(0000) GS:ff424351c24d6000(0000) knlGS:0000000000000000
[   99.979356] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.979357] CR2: 00007f443b7fdfb8 CR3: 0000012724555006 CR4: 0000000000771ef0
[   99.979358] PKRU: 55555554
[   99.979359] Call Trace:
[   99.979361]  <TASK>
[   99.979364]  sched_mm_cid_fork+0x3fb/0x590
[   99.979369]  copy_process+0xd1a/0x2130
[   99.979375]  kernel_clone+0x9d/0x3b0
[   99.979379]  __do_sys_clone+0x65/0x90
[   99.979384]  do_syscall_64+0x64/0x670
[   99.979388]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   99.979391] RIP: 0033:0x7f469d77d8c5


As you can see, at least when it cannot find available cid’s it is in per-task mm cid mode.
Perhaps it is taking longer to drop used cid’s? I have not delved into the mm cid management. 
Hopeful you can make out something from the above trace.

Let me know if you want me to add more tracing. 

-Prakash


> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Thomas Gleixner 2 months, 3 weeks ago

On Wed, Nov 12 2025 at 06:30, Prakash Sangappa wrote:
> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
> It occurs soon after system boot up.  Does not reproduce on a 64cpu VM.

Can you verify that the top most commit of the rseq/slice branch is:

    d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")

Thanks,

        tglx

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Prakash Sangappa 2 months, 3 weeks ago


> On Nov 12, 2025, at 1:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, Nov 12 2025 at 06:30, Prakash Sangappa wrote:
>> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
>> It occurs soon after system boot up.  Does not reproduce on a 64cpu VM.
> 
> Can you verify that the top most commit of the rseq/slice branch is:
> 
>    d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")

No it is

c46f12a1166058764da8e84a215a6b66cae2fe0a
    selftests/rseq: Implement time slice extension test

I can refresh and try.
-Prakash

> 
> Thanks,
> 
>        tglx

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Prakash Sangappa 2 months, 3 weeks ago


> On Nov 12, 2025, at 3:17 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
> 
> 
>> On Nov 12, 2025, at 1:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>> On Wed, Nov 12 2025 at 06:30, Prakash Sangappa wrote:
>>> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
>>> It occurs soon after system boot up.  Does not reproduce on a 64cpu VM.
>> 
>> Can you verify that the top most commit of the rseq/slice branch is:
>> 
>>   d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
> 
> No it is
> 
> c46f12a1166058764da8e84a215a6b66cae2fe0a
>    selftests/rseq: Implement time slice extension test
> 

Tested the latest from rseq/slice with the top most commit you mentioned above and the 
watchdog panic does not reproduce anymore.

Thanks,
-Prakash

> I can refresh and try.
> -Prakash
> 
>> 
>> Thanks,
>> 
>>       tglx
>

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Thomas Gleixner 2 months, 3 weeks ago

On Thu, Nov 13 2025 at 02:34, Prakash Sangappa wrote:
>> On Nov 12, 2025, at 3:17 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>>> On Nov 12, 2025, at 1:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Can you verify that the top most commit of the rseq/slice branch is:
>>> 
>>>   d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
>> 
>> No it is
>> 
>> c46f12a1166058764da8e84a215a6b66cae2fe0a
>>    selftests/rseq: Implement time slice extension test
>> 
>
> Tested the latest from rseq/slice with the top most commit you mentioned above and the 
> watchdog panic does not reproduce anymore.

Thanks for checking. I'll post a V4 soonish.

Thanks,

        tglx

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Mathieu Desnoyers 2 months, 3 weeks ago

On 2025-11-12 01:30, Prakash Sangappa wrote:
[...]
> 
> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
> It occurs soon after system boot up.  Does not reproduce on a 64cpu VM.
> 
> Managed to grep the ‘mksquashfs’ command that was executing, which  triggers the panic.
> 
> #ps -ef |grep mksquash.
> root       16614   10829  0 05:55 ?        00:00:00 mksquashfs /dev/null /var/tmp/dracut.iLs0z0/.squash-test.img -no-progress -comp xz
> 
> 

[...]

> ..
> [   65.143712] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
> [   65.143767] pid 16614, exec mksquashfs, maxcids 175 percpu 0 pcputhr 0, users 140 nrcpus_allwd 384
> [   65.143769] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
> 

It's weird that the cid bitmask is all f values (all 1). Aren't those
zeroed on mm init ?

> Followed by the panic.
> [   99.979256] watchdog: CPU114: Watchdog detected hard LOCKUP on cpu 114
> ..
[...]
> 
> As you can see, at least when it cannot find available cid’s it is in per-task mm cid mode.
> Perhaps it is taking longer to drop used cid’s? I have not delved into the mm cid management.
> Hopeful you can make out something from the above trace.
> 
> Let me know if you want me to add more tracing.

How soon is that after boot up ?

I'm starting to wonder if the num_possible_cpus() value used in
mm_cid_size() and mm_init_cid used respectively for mm allocation
and initialization may be read before it is initialized by the boot up
sequence ?

That's far fetched, but it would be good if we can double-check that
those are never called before the last call to init_cpu_possible and
set_cpu_possible().

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

Posted by Mathieu Desnoyers 2 months, 4 weeks ago

On 2025-11-10 09:23, Mathieu Desnoyers wrote:
> On 2025-11-06 12:28, Prakash Sangappa wrote:
> [...]
>> Hit this watchdog panic.
>>
>> Using following tree. Assume this Is the latest.
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/ 
>> slice
>>
>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
> 
> When this happened during the development of the "complex" mm_cid
> scheme, this was typically caused by a stale "mm_cid" being kept around
> by a task even though it was not actually scheduled, thus causing
> over-reservation of concurrency IDs beyond the max_cids threshold. This
> ends up looping in:
> 
> static inline unsigned int mm_get_cid(struct mm_struct *mm)
> {
>          unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm- 
>  >mm_cid.max_cids));
> 
>          while (cid == MM_CID_UNSET) {
>                  cpu_relax();
>                  cid = __mm_get_cid(mm, num_possible_cpus());
>          }
>          return cid;
> }
> 
> Based on the stacktrace you provided, it seems to happen within
> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
> initialization issue in fork, or an issue when cloning a new thread ?

One possible issue here: I note that kernel/sched/core.c:mm_init_cid()
misses the following initialization:

   mm->mm_cid.transit = 0;

Thanks,

Mathieu


> 
> Thanks,
> 
> Mathieu
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com