include/linux/bitmap.h | 15 include/linux/cpumask.h | 26 + include/linux/irq_work.h | 9 include/linux/irq_work_types.h | 14 include/linux/mm_types.h | 125 ------ include/linux/rseq.h | 27 - include/linux/rseq_types.h | 71 +++ include/linux/sched.h | 19 init/init_task.c | 3 kernel/cpu.c | 15 kernel/exit.c | 1 kernel/fork.c | 7 kernel/sched/core.c | 815 +++++++++++++++++++---------------------- kernel/sched/sched.h | 395 ++++++++----------- kernel/signal.c | 2 lib/bitmap.c | 6 16 files changed, 727 insertions(+), 823 deletions(-)
This is a follow up on V2 series which can be found here:
https://lore.kernel.org/20251022104005.907410538@linutronix.de
The V1 cover letter contains a detailed analyisis of the issues:
https://lore.kernel.org/20251015164952.694882104@linutronix.de
TLDR: The CID management is way to complex and adds significant overhead
into scheduler hotpaths.
The series rewrites MM CID management in a more simplistic way which
focusses on low overhead in the scheduler while maintaining per task CIDs
as long as the number of threads is not exceeding the number of possible
CPUs.
The series is based on the V6 series of the rseq rewrite:
https://lore.kernel.org/20251027084220.785525188@linutronix.de
which is also available from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/rseq
The series on top of the tip core/rseq branch is available from git as
well:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
Changes vs. V2:
- Rename to cpumask/bitmap_weighted_or() - Yury
- Zero the bitmap with length of bitmap_size(nr_possible_cpus()) -
Shrikanth
- Move cpu_relax() out of for() as that fails to build when cpu_relax()
is a macro. - Shrikanth
- Picked up Reviewed/Acked-by tags where appropriate
Thanks,
tglx
---
Thomas Gleixner (20):
sched/mmcid: Revert the complex CID management
sched/mmcid: Use proper data structures
sched/mmcid: Cacheline align MM CID storage
sched: Fixup whitespace damage
sched/mmcid: Move scheduler code out of global header
sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
cpumask: Introduce cpumask_weighted_or()
sched/mmcid: Use cpumask_weighted_or()
cpumask: Cache num_possible_cpus()
sched/mmcid: Convert mm CID mask to a bitmap
signal: Move MMCID exit out of sighand lock
sched/mmcid: Move initialization out of line
sched/mmcid: Provide precomputed maximal value
sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
sched/mmcid: Introduce per task/CPU ownership infrastrcuture
sched/mmcid: Provide new scheduler CID mechanism
sched/mmcid: Provide CID ownership mode fixup functions
irqwork: Move data struct to a types header
sched/mmcid: Implement deferred mode change
sched/mmcid: Switch over to the new mechanism
include/linux/bitmap.h | 15
include/linux/cpumask.h | 26 +
include/linux/irq_work.h | 9
include/linux/irq_work_types.h | 14
include/linux/mm_types.h | 125 ------
include/linux/rseq.h | 27 -
include/linux/rseq_types.h | 71 +++
include/linux/sched.h | 19
init/init_task.c | 3
kernel/cpu.c | 15
kernel/exit.c | 1
kernel/fork.c | 7
kernel/sched/core.c | 815 +++++++++++++++++++----------------------
kernel/sched/sched.h | 395 ++++++++-----------
kernel/signal.c | 2
lib/bitmap.c | 6
16 files changed, 727 insertions(+), 823 deletions(-)
Hi Thomas.
On 10/29/25 6:38 PM, Thomas Gleixner wrote:
> This is a follow up on V2 series which can be found here:
>
> https://lore.kernel.org/20251022104005.907410538@linutronix.de
>
> The V1 cover letter contains a detailed analyisis of the issues:
>
> https://lore.kernel.org/20251015164952.694882104@linutronix.de
>
> TLDR: The CID management is way to complex and adds significant overhead
> into scheduler hotpaths.
>
> The series rewrites MM CID management in a more simplistic way which
> focusses on low overhead in the scheduler while maintaining per task CIDs
> as long as the number of threads is not exceeding the number of possible
> CPUs.
>
> The series is based on the V6 series of the rseq rewrite:
>
> https://lore.kernel.org/20251027084220.785525188@linutronix.de
>
> which is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/rseq
>
> The series on top of the tip core/rseq branch is available from git as
> well:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
>
> Changes vs. V2:
>
> - Rename to cpumask/bitmap_weighted_or() - Yury
>
> - Zero the bitmap with length of bitmap_size(nr_possible_cpus()) -
> Shrikanth
>
> - Move cpu_relax() out of for() as that fails to build when cpu_relax()
> is a macro. - Shrikanth
>
> - Picked up Reviewed/Acked-by tags where appropriate
>
> Thanks,
>
> tglx
> ---
> Thomas Gleixner (20):
> sched/mmcid: Revert the complex CID management
> sched/mmcid: Use proper data structures
> sched/mmcid: Cacheline align MM CID storage
> sched: Fixup whitespace damage
> sched/mmcid: Move scheduler code out of global header
> sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
> cpumask: Introduce cpumask_weighted_or()
> sched/mmcid: Use cpumask_weighted_or()
> cpumask: Cache num_possible_cpus()
> sched/mmcid: Convert mm CID mask to a bitmap
> signal: Move MMCID exit out of sighand lock
> sched/mmcid: Move initialization out of line
> sched/mmcid: Provide precomputed maximal value
> sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
> sched/mmcid: Introduce per task/CPU ownership infrastrcuture
> sched/mmcid: Provide new scheduler CID mechanism
> sched/mmcid: Provide CID ownership mode fixup functions
> irqwork: Move data struct to a types header
> sched/mmcid: Implement deferred mode change
> sched/mmcid: Switch over to the new mechanism
>
> include/linux/bitmap.h | 15
> include/linux/cpumask.h | 26 +
> include/linux/irq_work.h | 9
> include/linux/irq_work_types.h | 14
> include/linux/mm_types.h | 125 ------
> include/linux/rseq.h | 27 -
> include/linux/rseq_types.h | 71 +++
> include/linux/sched.h | 19
> init/init_task.c | 3
> kernel/cpu.c | 15
> kernel/exit.c | 1
> kernel/fork.c | 7
> kernel/sched/core.c | 815 +++++++++++++++++++----------------------
> kernel/sched/sched.h | 395 ++++++++-----------
> kernel/signal.c | 2
> lib/bitmap.c | 6
> 16 files changed, 727 insertions(+), 823 deletions(-)
>
>
I am running into crash at boot on power10 pseries.
Thought of putting it here first. Me trying to figure out why.
I am using your tree.
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
commit 789ff6e7cc5aa423473eb135f94812fe77b8aeab (HEAD -> rseq/cid, origin/rseq/cid)
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Tue Oct 14 10:51:04 2025 +0200
sched/mmcid: Switch over to the new mechanism
Oops: Kernel access of bad area, sig: 7 [#3]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
Modules linked in: drm drm_panel_orientation_quirks xfs sd_mod sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 96 UID: 0 PID: 0 Comm: swapper/96 Tainted: G D W 6.18.0-rc3+ #4 PREEMPT(lazy)
Tainted: [D]=DIE, [W]=WARN
NIP [c0000000001b5c10] mm_cid_switch_to+0x58/0x52c
LR [c000000001117c84] __schedule+0x4bc/0x760
Call Trace:
[c00000668367fde0] [c0000000001b53c8] __pick_next_task+0x60/0x2ac (unreliable)
[c00000668367fe40] [c000000001117a14] __schedule+0x24c/0x760
[c00000668367fee0] [c0000000011183d0] schedule_idle+0x3c/0x64
[c00000668367ff10] [c0000000001f2470] do_idle+0x15c/0x1ac
[c00000668367ff60] [c0000000001f2788] cpu_startup_entry+0x4c/0x50
[c00000668367ff90] [c00000000005ef20] start_secondary+0x284/0x288
[c00000668367ffe0] [c00000000000e158] start_secondary_prolog+0x10/0x14
On 10/30/25 10:30 AM, Shrikanth Hegde wrote: > Hi Thomas. >> +---------------------- >> kernel/sched/sched.h | 395 ++++++++----------- >> kernel/signal.c | 2 >> lib/bitmap.c | 6 >> 16 files changed, 727 insertions(+), 823 deletions(-) >> >> > > I am running into crash at boot on power10 pseries. > Thought of putting it here first. Me trying to figure out why. > > I am using your tree. > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git > > commit 789ff6e7cc5aa423473eb135f94812fe77b8aeab (HEAD -> rseq/cid, > origin/rseq/cid) > Author: Thomas Gleixner <tglx@linutronix.de> > Date: Tue Oct 14 10:51:04 2025 +0200 > > sched/mmcid: Switch over to the new mechanism > > > Oops: Kernel access of bad area, sig: 7 [#3] > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries > Modules linked in: drm drm_panel_orientation_quirks xfs sd_mod sg > ibmvscsi ibmveth scsi_transport_srp pseries_wdt dm_mirror dm_region_hash > dm_log dm_mod fuse > CPU: 96 UID: 0 PID: 0 Comm: swapper/96 Tainted: G D W > 6.18.0-rc3+ #4 PREEMPT(lazy) > Tainted: [D]=DIE, [W]=WARN > NIP [c0000000001b5c10] mm_cid_switch_to+0x58/0x52c > LR [c000000001117c84] __schedule+0x4bc/0x760 > Call Trace: > [c00000668367fde0] [c0000000001b53c8] __pick_next_task+0x60/0x2ac > (unreliable) > [c00000668367fe40] [c000000001117a14] __schedule+0x24c/0x760 > [c00000668367fee0] [c0000000011183d0] schedule_idle+0x3c/0x64 > [c00000668367ff10] [c0000000001f2470] do_idle+0x15c/0x1ac > [c00000668367ff60] [c0000000001f2788] cpu_startup_entry+0x4c/0x50 > [c00000668367ff90] [c00000000005ef20] start_secondary+0x284/0x288 > [c00000668367ffe0] [c00000000000e158] start_secondary_prolog+0x10/0x14 > Issue happens with NR_CPUS=8192. System boots fine with NR_CPUS=2048
On Thu, Oct 30 2025 at 12:10, Shrikanth Hegde wrote:
>> I am running into crash at boot on power10 pseries.
>> Thought of putting it here first. Me trying to figure out why.
>>
>> I am using your tree.
>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
Can you update and revalidate? There are a couple of fixes there though
I don't know how they would be related.
>> Oops: Kernel access of bad area, sig: 7 [#3]
>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
>> Modules linked in: drm drm_panel_orientation_quirks xfs sd_mod sg
>> ibmvscsi ibmveth scsi_transport_srp pseries_wdt dm_mirror dm_region_hash
>> dm_log dm_mod fuse
>> CPU: 96 UID: 0 PID: 0 Comm: swapper/96 Tainted: G D W
>> 6.18.0-rc3+ #4 PREEMPT(lazy)
>> Tainted: [D]=DIE, [W]=WARN
>> NIP [c0000000001b5c10] mm_cid_switch_to+0x58/0x52c
If it happens again, can you decode the source line?
>> LR [c000000001117c84] __schedule+0x4bc/0x760
>> Call Trace:
>> [c00000668367fde0] [c0000000001b53c8] __pick_next_task+0x60/0x2ac
>> (unreliable)
>> [c00000668367fe40] [c000000001117a14] __schedule+0x24c/0x760
>> [c00000668367fee0] [c0000000011183d0] schedule_idle+0x3c/0x64
>> [c00000668367ff10] [c0000000001f2470] do_idle+0x15c/0x1ac
>> [c00000668367ff60] [c0000000001f2788] cpu_startup_entry+0x4c/0x50
>> [c00000668367ff90] [c00000000005ef20] start_secondary+0x284/0x288
>> [c00000668367ffe0] [c00000000000e158] start_secondary_prolog+0x10/0x14
>>
>
> Issue happens with NR_CPUS=8192. System boots fine with NR_CPUS=2048
Hmm. Let me build a kernel with 8K and throw it at a VM then.
Thanks,
tglx
On 11/1/25 1:06 AM, Thomas Gleixner wrote:
> On Thu, Oct 30 2025 at 12:10, Shrikanth Hegde wrote:
>>> I am running into crash at boot on power10 pseries.
>>> Thought of putting it here first. Me trying to figure out why.
>>>
>>> I am using your tree.
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
>
> Can you update and revalidate? There are a couple of fixes there though
> I don't know how they would be related.
Tried with latest. It boots fine with NR_CPUS=8192.
at commit:
commit 870c1793316eddb6f8c9814f830f237e6e1c40ee (origin/rseq/cid)
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Tue Oct 14 10:51:04 2025 +0200
sched/mmcid: Switch over to the new mechanism
On Sat, Nov 01 2025 at 13:26, Shrikanth Hegde wrote: > On 11/1/25 1:06 AM, Thomas Gleixner wrote: >> On Thu, Oct 30 2025 at 12:10, Shrikanth Hegde wrote: >>>> I am running into crash at boot on power10 pseries. >>>> Thought of putting it here first. Me trying to figure out why. >>>> >>>> I am using your tree. >>>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git >> >> Can you update and revalidate? There are a couple of fixes there though >> I don't know how they would be related. > > Tried with latest. It boots fine with NR_CPUS=8192. Thanks a lot!
2025-10-29T13:09:04Z Thomas Gleixner <tglx@linutronix.de>: > This is a follow up on V2 series which can be found here: > I confirm this series passes the selftest in [1] consistently and the observed latency spikes caused by task_mm_cid_work are gone. Tested-by: Gabriele Monaco <gmonaco@redhat.com> Thanks, Gabriele [1] - https://lore.kernel.org/lkml/20250929114225.36172-5-gmonaco@redhat.com > https://lore.kernel.org/20251022104005.907410538@linutronix.de > > The V1 cover letter contains a detailed analyisis of the issues: > > https://lore.kernel.org/20251015164952.694882104@linutronix.de > > TLDR: The CID management is way to complex and adds significant overhead > into scheduler hotpaths. > > The series rewrites MM CID management in a more simplistic way which > focusses on low overhead in the scheduler while maintaining per task CIDs > as long as the number of threads is not exceeding the number of possible > CPUs. > > The series is based on the V6 series of the rseq rewrite: > > https://lore.kernel.org/20251027084220.785525188@linutronix.de > > which is also available from git: > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/rseq > > The series on top of the tip core/rseq branch is available from git as > well: > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid > > Changes vs. V2: > > - Rename to cpumask/bitmap_weighted_or() - Yury > > - Zero the bitmap with length of bitmap_size(nr_possible_cpus()) - > Shrikanth > > - Move cpu_relax() out of for() as that fails to build when cpu_relax() > is a macro. - Shrikanth > > - Picked up Reviewed/Acked-by tags where appropriate > > Thanks, > > tglx > --- > Thomas Gleixner (20): > sched/mmcid: Revert the complex CID management > sched/mmcid: Use proper data structures > sched/mmcid: Cacheline align MM CID storage > sched: Fixup whitespace damage > sched/mmcid: Move scheduler code out of global header > sched/mmcid: Prevent pointless work in mm_update_cpus_allowed() > cpumask: Introduce cpumask_weighted_or() > sched/mmcid: Use cpumask_weighted_or() > cpumask: Cache num_possible_cpus() > sched/mmcid: Convert mm CID mask to a bitmap > signal: Move MMCID exit out of sighand lock > sched/mmcid: Move initialization out of line > sched/mmcid: Provide precomputed maximal value > sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex > sched/mmcid: Introduce per task/CPU ownership infrastrcuture > sched/mmcid: Provide new scheduler CID mechanism > sched/mmcid: Provide CID ownership mode fixup functions > irqwork: Move data struct to a types header > sched/mmcid: Implement deferred mode change > sched/mmcid: Switch over to the new mechanism > > include/linux/bitmap.h | 15 > include/linux/cpumask.h | 26 + > include/linux/irq_work.h | 9 > include/linux/irq_work_types.h | 14 > include/linux/mm_types.h | 125 ------ > include/linux/rseq.h | 27 - > include/linux/rseq_types.h | 71 +++ > include/linux/sched.h | 19 > init/init_task.c | 3 > kernel/cpu.c | 15 > kernel/exit.c | 1 > kernel/fork.c | 7 > kernel/sched/core.c | 815 +++++++++++++++++++---------------------- > kernel/sched/sched.h | 395 ++++++++----------- > kernel/signal.c | 2 > lib/bitmap.c | 6 > 16 files changed, 727 insertions(+), 823 deletions(-)
© 2016 - 2026 Red Hat, Inc.