[patch V3 00/20] sched: Rewrite MM CID management

Thomas Gleixner posted 20 patches 3 months, 1 week ago
There is a newer version of this series
include/linux/bitmap.h         |   15
include/linux/cpumask.h        |   26 +
include/linux/irq_work.h       |    9
include/linux/irq_work_types.h |   14
include/linux/mm_types.h       |  125 ------
include/linux/rseq.h           |   27 -
include/linux/rseq_types.h     |   71 +++
include/linux/sched.h          |   19
init/init_task.c               |    3
kernel/cpu.c                   |   15
kernel/exit.c                  |    1
kernel/fork.c                  |    7
kernel/sched/core.c            |  815 +++++++++++++++++++----------------------
kernel/sched/sched.h           |  395 ++++++++-----------
kernel/signal.c                |    2
lib/bitmap.c                   |    6
16 files changed, 727 insertions(+), 823 deletions(-)
[patch V3 00/20] sched: Rewrite MM CID management
Posted by Thomas Gleixner 3 months, 1 week ago
This is a follow up on V2 series which can be found here:

    https://lore.kernel.org/20251022104005.907410538@linutronix.de

The V1 cover letter contains a detailed analyisis of the issues:

    https://lore.kernel.org/20251015164952.694882104@linutronix.de

TLDR: The CID management is way to complex and adds significant overhead
into scheduler hotpaths.

The series rewrites MM CID management in a more simplistic way which
focusses on low overhead in the scheduler while maintaining per task CIDs
as long as the number of threads is not exceeding the number of possible
CPUs.

The series is based on the V6 series of the rseq rewrite:

    https://lore.kernel.org/20251027084220.785525188@linutronix.de

which is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/rseq

The series on top of the tip core/rseq branch is available from git as
well:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid

Changes vs. V2:

   - Rename to cpumask/bitmap_weighted_or() - Yury

   - Zero the bitmap with length of bitmap_size(nr_possible_cpus()) -
     Shrikanth
   
   - Move cpu_relax() out of for() as that fails to build when cpu_relax()
     is a macro. - Shrikanth

   - Picked up Reviewed/Acked-by tags where appropriate

Thanks,

	tglx
---
Thomas Gleixner (20):
      sched/mmcid: Revert the complex CID management
      sched/mmcid: Use proper data structures
      sched/mmcid: Cacheline align MM CID storage
      sched: Fixup whitespace damage
      sched/mmcid: Move scheduler code out of global header
      sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
      cpumask: Introduce cpumask_weighted_or()
      sched/mmcid: Use cpumask_weighted_or()
      cpumask: Cache num_possible_cpus()
      sched/mmcid: Convert mm CID mask to a bitmap
      signal: Move MMCID exit out of sighand lock
      sched/mmcid: Move initialization out of line
      sched/mmcid: Provide precomputed maximal value
      sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
      sched/mmcid: Introduce per task/CPU ownership infrastrcuture
      sched/mmcid: Provide new scheduler CID mechanism
      sched/mmcid: Provide CID ownership mode fixup functions
      irqwork: Move data struct to a types header
      sched/mmcid: Implement deferred mode change
      sched/mmcid: Switch over to the new mechanism

 include/linux/bitmap.h         |   15 
 include/linux/cpumask.h        |   26 +
 include/linux/irq_work.h       |    9 
 include/linux/irq_work_types.h |   14 
 include/linux/mm_types.h       |  125 ------
 include/linux/rseq.h           |   27 -
 include/linux/rseq_types.h     |   71 +++
 include/linux/sched.h          |   19 
 init/init_task.c               |    3 
 kernel/cpu.c                   |   15 
 kernel/exit.c                  |    1 
 kernel/fork.c                  |    7 
 kernel/sched/core.c            |  815 +++++++++++++++++++----------------------
 kernel/sched/sched.h           |  395 ++++++++-----------
 kernel/signal.c                |    2 
 lib/bitmap.c                   |    6 
 16 files changed, 727 insertions(+), 823 deletions(-)
Re: [patch V3 00/20] sched: Rewrite MM CID management
Posted by Shrikanth Hegde 3 months, 1 week ago
Hi Thomas.

On 10/29/25 6:38 PM, Thomas Gleixner wrote:
> This is a follow up on V2 series which can be found here:
> 
>      https://lore.kernel.org/20251022104005.907410538@linutronix.de
> 
> The V1 cover letter contains a detailed analyisis of the issues:
> 
>      https://lore.kernel.org/20251015164952.694882104@linutronix.de
> 
> TLDR: The CID management is way to complex and adds significant overhead
> into scheduler hotpaths.
> 
> The series rewrites MM CID management in a more simplistic way which
> focusses on low overhead in the scheduler while maintaining per task CIDs
> as long as the number of threads is not exceeding the number of possible
> CPUs.
> 
> The series is based on the V6 series of the rseq rewrite:
> 
>      https://lore.kernel.org/20251027084220.785525188@linutronix.de
> 
> which is also available from git:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/rseq
> 
> The series on top of the tip core/rseq branch is available from git as
> well:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
> 
> Changes vs. V2:
> 
>     - Rename to cpumask/bitmap_weighted_or() - Yury
> 
>     - Zero the bitmap with length of bitmap_size(nr_possible_cpus()) -
>       Shrikanth
>     
>     - Move cpu_relax() out of for() as that fails to build when cpu_relax()
>       is a macro. - Shrikanth
> 
>     - Picked up Reviewed/Acked-by tags where appropriate
> 
> Thanks,
> 
> 	tglx
> ---
> Thomas Gleixner (20):
>        sched/mmcid: Revert the complex CID management
>        sched/mmcid: Use proper data structures
>        sched/mmcid: Cacheline align MM CID storage
>        sched: Fixup whitespace damage
>        sched/mmcid: Move scheduler code out of global header
>        sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
>        cpumask: Introduce cpumask_weighted_or()
>        sched/mmcid: Use cpumask_weighted_or()
>        cpumask: Cache num_possible_cpus()
>        sched/mmcid: Convert mm CID mask to a bitmap
>        signal: Move MMCID exit out of sighand lock
>        sched/mmcid: Move initialization out of line
>        sched/mmcid: Provide precomputed maximal value
>        sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
>        sched/mmcid: Introduce per task/CPU ownership infrastrcuture
>        sched/mmcid: Provide new scheduler CID mechanism
>        sched/mmcid: Provide CID ownership mode fixup functions
>        irqwork: Move data struct to a types header
>        sched/mmcid: Implement deferred mode change
>        sched/mmcid: Switch over to the new mechanism
> 
>   include/linux/bitmap.h         |   15
>   include/linux/cpumask.h        |   26 +
>   include/linux/irq_work.h       |    9
>   include/linux/irq_work_types.h |   14
>   include/linux/mm_types.h       |  125 ------
>   include/linux/rseq.h           |   27 -
>   include/linux/rseq_types.h     |   71 +++
>   include/linux/sched.h          |   19
>   init/init_task.c               |    3
>   kernel/cpu.c                   |   15
>   kernel/exit.c                  |    1
>   kernel/fork.c                  |    7
>   kernel/sched/core.c            |  815 +++++++++++++++++++----------------------
>   kernel/sched/sched.h           |  395 ++++++++-----------
>   kernel/signal.c                |    2
>   lib/bitmap.c                   |    6
>   16 files changed, 727 insertions(+), 823 deletions(-)
> 
> 

I am running into crash at boot on power10 pseries.
Thought of putting it here first. Me trying to figure out why.

I am using your tree.
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git

commit 789ff6e7cc5aa423473eb135f94812fe77b8aeab (HEAD -> rseq/cid, origin/rseq/cid)
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Tue Oct 14 10:51:04 2025 +0200

     sched/mmcid: Switch over to the new mechanism


Oops: Kernel access of bad area, sig: 7 [#3]
LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
Modules linked in: drm drm_panel_orientation_quirks xfs sd_mod sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 96 UID: 0 PID: 0 Comm: swapper/96 Tainted: G      D W           6.18.0-rc3+ #4 PREEMPT(lazy)
Tainted: [D]=DIE, [W]=WARN
NIP [c0000000001b5c10] mm_cid_switch_to+0x58/0x52c
LR [c000000001117c84] __schedule+0x4bc/0x760
Call Trace:
[c00000668367fde0] [c0000000001b53c8] __pick_next_task+0x60/0x2ac (unreliable)
[c00000668367fe40] [c000000001117a14] __schedule+0x24c/0x760
[c00000668367fee0] [c0000000011183d0] schedule_idle+0x3c/0x64
[c00000668367ff10] [c0000000001f2470] do_idle+0x15c/0x1ac
[c00000668367ff60] [c0000000001f2788] cpu_startup_entry+0x4c/0x50
[c00000668367ff90] [c00000000005ef20] start_secondary+0x284/0x288
[c00000668367ffe0] [c00000000000e158] start_secondary_prolog+0x10/0x14
Re: [patch V3 00/20] sched: Rewrite MM CID management
Posted by Shrikanth Hegde 3 months, 1 week ago

On 10/30/25 10:30 AM, Shrikanth Hegde wrote:
> Hi Thomas.

>> +----------------------
>>   kernel/sched/sched.h           |  395 ++++++++-----------
>>   kernel/signal.c                |    2
>>   lib/bitmap.c                   |    6
>>   16 files changed, 727 insertions(+), 823 deletions(-)
>>
>>
> 
> I am running into crash at boot on power10 pseries.
> Thought of putting it here first. Me trying to figure out why.
> 
> I am using your tree.
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
> 
> commit 789ff6e7cc5aa423473eb135f94812fe77b8aeab (HEAD -> rseq/cid, 
> origin/rseq/cid)
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Tue Oct 14 10:51:04 2025 +0200
> 
>      sched/mmcid: Switch over to the new mechanism
> 
> 
> Oops: Kernel access of bad area, sig: 7 [#3]
> LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
> Modules linked in: drm drm_panel_orientation_quirks xfs sd_mod sg 
> ibmvscsi ibmveth scsi_transport_srp pseries_wdt dm_mirror dm_region_hash 
> dm_log dm_mod fuse
> CPU: 96 UID: 0 PID: 0 Comm: swapper/96 Tainted: G      D W           
> 6.18.0-rc3+ #4 PREEMPT(lazy)
> Tainted: [D]=DIE, [W]=WARN
> NIP [c0000000001b5c10] mm_cid_switch_to+0x58/0x52c
> LR [c000000001117c84] __schedule+0x4bc/0x760
> Call Trace:
> [c00000668367fde0] [c0000000001b53c8] __pick_next_task+0x60/0x2ac 
> (unreliable)
> [c00000668367fe40] [c000000001117a14] __schedule+0x24c/0x760
> [c00000668367fee0] [c0000000011183d0] schedule_idle+0x3c/0x64
> [c00000668367ff10] [c0000000001f2470] do_idle+0x15c/0x1ac
> [c00000668367ff60] [c0000000001f2788] cpu_startup_entry+0x4c/0x50
> [c00000668367ff90] [c00000000005ef20] start_secondary+0x284/0x288
> [c00000668367ffe0] [c00000000000e158] start_secondary_prolog+0x10/0x14
> 

Issue happens with NR_CPUS=8192. System boots fine with NR_CPUS=2048
Re: [patch V3 00/20] sched: Rewrite MM CID management
Posted by Thomas Gleixner 3 months, 1 week ago
On Thu, Oct 30 2025 at 12:10, Shrikanth Hegde wrote:
>> I am running into crash at boot on power10 pseries.
>> Thought of putting it here first. Me trying to figure out why.
>> 
>> I am using your tree.
>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git

Can you update and revalidate? There are a couple of fixes there though
I don't know how they would be related.

>> Oops: Kernel access of bad area, sig: 7 [#3]
>> LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
>> Modules linked in: drm drm_panel_orientation_quirks xfs sd_mod sg 
>> ibmvscsi ibmveth scsi_transport_srp pseries_wdt dm_mirror dm_region_hash 
>> dm_log dm_mod fuse
>> CPU: 96 UID: 0 PID: 0 Comm: swapper/96 Tainted: G      D W           
>> 6.18.0-rc3+ #4 PREEMPT(lazy)
>> Tainted: [D]=DIE, [W]=WARN
>> NIP [c0000000001b5c10] mm_cid_switch_to+0x58/0x52c

If it happens again, can you decode the source line?

>> LR [c000000001117c84] __schedule+0x4bc/0x760
>> Call Trace:
>> [c00000668367fde0] [c0000000001b53c8] __pick_next_task+0x60/0x2ac 
>> (unreliable)
>> [c00000668367fe40] [c000000001117a14] __schedule+0x24c/0x760
>> [c00000668367fee0] [c0000000011183d0] schedule_idle+0x3c/0x64
>> [c00000668367ff10] [c0000000001f2470] do_idle+0x15c/0x1ac
>> [c00000668367ff60] [c0000000001f2788] cpu_startup_entry+0x4c/0x50
>> [c00000668367ff90] [c00000000005ef20] start_secondary+0x284/0x288
>> [c00000668367ffe0] [c00000000000e158] start_secondary_prolog+0x10/0x14
>> 
>
> Issue happens with NR_CPUS=8192. System boots fine with NR_CPUS=2048

Hmm. Let me build a kernel with 8K and throw it at a VM then.

Thanks,

        tglx
Re: [patch V3 00/20] sched: Rewrite MM CID management
Posted by Shrikanth Hegde 3 months, 1 week ago

On 11/1/25 1:06 AM, Thomas Gleixner wrote:
> On Thu, Oct 30 2025 at 12:10, Shrikanth Hegde wrote:
>>> I am running into crash at boot on power10 pseries.
>>> Thought of putting it here first. Me trying to figure out why.
>>>
>>> I am using your tree.
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
> 
> Can you update and revalidate? There are a couple of fixes there though
> I don't know how they would be related.

Tried with latest. It boots fine with NR_CPUS=8192.

at commit:
commit 870c1793316eddb6f8c9814f830f237e6e1c40ee (origin/rseq/cid)
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Tue Oct 14 10:51:04 2025 +0200

     sched/mmcid: Switch over to the new mechanism
Re: [patch V3 00/20] sched: Rewrite MM CID management
Posted by Thomas Gleixner 3 months, 1 week ago
On Sat, Nov 01 2025 at 13:26, Shrikanth Hegde wrote:
> On 11/1/25 1:06 AM, Thomas Gleixner wrote:
>> On Thu, Oct 30 2025 at 12:10, Shrikanth Hegde wrote:
>>>> I am running into crash at boot on power10 pseries.
>>>> Thought of putting it here first. Me trying to figure out why.
>>>>
>>>> I am using your tree.
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
>> 
>> Can you update and revalidate? There are a couple of fixes there though
>> I don't know how they would be related.
>
> Tried with latest. It boots fine with NR_CPUS=8192.

Thanks a lot!
Re: [patch V3 00/20] sched: Rewrite MM CID management
Posted by Gabriele Monaco 2 months, 4 weeks ago
2025-10-29T13:09:04Z Thomas Gleixner <tglx@linutronix.de>:

> This is a follow up on V2 series which can be found here:
>

I confirm this series passes the selftest in [1] consistently and the observed latency spikes caused by task_mm_cid_work are gone.

Tested-by: Gabriele Monaco <gmonaco@redhat.com>

Thanks,
Gabriele

[1] - https://lore.kernel.org/lkml/20250929114225.36172-5-gmonaco@redhat.com

>     https://lore.kernel.org/20251022104005.907410538@linutronix.de
>
> The V1 cover letter contains a detailed analyisis of the issues:
>
>     https://lore.kernel.org/20251015164952.694882104@linutronix.de
>
> TLDR: The CID management is way to complex and adds significant overhead
> into scheduler hotpaths.
>
> The series rewrites MM CID management in a more simplistic way which
> focusses on low overhead in the scheduler while maintaining per task CIDs
> as long as the number of threads is not exceeding the number of possible
> CPUs.
>
> The series is based on the V6 series of the rseq rewrite:
>
>     https://lore.kernel.org/20251027084220.785525188@linutronix.de
>
> which is also available from git:
>
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/rseq
>
> The series on top of the tip core/rseq branch is available from git as
> well:
>
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
>
> Changes vs. V2:
>
>    - Rename to cpumask/bitmap_weighted_or() - Yury
>
>    - Zero the bitmap with length of bitmap_size(nr_possible_cpus()) -
>      Shrikanth
>   
>    - Move cpu_relax() out of for() as that fails to build when cpu_relax()
>      is a macro. - Shrikanth
>
>    - Picked up Reviewed/Acked-by tags where appropriate
>
> Thanks,
>
>     tglx
> ---
> Thomas Gleixner (20):
>       sched/mmcid: Revert the complex CID management
>       sched/mmcid: Use proper data structures
>       sched/mmcid: Cacheline align MM CID storage
>       sched: Fixup whitespace damage
>       sched/mmcid: Move scheduler code out of global header
>       sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
>       cpumask: Introduce cpumask_weighted_or()
>       sched/mmcid: Use cpumask_weighted_or()
>       cpumask: Cache num_possible_cpus()
>       sched/mmcid: Convert mm CID mask to a bitmap
>       signal: Move MMCID exit out of sighand lock
>       sched/mmcid: Move initialization out of line
>       sched/mmcid: Provide precomputed maximal value
>       sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
>       sched/mmcid: Introduce per task/CPU ownership infrastrcuture
>       sched/mmcid: Provide new scheduler CID mechanism
>       sched/mmcid: Provide CID ownership mode fixup functions
>       irqwork: Move data struct to a types header
>       sched/mmcid: Implement deferred mode change
>       sched/mmcid: Switch over to the new mechanism
>
> include/linux/bitmap.h         |   15
> include/linux/cpumask.h        |   26 +
> include/linux/irq_work.h       |    9
> include/linux/irq_work_types.h |   14
> include/linux/mm_types.h       |  125 ------
> include/linux/rseq.h           |   27 -
> include/linux/rseq_types.h     |   71 +++
> include/linux/sched.h          |   19
> init/init_task.c               |    3
> kernel/cpu.c                   |   15
> kernel/exit.c                  |    1
> kernel/fork.c                  |    7
> kernel/sched/core.c            |  815 +++++++++++++++++++----------------------
> kernel/sched/sched.h           |  395 ++++++++-----------
> kernel/signal.c                |    2
> lib/bitmap.c                   |    6
> 16 files changed, 727 insertions(+), 823 deletions(-)