[v1] sched/mmcid: Cure mode transition woes

[patch 4/4] sched/mmcid: Optimize transitional CIDs when scheduling out

Posted by Thomas Gleixner 1 week, 1 day ago

During the investigation of the various transition mode issues
instrumentation revealed that the amount of bitmap operations can be
significantly reduced when a task with a transitional CID schedules out
after the fixup function completed and disabled the transition mode.

At that point the mode is stable and therefore it is not required to drop
the transitional CID back into the pool. As the fixup is complete the
potential exhaustion of the CID pool is not longer possible, so the CID can
be transferred to the scheduling out task or to the CPU depending on the
current ownership mode. This is now possible because mm_cid::mode contains
both the ownership state and the transition bit so the racy snapshot is
valid under all circumstances because a subsequent modification of the
mode is serialized by the corresponding runqueue lock.

Assigning the ownership right there not only spares the bitmap access for
dropping the CID it also avoids it when the task is scheduled back in as it
directly hits the fast path in both modes when the CID is within the
optimal range. If it's outside the range the next schedule in will need to
converge so dropping it right away is sensible. In the good case this also
allows to go into the fast path on the next schedule in operation.

With a thread pool benchmark which is configured to cross the mode switch
boundaries frequently this reduces the number of bitmap operations by about
30% and increases the fastpath utilization in the low single digit
percentage range.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/sched.h |   24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3902,12 +3902,32 @@ static __always_inline void mm_cid_sched
 
 static __always_inline void mm_cid_schedout(struct task_struct *prev)
 {
+	struct mm_struct *mm = prev->mm;
+	unsigned int mode, cid;
+
 	/* During mode transitions CIDs are temporary and need to be dropped */
 	if (likely(!cid_in_transit(prev->mm_cid.cid)))
 		return;
 
-	mm_drop_cid(prev->mm, cid_from_transit_cid(prev->mm_cid.cid));
-	prev->mm_cid.cid = MM_CID_UNSET;
+	mode = READ_ONCE(mm->mm_cid.mode);
+	cid = cid_from_transit_cid(prev->mm_cid.cid);
+
+	/*
+	 * If transition mode is done, transfer ownership when the CID is
+	 * within the convergion range. Otherwise the next schedule in will
+	 * have to allocate or converge
+	 */
+	if (!cid_in_transit(mode) && cid < READ_ONCE(mm->mm_cid.max_cids)) {
+		if (cid_on_cpu(mode))
+			cid = cid_to_cpu_cid(cid);
+
+		/* Update both so that the next schedule in goes into the fast path */
+		mm_cid_update_pcpu_cid(mm, cid);
+		prev->mm_cid.cid = cid;
+	} else {
+		mm_drop_cid(mm, cid);
+		prev->mm_cid.cid = MM_CID_UNSET;
+	}
 }
 
 static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct *next)

Re: [patch 4/4] sched/mmcid: Optimize transitional CIDs when scheduling out

Posted by Mathieu Desnoyers 1 week ago

On 2026-01-29 16:20, Thomas Gleixner wrote:
> During the investigation of the various transition mode issues
> instrumentation revealed that the amount of bitmap operations can be
> significantly reduced when a task with a transitional CID schedules out
> after the fixup function completed and disabled the transition mode.
> 
> At that point the mode is stable and therefore it is not required to drop
> the transitional CID back into the pool. As the fixup is complete the
> potential exhaustion of the CID pool is not longer possible, so the CID can
> be transferred to the scheduling out task or to the CPU depending on the
> current ownership mode. This is now possible because mm_cid::mode contains
> both the ownership state and the transition bit so the racy snapshot is
> valid under all circumstances because a subsequent modification of the
> mode is serialized by the corresponding runqueue lock.

AFAIU the mc->mode updates are serialized by the mm->mm_cid.lock
and not the runqueue locks. What am I missing ?

[...]

> +	/*
> +	 * If transition mode is done, transfer ownership when the CID is
> +	 * within the convergion range. Otherwise the next schedule in will

convergence

> +	 * have to allocate or converge

add final ".".

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [patch 4/4] sched/mmcid: Optimize transitional CIDs when scheduling out

Posted by Thomas Gleixner 1 week ago

On Fri, Jan 30 2026 at 10:50, Mathieu Desnoyers wrote:
> On 2026-01-29 16:20, Thomas Gleixner wrote:
>> During the investigation of the various transition mode issues
>> instrumentation revealed that the amount of bitmap operations can be
>> significantly reduced when a task with a transitional CID schedules out
>> after the fixup function completed and disabled the transition mode.
>> 
>> At that point the mode is stable and therefore it is not required to drop
>> the transitional CID back into the pool. As the fixup is complete the
>> potential exhaustion of the CID pool is not longer possible, so the CID can
>> be transferred to the scheduling out task or to the CPU depending on the
>> current ownership mode. This is now possible because mm_cid::mode contains
>> both the ownership state and the transition bit so the racy snapshot is
>> valid under all circumstances because a subsequent modification of the
>> mode is serialized by the corresponding runqueue lock.
>
> AFAIU the mc->mode updates are serialized by the mm->mm_cid.lock
> and not the runqueue locks. What am I missing ?

Actually the mode updates are serialized by the mutex. They happen under
the lock as well, but the lock is not a serialization requirement for
mode changes.

What I meant to write with tired brain is:

  The racy snapshot is valid under runqueue lock even when there is a
  concurrent mode update going on because the subsequent fixup function
  is serialized with runqueue lock. That means in the following
  scenario:

  CPU0                  CPU1
  clear TRANSIT
  ....
  			lock(rq)
                        sched_out()
  		          CID has TRANSIT set
                          ...
                          // observes TRANSIT=0
                          localmode = READ_ONCE(...mode);
  // sets TRANSIT
  switch mode
                          transfer CID according to localmode
  fixup()
    lock(rq)    <- Blocked until the schedule on CPU1 is complete

So both sched_out() and fixup() observe consistent state and everything
just works.

Thanks,

        tglx

Re: [patch 4/4] sched/mmcid: Optimize transitional CIDs when scheduling out

Posted by Mathieu Desnoyers 1 week ago

On 2026-01-30 11:13, Thomas Gleixner wrote:
> On Fri, Jan 30 2026 at 10:50, Mathieu Desnoyers wrote:
>> On 2026-01-29 16:20, Thomas Gleixner wrote:
>>> During the investigation of the various transition mode issues
>>> instrumentation revealed that the amount of bitmap operations can be
>>> significantly reduced when a task with a transitional CID schedules out
>>> after the fixup function completed and disabled the transition mode.
>>>
>>> At that point the mode is stable and therefore it is not required to drop
>>> the transitional CID back into the pool. As the fixup is complete the
>>> potential exhaustion of the CID pool is not longer possible, so the CID can
>>> be transferred to the scheduling out task or to the CPU depending on the
>>> current ownership mode. This is now possible because mm_cid::mode contains
>>> both the ownership state and the transition bit so the racy snapshot is
>>> valid under all circumstances because a subsequent modification of the
>>> mode is serialized by the corresponding runqueue lock.
>>
>> AFAIU the mc->mode updates are serialized by the mm->mm_cid.lock
>> and not the runqueue locks. What am I missing ?
> 
> Actually the mode updates are serialized by the mutex. They happen under
> the lock as well, but the lock is not a serialization requirement for
> mode changes.

Right, I meant the mutex but got mixed up with the raw spinlock.

> 
> What I meant to write with tired brain is:
> 
>    The racy snapshot is valid under runqueue lock even when there is a
>    concurrent mode update going on because the subsequent fixup function
>    is serialized with runqueue lock. That means in the following
>    scenario:
> 
>    CPU0                  CPU1
>    clear TRANSIT
>    ....
>    			lock(rq)
>                          sched_out()
>    		          CID has TRANSIT set
>                            ...
>                            // observes TRANSIT=0
>                            localmode = READ_ONCE(...mode);
>    // sets TRANSIT
>    switch mode
>                            transfer CID according to localmode
>    fixup()
>      lock(rq)    <- Blocked until the schedule on CPU1 is complete
> 
> So both sched_out() and fixup() observe consistent state and everything
> just works.

There is still one detail I'm concerned about here.

I would be tempted to add explicit memory barriers between:

store to mm->mm_cid.mode (set TRANSIT)
smp_mb();	/* Order store to mode before rq locks */
mm_cid_fixup_cpus_to_tasks() / mm_cid_fixup_tasks_to_cpus()
smp_mb();	/* Order rq unlocks before store to mode. */
store to mm->mm_cid.mode (clear TRANSIT)

because AFAIU the rq locks taken within the fixups are the
only serialization between the scheduler and the fixup, but
the mode stores performed by the mode transition are done
outside of the rq locks, which means those can be reordered
within the fixup rq lock critical sections. Locks are
semi-permeable barriers only, unless there is something
special about the rq lock ?

AFAIU, having the transit state cleared while performing the
fixup is a state we don't want.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com