sched/fair: properly serialize the cfs_rq h_load calculation

[PATCH] sched/fair: properly serialize the cfs_rq h_load calculation

Posted by Daniel Vacek 1 year, 2 months ago

Make sure the given cfs_rq's h_load is always correctly updated. This
prevents a race between more CPUs eventually updating the same hierarchy
of h_load_next return pointers.

Signed-off-by: Daniel Vacek <neelx@suse.com>
---
 kernel/sched/fair.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..50794ba0db75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9786,6 +9786,8 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
 	return decayed;
 }
 
+static DEFINE_PER_CPU(raw_spinlock_t, h_load_lock);
+
 /*
  * Compute the hierarchical load factor for cfs_rq and all its ascendants.
  * This needs to be done in a top-down fashion because the load of a child
@@ -9793,18 +9795,26 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
  */
 static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 {
-	struct rq *rq = rq_of(cfs_rq);
-	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
+	int cpu = cpu_of(rq_of(cfs_rq));
+	struct sched_entity *se = cfs_rq->tg->se[cpu];
+	raw_spinlock_t * lock;
 	unsigned long now = jiffies;
 	unsigned long load;
 
 	if (cfs_rq->last_h_load_update == now)
 		return;
 
-	WRITE_ONCE(cfs_rq->h_load_next, NULL);
+	/* Protects cfs_rq->h_load_next and cfs_rq->last_h_load_update */
+	raw_spin_lock(lock = &per_cpu(h_load_lock, cpu));
+
+	now = jiffies;
+	if (cfs_rq->last_h_load_update == now)
+		goto unlock;
+
+	cfs_rq->h_load_next = NULL;
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		WRITE_ONCE(cfs_rq->h_load_next, se);
+		cfs_rq->h_load_next = se;
 		if (cfs_rq->last_h_load_update == now)
 			break;
 	}
@@ -9814,7 +9824,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 		cfs_rq->last_h_load_update = now;
 	}
 
-	while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
+	while ((se = cfs_rq->h_load_next) != NULL) {
 		load = cfs_rq->h_load;
 		load = div64_ul(load * se->avg.load_avg,
 			cfs_rq_load_avg(cfs_rq) + 1);
@@ -9822,6 +9832,8 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 		cfs_rq->h_load = load;
 		cfs_rq->last_h_load_update = now;
 	}
+unlock:
+	raw_spin_unlock(lock);
 }
 
 static unsigned long task_h_load(struct task_struct *p)
@@ -13665,6 +13677,9 @@ __init void init_sched_fair_class(void)
 		zalloc_cpumask_var_node(&per_cpu(should_we_balance_tmpmask, i),
 					GFP_KERNEL, cpu_to_node(i));
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+		raw_spin_lock_init(&per_cpu(h_load_lock, i));
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 		INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i));
 		INIT_LIST_HEAD(&cpu_rq(i)->cfsb_csd_list);
-- 
2.45.2

Re: [PATCH] sched/fair: properly serialize the cfs_rq h_load calculation

Posted by Peter Zijlstra 1 year, 2 months ago

On Fri, Nov 22, 2024 at 04:28:55PM +0100, Daniel Vacek wrote:
> Make sure the given cfs_rq's h_load is always correctly updated. This
> prevents a race between more CPUs eventually updating the same hierarchy
> of h_load_next return pointers.

Is there an actual problem observed?

Re: [PATCH] sched/fair: properly serialize the cfs_rq h_load calculation

Posted by Daniel Vacek 1 year, 2 months ago

On Fri, Nov 22, 2024 at 4:42 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 22, 2024 at 04:28:55PM +0100, Daniel Vacek wrote:
> > Make sure the given cfs_rq's h_load is always correctly updated. This
> > prevents a race between more CPUs eventually updating the same hierarchy
> > of h_load_next return pointers.
>
> Is there an actual problem observed?

Well, that depends. Do we care about correct (exact) load calculation
every time?
If it's not a big deal we may just drop this patch.
I am not sure what (if any real) problems this can cause. I did not
observe any I'm aware of. Actually I should have labeled this [RFC],
but I forgot :-/

This is being called from `try_to_wake_up` => `select_task_rq_fair`.
If two (or more) CPUs race to wake up => `task_h_load()` *different*
tasks on the same rq (I mean the same target CPU), they may get a
wrong result if the tasks are in different cgroups. Well, wrong in a
sense the `cfs_rq->h_load` may not be up to date and the old, former
value is used for all but one of the racing cgroups (cfs_rqs).

I could detect the race collisions almost every minute on my lightly
loaded laptop (using bpftrace which admittedly opened the window a
bit, but for sure it can happen). Though I am not sure if it's a big
deal?
The `cfs_rq->h_load` will get updated the next time when the race does
not happen again. So very likely right the next time.
And we may be pretty fine eventually using the old value from time to
time. The question is are we fine with that or are we not? I guess we
are and this patch can be dropped, right?

It almost looks like the function is specifically designed this way as
we really do not care about unlikely failures because the worst can
happen is a bit older value is kept in `h_load`. It may not be even
that different to the correct value I guess and it will (most)
definitely get fixed/updated the next time.

If that is really the intention of the current design, let's just drop
this patch.

I understand that this is adding another lock into the scheduler which
is always to be well considered. But on the other hand the race is
limited to once per jiffy for a given CPU otherwise the condition
bails out early. By the nature of this race the contention should be
unlikely most of the time. With that respect I was considering just
using the rq lock, but using a dedicated one actually looked simpler
to me after all. Also the scope of the lock is clear this way. It
serves only this one purpose. Which we may not need or do not care
about after all.

Hence I'm wondering what is your opinion with regards to this?
Would we benefit from guaranteed correct calculation every time in
exchange for a little overhead?
Perhaps, can you suggest a stress test or benchmark or any workload
which heavily exercises task wake ups so that I can try to quantify
the added overhead?

--nX

Re: [PATCH] sched/fair: properly serialize the cfs_rq h_load calculation

Posted by Peter Zijlstra 1 year, 2 months ago

On Fri, Nov 22, 2024 at 06:33:31PM +0100, Daniel Vacek wrote:
> On Fri, Nov 22, 2024 at 4:42 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Nov 22, 2024 at 04:28:55PM +0100, Daniel Vacek wrote:
> > > Make sure the given cfs_rq's h_load is always correctly updated. This
> > > prevents a race between more CPUs eventually updating the same hierarchy
> > > of h_load_next return pointers.
> >
> > Is there an actual problem observed?
> 
> Well, that depends. Do we care about correct (exact) load calculation
> every time?

The whole load balancer is full of races. And typically it all more or
less works out.

I mean, the worst case is typically a spurious migration, which will get
'fixed' up the next round.

Only if behaviour gets to be really bad/stupid do we tend to try and fix
this.

Now your patch didn't look awful :-), but it would make a stronger case
if you'd done it because you observed it doing stupid and it now no
longer does stupid and your workload improves.

Re: [PATCH] sched/fair: properly serialize the cfs_rq h_load calculation

Posted by Daniel Vacek 1 year, 2 months ago

On Mon, Nov 25, 2024 at 11:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 22, 2024 at 06:33:31PM +0100, Daniel Vacek wrote:
> > On Fri, Nov 22, 2024 at 4:42 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Nov 22, 2024 at 04:28:55PM +0100, Daniel Vacek wrote:
> > > > Make sure the given cfs_rq's h_load is always correctly updated. This
> > > > prevents a race between more CPUs eventually updating the same hierarchy
> > > > of h_load_next return pointers.
> > >
> > > Is there an actual problem observed?
> >
> > Well, that depends. Do we care about correct (exact) load calculation
> > every time?
>
> The whole load balancer is full of races. And typically it all more or
> less works out.
>
> I mean, the worst case is typically a spurious migration, which will get
> 'fixed' up the next round.
>
> Only if behaviour gets to be really bad/stupid do we tend to try and fix
> this.
>
> Now your patch didn't look awful :-), but it would make a stronger case
> if you'd done it because you observed it doing stupid and it now no
> longer does stupid and your workload improves.

Well, the original motivation were crashes reported on s390 before
commit 0e9f02450da0.
That commit addresses these crashes but not the failed load
calculation. This patch addresses both issues. As a result it makes
the scheduler more correct and deterministic and less racy. The
question is if we strictly need that or if we are happy with it for
the price? And the price looks quite fair to me as the lock could be
acquired only once per jiffy.

Note that the load calculation fails more often than the observed
crashes. Crash is just a special case of a failure depending on the
actual cgroup hierarchy and relation of where are the tasks racing
being woken within that hierarchy.