kernel/sched/fair.c | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-)
Make sure the given cfs_rq's h_load is always correctly updated. This
prevents a race between more CPUs eventually updating the same hierarchy
of h_load_next return pointers.
Signed-off-by: Daniel Vacek <neelx@suse.com>
---
kernel/sched/fair.c | 25 ++++++++++++++++++++-----
1 file changed, 20 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..50794ba0db75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9786,6 +9786,8 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
return decayed;
}
+static DEFINE_PER_CPU(raw_spinlock_t, h_load_lock);
+
/*
* Compute the hierarchical load factor for cfs_rq and all its ascendants.
* This needs to be done in a top-down fashion because the load of a child
@@ -9793,18 +9795,26 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
*/
static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
{
- struct rq *rq = rq_of(cfs_rq);
- struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
+ int cpu = cpu_of(rq_of(cfs_rq));
+ struct sched_entity *se = cfs_rq->tg->se[cpu];
+ raw_spinlock_t * lock;
unsigned long now = jiffies;
unsigned long load;
if (cfs_rq->last_h_load_update == now)
return;
- WRITE_ONCE(cfs_rq->h_load_next, NULL);
+ /* Protects cfs_rq->h_load_next and cfs_rq->last_h_load_update */
+ raw_spin_lock(lock = &per_cpu(h_load_lock, cpu));
+
+ now = jiffies;
+ if (cfs_rq->last_h_load_update == now)
+ goto unlock;
+
+ cfs_rq->h_load_next = NULL;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
- WRITE_ONCE(cfs_rq->h_load_next, se);
+ cfs_rq->h_load_next = se;
if (cfs_rq->last_h_load_update == now)
break;
}
@@ -9814,7 +9824,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
cfs_rq->last_h_load_update = now;
}
- while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
+ while ((se = cfs_rq->h_load_next) != NULL) {
load = cfs_rq->h_load;
load = div64_ul(load * se->avg.load_avg,
cfs_rq_load_avg(cfs_rq) + 1);
@@ -9822,6 +9832,8 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
cfs_rq->h_load = load;
cfs_rq->last_h_load_update = now;
}
+unlock:
+ raw_spin_unlock(lock);
}
static unsigned long task_h_load(struct task_struct *p)
@@ -13665,6 +13677,9 @@ __init void init_sched_fair_class(void)
zalloc_cpumask_var_node(&per_cpu(should_we_balance_tmpmask, i),
GFP_KERNEL, cpu_to_node(i));
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ raw_spin_lock_init(&per_cpu(h_load_lock, i));
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i));
INIT_LIST_HEAD(&cpu_rq(i)->cfsb_csd_list);
--
2.45.2
On Fri, Nov 22, 2024 at 04:28:55PM +0100, Daniel Vacek wrote: > Make sure the given cfs_rq's h_load is always correctly updated. This > prevents a race between more CPUs eventually updating the same hierarchy > of h_load_next return pointers. Is there an actual problem observed?
On Fri, Nov 22, 2024 at 4:42 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Nov 22, 2024 at 04:28:55PM +0100, Daniel Vacek wrote: > > Make sure the given cfs_rq's h_load is always correctly updated. This > > prevents a race between more CPUs eventually updating the same hierarchy > > of h_load_next return pointers. > > Is there an actual problem observed? Well, that depends. Do we care about correct (exact) load calculation every time? If it's not a big deal we may just drop this patch. I am not sure what (if any real) problems this can cause. I did not observe any I'm aware of. Actually I should have labeled this [RFC], but I forgot :-/ This is being called from `try_to_wake_up` => `select_task_rq_fair`. If two (or more) CPUs race to wake up => `task_h_load()` *different* tasks on the same rq (I mean the same target CPU), they may get a wrong result if the tasks are in different cgroups. Well, wrong in a sense the `cfs_rq->h_load` may not be up to date and the old, former value is used for all but one of the racing cgroups (cfs_rqs). I could detect the race collisions almost every minute on my lightly loaded laptop (using bpftrace which admittedly opened the window a bit, but for sure it can happen). Though I am not sure if it's a big deal? The `cfs_rq->h_load` will get updated the next time when the race does not happen again. So very likely right the next time. And we may be pretty fine eventually using the old value from time to time. The question is are we fine with that or are we not? I guess we are and this patch can be dropped, right? It almost looks like the function is specifically designed this way as we really do not care about unlikely failures because the worst can happen is a bit older value is kept in `h_load`. It may not be even that different to the correct value I guess and it will (most) definitely get fixed/updated the next time. If that is really the intention of the current design, let's just drop this patch. I understand that this is adding another lock into the scheduler which is always to be well considered. But on the other hand the race is limited to once per jiffy for a given CPU otherwise the condition bails out early. By the nature of this race the contention should be unlikely most of the time. With that respect I was considering just using the rq lock, but using a dedicated one actually looked simpler to me after all. Also the scope of the lock is clear this way. It serves only this one purpose. Which we may not need or do not care about after all. Hence I'm wondering what is your opinion with regards to this? Would we benefit from guaranteed correct calculation every time in exchange for a little overhead? Perhaps, can you suggest a stress test or benchmark or any workload which heavily exercises task wake ups so that I can try to quantify the added overhead? --nX
On Fri, Nov 22, 2024 at 06:33:31PM +0100, Daniel Vacek wrote: > On Fri, Nov 22, 2024 at 4:42 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Fri, Nov 22, 2024 at 04:28:55PM +0100, Daniel Vacek wrote: > > > Make sure the given cfs_rq's h_load is always correctly updated. This > > > prevents a race between more CPUs eventually updating the same hierarchy > > > of h_load_next return pointers. > > > > Is there an actual problem observed? > > Well, that depends. Do we care about correct (exact) load calculation > every time? The whole load balancer is full of races. And typically it all more or less works out. I mean, the worst case is typically a spurious migration, which will get 'fixed' up the next round. Only if behaviour gets to be really bad/stupid do we tend to try and fix this. Now your patch didn't look awful :-), but it would make a stronger case if you'd done it because you observed it doing stupid and it now no longer does stupid and your workload improves.
On Mon, Nov 25, 2024 at 11:01 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Nov 22, 2024 at 06:33:31PM +0100, Daniel Vacek wrote: > > On Fri, Nov 22, 2024 at 4:42 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Fri, Nov 22, 2024 at 04:28:55PM +0100, Daniel Vacek wrote: > > > > Make sure the given cfs_rq's h_load is always correctly updated. This > > > > prevents a race between more CPUs eventually updating the same hierarchy > > > > of h_load_next return pointers. > > > > > > Is there an actual problem observed? > > > > Well, that depends. Do we care about correct (exact) load calculation > > every time? > > The whole load balancer is full of races. And typically it all more or > less works out. > > I mean, the worst case is typically a spurious migration, which will get > 'fixed' up the next round. > > Only if behaviour gets to be really bad/stupid do we tend to try and fix > this. > > Now your patch didn't look awful :-), but it would make a stronger case > if you'd done it because you observed it doing stupid and it now no > longer does stupid and your workload improves. Well, the original motivation were crashes reported on s390 before commit 0e9f02450da0. That commit addresses these crashes but not the failed load calculation. This patch addresses both issues. As a result it makes the scheduler more correct and deterministic and less racy. The question is if we strictly need that or if we are happy with it for the price? And the price looks quite fair to me as the lock could be acquired only once per jiffy. Note that the load calculation fails more often than the observed crashes. Crash is just a special case of a failure depending on the actual cgroup hierarchy and relation of where are the tasks racing being woken within that hierarchy.
© 2016 - 2026 Red Hat, Inc.