kernel/sched/fair.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.
Consider the following scenario:
root
/ \
A B (*) delayed since B is no longer eligible on root
| |
Task0 Task1 <--- dequeue_task_fair() - task blocks
When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.
A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:
cfs_rq->h_nr_runnable -= h_nr_runnable;
SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.
This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.
h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.
Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.
For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.
Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98ac49ce78ea..0fe6c6e65e55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5481,6 +5481,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
static void set_delayed(struct sched_entity *se)
{
se->sched_delayed = 1;
+
+ /*
+ * Delayed se of cfs_rq have no tasks queued on them.
+ * Do not adjust h_nr_runnable since dequeue_entities()
+ * will account it for blocked tasks.
+ */
+ if (!entity_is_task(se))
+ return;
+
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -5493,6 +5502,16 @@ static void set_delayed(struct sched_entity *se)
static void clear_delayed(struct sched_entity *se)
{
se->sched_delayed = 0;
+
+ /*
+ * Delayed se of cfs_rq have no tasks queued on them.
+ * Do not adjust h_nr_runnable since a dequeue has
+ * already accounted for it or an enqueue of a task
+ * below it will account for it in enqueue_task_fair().
+ */
+ if (!entity_is_task(se))
+ return;
+
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
base-commit: 7d9da040575b343085287686fa902a5b2d43c7ca
--
2.34.1
Hi Prateek, >A SCHED_WARN_ON() to inspect h_nr_runnable post its update in >dequeue_entities() like below: > > cfs_rq->h_nr_runnable -= h_nr_runnable; > SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); > >is consistently tripped when running wakeup intensive workloads like >hackbench in a cgroup. I observed that the WARN_ON is triggered during the boot process without the patch, and the patch resolves the issue. However, I was unable to trigger the WARN_ON by running hackbench in a cgroup without the patch. Could you please share the specific test scenario or configuration you used to reproduce it? For the boot process scenario: Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Thanks, Madadi Vineeth Reddy
Hello Madadi Vineeth Reddy, Thank you for the review and test. On 1/20/2025 10:36 AM, Madadi Vineeth Reddy wrote: > Hi Prateek, > >> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in >> dequeue_entities() like below: >> >> cfs_rq->h_nr_runnable -= h_nr_runnable; >> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); >> >> is consistently tripped when running wakeup intensive workloads like >> hackbench in a cgroup. > > I observed that the WARN_ON is triggered during the boot process without > the patch, and the patch resolves the issue. > > However, I was unable to trigger the WARN_ON by running hackbench in a > cgroup without the patch. Could you please share the specific test > scenario or configuration you used to reproduce it? Can you try converting the SCHED_WARN_ON() to a WARN_ON() and try again. I can consistently hit it to a point that it floods my console. With autogroup enabled on Ubuntu, I believe it is trivial to hit this issue. Following is the exact diff I'm using on top of tip:sched/core that floods my console: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 98ac49ce78ea..7bc2c57601b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7160,6 +7160,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) cfs_rq->h_nr_runnable -= h_nr_runnable; cfs_rq->h_nr_queued -= h_nr_queued; + WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); cfs_rq->h_nr_idle -= h_nr_idle; if (cfs_rq_is_idle(cfs_rq)) @@ -7199,6 +7200,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) cfs_rq->h_nr_runnable -= h_nr_runnable; cfs_rq->h_nr_queued -= h_nr_queued; + WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); cfs_rq->h_nr_idle -= h_nr_idle; if (cfs_rq_is_idle(cfs_rq)) -- I tested this on a 32 vCPU VM and I get similar floods. > > For the boot process scenario: > Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Thanks a ton for testing! > > Thanks, > Madadi Vineeth Reddy -- Thanks and Regards, Prateek
> However, I was unable to trigger the WARN_ON by running hackbench in a > cgroup without the patch. Could you please share the specific test > scenario or configuration you used to reproduce it? My mistake: SCHED_WARN_ON is a WARN_ONCE, so I had to clear it before running hackbench to ensure the WARN_ON could be triggered again. Now I am able to trigger the WARN_ON in the unpatched kernel while running hackbench after executing `echo 1 > /sys/kernel/debug/clear_warn_once` following the initial warning during the boot process. I have verified that the patched kernel fixes the issue. Thanks, Madadi Vineeth Reddy
On 1/20/2025 2:14 PM, Madadi Vineeth Reddy wrote: >> However, I was unable to trigger the WARN_ON by running hackbench in a >> cgroup without the patch. Could you please share the specific test >> scenario or configuration you used to reproduce it? > > My mistake: SCHED_WARN_ON is a WARN_ONCE, so I had to clear it before > running hackbench to ensure the WARN_ON could be triggered again. Yup! A WARN_ON() is hit very consistently. > > Now I am able to trigger the WARN_ON in the unpatched kernel while > running hackbench after executing `echo 1 > /sys/kernel/debug/clear_warn_once` > following the initial warning during the boot process. I have verified > that the patched kernel fixes the issue. Thank you again for testing the fix. Much appreciated! > > Thanks, > Madadi Vineeth Reddy -- Thanks and Regards, Prateek
Hi Prateek,
On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
> entity is delayed irrespective of whether the entity corresponds to a
> task or a cfs_rq.
>
> Consider the following scenario:
>
> root
> / \
> A B (*) delayed since B is no longer eligible on root
> | |
> Task0 Task1 <--- dequeue_task_fair() - task blocks
>
> When Task1 blocks (dequeue_entity() for task's se returns true),
> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
> hierarchy of Task1. However, when the sched_entity corresponding to
> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
> hierarchy too leading to both dequeue_entity() and set_delayed()
> decrementing h_nr_runnable for the dequeue of the same task.
>
> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
> dequeue_entities() like below:
>
> cfs_rq->h_nr_runnable -= h_nr_runnable;
> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
>
> is consistently tripped when running wakeup intensive workloads like
> hackbench in a cgroup.
>
> This error is self correcting since cfs_rq are per-cpu and cannot
> migrate. The entitiy is either picked for full dequeue or is requeued
> when a task wakes up below it. Both those paths call clear_delayed()
> which again increments h_nr_runnable of the hierarchy without
> considering if the entity corresponds to a task or not.
>
> h_nr_runnable will eventually reflect the correct value however in the
> interim, the incorrect values can still influence PELT calculation which
> uses se->runnable_weight or cfs_rq->h_nr_runnable.
>
> Since only delayed tasks take the early return path in
> dequeue_entities() and enqueue_task_fair(), adjust the
> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
> this path skips the h_nr_* update loops and returns early.
>
> For entities corresponding to cfs_rq, the h_nr_* update loop in the
> caller will do the right thing.
>
> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
Before we were tracking the opposite h_nr_delayed. Did you see the
problem only on tip/sched/core or also before the rework which added
h_nr_runnable and removed h_nr_delayed
I'm going to have a closer look
> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> kernel/sched/fair.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98ac49ce78ea..0fe6c6e65e55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5481,6 +5481,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> static void set_delayed(struct sched_entity *se)
> {
> se->sched_delayed = 1;
> +
> + /*
> + * Delayed se of cfs_rq have no tasks queued on them.
> + * Do not adjust h_nr_runnable since dequeue_entities()
> + * will account it for blocked tasks.
> + */
> + if (!entity_is_task(se))
> + return;
> +
> for_each_sched_entity(se) {
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> @@ -5493,6 +5502,16 @@ static void set_delayed(struct sched_entity *se)
> static void clear_delayed(struct sched_entity *se)
> {
> se->sched_delayed = 0;
> +
> + /*
> + * Delayed se of cfs_rq have no tasks queued on them.
> + * Do not adjust h_nr_runnable since a dequeue has
> + * already accounted for it or an enqueue of a task
> + * below it will account for it in enqueue_task_fair().
> + */
> + if (!entity_is_task(se))
> + return;
> +
> for_each_sched_entity(se) {
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
>
> base-commit: 7d9da040575b343085287686fa902a5b2d43c7ca
> --
> 2.34.1
>
Hello Vincent,
On 1/17/2025 6:55 PM, Vincent Guittot wrote:
> Hi Prateek,
>
> On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
>> entity is delayed irrespective of whether the entity corresponds to a
>> task or a cfs_rq.
>>
>> Consider the following scenario:
>>
>> root
>> / \
>> A B (*) delayed since B is no longer eligible on root
>> | |
>> Task0 Task1 <--- dequeue_task_fair() - task blocks
>>
>> When Task1 blocks (dequeue_entity() for task's se returns true),
>> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
>> hierarchy of Task1. However, when the sched_entity corresponding to
>> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
>> hierarchy too leading to both dequeue_entity() and set_delayed()
>> decrementing h_nr_runnable for the dequeue of the same task.
>>
>> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
>> dequeue_entities() like below:
>>
>> cfs_rq->h_nr_runnable -= h_nr_runnable;
>> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
>>
>> is consistently tripped when running wakeup intensive workloads like
>> hackbench in a cgroup.
>>
>> This error is self correcting since cfs_rq are per-cpu and cannot
>> migrate. The entitiy is either picked for full dequeue or is requeued
>> when a task wakes up below it. Both those paths call clear_delayed()
>> which again increments h_nr_runnable of the hierarchy without
>> considering if the entity corresponds to a task or not.
>>
>> h_nr_runnable will eventually reflect the correct value however in the
>> interim, the incorrect values can still influence PELT calculation which
>> uses se->runnable_weight or cfs_rq->h_nr_runnable.
>>
>> Since only delayed tasks take the early return path in
>> dequeue_entities() and enqueue_task_fair(), adjust the
>> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
>> this path skips the h_nr_* update loops and returns early.
>>
>> For entities corresponding to cfs_rq, the h_nr_* update loop in the
>> caller will do the right thing.
>>
>> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
>
> You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
You are right! I had done a git blame on set_delayed() ad landed at
commit 76f2f783294d but you are right, it should be c2a295bffeaf
("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was
inverted to account runnable tasks. Thank you for pointing that out.
> Before we were tracking the opposite h_nr_delayed. Did you see the
> problem only on tip/sched/core or also before the rework which added
> h_nr_runnable and removed h_nr_delayed
The problem is on tip:sched/core. I did not encounter any anomalies on
76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
"h_nr_delayed" was only adjusted in dequeue_entities() for "!seep &&
!delayed" which would imply migration or a save + restore type operation
and the whole "h_nr_delayed" adjusting was contained in
{set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and
requeue.
>
> I'm going to have a closer look
Thank you!
>
>
>> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
>> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>
>> [..snip..]
>>
--
Thanks and Regards,
Prateek
On Fri, 17 Jan 2025 at 16:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 1/17/2025 6:55 PM, Vincent Guittot wrote:
> > Hi Prateek,
> >
> > On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >>
> >> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
> >> entity is delayed irrespective of whether the entity corresponds to a
> >> task or a cfs_rq.
> >>
> >> Consider the following scenario:
> >>
> >> root
> >> / \
> >> A B (*) delayed since B is no longer eligible on root
> >> | |
> >> Task0 Task1 <--- dequeue_task_fair() - task blocks
> >>
> >> When Task1 blocks (dequeue_entity() for task's se returns true),
> >> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
> >> hierarchy of Task1. However, when the sched_entity corresponding to
> >> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
> >> hierarchy too leading to both dequeue_entity() and set_delayed()
> >> decrementing h_nr_runnable for the dequeue of the same task.
> >>
> >> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
> >> dequeue_entities() like below:
> >>
> >> cfs_rq->h_nr_runnable -= h_nr_runnable;
> >> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
> >>
> >> is consistently tripped when running wakeup intensive workloads like
> >> hackbench in a cgroup.
> >>
> >> This error is self correcting since cfs_rq are per-cpu and cannot
> >> migrate. The entitiy is either picked for full dequeue or is requeued
> >> when a task wakes up below it. Both those paths call clear_delayed()
> >> which again increments h_nr_runnable of the hierarchy without
> >> considering if the entity corresponds to a task or not.
> >>
> >> h_nr_runnable will eventually reflect the correct value however in the
> >> interim, the incorrect values can still influence PELT calculation which
> >> uses se->runnable_weight or cfs_rq->h_nr_runnable.
> >>
> >> Since only delayed tasks take the early return path in
> >> dequeue_entities() and enqueue_task_fair(), adjust the
> >> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
> >> this path skips the h_nr_* update loops and returns early.
> >>
> >> For entities corresponding to cfs_rq, the h_nr_* update loop in the
> >> caller will do the right thing.
> >>
> >> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
> >
> > You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
>
> You are right! I had done a git blame on set_delayed() ad landed at
> commit 76f2f783294d but you are right, it should be c2a295bffeaf
> ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was
> inverted to account runnable tasks. Thank you for pointing that out.
>
> > Before we were tracking the opposite h_nr_delayed. Did you see the
> > problem only on tip/sched/core or also before the rework which added
> > h_nr_runnable and removed h_nr_delayed
>
> The problem is on tip:sched/core. I did not encounter any anomalies on
> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
>
> "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep &&
> !delayed" which would imply migration or a save + restore type operation
> and the whole "h_nr_delayed" adjusting was contained in
> {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and
> requeue.
>
> >
> > I'm going to have a closer look
Your fix looks good to me. I also run some tests after re-adding
h_nr_delayed and checking that h_nr_queued = h_nr_runnable +
h_nr_delayed after each update and I didn't get any warning with your
patch whereas I got one during boot without it (but none after that
during my tests)
Thanks for catching this
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Thank you!
>
> >
> >
> >> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
> >> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> >> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >> ---
> >>
> >> [..snip..]
> >>
>
> --
> Thanks and Regards,
> Prateek
>
Hello Vincent,
On 1/18/2025 2:00 PM, Vincent Guittot wrote:
> On Fri, 17 Jan 2025 at 16:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello Vincent,
>>
>> On 1/17/2025 6:55 PM, Vincent Guittot wrote:
>>> Hi Prateek,
>>>
>>> On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>>>
>>>> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
>>>> entity is delayed irrespective of whether the entity corresponds to a
>>>> task or a cfs_rq.
>>>>
>>>> Consider the following scenario:
>>>>
>>>> root
>>>> / \
>>>> A B (*) delayed since B is no longer eligible on root
>>>> | |
>>>> Task0 Task1 <--- dequeue_task_fair() - task blocks
>>>>
>>>> When Task1 blocks (dequeue_entity() for task's se returns true),
>>>> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
>>>> hierarchy of Task1. However, when the sched_entity corresponding to
>>>> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
>>>> hierarchy too leading to both dequeue_entity() and set_delayed()
>>>> decrementing h_nr_runnable for the dequeue of the same task.
>>>>
>>>> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
>>>> dequeue_entities() like below:
>>>>
>>>> cfs_rq->h_nr_runnable -= h_nr_runnable;
>>>> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
>>>>
>>>> is consistently tripped when running wakeup intensive workloads like
>>>> hackbench in a cgroup.
>>>>
>>>> This error is self correcting since cfs_rq are per-cpu and cannot
>>>> migrate. The entitiy is either picked for full dequeue or is requeued
>>>> when a task wakes up below it. Both those paths call clear_delayed()
>>>> which again increments h_nr_runnable of the hierarchy without
>>>> considering if the entity corresponds to a task or not.
>>>>
>>>> h_nr_runnable will eventually reflect the correct value however in the
>>>> interim, the incorrect values can still influence PELT calculation which
>>>> uses se->runnable_weight or cfs_rq->h_nr_runnable.
>>>>
>>>> Since only delayed tasks take the early return path in
>>>> dequeue_entities() and enqueue_task_fair(), adjust the
>>>> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
>>>> this path skips the h_nr_* update loops and returns early.
>>>>
>>>> For entities corresponding to cfs_rq, the h_nr_* update loop in the
>>>> caller will do the right thing.
>>>>
>>>> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
>>>
>>> You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
>>
>> You are right! I had done a git blame on set_delayed() ad landed at
>> commit 76f2f783294d but you are right, it should be c2a295bffeaf
>> ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was
>> inverted to account runnable tasks. Thank you for pointing that out.
>>
>>> Before we were tracking the opposite h_nr_delayed. Did you see the
>>> problem only on tip/sched/core or also before the rework which added
>>> h_nr_runnable and removed h_nr_delayed
>>
>> The problem is on tip:sched/core. I did not encounter any anomalies on
>> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
>>
>> "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep &&
>> !delayed" which would imply migration or a save + restore type operation
>> and the whole "h_nr_delayed" adjusting was contained in
>> {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and
>> requeue.
So I was looking at it wrong when I was investigating on commit
76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") h_nr_delayed
can never be larger than h_nr_running (h_nr_queued upstream) since the
number of delayed tasks can never cross number of tasks queued below
the given cfs_rq but with the following:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97ee48c8bf5e..8e713f241483 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7145,6 +7145,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
cfs_rq->h_nr_running -= h_nr_running;
cfs_rq->idle_h_nr_running -= idle_h_nr_running;
cfs_rq->h_nr_delayed -= h_nr_delayed;
+ SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running);
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = h_nr_running;
@@ -7185,6 +7186,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
cfs_rq->idle_h_nr_running -= idle_h_nr_running;
cfs_rq->h_nr_delayed -= h_nr_delayed;
+ SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running);
+
if (cfs_rq_is_idle(cfs_rq))
idle_h_nr_running = h_nr_running;
--
I can again consistently hit the warning without the fix on 76f2f783294d
("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
I think that the original "Fixes:" tag is indeed right.
>>
>>>
>>> I'm going to have a closer look
>
> Your fix looks good to me. I also run some tests after re-adding
> h_nr_delayed and checking that h_nr_queued = h_nr_runnable +
> h_nr_delayed after each update and I didn't get any warning with your
> patch whereas I got one during boot without it (but none after that
> during my tests)
Could it be the case that h_nr_delayed counts a tiny bit higher than
the actual number and h_nr_runnable counts a tiny bit lower by the
same amount and they both correct each other to give the correct
h_nr_queued?
>
> Thanks for catching this
>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Thank you for reviewing the patch!
>
>>
>> Thank you!
>>
>>>
>>>
>>>> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
>>>> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>>> ---
>>>>
>>>> [..snip..]
>>>>
>>
>> --
>> Thanks and Regards,
>> Prateek
>>
--
Thanks and Regards,
Prateek
On Tue, 21 Jan 2025 at 09:09, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 1/18/2025 2:00 PM, Vincent Guittot wrote:
> > On Fri, 17 Jan 2025 at 16:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >>
> >> Hello Vincent,
> >>
> >> On 1/17/2025 6:55 PM, Vincent Guittot wrote:
> >>> Hi Prateek,
> >>>
> >>> On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >>>>
> >>>> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
> >>>> entity is delayed irrespective of whether the entity corresponds to a
> >>>> task or a cfs_rq.
> >>>>
> >>>> Consider the following scenario:
> >>>>
> >>>> root
> >>>> / \
> >>>> A B (*) delayed since B is no longer eligible on root
> >>>> | |
> >>>> Task0 Task1 <--- dequeue_task_fair() - task blocks
> >>>>
> >>>> When Task1 blocks (dequeue_entity() for task's se returns true),
> >>>> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
> >>>> hierarchy of Task1. However, when the sched_entity corresponding to
> >>>> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
> >>>> hierarchy too leading to both dequeue_entity() and set_delayed()
> >>>> decrementing h_nr_runnable for the dequeue of the same task.
> >>>>
> >>>> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
> >>>> dequeue_entities() like below:
> >>>>
> >>>> cfs_rq->h_nr_runnable -= h_nr_runnable;
> >>>> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
> >>>>
> >>>> is consistently tripped when running wakeup intensive workloads like
> >>>> hackbench in a cgroup.
> >>>>
> >>>> This error is self correcting since cfs_rq are per-cpu and cannot
> >>>> migrate. The entitiy is either picked for full dequeue or is requeued
> >>>> when a task wakes up below it. Both those paths call clear_delayed()
> >>>> which again increments h_nr_runnable of the hierarchy without
> >>>> considering if the entity corresponds to a task or not.
> >>>>
> >>>> h_nr_runnable will eventually reflect the correct value however in the
> >>>> interim, the incorrect values can still influence PELT calculation which
> >>>> uses se->runnable_weight or cfs_rq->h_nr_runnable.
> >>>>
> >>>> Since only delayed tasks take the early return path in
> >>>> dequeue_entities() and enqueue_task_fair(), adjust the
> >>>> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
> >>>> this path skips the h_nr_* update loops and returns early.
> >>>>
> >>>> For entities corresponding to cfs_rq, the h_nr_* update loop in the
> >>>> caller will do the right thing.
> >>>>
> >>>> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
> >>>
> >>> You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
> >>
> >> You are right! I had done a git blame on set_delayed() ad landed at
> >> commit 76f2f783294d but you are right, it should be c2a295bffeaf
> >> ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was
> >> inverted to account runnable tasks. Thank you for pointing that out.
> >>
> >>> Before we were tracking the opposite h_nr_delayed. Did you see the
> >>> problem only on tip/sched/core or also before the rework which added
> >>> h_nr_runnable and removed h_nr_delayed
> >>
> >> The problem is on tip:sched/core. I did not encounter any anomalies on
> >> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
> >>
> >> "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep &&
> >> !delayed" which would imply migration or a save + restore type operation
> >> and the whole "h_nr_delayed" adjusting was contained in
> >> {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and
> >> requeue.
>
> So I was looking at it wrong when I was investigating on commit
> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") h_nr_delayed
> can never be larger than h_nr_running (h_nr_queued upstream) since the
> number of delayed tasks can never cross number of tasks queued below
> the given cfs_rq but with the following:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 97ee48c8bf5e..8e713f241483 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7145,6 +7145,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> cfs_rq->h_nr_running -= h_nr_running;
> cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> cfs_rq->h_nr_delayed -= h_nr_delayed;
> + SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running);
>
> if (cfs_rq_is_idle(cfs_rq))
> idle_h_nr_running = h_nr_running;
> @@ -7185,6 +7186,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> cfs_rq->h_nr_delayed -= h_nr_delayed;
>
> + SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running);
> +
> if (cfs_rq_is_idle(cfs_rq))
> idle_h_nr_running = h_nr_running;
>
> --
>
> I can again consistently hit the warning without the fix on 76f2f783294d
> ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
I updated my warning conditions with
h_nr_queued != h_nr_delayed + h_nr_runnable
h_nr_delayed < 0 or > h_nr_queued
h_nr_runnable < 0 or > h_nr_queued
In addition to h_nr_runnable < 0, h_nr_delayed > h_nr_queued has also
triggered a warning so I confirm that the fix is on 76f2f783294d
("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
No warning are triggered with your fix
>
> I think that the original "Fixes:" tag is indeed right.
>
> >>
> >>>
> >>> I'm going to have a closer look
> >
> > Your fix looks good to me. I also run some tests after re-adding
> > h_nr_delayed and checking that h_nr_queued = h_nr_runnable +
> > h_nr_delayed after each update and I didn't get any warning with your
> > patch whereas I got one during boot without it (but none after that
> > during my tests)
>
> Could it be the case that h_nr_delayed counts a tiny bit higher than
> the actual number and h_nr_runnable counts a tiny bit lower by the
> same amount and they both correct each other to give the correct
> h_nr_queued?
>
> >
> > Thanks for catching this
> >
> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Thank you for reviewing the patch!
>
> >
> >>
> >> Thank you!
> >>
> >>>
> >>>
> >>>> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
> >>>> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> >>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >>>> ---
> >>>>
> >>>> [..snip..]
> >>>>
> >>
> >> --
> >> Thanks and Regards,
> >> Prateek
> >>
>
> --
> Thanks and Regards,
> Prateek
>
The following commit has been merged into the sched/urgent branch of tip:
Commit-ID: 3429dd57f0deb1a602c2624a1dd7c4c11b6c4734
Gitweb: https://git.kernel.org/tip/3429dd57f0deb1a602c2624a1dd7c4c11b6c4734
Author: K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate: Fri, 17 Jan 2025 10:58:52
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 21 Jan 2025 13:13:36 +01:00
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.
Consider the following scenario:
root
/ \
A B (*) delayed since B is no longer eligible on root
| |
Task0 Task1 <--- dequeue_task_fair() - task blocks
When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.
A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:
cfs_rq->h_nr_runnable -= h_nr_runnable;
SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.
This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.
h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.
Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.
For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.
Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@amd.com
---
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2695843..f4e4d3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5372,6 +5372,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
static void set_delayed(struct sched_entity *se)
{
se->sched_delayed = 1;
+
+ /*
+ * Delayed se of cfs_rq have no tasks queued on them.
+ * Do not adjust h_nr_runnable since dequeue_entities()
+ * will account it for blocked tasks.
+ */
+ if (!entity_is_task(se))
+ return;
+
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -5384,6 +5393,16 @@ static void set_delayed(struct sched_entity *se)
static void clear_delayed(struct sched_entity *se)
{
se->sched_delayed = 0;
+
+ /*
+ * Delayed se of cfs_rq have no tasks queued on them.
+ * Do not adjust h_nr_runnable since a dequeue has
+ * already accounted for it or an enqueue of a task
+ * below it will account for it in enqueue_task_fair().
+ */
+ if (!entity_is_task(se))
+ return;
+
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
© 2016 - 2026 Red Hat, Inc.