[PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue

K Prateek Nayak posted 1 patch 1 year ago
There is a newer version of this series
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
[PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by K Prateek Nayak 1 year ago
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.

Consider the following scenario:

	root
       /    \
      A	     B (*) delayed since B is no longer eligible on root
      |	     |
    Task0  Task1 <--- dequeue_task_fair() - task blocks

When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.

A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:

    cfs_rq->h_nr_runnable -= h_nr_runnable;
    SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);

is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.

This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.

h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.

Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.

For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.

Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98ac49ce78ea..0fe6c6e65e55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5481,6 +5481,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 static void set_delayed(struct sched_entity *se)
 {
 	se->sched_delayed = 1;
+
+	/*
+	 * Delayed se of cfs_rq have no tasks queued on them.
+	 * Do not adjust h_nr_runnable since dequeue_entities()
+	 * will account it for blocked tasks.
+	 */
+	if (!entity_is_task(se))
+		return;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
@@ -5493,6 +5502,16 @@ static void set_delayed(struct sched_entity *se)
 static void clear_delayed(struct sched_entity *se)
 {
 	se->sched_delayed = 0;
+
+	/*
+	 * Delayed se of cfs_rq have no tasks queued on them.
+	 * Do not adjust h_nr_runnable since a dequeue has
+	 * already accounted for it or an enqueue of a task
+	 * below it will account for it in enqueue_task_fair().
+	 */
+	if (!entity_is_task(se))
+		return;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 

base-commit: 7d9da040575b343085287686fa902a5b2d43c7ca
-- 
2.34.1
Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by Madadi Vineeth Reddy 1 year ago
Hi Prateek,

>A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
>dequeue_entities() like below:
>
>    cfs_rq->h_nr_runnable -= h_nr_runnable;
>    SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
>
>is consistently tripped when running wakeup intensive workloads like
>hackbench in a cgroup.

I observed that the WARN_ON is triggered during the boot process without
the patch, and the patch resolves the issue.

However, I was unable to trigger the WARN_ON by running hackbench in a
cgroup without the patch. Could you please share the specific test
scenario or configuration you used to reproduce it?

For the boot process scenario:
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>

Thanks,
Madadi Vineeth Reddy
Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by K Prateek Nayak 1 year ago
Hello Madadi Vineeth Reddy,

Thank you for the review and test.

On 1/20/2025 10:36 AM, Madadi Vineeth Reddy wrote:
> Hi Prateek,
> 
>> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
>> dequeue_entities() like below:
>>
>>     cfs_rq->h_nr_runnable -= h_nr_runnable;
>>     SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
>>
>> is consistently tripped when running wakeup intensive workloads like
>> hackbench in a cgroup.
> 
> I observed that the WARN_ON is triggered during the boot process without
> the patch, and the patch resolves the issue.
> 
> However, I was unable to trigger the WARN_ON by running hackbench in a
> cgroup without the patch. Could you please share the specific test
> scenario or configuration you used to reproduce it?

Can you try converting the SCHED_WARN_ON() to a WARN_ON() and try again.
I can consistently hit it to a point that it floods my console. With
autogroup enabled on Ubuntu, I believe it is trivial to hit this issue.

Following is the exact diff I'm using on top of tip:sched/core that
floods my console:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98ac49ce78ea..7bc2c57601b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7160,6 +7160,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  
  		cfs_rq->h_nr_runnable -= h_nr_runnable;
  		cfs_rq->h_nr_queued -= h_nr_queued;
+		WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
  		cfs_rq->h_nr_idle -= h_nr_idle;
  
  		if (cfs_rq_is_idle(cfs_rq))
@@ -7199,6 +7200,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  
  		cfs_rq->h_nr_runnable -= h_nr_runnable;
  		cfs_rq->h_nr_queued -= h_nr_queued;
+		WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
  		cfs_rq->h_nr_idle -= h_nr_idle;
  
  		if (cfs_rq_is_idle(cfs_rq))
--

I tested this on a 32 vCPU VM and I get similar floods.

> 
> For the boot process scenario:
> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>

Thanks a ton for testing!

> 
> Thanks,
> Madadi Vineeth Reddy

-- 
Thanks and Regards,
Prateek
Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by Madadi Vineeth Reddy 1 year ago
> However, I was unable to trigger the WARN_ON by running hackbench in a
> cgroup without the patch. Could you please share the specific test
> scenario or configuration you used to reproduce it?

My mistake: SCHED_WARN_ON is a WARN_ONCE, so I had to clear it before
running hackbench to ensure the WARN_ON could be triggered again.

Now I am able to trigger the WARN_ON in the unpatched kernel while
running hackbench after executing `echo 1 > /sys/kernel/debug/clear_warn_once`
following the initial warning during the boot process. I have verified
that the patched kernel fixes the issue.

Thanks,
Madadi Vineeth Reddy
Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by K Prateek Nayak 1 year ago
On 1/20/2025 2:14 PM, Madadi Vineeth Reddy wrote:
>> However, I was unable to trigger the WARN_ON by running hackbench in a
>> cgroup without the patch. Could you please share the specific test
>> scenario or configuration you used to reproduce it?
> 
> My mistake: SCHED_WARN_ON is a WARN_ONCE, so I had to clear it before
> running hackbench to ensure the WARN_ON could be triggered again.

Yup! A WARN_ON() is hit very consistently.

> 
> Now I am able to trigger the WARN_ON in the unpatched kernel while
> running hackbench after executing `echo 1 > /sys/kernel/debug/clear_warn_once`
> following the initial warning during the boot process. I have verified
> that the patched kernel fixes the issue.

Thank you again for testing the fix. Much appreciated!

> 
> Thanks,
> Madadi Vineeth Reddy

-- 
Thanks and Regards,
Prateek
Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by Vincent Guittot 1 year ago
Hi Prateek,

On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
> entity is delayed irrespective of whether the entity corresponds to a
> task or a cfs_rq.
>
> Consider the following scenario:
>
>         root
>        /    \
>       A      B (*) delayed since B is no longer eligible on root
>       |      |
>     Task0  Task1 <--- dequeue_task_fair() - task blocks
>
> When Task1 blocks (dequeue_entity() for task's se returns true),
> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
> hierarchy of Task1. However, when the sched_entity corresponding to
> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
> hierarchy too leading to both dequeue_entity() and set_delayed()
> decrementing h_nr_runnable for the dequeue of the same task.
>
> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
> dequeue_entities() like below:
>
>     cfs_rq->h_nr_runnable -= h_nr_runnable;
>     SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
>
> is consistently tripped when running wakeup intensive workloads like
> hackbench in a cgroup.
>
> This error is self correcting since cfs_rq are per-cpu and cannot
> migrate. The entitiy is either picked for full dequeue or is requeued
> when a task wakes up below it. Both those paths call clear_delayed()
> which again increments h_nr_runnable of the hierarchy without
> considering if the entity corresponds to a task or not.
>
> h_nr_runnable will eventually reflect the correct value however in the
> interim, the incorrect values can still influence PELT calculation which
> uses se->runnable_weight or cfs_rq->h_nr_runnable.
>
> Since only delayed tasks take the early return path in
> dequeue_entities() and enqueue_task_fair(), adjust the
> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
> this path skips the h_nr_* update loops and returns early.
>
> For entities corresponding to cfs_rq, the h_nr_* update loop in the
> caller will do the right thing.
>
> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")

You probably mean  c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
Before we were tracking the opposite h_nr_delayed. Did you see the
problem only on tip/sched/core or also before the rework which added
h_nr_runnable and removed h_nr_delayed

I'm going to have a closer look


> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98ac49ce78ea..0fe6c6e65e55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5481,6 +5481,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
>  static void set_delayed(struct sched_entity *se)
>  {
>         se->sched_delayed = 1;
> +
> +       /*
> +        * Delayed se of cfs_rq have no tasks queued on them.
> +        * Do not adjust h_nr_runnable since dequeue_entities()
> +        * will account it for blocked tasks.
> +        */
> +       if (!entity_is_task(se))
> +               return;
> +
>         for_each_sched_entity(se) {
>                 struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> @@ -5493,6 +5502,16 @@ static void set_delayed(struct sched_entity *se)
>  static void clear_delayed(struct sched_entity *se)
>  {
>         se->sched_delayed = 0;
> +
> +       /*
> +        * Delayed se of cfs_rq have no tasks queued on them.
> +        * Do not adjust h_nr_runnable since a dequeue has
> +        * already accounted for it or an enqueue of a task
> +        * below it will account for it in enqueue_task_fair().
> +        */
> +       if (!entity_is_task(se))
> +               return;
> +
>         for_each_sched_entity(se) {
>                 struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
>
> base-commit: 7d9da040575b343085287686fa902a5b2d43c7ca
> --
> 2.34.1
>
Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by K Prateek Nayak 1 year ago
Hello Vincent,

On 1/17/2025 6:55 PM, Vincent Guittot wrote:
> Hi Prateek,
> 
> On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
>> entity is delayed irrespective of whether the entity corresponds to a
>> task or a cfs_rq.
>>
>> Consider the following scenario:
>>
>>          root
>>         /    \
>>        A      B (*) delayed since B is no longer eligible on root
>>        |      |
>>      Task0  Task1 <--- dequeue_task_fair() - task blocks
>>
>> When Task1 blocks (dequeue_entity() for task's se returns true),
>> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
>> hierarchy of Task1. However, when the sched_entity corresponding to
>> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
>> hierarchy too leading to both dequeue_entity() and set_delayed()
>> decrementing h_nr_runnable for the dequeue of the same task.
>>
>> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
>> dequeue_entities() like below:
>>
>>      cfs_rq->h_nr_runnable -= h_nr_runnable;
>>      SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
>>
>> is consistently tripped when running wakeup intensive workloads like
>> hackbench in a cgroup.
>>
>> This error is self correcting since cfs_rq are per-cpu and cannot
>> migrate. The entitiy is either picked for full dequeue or is requeued
>> when a task wakes up below it. Both those paths call clear_delayed()
>> which again increments h_nr_runnable of the hierarchy without
>> considering if the entity corresponds to a task or not.
>>
>> h_nr_runnable will eventually reflect the correct value however in the
>> interim, the incorrect values can still influence PELT calculation which
>> uses se->runnable_weight or cfs_rq->h_nr_runnable.
>>
>> Since only delayed tasks take the early return path in
>> dequeue_entities() and enqueue_task_fair(), adjust the
>> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
>> this path skips the h_nr_* update loops and returns early.
>>
>> For entities corresponding to cfs_rq, the h_nr_* update loop in the
>> caller will do the right thing.
>>
>> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
> 
> You probably mean  c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")

You are right! I had done a git blame on set_delayed() ad landed at
commit 76f2f783294d but you are right, it should be c2a295bffeaf
("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was
inverted to account runnable tasks. Thank you for pointing that out.

> Before we were tracking the opposite h_nr_delayed. Did you see the
> problem only on tip/sched/core or also before the rework which added
> h_nr_runnable and removed h_nr_delayed

The problem is on tip:sched/core. I did not encounter any anomalies on
76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")

"h_nr_delayed" was only adjusted in dequeue_entities() for "!seep &&
!delayed" which would imply migration or a save + restore type operation
and the whole "h_nr_delayed" adjusting was contained in
{set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and
requeue.

> 
> I'm going to have a closer look

Thank you!

> 
> 
>> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
>> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>
>> [..snip..]
>>

-- 
Thanks and Regards,
Prateek
Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by Vincent Guittot 1 year ago
On Fri, 17 Jan 2025 at 16:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 1/17/2025 6:55 PM, Vincent Guittot wrote:
> > Hi Prateek,
> >
> > On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >>
> >> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
> >> entity is delayed irrespective of whether the entity corresponds to a
> >> task or a cfs_rq.
> >>
> >> Consider the following scenario:
> >>
> >>          root
> >>         /    \
> >>        A      B (*) delayed since B is no longer eligible on root
> >>        |      |
> >>      Task0  Task1 <--- dequeue_task_fair() - task blocks
> >>
> >> When Task1 blocks (dequeue_entity() for task's se returns true),
> >> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
> >> hierarchy of Task1. However, when the sched_entity corresponding to
> >> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
> >> hierarchy too leading to both dequeue_entity() and set_delayed()
> >> decrementing h_nr_runnable for the dequeue of the same task.
> >>
> >> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
> >> dequeue_entities() like below:
> >>
> >>      cfs_rq->h_nr_runnable -= h_nr_runnable;
> >>      SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
> >>
> >> is consistently tripped when running wakeup intensive workloads like
> >> hackbench in a cgroup.
> >>
> >> This error is self correcting since cfs_rq are per-cpu and cannot
> >> migrate. The entitiy is either picked for full dequeue or is requeued
> >> when a task wakes up below it. Both those paths call clear_delayed()
> >> which again increments h_nr_runnable of the hierarchy without
> >> considering if the entity corresponds to a task or not.
> >>
> >> h_nr_runnable will eventually reflect the correct value however in the
> >> interim, the incorrect values can still influence PELT calculation which
> >> uses se->runnable_weight or cfs_rq->h_nr_runnable.
> >>
> >> Since only delayed tasks take the early return path in
> >> dequeue_entities() and enqueue_task_fair(), adjust the
> >> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
> >> this path skips the h_nr_* update loops and returns early.
> >>
> >> For entities corresponding to cfs_rq, the h_nr_* update loop in the
> >> caller will do the right thing.
> >>
> >> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
> >
> > You probably mean  c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
>
> You are right! I had done a git blame on set_delayed() ad landed at
> commit 76f2f783294d but you are right, it should be c2a295bffeaf
> ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was
> inverted to account runnable tasks. Thank you for pointing that out.
>
> > Before we were tracking the opposite h_nr_delayed. Did you see the
> > problem only on tip/sched/core or also before the rework which added
> > h_nr_runnable and removed h_nr_delayed
>
> The problem is on tip:sched/core. I did not encounter any anomalies on
> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
>
> "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep &&
> !delayed" which would imply migration or a save + restore type operation
> and the whole "h_nr_delayed" adjusting was contained in
> {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and
> requeue.
>
> >
> > I'm going to have a closer look

Your fix looks good to me. I also run some tests after re-adding
h_nr_delayed and checking that h_nr_queued = h_nr_runnable +
h_nr_delayed after each update and I didn't get any warning with your
patch whereas I got one during boot without it (but none after that
during my tests)

Thanks for catching this

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> Thank you!
>
> >
> >
> >> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
> >> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> >> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >> ---
> >>
> >> [..snip..]
> >>
>
> --
> Thanks and Regards,
> Prateek
>
Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by K Prateek Nayak 1 year ago
Hello Vincent,

On 1/18/2025 2:00 PM, Vincent Guittot wrote:
> On Fri, 17 Jan 2025 at 16:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello Vincent,
>>
>> On 1/17/2025 6:55 PM, Vincent Guittot wrote:
>>> Hi Prateek,
>>>
>>> On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>>>
>>>> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
>>>> entity is delayed irrespective of whether the entity corresponds to a
>>>> task or a cfs_rq.
>>>>
>>>> Consider the following scenario:
>>>>
>>>>           root
>>>>          /    \
>>>>         A      B (*) delayed since B is no longer eligible on root
>>>>         |      |
>>>>       Task0  Task1 <--- dequeue_task_fair() - task blocks
>>>>
>>>> When Task1 blocks (dequeue_entity() for task's se returns true),
>>>> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
>>>> hierarchy of Task1. However, when the sched_entity corresponding to
>>>> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
>>>> hierarchy too leading to both dequeue_entity() and set_delayed()
>>>> decrementing h_nr_runnable for the dequeue of the same task.
>>>>
>>>> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
>>>> dequeue_entities() like below:
>>>>
>>>>       cfs_rq->h_nr_runnable -= h_nr_runnable;
>>>>       SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
>>>>
>>>> is consistently tripped when running wakeup intensive workloads like
>>>> hackbench in a cgroup.
>>>>
>>>> This error is self correcting since cfs_rq are per-cpu and cannot
>>>> migrate. The entitiy is either picked for full dequeue or is requeued
>>>> when a task wakes up below it. Both those paths call clear_delayed()
>>>> which again increments h_nr_runnable of the hierarchy without
>>>> considering if the entity corresponds to a task or not.
>>>>
>>>> h_nr_runnable will eventually reflect the correct value however in the
>>>> interim, the incorrect values can still influence PELT calculation which
>>>> uses se->runnable_weight or cfs_rq->h_nr_runnable.
>>>>
>>>> Since only delayed tasks take the early return path in
>>>> dequeue_entities() and enqueue_task_fair(), adjust the
>>>> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
>>>> this path skips the h_nr_* update loops and returns early.
>>>>
>>>> For entities corresponding to cfs_rq, the h_nr_* update loop in the
>>>> caller will do the right thing.
>>>>
>>>> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
>>>
>>> You probably mean  c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
>>
>> You are right! I had done a git blame on set_delayed() ad landed at
>> commit 76f2f783294d but you are right, it should be c2a295bffeaf
>> ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was
>> inverted to account runnable tasks. Thank you for pointing that out.
>>
>>> Before we were tracking the opposite h_nr_delayed. Did you see the
>>> problem only on tip/sched/core or also before the rework which added
>>> h_nr_runnable and removed h_nr_delayed
>>
>> The problem is on tip:sched/core. I did not encounter any anomalies on
>> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
>>
>> "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep &&
>> !delayed" which would imply migration or a save + restore type operation
>> and the whole "h_nr_delayed" adjusting was contained in
>> {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and
>> requeue.

So I was looking at it wrong when I was investigating on commit
76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") h_nr_delayed
can never be larger than h_nr_running (h_nr_queued upstream) since the
number of delayed tasks can never cross number of tasks queued below
the given cfs_rq but with the following:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97ee48c8bf5e..8e713f241483 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7145,6 +7145,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  		cfs_rq->h_nr_running -= h_nr_running;
  		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
  		cfs_rq->h_nr_delayed -= h_nr_delayed;
+		SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running);
  
  		if (cfs_rq_is_idle(cfs_rq))
  			idle_h_nr_running = h_nr_running;
@@ -7185,6 +7186,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
  		cfs_rq->h_nr_delayed -= h_nr_delayed;
  
+		SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running);
+
  		if (cfs_rq_is_idle(cfs_rq))
  			idle_h_nr_running = h_nr_running;
  
--

I can again consistently hit the warning without the fix on 76f2f783294d
("sched/eevdf: More PELT vs DELAYED_DEQUEUE")

I think that the original "Fixes:" tag is indeed right.

>>
>>>
>>> I'm going to have a closer look
> 
> Your fix looks good to me. I also run some tests after re-adding
> h_nr_delayed and checking that h_nr_queued = h_nr_runnable +
> h_nr_delayed after each update and I didn't get any warning with your
> patch whereas I got one during boot without it (but none after that
> during my tests)

Could it be the case that h_nr_delayed counts a tiny bit higher than
the actual number and h_nr_runnable counts a tiny bit lower by the
same amount and they both correct each other to give the correct
h_nr_queued?

> 
> Thanks for catching this
> 
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

Thank you for reviewing the patch!

> 
>>
>> Thank you!
>>
>>>
>>>
>>>> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
>>>> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>>> ---
>>>>
>>>> [..snip..]
>>>>
>>
>> --
>> Thanks and Regards,
>> Prateek
>>

-- 
Thanks and Regards,
Prateek
Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by Vincent Guittot 1 year ago
On Tue, 21 Jan 2025 at 09:09, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 1/18/2025 2:00 PM, Vincent Guittot wrote:
> > On Fri, 17 Jan 2025 at 16:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >>
> >> Hello Vincent,
> >>
> >> On 1/17/2025 6:55 PM, Vincent Guittot wrote:
> >>> Hi Prateek,
> >>>
> >>> On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >>>>
> >>>> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
> >>>> entity is delayed irrespective of whether the entity corresponds to a
> >>>> task or a cfs_rq.
> >>>>
> >>>> Consider the following scenario:
> >>>>
> >>>>           root
> >>>>          /    \
> >>>>         A      B (*) delayed since B is no longer eligible on root
> >>>>         |      |
> >>>>       Task0  Task1 <--- dequeue_task_fair() - task blocks
> >>>>
> >>>> When Task1 blocks (dequeue_entity() for task's se returns true),
> >>>> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
> >>>> hierarchy of Task1. However, when the sched_entity corresponding to
> >>>> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
> >>>> hierarchy too leading to both dequeue_entity() and set_delayed()
> >>>> decrementing h_nr_runnable for the dequeue of the same task.
> >>>>
> >>>> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
> >>>> dequeue_entities() like below:
> >>>>
> >>>>       cfs_rq->h_nr_runnable -= h_nr_runnable;
> >>>>       SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
> >>>>
> >>>> is consistently tripped when running wakeup intensive workloads like
> >>>> hackbench in a cgroup.
> >>>>
> >>>> This error is self correcting since cfs_rq are per-cpu and cannot
> >>>> migrate. The entitiy is either picked for full dequeue or is requeued
> >>>> when a task wakes up below it. Both those paths call clear_delayed()
> >>>> which again increments h_nr_runnable of the hierarchy without
> >>>> considering if the entity corresponds to a task or not.
> >>>>
> >>>> h_nr_runnable will eventually reflect the correct value however in the
> >>>> interim, the incorrect values can still influence PELT calculation which
> >>>> uses se->runnable_weight or cfs_rq->h_nr_runnable.
> >>>>
> >>>> Since only delayed tasks take the early return path in
> >>>> dequeue_entities() and enqueue_task_fair(), adjust the
> >>>> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
> >>>> this path skips the h_nr_* update loops and returns early.
> >>>>
> >>>> For entities corresponding to cfs_rq, the h_nr_* update loop in the
> >>>> caller will do the right thing.
> >>>>
> >>>> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
> >>>
> >>> You probably mean  c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
> >>
> >> You are right! I had done a git blame on set_delayed() ad landed at
> >> commit 76f2f783294d but you are right, it should be c2a295bffeaf
> >> ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was
> >> inverted to account runnable tasks. Thank you for pointing that out.
> >>
> >>> Before we were tracking the opposite h_nr_delayed. Did you see the
> >>> problem only on tip/sched/core or also before the rework which added
> >>> h_nr_runnable and removed h_nr_delayed
> >>
> >> The problem is on tip:sched/core. I did not encounter any anomalies on
> >> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
> >>
> >> "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep &&
> >> !delayed" which would imply migration or a save + restore type operation
> >> and the whole "h_nr_delayed" adjusting was contained in
> >> {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and
> >> requeue.
>
> So I was looking at it wrong when I was investigating on commit
> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") h_nr_delayed
> can never be larger than h_nr_running (h_nr_queued upstream) since the
> number of delayed tasks can never cross number of tasks queued below
> the given cfs_rq but with the following:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 97ee48c8bf5e..8e713f241483 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7145,6 +7145,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>                 cfs_rq->h_nr_running -= h_nr_running;
>                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
>                 cfs_rq->h_nr_delayed -= h_nr_delayed;
> +               SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running);
>
>                 if (cfs_rq_is_idle(cfs_rq))
>                         idle_h_nr_running = h_nr_running;
> @@ -7185,6 +7186,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
>                 cfs_rq->h_nr_delayed -= h_nr_delayed;
>
> +               SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running);
> +
>                 if (cfs_rq_is_idle(cfs_rq))
>                         idle_h_nr_running = h_nr_running;
>
> --
>
> I can again consistently hit the warning without the fix on 76f2f783294d
> ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")

I updated my warning conditions with
  h_nr_queued != h_nr_delayed + h_nr_runnable
  h_nr_delayed < 0 or > h_nr_queued
  h_nr_runnable < 0 or > h_nr_queued

In addition to h_nr_runnable < 0, h_nr_delayed > h_nr_queued has also
triggered a warning so I confirm that the fix is on 76f2f783294d
("sched/eevdf: More PELT vs DELAYED_DEQUEUE")

No warning are triggered with your fix

>
> I think that the original "Fixes:" tag is indeed right.
>
> >>
> >>>
> >>> I'm going to have a closer look
> >
> > Your fix looks good to me. I also run some tests after re-adding
> > h_nr_delayed and checking that h_nr_queued = h_nr_runnable +
> > h_nr_delayed after each update and I didn't get any warning with your
> > patch whereas I got one during boot without it (but none after that
> > during my tests)
>
> Could it be the case that h_nr_delayed counts a tiny bit higher than
> the actual number and h_nr_runnable counts a tiny bit lower by the
> same amount and they both correct each other to give the correct
> h_nr_queued?
>
> >
> > Thanks for catching this
> >
> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Thank you for reviewing the patch!
>
> >
> >>
> >> Thank you!
> >>
> >>>
> >>>
> >>>> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
> >>>> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> >>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >>>> ---
> >>>>
> >>>> [..snip..]
> >>>>
> >>
> >> --
> >> Thanks and Regards,
> >> Prateek
> >>
>
> --
> Thanks and Regards,
> Prateek
>
[tip: sched/urgent] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Posted by tip-bot2 for K Prateek Nayak 1 year ago
The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     3429dd57f0deb1a602c2624a1dd7c4c11b6c4734
Gitweb:        https://git.kernel.org/tip/3429dd57f0deb1a602c2624a1dd7c4c11b6c4734
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Fri, 17 Jan 2025 10:58:52 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 21 Jan 2025 13:13:36 +01:00

sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue

set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.

Consider the following scenario:

	root
       /    \
      A	     B (*) delayed since B is no longer eligible on root
      |	     |
    Task0  Task1 <--- dequeue_task_fair() - task blocks

When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.

A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:

    cfs_rq->h_nr_runnable -= h_nr_runnable;
    SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);

is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.

This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.

h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.

Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.

For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.

Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@amd.com
---
 kernel/sched/fair.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2695843..f4e4d3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5372,6 +5372,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 static void set_delayed(struct sched_entity *se)
 {
 	se->sched_delayed = 1;
+
+	/*
+	 * Delayed se of cfs_rq have no tasks queued on them.
+	 * Do not adjust h_nr_runnable since dequeue_entities()
+	 * will account it for blocked tasks.
+	 */
+	if (!entity_is_task(se))
+		return;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
@@ -5384,6 +5393,16 @@ static void set_delayed(struct sched_entity *se)
 static void clear_delayed(struct sched_entity *se)
 {
 	se->sched_delayed = 0;
+
+	/*
+	 * Delayed se of cfs_rq have no tasks queued on them.
+	 * Do not adjust h_nr_runnable since a dequeue has
+	 * already accounted for it or an enqueue of a task
+	 * below it will account for it in enqueue_task_fair().
+	 */
+	if (!entity_is_task(se))
+		return;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);