[PATCH] sched/deadline: Skip overflow check if 0 capacity

Waiman Long posted 1 patch 2 weeks, 1 day ago
kernel/sched/deadline.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
[PATCH] sched/deadline: Skip overflow check if 0 capacity
Posted by Waiman Long 2 weeks, 1 day ago
By properly setting up a 1-cpu sched domain (partition) with no
task, it was found that offlining that particular CPU failed because
dl_bw_check_overflow() in cpuset_cpu_inactive() returned -EBUSY. This
is due to the fact that dl_bw_capacity() return 0 as the sched domain
has no active CPU causing a false positive in the overflow check.

Fix this corner case by skipping the __dl_overflow() check in
dl_bw_manage() when the returned capacity is 0.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/deadline.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index be1b917dc8ce..0195f350d6d3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3479,7 +3479,13 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 	} else {
 		unsigned long cap = dl_bw_capacity(cpu);
 
-		overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
+		/*
+		 * In the unlikely case of 0 capacity (e.g. a sched domain
+		 * with no active CPUs), skip the overflow check as it will
+		 * always return a false positive.
+		 */
+		if (likely(cap))
+			overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
 
 		if (req == dl_bw_req_alloc && !overflow) {
 			/*
-- 
2.47.0
Re: [PATCH] sched/deadline: Skip overflow check if 0 capacity
Posted by Juri Lelli 2 weeks, 1 day ago
On 07/11/24 23:29, Waiman Long wrote:
> By properly setting up a 1-cpu sched domain (partition) with no
> task, it was found that offlining that particular CPU failed because
> dl_bw_check_overflow() in cpuset_cpu_inactive() returned -EBUSY. This
> is due to the fact that dl_bw_capacity() return 0 as the sched domain
> has no active CPU causing a false positive in the overflow check.
> 
> Fix this corner case by skipping the __dl_overflow() check in
> dl_bw_manage() when the returned capacity is 0.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/sched/deadline.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index be1b917dc8ce..0195f350d6d3 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -3479,7 +3479,13 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
>  	} else {
>  		unsigned long cap = dl_bw_capacity(cpu);
>  
> -		overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
> +		/*
> +		 * In the unlikely case of 0 capacity (e.g. a sched domain
> +		 * with no active CPUs), skip the overflow check as it will
> +		 * always return a false positive.
> +		 */
> +		if (likely(cap))
> +			overflow = __dl_overflow(dl_b, cap, 0, dl_bw);

The remaining total_bw that make this check fail should be the one
relative to the dl_server on the cpu that is going offline. Wonder if we
shouldn't rather clean that up (remove dl_server contribution) before we
get to this point during an hotplug operation. Need to think about it a
little more.

Thanks,
Juri
Re: [PATCH] sched/deadline: Skip overflow check if 0 capacity
Posted by Waiman Long 2 weeks, 1 day ago
On 11/8/24 8:25 AM, Juri Lelli wrote:
> On 07/11/24 23:29, Waiman Long wrote:
>> By properly setting up a 1-cpu sched domain (partition) with no
>> task, it was found that offlining that particular CPU failed because
>> dl_bw_check_overflow() in cpuset_cpu_inactive() returned -EBUSY. This
>> is due to the fact that dl_bw_capacity() return 0 as the sched domain
>> has no active CPU causing a false positive in the overflow check.
>>
>> Fix this corner case by skipping the __dl_overflow() check in
>> dl_bw_manage() when the returned capacity is 0.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/sched/deadline.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index be1b917dc8ce..0195f350d6d3 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -3479,7 +3479,13 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
>>   	} else {
>>   		unsigned long cap = dl_bw_capacity(cpu);
>>   
>> -		overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
>> +		/*
>> +		 * In the unlikely case of 0 capacity (e.g. a sched domain
>> +		 * with no active CPUs), skip the overflow check as it will
>> +		 * always return a false positive.
>> +		 */
>> +		if (likely(cap))
>> +			overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
> The remaining total_bw that make this check fail should be the one
> relative to the dl_server on the cpu that is going offline. Wonder if we
> shouldn't rather clean that up (remove dl_server contribution) before we
> get to this point during an hotplug operation. Need to think about it a
> little more.
static inline bool
__dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
{
         return dl_b->bw != -1 &&
                cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
}

With a 0 cap, cap_scale(dl_b->bw, cap) will always be 0. As long as 
total_bw isn't 0 and bw isn't -1, the condition will be true.

Cheers,
Longman

Re: [PATCH] sched/deadline: Skip overflow check if 0 capacity
Posted by Juri Lelli 2 weeks, 1 day ago
On 08/11/24 08:39, Waiman Long wrote:
> On 11/8/24 8:25 AM, Juri Lelli wrote:
> > On 07/11/24 23:29, Waiman Long wrote:
> > > By properly setting up a 1-cpu sched domain (partition) with no
> > > task, it was found that offlining that particular CPU failed because
> > > dl_bw_check_overflow() in cpuset_cpu_inactive() returned -EBUSY. This
> > > is due to the fact that dl_bw_capacity() return 0 as the sched domain
> > > has no active CPU causing a false positive in the overflow check.
> > > 
> > > Fix this corner case by skipping the __dl_overflow() check in
> > > dl_bw_manage() when the returned capacity is 0.
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > ---
> > >   kernel/sched/deadline.c | 8 +++++++-
> > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > index be1b917dc8ce..0195f350d6d3 100644
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -3479,7 +3479,13 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
> > >   	} else {
> > >   		unsigned long cap = dl_bw_capacity(cpu);
> > > -		overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
> > > +		/*
> > > +		 * In the unlikely case of 0 capacity (e.g. a sched domain
> > > +		 * with no active CPUs), skip the overflow check as it will
> > > +		 * always return a false positive.
> > > +		 */
> > > +		if (likely(cap))
> > > +			overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
> > The remaining total_bw that make this check fail should be the one
> > relative to the dl_server on the cpu that is going offline. Wonder if we
> > shouldn't rather clean that up (remove dl_server contribution) before we
> > get to this point during an hotplug operation. Need to think about it a
> > little more.
> static inline bool
> __dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
> {
>         return dl_b->bw != -1 &&
>                cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
> }
> 
> With a 0 cap, cap_scale(dl_b->bw, cap) will always be 0. As long as total_bw
> isn't 0 and bw isn't -1, the condition will be true.

Right, but I fear that by hiding the special corner case we would also
miss the cases where we have DEADLINE tasks with bandwidth on that
single CPU and then ignore it.

So, maybe something like the below?

---
 kernel/sched/deadline.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index be1b917dc8ce..7884e566557c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -130,11 +130,24 @@ static inline int dl_bw_cpus(int i)
 	if (cpumask_subset(rd->span, cpu_active_mask))
 		return cpumask_weight(rd->span);
 
+	/*
+	 * Hotplug extreme case in which the last remaining online CPU in a
+	 * root domain is going offline. We get here after that cpus has been
+	 * cleared from cpu_active_mask, so the loop below would return 0
+	 * CPUs, which of course doesn't make much sense. Return at least 1
+	 * CPU.
+	 */
+
+	if (cpumask_weight(rd->span) == 1)
+		return 1;
+
 	cpus = 0;
 
 	for_each_cpu_and(i, rd->span, cpu_active_mask)
 		cpus++;
 
+	WARN_ON(!cpus);
+
 	return cpus;
 }
---

That said though, I believe I just found an additional issue. With the
above the system doesn't crash (it did w/o it), but happily moves
DEADLINE tasks out of a domain with a single CPU going offline. Last
time I looked at this we were properly checking and failing the hotplug
operation, but it was indeed a while ago, so not sure yet what changed.
More staring.

Oh, so broken, yay. :)

Best,
Juri
Re: [PATCH] sched/deadline: Skip overflow check if 0 capacity
Posted by Waiman Long 2 weeks, 1 day ago
On 11/8/24 11:31 AM, Juri Lelli wrote:
> That said though, I believe I just found an additional issue. With the
> above the system doesn't crash (it did w/o it), but happily moves
> DEADLINE tasks out of a domain with a single CPU going offline. Last
> time I looked at this we were properly checking and failing the hotplug
> operation, but it was indeed a while ago, so not sure yet what changed.
> More staring.
>
> Oh, so broken, yay. 🙂

Is it the case that the null total_bw bug let the cpu offline succeeds 
in a 1-cpu partition with a DL task and cause it to crash? I am also OK 
if you adds the check for the presence of a DL task in the cpu and fails 
the offline check in this case.

Thanks,
Longman

Re: [PATCH] sched/deadline: Skip overflow check if 0 capacity
Posted by Waiman Long 2 weeks, 1 day ago
On 11/8/24 11:31 AM, Juri Lelli wrote:
> On 08/11/24 08:39, Waiman Long wrote:
>> On 11/8/24 8:25 AM, Juri Lelli wrote:
>>> On 07/11/24 23:29, Waiman Long wrote:
>>>> By properly setting up a 1-cpu sched domain (partition) with no
>>>> task, it was found that offlining that particular CPU failed because
>>>> dl_bw_check_overflow() in cpuset_cpu_inactive() returned -EBUSY. This
>>>> is due to the fact that dl_bw_capacity() return 0 as the sched domain
>>>> has no active CPU causing a false positive in the overflow check.
>>>>
>>>> Fix this corner case by skipping the __dl_overflow() check in
>>>> dl_bw_manage() when the returned capacity is 0.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> ---
>>>>    kernel/sched/deadline.c | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>>> index be1b917dc8ce..0195f350d6d3 100644
>>>> --- a/kernel/sched/deadline.c
>>>> +++ b/kernel/sched/deadline.c
>>>> @@ -3479,7 +3479,13 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
>>>>    	} else {
>>>>    		unsigned long cap = dl_bw_capacity(cpu);
>>>> -		overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
>>>> +		/*
>>>> +		 * In the unlikely case of 0 capacity (e.g. a sched domain
>>>> +		 * with no active CPUs), skip the overflow check as it will
>>>> +		 * always return a false positive.
>>>> +		 */
>>>> +		if (likely(cap))
>>>> +			overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
>>> The remaining total_bw that make this check fail should be the one
>>> relative to the dl_server on the cpu that is going offline. Wonder if we
>>> shouldn't rather clean that up (remove dl_server contribution) before we
>>> get to this point during an hotplug operation. Need to think about it a
>>> little more.
>> static inline bool
>> __dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
>> {
>>          return dl_b->bw != -1 &&
>>                 cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
>> }
>>
>> With a 0 cap, cap_scale(dl_b->bw, cap) will always be 0. As long as total_bw
>> isn't 0 and bw isn't -1, the condition will be true.
> Right, but I fear that by hiding the special corner case we would also
> miss the cases where we have DEADLINE tasks with bandwidth on that
> single CPU and then ignore it.
>
> So, maybe something like the below?
>
> ---
>   kernel/sched/deadline.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index be1b917dc8ce..7884e566557c 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -130,11 +130,24 @@ static inline int dl_bw_cpus(int i)
>   	if (cpumask_subset(rd->span, cpu_active_mask))
>   		return cpumask_weight(rd->span);
>   
> +	/*
> +	 * Hotplug extreme case in which the last remaining online CPU in a
> +	 * root domain is going offline. We get here after that cpus has been
> +	 * cleared from cpu_active_mask, so the loop below would return 0
> +	 * CPUs, which of course doesn't make much sense. Return at least 1
> +	 * CPU.
> +	 */
> +
> +	if (cpumask_weight(rd->span) == 1)
> +		return 1;
> +
>   	cpus = 0;
>   
>   	for_each_cpu_and(i, rd->span, cpu_active_mask)
>   		cpus++;
>   
> +	WARN_ON(!cpus);
> +
>   	return cpus;
>   }
> ---

This patch looks good to me.

With this patch and my cpuset patches applied, the test_cpuset_prs.sh 
test passed without any error. You can add the following tags if you 
send this patch out.

Reported-by: Waiman Long <longman@redhat.com>
Tested-by: Waiman Long <longman@redhat.com

> That said though, I believe I just found an additional issue. With the
> above the system doesn't crash (it did w/o it), but happily moves
> DEADLINE tasks out of a domain with a single CPU going offline. Last
> time I looked at this we were properly checking and failing the hotplug
> operation, but it was indeed a while ago, so not sure yet what changed.
> More staring.
>
> Oh, so broken, yay. :)

The cpuset code will move tasks in a partition with no effective CPUs to 
its parent.

Cheers,
Longman