sched_ext/for-6.11: cpu validity check in ops_cpu_valid

Posted by Vishal Chourasia 1 year, 7 months ago

Currently, the BPF scheduler can return a CPU that is marked as possible
in the system configurations, but this doesn't guarantee that the CPU is
actually present or online at the time. This behavior can lead to
scenarios where the scheduler attempts to assign tasks to CPUs that are
not available, causing the fallback mechanism to activate and
potentially leading to an uneven load distribution across the system.

By defalut, When a "not possible" CPU is returned, sched_ext gracefully
exits the bpf scheduler.

static bool ops_cpu_valid(s32 cpu, const char *where)
{
	if (likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu))) {
		return true;
	} else {
		scx_ops_error("invalid CPU %d%s%s", cpu,
			      where ? " " : "", where ?: "");
		return false;
	}
}

On POWER, a system can have differences in cpu_present and cpu_possible
mask. Not present, but possible CPUs can be added later but once added
will also be marked set in the cpu present mask. 

Looks like cpu_present() is a better check.

# tail -n +1 /sys/devices/system/cpu/{possible,present,online,offline}
==> /sys/devices/system/cpu/possible <==
0-63

==> /sys/devices/system/cpu/present <==
0-31

==> /sys/devices/system/cpu/online <==
0-31

==> /sys/devices/system/cpu/offline <==
32-63


diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 03da2cecb547..ca36596176c5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1333,7 +1333,7 @@ static void wait_ops_state(struct task_struct *p, unsigned long opss)
  */
 static bool ops_cpu_valid(s32 cpu, const char *where)
 {
-       if (likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu))) {
+       if (likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_present(cpu))) {
                return true;
        } else {
                scx_ops_error("invalid CPU %d%s%s", cpu,

Note: With this, when the BPF scheduler erroneously assigns a task to an
offline CPU, it doesn't stop. Instead, the core scheduler compensates by
allocating a fallback CPU from the same node as the task's previous CPU.
This can sometimes lead to overloading of some CPUs.

Will cpu_online(cpu) check be a better alternative?

Re: sched_ext/for-6.11: cpu validity check in ops_cpu_valid

Posted by Tejun Heo 1 year, 6 months ago

Hello, Vishal.

On Sun, Jul 14, 2024 at 12:44:24AM +0530, Vishal Chourasia wrote:
> Currently, the BPF scheduler can return a CPU that is marked as possible
> in the system configurations, but this doesn't guarantee that the CPU is
> actually present or online at the time. This behavior can lead to
> scenarios where the scheduler attempts to assign tasks to CPUs that are
> not available, causing the fallback mechanism to activate and
> potentially leading to an uneven load distribution across the system.

ops.select_cpu() is allowed to return any CPU and then the scheduler will
pick a fallback CPU. This is mostly because that's how
sched_class->select_task_rq() behaves. Here, SCX is just inheriting the
behavior.

Dispatching to foreign local DSQ using SCX_DSQ_LOCAL_ON also does
auto-fallback. This is because it's difficult for the BPF scheduler to
strongly synchronize its dispatch operation against CPU hotplug operations.

> By defalut, When a "not possible" CPU is returned, sched_ext gracefully
> exits the bpf scheduler.
> 
> static bool ops_cpu_valid(s32 cpu, const char *where)
> {
> 	if (likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu))) {
> 		return true;
> 	} else {
> 		scx_ops_error("invalid CPU %d%s%s", cpu,
> 			      where ? " " : "", where ?: "");
> 		return false;
> 	}
> }
>
> On POWER, a system can have differences in cpu_present and cpu_possible
> mask. Not present, but possible CPUs can be added later but once added
> will also be marked set in the cpu present mask. 
> 
> Looks like cpu_present() is a better check.

We can consider tightening each path separately but I'm not sure making
ops_cpu_valid() more strict is a good idea. For example, there's no reason
to abort because a scheduler is calling scx_bpf_dsq_nr_queued() on an
offline CPU especially given that the kfunc is allowed from any context
without any synchronization. It can create aborting bugs which are really
difficult to reproduce.

Thanks.

-- 
tejun

Re: sched_ext/for-6.11: cpu validity check in ops_cpu_valid

Posted by Vishal Chourasia 1 year, 6 months ago

On Sun, Jul 14, 2024 at 07:17:32PM -1000, Tejun Heo wrote:
> Hello, Vishal.
> 
> On Sun, Jul 14, 2024 at 12:44:24AM +0530, Vishal Chourasia wrote:
> > Currently, the BPF scheduler can return a CPU that is marked as possible
> > in the system configurations, but this doesn't guarantee that the CPU is
> > actually present or online at the time. This behavior can lead to
> > scenarios where the scheduler attempts to assign tasks to CPUs that are
> > not available, causing the fallback mechanism to activate and
> > potentially leading to an uneven load distribution across the system.
> 
> ops.select_cpu() is allowed to return any CPU and then the scheduler will
> pick a fallback CPU. This is mostly because that's how
> sched_class->select_task_rq() behaves. Here, SCX is just inheriting the
> behavior.
> 
> Dispatching to foreign local DSQ using SCX_DSQ_LOCAL_ON also does
> auto-fallback. This is because it's difficult for the BPF scheduler to
> strongly synchronize its dispatch operation against CPU hotplug operations.
> 
> > By defalut, When a "not possible" CPU is returned, sched_ext gracefully
> > exits the bpf scheduler.
> > 
> > static bool ops_cpu_valid(s32 cpu, const char *where)
> > {
> > 	if (likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu))) {
> > 		return true;
> > 	} else {
> > 		scx_ops_error("invalid CPU %d%s%s", cpu,
> > 			      where ? " " : "", where ?: "");
> > 		return false;
> > 	}
> > }
> >
> > On POWER, a system can have differences in cpu_present and cpu_possible
> > mask. Not present, but possible CPUs can be added later but once added
> > will also be marked set in the cpu present mask. 
> > 
> > Looks like cpu_present() is a better check.
> 
> We can consider tightening each path separately but I'm not sure making
What do you mean by "each path separately"?

> ops_cpu_valid() more strict is a good idea. For example, there's no reason
> to abort because a scheduler is calling scx_bpf_dsq_nr_queued() on an
> offline CPU especially given that the kfunc is allowed from any context
> without any synchronization. It can create aborting bugs which are really
> difficult to reproduce.
I agree, I wouldn't want to kick the BPF scheduler out for things that
can be handled. If an invalid CPU was returned by any sched_class, it's
best to handle it, because we don't have any other option.

However, the case of the BPF scheduler is different; we shouldn't need
to handle corner cases but instead immediately flag such cases.

Consider this: if a BPF scheduler is returning a non-present CPU in
select_cpu, the corresponding task will get scheduled on a CPU (using
the fallback mechanism) that may not be the best placement, causing
inconsistent behavior. And there will be no red flags reported making it
difficult to catch. My point is that sched_ext should be much stricter
towards the BPF scheduler.

Note: There is still the case when a offline CPU is returned by the bpf
scheduler. If sched_ext can catch that and handle it seperately by
calling scx_bpf_select_cpu_dfl

> 
> Thanks.
> 
> -- 
> tejun

Re: sched_ext/for-6.11: cpu validity check in ops_cpu_valid

Posted by Tejun Heo 1 year, 6 months ago

Hello, Vishal.

On Tue, Jul 16, 2024 at 12:19:16PM +0530, Vishal Chourasia wrote:
...
> However, the case of the BPF scheduler is different; we shouldn't need
> to handle corner cases but instead immediately flag such cases.

I'm not convinced of this. There's a tension here and I don't think either
end of the spectrum is the right solution. Please see below.

> Consider this: if a BPF scheduler is returning a non-present CPU in
> select_cpu, the corresponding task will get scheduled on a CPU (using
> the fallback mechanism) that may not be the best placement, causing
> inconsistent behavior. And there will be no red flags reported making it
> difficult to catch. My point is that sched_ext should be much stricter
> towards the BPF scheduler.

While flagging any deviation as failure and aborting sounds simple and clean
on the surface, I don't think it's that clear cut. There already are edge
conditions where ext or core scheduler code overrides sched_class decisions
and it's not straightforward to get synchronization against e.g. CPU hotplug
watertight from the BPF scheduler. So, we can end up with aborting a
scheduler once in a blue moon for a condition which can only occur during
hotplug and be easily worked around without any noticeable impact. I don't
think that's what we want.

That's not to say that the current situation is great because, as you
pointed out, it's possible to be systematically buggy and fly under the
radar, although I have to say that I've never seen this particular part
being a problem but YMMV.

Currently, error handling is binary. Either it's all okay or the scheduler
dies, but I think things like select_cpu() returning an offline CPU likely
needs a bit more nuance. ie. If it happens once around CPU hotplug, who
cares? But if a scheduler is consistently returning an invalid CPU, that
certainly is a problem and it may not be easy to notice. One way to go about
it could be collecting stats for these events and let the BPF scheduler
decide what to do about them.

Thanks.

-- 
tejun

Re: sched_ext/for-6.11: cpu validity check in ops_cpu_valid

Posted by Vishal Chourasia 1 year, 6 months ago

On Tue, Jul 16, 2024 at 11:44:51AM -1000, Tejun Heo wrote:
> Hello, Vishal.
> 
> On Tue, Jul 16, 2024 at 12:19:16PM +0530, Vishal Chourasia wrote:
> ...
> > However, the case of the BPF scheduler is different; we shouldn't need
> > to handle corner cases but instead immediately flag such cases.
> 
> I'm not convinced of this. There's a tension here and I don't think either
> end of the spectrum is the right solution. Please see below.
> 
> > Consider this: if a BPF scheduler is returning a non-present CPU in
> > select_cpu, the corresponding task will get scheduled on a CPU (using
> > the fallback mechanism) that may not be the best placement, causing
> > inconsistent behavior. And there will be no red flags reported making it
> > difficult to catch. My point is that sched_ext should be much stricter
> > towards the BPF scheduler.
> 
> While flagging any deviation as failure and aborting sounds simple and clean
> on the surface, I don't think it's that clear cut. There already are edge
> conditions where ext or core scheduler code overrides sched_class decisions
> and it's not straightforward to get synchronization against e.g. CPU hotplug
> watertight from the BPF scheduler. So, we can end up with aborting a
> scheduler once in a blue moon for a condition which can only occur during
> hotplug and be easily worked around without any noticeable impact. I don't
> think that's what we want.
> 
> That's not to say that the current situation is great because, as you
> pointed out, it's possible to be systematically buggy and fly under the
> radar, although I have to say that I've never seen this particular part
> being a problem but YMMV.
> 
> Currently, error handling is binary. Either it's all okay or the scheduler
> dies, but I think things like select_cpu() returning an offline CPU likely
> needs a bit more nuance. ie. If it happens once around CPU hotplug, who
> cares? But if a scheduler is consistently returning an invalid CPU, that
> certainly is a problem and it may not be easy to notice. One way to go about
> it could be collecting stats for these events and let the BPF scheduler
> decide what to do about them.
> 
> Thanks.
> 
> -- 
> tejun
Thanks for the replies.

--
vishal.c