kernel/sched/ext.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-)
scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
lock, which may lead to unsafe behavior and trigger the following
warning, due to the lockdep_assert_rq_held() check:
[ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0
...
[ 51.713836] Call trace:
[ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P)
[ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440
[ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c
[ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0
[ 51.713845] bpf_scx_reg+0x18/0x30
[ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0
[ 51.713849] __sys_bpf+0x1934/0x22a0
Fix by properly acquiring the rq lock.
Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 3d20228052217..53b97b32dfa43 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf)
if (ops_cpu_valid(cpu, NULL)) {
struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+ bool rq_unlocked;
+
+ preempt_disable();
+ rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked();
+ if (rq_unlocked) {
+ rq_lock_irqsave(rq, &rf);
+ update_rq_clock(rq);
+ }
rq->scx.cpuperf_target = perf;
+ cpufreq_update_util(rq, 0);
- rcu_read_lock_sched_notrace();
- cpufreq_update_util(cpu_rq(cpu), 0);
- rcu_read_unlock_sched_notrace();
+ if (rq_unlocked)
+ rq_unlock_irqrestore(rq, &rf);
+ preempt_enable();
}
}
--
2.49.0
Hello, Andrea.
On Tue, Mar 25, 2025 at 03:00:21PM +0100, Andrea Righi wrote:
> @@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf)
>
> if (ops_cpu_valid(cpu, NULL)) {
> struct rq *rq = cpu_rq(cpu);
> + struct rq_flags rf;
> + bool rq_unlocked;
> +
> + preempt_disable();
> + rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked();
> + if (rq_unlocked) {
> + rq_lock_irqsave(rq, &rf);
I don't think this is correct:
- This is double-locking regardless of the locking order and thus can lead
to ABBA deadlocks.
- There's no guarantee that the locked rq is this_rq(). e.g. In wakeup path,
the locked rq is on the CPU that the wakeup is targeting, not this_rq().
Hmm... this is a bit tricky. SCX_CALL_OP*() always knows whether the rq is
locked or not. We might as well pass it the currently locked rq and remember
that in a percpu variable, so that scx_bpf_*() can always tell whether and
which cpu is rq-locked currently. If unlocked, we can grab the rq lock. If
the traget cpu is not the locked one, we can either fail the operation (and
trigger ops error) or bounce it to an irq work.
Thanks.
--
tejun
On Wed, Mar 26, 2025 at 02:24:16PM -1000, Tejun Heo wrote:
> Hello, Andrea.
>
> On Tue, Mar 25, 2025 at 03:00:21PM +0100, Andrea Righi wrote:
> > @@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf)
> >
> > if (ops_cpu_valid(cpu, NULL)) {
> > struct rq *rq = cpu_rq(cpu);
> > + struct rq_flags rf;
> > + bool rq_unlocked;
> > +
> > + preempt_disable();
> > + rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked();
> > + if (rq_unlocked) {
> > + rq_lock_irqsave(rq, &rf);
>
> I don't think this is correct:
>
> - This is double-locking regardless of the locking order and thus can lead
> to ABBA deadlocks.
>
> - There's no guarantee that the locked rq is this_rq(). e.g. In wakeup path,
> the locked rq is on the CPU that the wakeup is targeting, not this_rq().
>
> Hmm... this is a bit tricky. SCX_CALL_OP*() always knows whether the rq is
> locked or not. We might as well pass it the currently locked rq and remember
> that in a percpu variable, so that scx_bpf_*() can always tell whether and
> which cpu is rq-locked currently. If unlocked, we can grab the rq lock. If
> the traget cpu is not the locked one, we can either fail the operation (and
> trigger ops error) or bounce it to an irq work.
Hm... that's right, it looks like this requires a bit more work than
expected, but saving the currently locked rq might be helpful also for
other kfuncs, I'll take a look at this.
Thanks!
-Andrea
On Thu, Mar 27, 2025 at 08:56:12AM +0100, Andrea Righi wrote:
> On Wed, Mar 26, 2025 at 02:24:16PM -1000, Tejun Heo wrote:
> > Hello, Andrea.
> >
> > On Tue, Mar 25, 2025 at 03:00:21PM +0100, Andrea Righi wrote:
> > > @@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf)
> > >
> > > if (ops_cpu_valid(cpu, NULL)) {
> > > struct rq *rq = cpu_rq(cpu);
> > > + struct rq_flags rf;
> > > + bool rq_unlocked;
> > > +
> > > + preempt_disable();
> > > + rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked();
> > > + if (rq_unlocked) {
> > > + rq_lock_irqsave(rq, &rf);
> >
> > I don't think this is correct:
> >
> > - This is double-locking regardless of the locking order and thus can lead
> > to ABBA deadlocks.
> >
> > - There's no guarantee that the locked rq is this_rq(). e.g. In wakeup path,
> > the locked rq is on the CPU that the wakeup is targeting, not this_rq().
> >
> > Hmm... this is a bit tricky. SCX_CALL_OP*() always knows whether the rq is
> > locked or not. We might as well pass it the currently locked rq and remember
> > that in a percpu variable, so that scx_bpf_*() can always tell whether and
> > which cpu is rq-locked currently. If unlocked, we can grab the rq lock. If
> > the traget cpu is not the locked one, we can either fail the operation (and
> > trigger ops error) or bounce it to an irq work.
>
> Hm... that's right, it looks like this requires a bit more work than
> expected, but saving the currently locked rq might be helpful also for
> other kfuncs, I'll take a look at this.
What if we lock the rq in the scx_kf_allowed_if_unlocked() case, and for
all the other cases we ignore locking if rq == this_rq(). If we need to
operate on a different rq than the current one we could either defer the
work or just trigger an ops error. Something like:
if (scx_kf_allowed_if_unlocked()) {
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
} else if (rq != this_rq()) {
// defer work or ops error
return;
}
lockdep_assert_rq_held(rq);
rq->scx.cpuperf_target = perf;
cpufreq_update_util(rq, 0);
if (scx_kf_allowed_if_unlocked())
rq_unlock_irqrestore(rq, &rf);
AFAICS all the current scx schedulers call scx_bpf_cpuperf_set() from
ops.running(), ops.tick() or ops.init(), so even with the ops error we
should cover all the existent cases.
The only unsupported scenario is calling scx_bpf_cpuperf_set() from
ops.enqueue() / ops.select_cpu(), but maybe we could add the deferred work
later to handle that if needed.
Thanks,
-Andrea
Hello,
On Thu, Mar 27, 2025 at 10:53:39AM +0100, Andrea Righi wrote:
...
> > Hm... that's right, it looks like this requires a bit more work than
> > expected, but saving the currently locked rq might be helpful also for
> > other kfuncs, I'll take a look at this.
>
> What if we lock the rq in the scx_kf_allowed_if_unlocked() case, and for
> all the other cases we ignore locking if rq == this_rq(). If we need to
> operate on a different rq than the current one we could either defer the
> work or just trigger an ops error. Something like:
>
> if (scx_kf_allowed_if_unlocked()) {
> rq_lock_irqsave(rq, &rf);
> update_rq_clock(rq);
> } else if (rq != this_rq()) {
> // defer work or ops error
> return;
> }
>
> lockdep_assert_rq_held(rq);
> rq->scx.cpuperf_target = perf;
> cpufreq_update_util(rq, 0);
>
> if (scx_kf_allowed_if_unlocked())
> rq_unlock_irqrestore(rq, &rf);
>
> AFAICS all the current scx schedulers call scx_bpf_cpuperf_set() from
> ops.running(), ops.tick() or ops.init(), so even with the ops error we
> should cover all the existent cases.
>
> The only unsupported scenario is calling scx_bpf_cpuperf_set() from
> ops.enqueue() / ops.select_cpu(), but maybe we could add the deferred work
> later to handle that if needed.
balance_one() can be called from a sibling CPU when core sched is enabled,
so ttwu isn't the only path where this_rq() test wouldn't work. Even if we
plug all the existing holes and make it work, it feels a bit too fragile to
me. It's something which can easily break inadvertently and cause subtle
failures.
If we don't want to do locked rq tracking, we can always use
schedule_deferred() when any rq is locked too. That's a bit more expensive
tho.
Thanks.
--
tejun
On Thu, Mar 27, 2025 at 07:09:25AM -1000, Tejun Heo wrote:
> Hello,
>
> On Thu, Mar 27, 2025 at 10:53:39AM +0100, Andrea Righi wrote:
> ...
> > > Hm... that's right, it looks like this requires a bit more work than
> > > expected, but saving the currently locked rq might be helpful also for
> > > other kfuncs, I'll take a look at this.
> >
> > What if we lock the rq in the scx_kf_allowed_if_unlocked() case, and for
> > all the other cases we ignore locking if rq == this_rq(). If we need to
> > operate on a different rq than the current one we could either defer the
> > work or just trigger an ops error. Something like:
> >
> > if (scx_kf_allowed_if_unlocked()) {
> > rq_lock_irqsave(rq, &rf);
> > update_rq_clock(rq);
> > } else if (rq != this_rq()) {
> > // defer work or ops error
> > return;
> > }
> >
> > lockdep_assert_rq_held(rq);
> > rq->scx.cpuperf_target = perf;
> > cpufreq_update_util(rq, 0);
> >
> > if (scx_kf_allowed_if_unlocked())
> > rq_unlock_irqrestore(rq, &rf);
> >
> > AFAICS all the current scx schedulers call scx_bpf_cpuperf_set() from
> > ops.running(), ops.tick() or ops.init(), so even with the ops error we
> > should cover all the existent cases.
> >
> > The only unsupported scenario is calling scx_bpf_cpuperf_set() from
> > ops.enqueue() / ops.select_cpu(), but maybe we could add the deferred work
> > later to handle that if needed.
>
> balance_one() can be called from a sibling CPU when core sched is enabled,
> so ttwu isn't the only path where this_rq() test wouldn't work. Even if we
> plug all the existing holes and make it work, it feels a bit too fragile to
> me. It's something which can easily break inadvertently and cause subtle
> failures.
>
> If we don't want to do locked rq tracking, we can always use
> schedule_deferred() when any rq is locked too. That's a bit more expensive
> tho.
Yeah, I'm a bit worried that locked rq tracking might introduce overhead to
all the scx callbacks, just to address this issue.
Perhaps schedule_deferred() is a reasonable compromise and we can limit the
overhead just to scx_bpf_cpuperf_set().
-Andrea
Hello, On Thu, Mar 27, 2025 at 06:15:09PM +0100, Andrea Righi wrote: > > If we don't want to do locked rq tracking, we can always use > > schedule_deferred() when any rq is locked too. That's a bit more expensive > > tho. > > Yeah, I'm a bit worried that locked rq tracking might introduce overhead to > all the scx callbacks, just to address this issue. All operaitons are already wrapped with SCX_CALL_OP() and updating per-cpu state (kf flags). It's unlikely that another percpu variable update is going to be noticeable. Thanks. -- tejun
On Thu, Mar 27, 2025 at 07:19:03AM -1000, Tejun Heo wrote: > Hello, > > On Thu, Mar 27, 2025 at 06:15:09PM +0100, Andrea Righi wrote: > > > If we don't want to do locked rq tracking, we can always use > > > schedule_deferred() when any rq is locked too. That's a bit more expensive > > > tho. > > > > Yeah, I'm a bit worried that locked rq tracking might introduce overhead to > > all the scx callbacks, just to address this issue. > > All operaitons are already wrapped with SCX_CALL_OP() and updating per-cpu > state (kf flags). It's unlikely that another percpu variable update is going > to be noticeable. Ack, I'll explore the locked rq tracking way then. Thanks, -Andrea
© 2016 - 2025 Red Hat, Inc.