sched_ext: Fix SCX_KICK_WAIT cycle deadlock

[PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization

Posted by Christian Loehle 3 weeks ago

SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
smp_cond_load_acquire() until the target CPU's current SCX task has been
context-switched out (its kick_sync counter advanced).

If multiple CPUs each issue SCX_KICK_WAIT targeting one another
concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
context, it cannot reschedule, so no kick_sync counter ever advances and
the system deadlocks.

Fix this by serializing access to the wait loop behind a global raw
spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
acquire the lock records itself in scx_kick_wait_pending and returns.
When the active waiter finishes and releases the lock, it replays the
pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
no wait request is silently dropped.

This is deliberately a coarse serialization: multiple simultaneous wait
operations now run sequentially, increasing latency.  In exchange,
deadlocks are impossible regardless of the cycle length (A->B->C->...->A).

Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
bits left by a CPU that deferred just as the scheduler exited are reset
before the next scheduler instance loads.

Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 26a6ac2f8826..b63ae13d0486 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -89,6 +89,19 @@ struct scx_kick_syncs {
 
 static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
 
+/*
+ * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
+ * Callers failing to acquire @scx_kick_wait_lock defer by recording
+ * themselves in @scx_kick_wait_pending and are retriggered when the active
+ * waiter completes.
+ *
+ * Lock ordering: @scx_kick_wait_lock is always acquired before
+ * @scx_kick_wait_pending_lock; the two are never taken in the opposite order.
+ */
+static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
+static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
+static cpumask_t scx_kick_wait_pending;
+
 /*
  * Direct dispatch marker.
  *
@@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
 		if (to_free)
 			kvfree_rcu(to_free, rcu);
 	}
+
+	/*
+	 * Clear any CPUs that were waiting for the lock when the scheduler
+	 * exited.  Their irq_work has already returned so no in-flight
+	 * waiter can observe the stale bits on the next enable.
+	 */
+	cpumask_clear(&scx_kick_wait_pending);
 }
 
 static void scx_disable_workfn(struct kthread_work *work)
@@ -5647,8 +5667,9 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 	struct rq *this_rq = this_rq();
 	struct scx_rq *this_scx = &this_rq->scx;
 	struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs);
-	bool should_wait = false;
+	bool should_wait = !cpumask_empty(this_scx->cpus_to_wait);
 	unsigned long *ksyncs;
+	s32 this_cpu = cpu_of(this_rq);
 	s32 cpu;
 
 	if (unlikely(!ksyncs_pcpu)) {
@@ -5672,6 +5693,17 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 	if (!should_wait)
 		return;
 
+	if (!raw_spin_trylock(&scx_kick_wait_lock)) {
+		raw_spin_lock(&scx_kick_wait_pending_lock);
+		cpumask_set_cpu(this_cpu, &scx_kick_wait_pending);
+		raw_spin_unlock(&scx_kick_wait_pending_lock);
+		return;
+	}
+
+	raw_spin_lock(&scx_kick_wait_pending_lock);
+	cpumask_clear_cpu(this_cpu, &scx_kick_wait_pending);
+	raw_spin_unlock(&scx_kick_wait_pending_lock);
+
 	for_each_cpu(cpu, this_scx->cpus_to_wait) {
 		unsigned long *wait_kick_sync = &cpu_rq(cpu)->scx.kick_sync;
 
@@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 		 * task is picked subsequently. The latter is necessary to break
 		 * the wait when $cpu is taken by a higher sched class.
 		 */
-		if (cpu != cpu_of(this_rq))
+		if (cpu != this_cpu)
 			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
 
 		cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
 	}
+
+	raw_spin_unlock(&scx_kick_wait_lock);
+
+	raw_spin_lock(&scx_kick_wait_pending_lock);
+	for_each_cpu(cpu, &scx_kick_wait_pending) {
+		cpumask_clear_cpu(cpu, &scx_kick_wait_pending);
+		irq_work_queue(&cpu_rq(cpu)->scx.kick_cpus_irq_work);
+	}
+	raw_spin_unlock(&scx_kick_wait_pending_lock);
 }
 
 /**
-- 
2.34.1

Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization

Posted by Tejun Heo 3 weeks ago

Hello,

On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  		 * task is picked subsequently. The latter is necessary to break
>  		 * the wait when $cpu is taken by a higher sched class.
>  		 */
> -		if (cpu != cpu_of(this_rq))
> +		if (cpu != this_cpu)
>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);

Given that irq_work is executed at the end of IRQ handling, we can just
reschedule the irq work when the condition is not met (or separate that out
into its own irq_work). That way, I think we can avoid the global lock.

Thanks.

-- 
tejun

Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization

Posted by Christian Loehle 3 weeks ago

On 3/16/26 17:46, Tejun Heo wrote:
> Hello,
> 
> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
>> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>>  		 * task is picked subsequently. The latter is necessary to break
>>  		 * the wait when $cpu is taken by a higher sched class.
>>  		 */
>> -		if (cpu != cpu_of(this_rq))
>> +		if (cpu != this_cpu)
>>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
> 
> Given that irq_work is executed at the end of IRQ handling, we can just
> reschedule the irq work when the condition is not met (or separate that out
> into its own irq_work). That way, I think we can avoid the global lock.
> 
I'll go poke at it some more, but I think it's not guaranteed that B actually
advances kick_sync if A keeps kicking. At least not if the handling is in HARD irqwork?
Or what would the separated out irq work do differently?

Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization

Posted by Christian Loehle 2 weeks, 6 days ago

On 3/16/26 22:26, Christian Loehle wrote:
> On 3/16/26 17:46, Tejun Heo wrote:
>> Hello,
>>
>> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
>>> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>>>  		 * task is picked subsequently. The latter is necessary to break
>>>  		 * the wait when $cpu is taken by a higher sched class.
>>>  		 */
>>> -		if (cpu != cpu_of(this_rq))
>>> +		if (cpu != this_cpu)
>>>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
>>
>> Given that irq_work is executed at the end of IRQ handling, we can just
>> reschedule the irq work when the condition is not met (or separate that out
>> into its own irq_work). That way, I think we can avoid the global lock.
>>
> I'll go poke at it some more, but I think it's not guaranteed that B actually
> advances kick_sync if A keeps kicking. At least not if the handling is in HARD irqwork?
> Or what would the separated out irq work do differently?

So in my particular example I do the SCX_KICK_WAIT in ops.enqueue(), which is fair, but
I don't think we can delay calling that until we've advanced our local kick_sync and if we
don't we end up in the deadlock, even if e.g. we separate out the retry (and make that lazy),
because then the local CPU is able to continuously issue new kicks (which will have to
be handled by the non-retry path) without advancing it's own kick_sync.
The closest thing to that I can get working is separating out the SCX_KICK_WAIT entirely and
make that lazy. In practice though that would realistically make the SCX_KICK_WAIT latency
most likely a lot higher than with the global lock, is that what you had in mind?
Or am I missing something here?

Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization

Posted by Christian Loehle 2 weeks, 6 days ago

On 3/17/26 08:23, Christian Loehle wrote:
> On 3/16/26 22:26, Christian Loehle wrote:
>> On 3/16/26 17:46, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
>>>> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>>>>  		 * task is picked subsequently. The latter is necessary to break
>>>>  		 * the wait when $cpu is taken by a higher sched class.
>>>>  		 */
>>>> -		if (cpu != cpu_of(this_rq))
>>>> +		if (cpu != this_cpu)
>>>>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
>>>
>>> Given that irq_work is executed at the end of IRQ handling, we can just
>>> reschedule the irq work when the condition is not met (or separate that out
>>> into its own irq_work). That way, I think we can avoid the global lock.
>>>
>> I'll go poke at it some more, but I think it's not guaranteed that B actually
>> advances kick_sync if A keeps kicking. At least not if the handling is in HARD irqwork?
>> Or what would the separated out irq work do differently?
> 
> So in my particular example I do the SCX_KICK_WAIT in ops.enqueue(), which is fair, but
> I don't think we can delay calling that until we've advanced our local kick_sync and if we
> don't we end up in the deadlock, even if e.g. we separate out the retry (and make that lazy),
> because then the local CPU is able to continuously issue new kicks (which will have to
> be handled by the non-retry path) without advancing it's own kick_sync.
> The closest thing to that I can get working is separating out the SCX_KICK_WAIT entirely and
> make that lazy. In practice though that would realistically make the SCX_KICK_WAIT latency
> most likely a lot higher than with the global lock, is that what you had in mind?
> Or am I missing something here?

Something like the attached, for completeness

Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization

Posted by Andrea Righi 3 weeks ago

Hi Christian,

On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
> smp_cond_load_acquire() until the target CPU's current SCX task has been
> context-switched out (its kick_sync counter advanced).
> 
> If multiple CPUs each issue SCX_KICK_WAIT targeting one another
> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
> simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
> context, it cannot reschedule, so no kick_sync counter ever advances and
> the system deadlocks.
> 
> Fix this by serializing access to the wait loop behind a global raw
> spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
> acquire the lock records itself in scx_kick_wait_pending and returns.
> When the active waiter finishes and releases the lock, it replays the
> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
> no wait request is silently dropped.
> 
> This is deliberately a coarse serialization: multiple simultaneous wait
> operations now run sequentially, increasing latency.  In exchange,
> deadlocks are impossible regardless of the cycle length (A->B->C->...->A).
> 
> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
> bits left by a CPU that deferred just as the scheduler exited are reset
> before the next scheduler instance loads.
> 
> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> ---
>  kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 26a6ac2f8826..b63ae13d0486 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -89,6 +89,19 @@ struct scx_kick_syncs {
>  
>  static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
>  
> +/*
> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
> + * Callers failing to acquire @scx_kick_wait_lock defer by recording
> + * themselves in @scx_kick_wait_pending and are retriggered when the active
> + * waiter completes.
> + *
> + * Lock ordering: @scx_kick_wait_lock is always acquired before
> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order.
> + */
> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
> +static cpumask_t scx_kick_wait_pending;
> +
>  /*
>   * Direct dispatch marker.
>   *
> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
>  		if (to_free)
>  			kvfree_rcu(to_free, rcu);
>  	}
> +
> +	/*
> +	 * Clear any CPUs that were waiting for the lock when the scheduler
> +	 * exited.  Their irq_work has already returned so no in-flight
> +	 * waiter can observe the stale bits on the next enable.
> +	 */
> +	cpumask_clear(&scx_kick_wait_pending);

Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make
sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()?
Probably it's not that relevant at this point, but I'd keep the locking for
correctness.

Thanks,
-Andrea

>  }
>  
>  static void scx_disable_workfn(struct kthread_work *work)
> @@ -5647,8 +5667,9 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  	struct rq *this_rq = this_rq();
>  	struct scx_rq *this_scx = &this_rq->scx;
>  	struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs);
> -	bool should_wait = false;
> +	bool should_wait = !cpumask_empty(this_scx->cpus_to_wait);
>  	unsigned long *ksyncs;
> +	s32 this_cpu = cpu_of(this_rq);
>  	s32 cpu;
>  
>  	if (unlikely(!ksyncs_pcpu)) {
> @@ -5672,6 +5693,17 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  	if (!should_wait)
>  		return;
>  
> +	if (!raw_spin_trylock(&scx_kick_wait_lock)) {
> +		raw_spin_lock(&scx_kick_wait_pending_lock);
> +		cpumask_set_cpu(this_cpu, &scx_kick_wait_pending);
> +		raw_spin_unlock(&scx_kick_wait_pending_lock);
> +		return;
> +	}
> +
> +	raw_spin_lock(&scx_kick_wait_pending_lock);
> +	cpumask_clear_cpu(this_cpu, &scx_kick_wait_pending);
> +	raw_spin_unlock(&scx_kick_wait_pending_lock);
> +
>  	for_each_cpu(cpu, this_scx->cpus_to_wait) {
>  		unsigned long *wait_kick_sync = &cpu_rq(cpu)->scx.kick_sync;
>  
> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  		 * task is picked subsequently. The latter is necessary to break
>  		 * the wait when $cpu is taken by a higher sched class.
>  		 */
> -		if (cpu != cpu_of(this_rq))
> +		if (cpu != this_cpu)
>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
>  
>  		cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
>  	}
> +
> +	raw_spin_unlock(&scx_kick_wait_lock);
> +
> +	raw_spin_lock(&scx_kick_wait_pending_lock);
> +	for_each_cpu(cpu, &scx_kick_wait_pending) {
> +		cpumask_clear_cpu(cpu, &scx_kick_wait_pending);
> +		irq_work_queue(&cpu_rq(cpu)->scx.kick_cpus_irq_work);
> +	}
> +	raw_spin_unlock(&scx_kick_wait_pending_lock);
>  }
>  
>  /**
> -- 
> 2.34.1
>

Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization

Posted by Christian Loehle 3 weeks ago

On 3/16/26 10:49, Andrea Righi wrote:
> Hi Christian,
> 
> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
>> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
>> smp_cond_load_acquire() until the target CPU's current SCX task has been
>> context-switched out (its kick_sync counter advanced).
>>
>> If multiple CPUs each issue SCX_KICK_WAIT targeting one another
>> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
>> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
>> simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
>> context, it cannot reschedule, so no kick_sync counter ever advances and
>> the system deadlocks.
>>
>> Fix this by serializing access to the wait loop behind a global raw
>> spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
>> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
>> acquire the lock records itself in scx_kick_wait_pending and returns.
>> When the active waiter finishes and releases the lock, it replays the
>> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
>> no wait request is silently dropped.
>>
>> This is deliberately a coarse serialization: multiple simultaneous wait
>> operations now run sequentially, increasing latency.  In exchange,
>> deadlocks are impossible regardless of the cycle length (A->B->C->...->A).
>>
>> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
>> bits left by a CPU that deferred just as the scheduler exited are reset
>> before the next scheduler instance loads.
>>
>> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
>> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
>> ---
>>  kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 43 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 26a6ac2f8826..b63ae13d0486 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -89,6 +89,19 @@ struct scx_kick_syncs {
>>  
>>  static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
>>  
>> +/*
>> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
>> + * Callers failing to acquire @scx_kick_wait_lock defer by recording
>> + * themselves in @scx_kick_wait_pending and are retriggered when the active
>> + * waiter completes.
>> + *
>> + * Lock ordering: @scx_kick_wait_lock is always acquired before
>> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order.
>> + */
>> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
>> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
>> +static cpumask_t scx_kick_wait_pending;
>> +
>>  /*
>>   * Direct dispatch marker.
>>   *
>> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
>>  		if (to_free)
>>  			kvfree_rcu(to_free, rcu);
>>  	}
>> +
>> +	/*
>> +	 * Clear any CPUs that were waiting for the lock when the scheduler
>> +	 * exited.  Their irq_work has already returned so no in-flight
>> +	 * waiter can observe the stale bits on the next enable.
>> +	 */
>> +	cpumask_clear(&scx_kick_wait_pending);
> 
> Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make
> sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()?
> Probably it's not that relevant at this point, but I'd keep the locking for
> correctness.

Of course, thanks. Noted for v2!

Are you fine with the approach, i.e. hitting it with the sledge hammer of global
serialization?
I have something more complex in mind too, but yeah, we'd need to at least either
let scx_bpf_kick_cpu() fail / -ERETRY or restrict kicking/kicked CPUs and introduce
a whole lot of infra, which seems a bit overkill for a apparently barely used
interface and also would be nasty to backport.

Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization

Posted by Andrea Righi 3 weeks ago

On Mon, Mar 16, 2026 at 11:12:43AM +0000, Christian Loehle wrote:
> On 3/16/26 10:49, Andrea Righi wrote:
> > Hi Christian,
> > 
> > On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
> >> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
> >> smp_cond_load_acquire() until the target CPU's current SCX task has been
> >> context-switched out (its kick_sync counter advanced).
> >>
> >> If multiple CPUs each issue SCX_KICK_WAIT targeting one another
> >> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
> >> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
> >> simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
> >> context, it cannot reschedule, so no kick_sync counter ever advances and
> >> the system deadlocks.
> >>
> >> Fix this by serializing access to the wait loop behind a global raw
> >> spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
> >> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
> >> acquire the lock records itself in scx_kick_wait_pending and returns.
> >> When the active waiter finishes and releases the lock, it replays the
> >> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
> >> no wait request is silently dropped.
> >>
> >> This is deliberately a coarse serialization: multiple simultaneous wait
> >> operations now run sequentially, increasing latency.  In exchange,
> >> deadlocks are impossible regardless of the cycle length (A->B->C->...->A).
> >>
> >> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
> >> bits left by a CPU that deferred just as the scheduler exited are reset
> >> before the next scheduler instance loads.
> >>
> >> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
> >> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> >> ---
> >>  kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
> >>  1 file changed, 43 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> >> index 26a6ac2f8826..b63ae13d0486 100644
> >> --- a/kernel/sched/ext.c
> >> +++ b/kernel/sched/ext.c
> >> @@ -89,6 +89,19 @@ struct scx_kick_syncs {
> >>  
> >>  static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
> >>  
> >> +/*
> >> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
> >> + * Callers failing to acquire @scx_kick_wait_lock defer by recording
> >> + * themselves in @scx_kick_wait_pending and are retriggered when the active
> >> + * waiter completes.
> >> + *
> >> + * Lock ordering: @scx_kick_wait_lock is always acquired before
> >> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order.
> >> + */
> >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
> >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
> >> +static cpumask_t scx_kick_wait_pending;
> >> +
> >>  /*
> >>   * Direct dispatch marker.
> >>   *
> >> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
> >>  		if (to_free)
> >>  			kvfree_rcu(to_free, rcu);
> >>  	}
> >> +
> >> +	/*
> >> +	 * Clear any CPUs that were waiting for the lock when the scheduler
> >> +	 * exited.  Their irq_work has already returned so no in-flight
> >> +	 * waiter can observe the stale bits on the next enable.
> >> +	 */
> >> +	cpumask_clear(&scx_kick_wait_pending);
> > 
> > Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make
> > sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()?
> > Probably it's not that relevant at this point, but I'd keep the locking for
> > correctness.
> 
> Of course, thanks. Noted for v2!
> 
> Are you fine with the approach, i.e. hitting it with the sledge hammer of global
> serialization?
> I have something more complex in mind too, but yeah, we'd need to at least either
> let scx_bpf_kick_cpu() fail / -ERETRY or restrict kicking/kicked CPUs and introduce
> a whole lot of infra, which seems a bit overkill for a apparently barely used
> interface and also would be nasty to backport.

Yes, the current approach looks reasonable to me, I think the potential
latency increase (assuming there's any noticeable increase) is totally
acceptable in order to fix the deadlock.

Thanks,
-Andrea