[PATCH] sched/proxy_exec: Limit find_proxy_task() chain depth to prevent CPU hang

soolaugust@gmail.com posted 1 patch 5 hours ago
kernel/sched/core.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
[PATCH] sched/proxy_exec: Limit find_proxy_task() chain depth to prevent CPU hang
Posted by soolaugust@gmail.com 5 hours ago
From: zhidao su <suzhidao@xiaomi.com>

find_proxy_task() follows the blocked_on chain with:

  for (p = donor; task_is_blocked(p); p = owner)

The existing WARN_ON(owner == p) only detects immediate self-loops
(a task waiting on a mutex it already owns). It does not detect
multi-task cycles: if tasks A and B form a cycle where A waits on
B's mutex and B waits on A's mutex, the chain traversal loops forever
between A and B, hanging the CPU indefinitely while holding rq->lock.

The scenario is real under PE: mutex-blocked tasks are kept on the
runqueue (try_to_block_task() with should_block=false), so both A and
B remain selectable by pick_next_task(). When A is selected as donor,
find_proxy_task() follows A->mutex_B->owner=B->mutex_A->owner=A->...
with no termination condition for cycles.

rt-mutex handles this identically with max_lock_depth (default 1024),
printing a warning and returning -EDEADLK when the chain is too deep.

Add a chain_depth counter with MAX_PROXY_CHAIN_DEPTH=64. When exceeded,
emit WARN_ONCE and call proxy_resched_idle() to schedule idle briefly,
consistent with how other unresolvable states are handled in the
function (e.g., owner migrating, curr_in_chain bailouts). This keeps
the kernel healthy without spinning; the deadlock resolution is the
caller's problem.

Tested with a built-in boot-param test (pe_cycle_test) that creates two
kthreads on CPU 0 each holding one kernel mutex while trying to acquire
the other, forming an A->B->A deadlock cycle.

With this fix:

  [  111.758150] sched/pe: proxy chain depth exceeded 64, possible deadlock cycle involving pid 120
  [  111.758150] WARNING: CPU: 0 PID: 119 at kernel/sched/core.c:7339 __schedule+0x1e6e/0x1e80
  ...
  [  112.694277] pe_cycle_test: still alive after 1s (CPU not hung)

Without this fix, an NMI watchdog (nmi_watchdog=1, watchdog_thresh=15)
fires a hard LOCKUP on CPU 0 with RIP in do_raw_spin_lock, called from
__schedule, confirming the CPU spins inside find_proxy_task() holding
rq->lock with no forward progress:

  [  109.951781] watchdog: CPU0: Watchdog detected hard LOCKUP on cpu 0
  [  109.951781] RIP: 0010:do_raw_spin_lock+0x3e/0xb0
  [  109.951781] Call Trace:
  [  109.951781]  __schedule+0x11e7/0x1e10
  [  109.951781]  schedule_preempt_disabled+0x18/0x30
  [  109.951781]  __mutex_lock+0x6f0/0xac0
  [  109.951781]  pe_test_thread_a+0x9c/0xe0

Fixes: 7de9d4f94638 ("sched: Start blocked_on chain processing in find_proxy_task()")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
 kernel/sched/core.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3f3425c6b2f2..bafb59432f7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7310,6 +7310,17 @@ DEFINE_LOCK_GUARD_1(blocked_on_lock, struct blocked_on_lock,
  * Returns the task that is going to be used as execution context (the one
  * that is actually going to be run on cpu_of(rq)).
  */
+/*
+ * Limit proxy chain traversal depth to avoid infinite loops in pathological
+ * cases (e.g., A waits for B's mutex while B waits for A's mutex). The
+ * existing WARN_ON(owner == p) only catches immediate self-loops; multi-task
+ * cycles like A->B->A are not detected without a depth counter.
+ *
+ * rt-mutex uses a similar guard (max_lock_depth = 1024). We use a smaller
+ * limit since proxy chains are expected to be short in practice.
+ */
+#define MAX_PROXY_CHAIN_DEPTH	64
+
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
@@ -7318,11 +7329,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	struct task_struct *owner = NULL;
 	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
+	int chain_depth = 0;
 	struct task_struct *p;
 	int owner_cpu;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; task_is_blocked(p); p = owner) {
+		if (++chain_depth > MAX_PROXY_CHAIN_DEPTH) {
+			WARN_ONCE(1, "sched/pe: proxy chain depth exceeded %d, possible deadlock cycle involving pid %d\n",
+				  MAX_PROXY_CHAIN_DEPTH, p->pid);
+			return proxy_resched_idle(rq);
+		}
 		/* copy the entire blocked_on structure */
 		raw_spin_lock(&p->blocked_lock);
 		bo = p->blocked_on;
-- 
2.43.0