From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 652FA3B7771 for ; Tue, 24 Mar 2026 19:13:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379624; cv=none; b=WmHTgBpCJiAtBlm9o1DOiA3v++1CZaABj6Bw68TGxdAfjk4mebf8ZERCw9oBZOxNNOQ8Nobe4EKadceD4RC1+LtYa7kmkm2SVnjOsPvEBeXlGxnoUTgo56ipL7n5erWU5uk24T/oP5LtmaWsBsUKK+ETHuK+jpEqaiNhGCIc1VY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379624; c=relaxed/simple; bh=ngCb/aQ4bDVXBkz5wCqEB3etCuqhzIu3JEQj3hyK+bc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=R+J2y/Q1cvTV5f78FcO7fRfUSjfnOZ8d346pk540mZ9RWnPXv7Z37B/yRbdBHRvcntYN0BlMZn6uZZlUgYGzrfnTTpIHRvxw2WaC54lhq8ZKp2zH81cSmbOt+2MBKyfw1U5gA1IMOa+l2o2z60eJzAXv8lx9WDSgKUXmpxn7U3I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=sQ4PjQ4X; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="sQ4PjQ4X" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-359fe4e9ea7so5235836a91.0 for ; Tue, 24 Mar 2026 12:13:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379623; x=1774984423; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=keD820Bd2VVPiFIRNkR+P92K+R/iH8crdu/iEvU+RlE=; b=sQ4PjQ4XoKnB1oVO0cdw7DJA1TbMMkquvUTS8SV7MpyKHCmkuvoxD05mJKrj9wuwM2 khpQBSyrqYx0RNojWX884wPI5wP/dikw/blB4muQLfmBERYC0wyL236cLR4uNHwqKRm2 z+NCQdWHZLdoRslcC8UOGCfJcPbjBD1dZ8nx6k+ewQG6lrQOt/zQYvgXfgem2KvpCJRZ DbwdjluW1eIdzcxtH7tJEgGoGn0ZZ4gxIcJZBe78L6gxEmpO0sjzmP9FwPbPW4H/sx6u Z2+jIeW9xpIZh4ePz3DY47BuTQ/+26vQlrMVQrR3oRO8k7IbOoJhmlTQFVU7NP/cxCAa dlzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379623; x=1774984423; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=keD820Bd2VVPiFIRNkR+P92K+R/iH8crdu/iEvU+RlE=; b=feRNWOvP0CWkmW7TfdvFs3aQyZ1x+WO6CKaVW7mcjQOeWr+DzyoyPu666JjKS53DMl EDeE44fvLjPUH/MMiiEwxZZ7kZAKQpi+WBt6th/AtFcefxGYbdddxPLdKw2pFRflaOml Fv8K0GLtOa0nGhj3rJwmEF7euqHuhuR2LyZV+ACMXnMxYbjOXGBOKC6xRA6EjekrQiXY O6ytLbxw6szWwUJQ41YlwoPN82x1zTmiImYrTU4/Lud4rjPiVG9q1v77QpDlz2DVN4VQ n3FoYONKbulEa9Ec6AVCVxSgwBFpMZrAZWrwHtRq9qfbowPMQbsYcNK7d3Ff1mxQ++yc RRqA== X-Gm-Message-State: AOJu0Yw+ITtTZtf0XOORXqYtZKjBKDGFB5SAWP06L9Iai1a0J8J2TkgO ReLwLPBJhdDaktieUhY7JyC02yVfTLISGKelPjmJnftT+8VqQkJY1tvTRzerkt69UWfhuQGPl0Q VdWidJbDdsuiR8C5kIgd5WxcW4uMLvkQ0fGuv6f7fR4mRT1oMnOMjxMGkuNz/B4jlHQGv3eKMKc thCUBRmuy4oDJA8CIgKzSQPsxzloBL55tFX5L3aWCSzW/nFwAj X-Received: from pjuj3.prod.google.com ([2002:a17:90a:d003:b0:35b:98ff:2e86]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:4a4f:b0:359:fd9a:c50c with SMTP id 98e67ed59e1d1-35c0ddb20b3mr434474a91.22.1774379622193; Tue, 24 Mar 2026 12:13:42 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:16 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-2-jstultz@google.com> Subject: [PATCH v26 01/10] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Peter Zijlstra , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With proxy-execution, the scheduler selects the donor, but for blocked donors, we end up running the lock owner. This caused some complexity, because the class schedulers make sure to remove the task they pick from their pushable task lists, which prevents the donor from being migrated, but there wasn't then anything to prevent rq->curr from being migrated if rq->curr !=3D rq->donor. This was sort of hacked around by calling proxy_tag_curr() on the rq->curr task if we were running something other then the donor. proxy_tag_curr() did a dequeue/enqueue pair on the rq->curr task, allowing the class schedulers to remove it from their pushable list. The dequeue/enqueue pair was wasteful, and additonally K Prateek highlighted that we didn't properly undo things when we stopped proxying, leaving the lock owner off the pushable list. After some alternative approaches were considered, Peter suggested just having the RT/DL classes just avoid migrating when task_on_cpu(). So rework pick_next_pushable_dl_task() and the rt pick_next_pushable_task() functions so that they skip over the first pushable task if it is on_cpu. Then just drop all of the proxy_tag_curr() logic. Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability") Reported-by: K Prateek Nayak Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@a= md.com/ Suggested-by: Peter Zijlstra Signed-off-by: John Stultz --- v26: * Fix issue Juri noticed by using a separate iterator value in pick_next_pusahble_task_dl() Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 24 ------------------------ kernel/sched/deadline.c | 18 ++++++++++++++++-- kernel/sched/rt.c | 15 ++++++++++++--- 3 files changed, 28 insertions(+), 29 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 496dff740dcaf..92b1807c05a4e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6705,23 +6705,6 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) } #endif /* SCHED_PROXY_EXEC */ =20 -static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner) -{ - if (!sched_proxy_exec()) - return; - /* - * pick_next_task() calls set_next_task() on the chosen task - * at some point, which ensures it is not push/pullable. - * However, the chosen/donor task *and* the mutex owner form an - * atomic pair wrt push/pull. - * - * Make sure owner we run is not pushable. Unfortunately we can - * only deal with that by means of a dequeue/enqueue cycle. :-/ - */ - dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE); - enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE); -} - /* * __schedule() is the main scheduler function. * @@ -6874,9 +6857,6 @@ static void __sched notrace __schedule(int sched_mode) */ RCU_INIT_POINTER(rq->curr, next); =20 - if (!task_current_donor(rq, next)) - proxy_tag_curr(rq, next); - /* * The membarrier system call requires each architecture * to have a full memory barrier after updating @@ -6910,10 +6890,6 @@ static void __sched notrace __schedule(int sched_mod= e) /* Also unlocks the rq: */ rq =3D context_switch(rq, prev, next, &rf); } else { - /* In case next was already curr but just got blocked_donor */ - if (!task_current_donor(rq, next)) - proxy_tag_curr(rq, next); - rq_unpin_lock(rq, &rf); __balance_callbacks(rq, NULL); raw_spin_rq_unlock_irq(rq); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index d08b004293234..52c524f5ba4dd 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2801,12 +2801,26 @@ static int find_later_rq(struct task_struct *task) =20 static struct task_struct *pick_next_pushable_dl_task(struct rq *rq) { - struct task_struct *p; + struct task_struct *i, *p =3D NULL; + struct rb_node *next_node; =20 if (!has_pushable_dl_tasks(rq)) return NULL; =20 - p =3D __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root)); + next_node =3D rb_first_cached(&rq->dl.pushable_dl_tasks_root); + while (next_node) { + i =3D __node_2_pdl(next_node); + /* make sure task isn't on_cpu (possible with proxy-exec) */ + if (!task_on_cpu(rq, i)) { + p =3D i; + break; + } + + next_node =3D rb_next(next_node); + } + + if (!p) + return NULL; =20 WARN_ON_ONCE(rq->cpu !=3D task_cpu(p)); WARN_ON_ONCE(task_current(rq, p)); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index f69e1f16d9238..61569b622d1a3 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1853,13 +1853,22 @@ static int find_lowest_rq(struct task_struct *task) =20 static struct task_struct *pick_next_pushable_task(struct rq *rq) { - struct task_struct *p; + struct plist_head *head =3D &rq->rt.pushable_tasks; + struct task_struct *i, *p =3D NULL; =20 if (!has_pushable_tasks(rq)) return NULL; =20 - p =3D plist_first_entry(&rq->rt.pushable_tasks, - struct task_struct, pushable_tasks); + plist_for_each_entry(i, head, pushable_tasks) { + /* make sure task isn't on_cpu (possible with proxy-exec) */ + if (!task_on_cpu(rq, i)) { + p =3D i; + break; + } + } + + if (!p) + return NULL; =20 BUG_ON(rq->cpu !=3D task_cpu(p)); BUG_ON(task_current(rq, p)); --=20 2.53.0.1018.g2bb0e51243-goog From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 088363B8D55 for ; Tue, 24 Mar 2026 19:13:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379626; cv=none; b=czE44BM39lWA9hXw5QyQnFZKxH4iPPIWQlQH9ObjhPDxvZOnyQFJdhdycxj4GdCpR2zypMW82OB7jrOggJ/BQ7BlD2Ubd4Idtu6g2NB3mSOGg8qOZpNF+gOt6veTWlQwSq0JBuRZUNo7VTRI4eS54OusvKkrKt3Uv0Wc/xNU+i0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379626; c=relaxed/simple; bh=iZIhg/r/TNNDlTVE57PfIglOHZs3AE+cYlDHkVMVeEA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=k002HyWKPuXd4WAqXDU87A5JkDU8+/o+g818CU2m/MDdewevY6CxVtxc0djg8v/J1UUBc0wq7aHjbzy+tGWOA+qVnGlPcXdqR3zJUJQyTM696Cb+v29XKFkT53BIDPebC9FG8bETP0RXPJv3HpSm1cCrTi3mg6R/wKJQKCZY8uU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=VvPuOqwY; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="VvPuOqwY" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-c7422397574so18749645a12.0 for ; Tue, 24 Mar 2026 12:13:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379624; x=1774984424; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=xIC3GhT6qS+YczviL1cN/gFYHUb8GVHPov43NUIVNl0=; b=VvPuOqwYWIFav9UZNiz1r0fF/SPNj/41afBK0u4FE4uz/d0qg+MUS3a9iZmNFRdAbZ vnZiIugjjXsothkyifeZtyTz23BARoSeK057+lslRP2l2orsoj8fnkGQjrigFCiZ1g03 f9Hkm5nJKsXYGZoGnUdT4S5F90py96tpdH6Xlc1jPpdqeV4YWg66Qb0LhMRj5Mp/tcID vmRW9fOeOOHZvE3kFC505l7HEm6bVYEaQHZ/7bWFPR1VwSdhhF1bUY60Q+sOfbdwycDN GFpVYxbEKfZ27GH2/mwyuz9oojjIf3ihqm1VfCWl0CaS4uJf+vhUY9gMsAY1vjt5DfeB ve6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379624; x=1774984424; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xIC3GhT6qS+YczviL1cN/gFYHUb8GVHPov43NUIVNl0=; b=VJZgu9fDTss+7LEFfWLYN5JzlvwlRZED2N6UtEUJ4joKAG7pUe1fEFL85KuCy+kRMd Lki47xC9yuDEriE5tJXDjkD56uGxBcrVKRgx3pRmG5QtZ2Us3eLZEA9a3kr5X9s6n71R MzKm7SsPfP+P9aiGK4gKspJF3OOV4XiiPlnrGNiuW72FhH26ZZdZD0P5Q6XPWGING6pw n+VcJvHLvUZh+N0l2jumhD0hJNrolplpYurEu0kDF7otovn85aD5ptaSn8gNJUqnFTkK nlpbWAjfjtAUkWqpdtXa3l23tClazO8mjrchZifoNyG04DnaigFpj/neYFL1ZZVl20VP LGcQ== X-Gm-Message-State: AOJu0YwerPg7JYOgGCYieo3ILPmp1evx6rR8Qk6wOh75xD3xCfXSZYTX mbddhtg+GF+/pAkpiGcFRtGjp5LgyGpFq00GlF6+CLwxwD4XTsQ62lpNcOKdQN167SCVBrFIwe2 TTV5611MO2KDcFsHhpikr8mqAESRZpZnWiQN5pA1C8r69nsTvFZjSPvoOuKX0wQGcRZ//r658PH 2k8n9BG8XgX2PqEYf/wmttmvW/+q82u9R69bqzc5tmz+cF5FTv X-Received: from pfblr9.prod.google.com ([2002:a05:6a00:7389:b0:829:a0ac:22ee]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:8717:b0:82c:6f07:2dc6 with SMTP id d2e1a72fcca58-82c6f0733d8mr258166b3a.52.1774379623906; Tue, 24 Mar 2026 12:13:43 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:17 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-3-jstultz@google.com> Subject: [PATCH v26 02/10] sched: Minimise repeated sched_proxy_exec() checking From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Peter Zijlstra , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Peter noted: Compilers are really bad (as in they utterly refuse) optimizing (even when marked with __pure) the static branch things, and will happily emit multiple identical in a row. So pull out the one obvious sched_proxy_exec() branch in __schedule() and remove some of the 'implicit' ones in that path. Reviewed-by: K Prateek Nayak Suggested-by: Peter Zijlstra Signed-off-by: John Stultz --- Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 92b1807c05a4e..dc044a405f83b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6600,11 +6600,7 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) struct mutex *mutex; =20 /* Follow blocked_on chain. */ - for (p =3D donor; task_is_blocked(p); p =3D owner) { - mutex =3D p->blocked_on; - /* Something changed in the chain, so pick again */ - if (!mutex) - return NULL; + for (p =3D donor; (mutex =3D p->blocked_on); p =3D owner) { /* * By taking mutex->wait_lock we hold off concurrent mutex_unlock() * and ensure @owner sticks around. @@ -6835,12 +6831,14 @@ static void __sched notrace __schedule(int sched_mo= de) next =3D pick_next_task(rq, rq->donor, &rf); rq_set_donor(rq, next); rq->next_class =3D next->sched_class; - if (unlikely(task_is_blocked(next))) { - next =3D find_proxy_task(rq, next, &rf); - if (!next) - goto pick_again; - if (next =3D=3D rq->idle) - goto keep_resched; + if (sched_proxy_exec()) { + if (unlikely(next->blocked_on)) { + next =3D find_proxy_task(rq, next, &rf); + if (!next) + goto pick_again; + if (next =3D=3D rq->idle) + goto keep_resched; + } } picked: clear_tsk_need_resched(prev); --=20 2.53.0.1018.g2bb0e51243-goog From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 704903B9DA6 for ; Tue, 24 Mar 2026 19:13:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379627; cv=none; b=a8BBQdceUrQ2HalZmm4HU4xijPgYR0d/rHE5lLbw/8o7Bu8jjqwXQBfrteDwk4FNS6SfZkjABA2n7Wp5k5+zl4oVEncPv7pGHuXzX/CjMoIZFr/30isS8vJAY7la8QvGiC+yKNCiIMv30RmMgTuuKbUEmVjEM0+0TsFKLic4vdg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379627; c=relaxed/simple; bh=s24gttX2fPobni+4uF5EYC/ETHJY8wNfhjrnD0ySd1E=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=u2TAtiUVM7NMmo6T46U7rOkIm0+FQTBfL1teVTm1XbZoV7QA5PTk8DKnsCFbgNYAt4M3AyG1BPrxWtg37qq+3acaO+bjT8+h/TEcFd4Lf8HabsnJ43zzCFGO8kdz60D+a9A0U0rsX0n0M75T9uNxDb5Va5M3Op9Bm7Meuap5NXw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=FUSeRBya; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="FUSeRBya" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-2aec07e8aafso20081855ad.1 for ; Tue, 24 Mar 2026 12:13:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379626; x=1774984426; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=bkKxrN4mK8grYtgcOZzbVYEk2bIMXF95k62kkx3vWTE=; b=FUSeRByaulyJkRjgwbGrdCMXUOJ2akYr2KneLUVDZHIQKGguB8twe5SJcJi/5HSCHX yajW8GIp3GvwSLuLlnccFkNvmLogKTaA9KBAmQxiDdHEH7ter6gBsj7I8biCFuI/rE/e Lkk9ZF36+hsPT2BPFJwmGmNszmCnxMEjClFKl5QoIoDetFeDS23yWvxeKchuiLWJRV9r tZpzUwvnUi6EiHzbkpqYeWt1IT6u1glJJ4f8aXekgm5I3X4aJuZCV00j8L3Mpu8FO37m MUhDRC8NVA8loG8JnNfKlNKVmKWg8fpuGrS92rH9lGVsjZQa/nCuyMtikJ87JEoQdb1M k+jQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379626; x=1774984426; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bkKxrN4mK8grYtgcOZzbVYEk2bIMXF95k62kkx3vWTE=; b=jkkUhoYHVhDL2DnWTSxsLvmJU5io7KwVeHtu+/CuDDyDeXG1wnENdsw1NRs93nFN3l MmOmIkwrZU3DcaiH9Hlvl8O+Oy+va9hpJ09VLQd92MDmKGNqR/QZevd9io3kARcDg7Xs xUJaicXsaqqx1khnpep5IdD1W9Mj7IoJqJvH8z5a2durPEPhyvk7kwTnOQTlrEbYTzF/ XHPoR2bxj3R1Mxv2YDiL1njv+taORNUKqVgoQT6FEdUr/+IcOffWq8xEh201OiTv0zzI WIvHVHjG50/RqSsSIaIsBVczz6tdjSFDRVI8yLWJwSFnMbk6ZTwydoMk2ubCb6GXDU58 lOLA== X-Gm-Message-State: AOJu0YxAUpiZ+Uhl1qdgthJiTReVCnnUsaEZMSPGrhcEz9H3QoeO/FPT j/Qo4ivy6utpSrmyLRpLAaXAq41yVFY07lCHKAQnlRqvo1y8OnHQqKHEa2ruWseFOCy2sMxj9vc iVm2yrRrr1ubxy/aMej8K0zvdc7BJVCfDP+T3zy/tJUOA/ri+IOYxNy7gJTmFLMlvKYBzSnskFj alebg+NLRNMK/17qTgMOpk3bDunhjS+Pg/wbjF5XPT7o4JfMSn X-Received: from pjvd16.prod.google.com ([2002:a17:90a:d990:b0:35b:a305:76f5]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:540f:b0:35b:9397:7073 with SMTP id 98e67ed59e1d1-35c0dd9aa90mr392920a91.30.1774379625457; Tue, 24 Mar 2026 12:13:45 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:18 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-4-jstultz@google.com> Subject: [PATCH v26 03/10] sched: Fix potentially missing balancing with Proxy Exec From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Peter Zijlstra , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" K Prateek pointed out that with Proxy Exec, we may have cases where we context switch in __schedule(), while the donor remains the same. This could cause balancing issues, since the put_prev_set_next() logic short-cuts if (prev =3D=3D next). With proxy-exec prev is the previous donor, and next is the next donor. Should the donor remain the same, but different tasks are picked to actually run, the shortcut will have avoided enqueuing the sched class balance callback. So, if we are context switching, add logic to catch the same-donor case, and trigger the put_prev/set_next calls to ensure the balance callbacks get enqueued. Reported-by: K Prateek Nayak Closes: https://lore.kernel.org/lkml/20ea3670-c30a-433b-a07f-c4ff98ae2379@a= md.com/ Suggested-by: Peter Zijlstra Signed-off-by: John Stultz --- Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dc044a405f83b..610e48cdb66a9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6829,9 +6829,11 @@ static void __sched notrace __schedule(int sched_mod= e) =20 pick_again: next =3D pick_next_task(rq, rq->donor, &rf); - rq_set_donor(rq, next); rq->next_class =3D next->sched_class; if (sched_proxy_exec()) { + struct task_struct *prev_donor =3D rq->donor; + + rq_set_donor(rq, next); if (unlikely(next->blocked_on)) { next =3D find_proxy_task(rq, next, &rf); if (!next) @@ -6839,7 +6841,27 @@ static void __sched notrace __schedule(int sched_mod= e) if (next =3D=3D rq->idle) goto keep_resched; } + if (rq->donor =3D=3D prev_donor && prev !=3D next) { + struct task_struct *donor =3D rq->donor; + /* + * When transitioning like: + * + * prev next + * donor: B B + * curr: A B or C + * + * then put_prev_set_next_task() will not have done + * anything, since B =3D=3D B. However, A might have + * missed a RT/DL balance opportunity due to being + * on_cpu. + */ + donor->sched_class->put_prev_task(rq, donor, donor); + donor->sched_class->set_next_task(rq, donor, true); + } + } else { + rq_set_donor(rq, next); } + picked: clear_tsk_need_resched(prev); clear_preempt_need_resched(); --=20 2.53.0.1018.g2bb0e51243-goog From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D9FD73BC683 for ; Tue, 24 Mar 2026 19:13:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379629; cv=none; b=IwFRYY5ENMG9TlatlhpzwCiVlhONhZcALGgSkSwTgKcpeYNGAuTLNv92TULRkZuHjkGSxtleG0TaUvz6CF97n7N0VGzy1JEKGoeHgt8jxsWAZ6DqpANj8yszg4Y6i4JnG5N0uO4KyuQw/nM6W8iHQATJNpqheBTjHpzV/55iHMs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379629; c=relaxed/simple; bh=1xwLJTadRtheElHy7jN+Ty5RwtDbKPw8FK/ib8Dx3w4=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=CMeAotHu65827lILO6AEdXjRzkvZkhd24ub2niFMEz4kh8NyMUVXsI8h2C8QO2ZGB/mjvbFANM+8mlkPrMawC/yXVQf+tCFPOVNceQjpCiCjBAgFPSFr/HaPcKW3DmoF33psJ7roadVEbhUgwmPCVy1/0A9pJYAlHeH+xZfCXn0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=hV6TmaQ5; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hV6TmaQ5" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-3568090851aso16693071a91.1 for ; Tue, 24 Mar 2026 12:13:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379627; x=1774984427; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=k74Wa8titNz8JCdnu+fsDD1ZR3pa7e3m92lYpEBLd+E=; b=hV6TmaQ5z9DEmtIqxf0cCEdFUaM2OO7C/AGDpGGuNZX73a2Jx7FGygBC05hQNthJUV nkMr1osDC+mfdfd05N1dvEIDBJUxyn5GI6bkSRLdN0y4AwlgXteyJMOPvrppC/M533Xm OrM0zReRj2uDlaAFxtOu25GSDPMu1jtQdeNuUwQls+Zp7Orzz7wxx8K0VC53+R76T3mt ZW3z5V9e198DldqSIS/Z5HD/ye8qkEBliUJdRanSeHiuWdrAHdYTpMfSkFWwWGV8rh9j 0txCsQDtwxZI5m3nSNI/wF3/w6IOK2Xh2VAzCHbFZ6Bzx3snnxPaVe3TyODPG96pfVgR LrAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379627; x=1774984427; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=k74Wa8titNz8JCdnu+fsDD1ZR3pa7e3m92lYpEBLd+E=; b=Nh+S/vFkhf8l82h72oHLSZ8kNfTgx8PpsV0xCS6+sKDthPD4O2CoyP83A6Jp1GqLFT dIl/MQ6bsLgzdxrgE0uhvcFrvQcgdnQgHKa9yC2cTaFGLUqyp2UxJRfeBigJO95wuV+z f4pTghwNMIE+/+Icx/Hqm1mAEPCaEViUOKw5HVB+UaegCfGLtrIwdytts1O0dL7qh/8m eperPKwbdG7RMzC/JD1kYwQEv6231LCLdLqP1gKIZxW54S6Dnw3gtcliM2mPDqh33kj6 qd+pqs0WX40Pxv2PBveDOjBI2pXH8Ix5HvMlERVLTolNYyJCmbxByk9BEQrtiYbmhZGs TTUg== X-Gm-Message-State: AOJu0YwXqkoOJrSYhhp0rvQIL35S7nwy4LTH9n0RH156HETPQdgGRQkG hUNnyYWrMHtffhPOI8An6mNDBzs0cvsU/CiuJcbF+A4BbJ+xV2GbUVDCFe3Is/CyT9Qgvkx36Xg oOirQUt39n5Y1k0TvIntabPfvHPKFQnscyymv415hOilQhNQVRU99klCJZ3YroPQGafIVVOeNMM /XEthnBWyquOVdecXOvIzs555uO+Ro4xTc3tjvyAi6MUB3zgPn X-Received: from pjzd9.prod.google.com ([2002:a17:90a:e289:b0:356:1f2b:7ea1]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:3f0d:b0:359:9158:7459 with SMTP id 98e67ed59e1d1-35c0db6a849mr533535a91.0.1774379626910; Tue, 24 Mar 2026 12:13:46 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:19 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-5-jstultz@google.com> Subject: [PATCH v26 04/10] locking: Add task::blocked_lock to serialize blocked_on state From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" So far, we have been able to utilize the mutex::wait_lock for serializing the blocked_on state, but when we move to proxying across runqueues, we will need to add more state and a way to serialize changes to this state in contexts where we don't hold the mutex::wait_lock. So introduce the task::blocked_lock, which nests under the mutex::wait_lock in the locking order, and rework the locking to use it. Signed-off-by: John Stultz Reviewed-by: K Prateek Nayak --- v15: * Split back out into later in the series v16: * Fixups to mark tasks unblocked before sleeping in mutex_optimistic_spin() * Rework to use guard() as suggested by Peter v19: * Rework logic for PREEMPT_RT issues reported by K Prateek Nayak v21: * After recently thinking more on ww_mutex code, I reworked the blocked_lock usage in mutex lock to avoid having to take nested locks in the ww_mutex paths, as I was concerned the lock ordering constraints weren't as strong as I had previously thought. v22: * Added some extra spaces to avoid dense code blocks suggested by K Prateek v23: * Move get_task_blocked_on() to kernel/locking/mutex.h as requested by PeterZ Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 48 +++++++++++++----------------------- init/init_task.c | 1 + kernel/fork.c | 1 + kernel/locking/mutex-debug.c | 4 +-- kernel/locking/mutex.c | 40 +++++++++++++++++++----------- kernel/locking/mutex.h | 6 +++++ kernel/locking/ww_mutex.h | 4 +-- kernel/sched/core.c | 4 ++- 8 files changed, 58 insertions(+), 50 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 5a5d3dbc9cdf3..2eef9bc6daaab 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1238,6 +1238,7 @@ struct task_struct { #endif =20 struct mutex *blocked_on; /* lock we're blocked on */ + raw_spinlock_t blocked_lock; =20 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER /* @@ -2181,57 +2182,42 @@ extern int __cond_resched_rwlock_write(rwlock_t *lo= ck) __must_hold(lock); #ifndef CONFIG_PREEMPT_RT static inline struct mutex *__get_task_blocked_on(struct task_struct *p) { - struct mutex *m =3D p->blocked_on; - - if (m) - lockdep_assert_held_once(&m->wait_lock); - return m; + lockdep_assert_held_once(&p->blocked_lock); + return p->blocked_on; } =20 static inline void __set_task_blocked_on(struct task_struct *p, struct mut= ex *m) { - struct mutex *blocked_on =3D READ_ONCE(p->blocked_on); - WARN_ON_ONCE(!m); /* The task should only be setting itself as blocked */ WARN_ON_ONCE(p !=3D current); - /* Currently we serialize blocked_on under the mutex::wait_lock */ - lockdep_assert_held_once(&m->wait_lock); + /* Currently we serialize blocked_on under the task::blocked_lock */ + lockdep_assert_held_once(&p->blocked_lock); /* * Check ensure we don't overwrite existing mutex value * with a different mutex. Note, setting it to the same * lock repeatedly is ok. */ - WARN_ON_ONCE(blocked_on && blocked_on !=3D m); - WRITE_ONCE(p->blocked_on, m); -} - -static inline void set_task_blocked_on(struct task_struct *p, struct mutex= *m) -{ - guard(raw_spinlock_irqsave)(&m->wait_lock); - __set_task_blocked_on(p, m); + WARN_ON_ONCE(p->blocked_on && p->blocked_on !=3D m); + p->blocked_on =3D m; } =20 static inline void __clear_task_blocked_on(struct task_struct *p, struct m= utex *m) { - if (m) { - struct mutex *blocked_on =3D READ_ONCE(p->blocked_on); - - /* Currently we serialize blocked_on under the mutex::wait_lock */ - lockdep_assert_held_once(&m->wait_lock); - /* - * There may be cases where we re-clear already cleared - * blocked_on relationships, but make sure we are not - * clearing the relationship with a different lock. - */ - WARN_ON_ONCE(blocked_on && blocked_on !=3D m); - } - WRITE_ONCE(p->blocked_on, NULL); + /* Currently we serialize blocked_on under the task::blocked_lock */ + lockdep_assert_held_once(&p->blocked_lock); + /* + * There may be cases where we re-clear already cleared + * blocked_on relationships, but make sure we are not + * clearing the relationship with a different lock. + */ + WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m); + p->blocked_on =3D NULL; } =20 static inline void clear_task_blocked_on(struct task_struct *p, struct mut= ex *m) { - guard(raw_spinlock_irqsave)(&m->wait_lock); + guard(raw_spinlock_irqsave)(&p->blocked_lock); __clear_task_blocked_on(p, m); } #else diff --git a/init/init_task.c b/init/init_task.c index 5c838757fc10e..b5f48ebdc2b6e 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -169,6 +169,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .journal_info =3D NULL, INIT_CPU_TIMERS(init_task) .pi_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock), + .blocked_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock), .timer_slack_ns =3D 50000, /* 50 usec default slack */ .thread_pid =3D &init_struct_pid, .thread_node =3D LIST_HEAD_INIT(init_signals.thread_head), diff --git a/kernel/fork.c b/kernel/fork.c index bc2bf58b93b65..079802cb61002 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2076,6 +2076,7 @@ __latent_entropy struct task_struct *copy_process( ftrace_graph_init_task(p); =20 rt_mutex_init_task(p); + raw_spin_lock_init(&p->blocked_lock); =20 lockdep_assert_irqs_enabled(); #ifdef CONFIG_PROVE_LOCKING diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c index 2c6b02d4699be..cc6aa9c6e9813 100644 --- a/kernel/locking/mutex-debug.c +++ b/kernel/locking/mutex-debug.c @@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct = mutex_waiter *waiter, lockdep_assert_held(&lock->wait_lock); =20 /* Current thread can't be already blocked (since it's executing!) */ - DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task)); + DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task)); } =20 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *wa= iter, struct task_struct *task) { - struct mutex *blocked_on =3D __get_task_blocked_on(task); + struct mutex *blocked_on =3D get_task_blocked_on(task); =20 DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list)); DEBUG_LOCKS_WARN_ON(waiter->task !=3D task); diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 2a1d165b3167e..4aa79bcab08c7 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -656,6 +656,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas goto err_early_kill; } =20 + raw_spin_lock(¤t->blocked_lock); __set_task_blocked_on(current, lock); set_current_state(state); trace_contention_begin(lock, LCB_F_MUTEX); @@ -669,8 +670,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas * the handoff. */ if (__mutex_trylock(lock)) - goto acquired; + break; =20 + raw_spin_unlock(¤t->blocked_lock); /* * Check for signals and kill conditions while holding * wait_lock. This ensures the lock cancellation is ordered @@ -693,12 +695,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas =20 first =3D __mutex_waiter_is_first(lock, &waiter); =20 + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); /* * As we likely have been woken up by task * that has cleared our blocked_on state, re-set * it to the lock we are trying to acquire. */ - set_task_blocked_on(current, lock); + __set_task_blocked_on(current, lock); set_current_state(state); /* * Here we order against unlock; we must either see it change @@ -709,25 +713,33 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas break; =20 if (first) { - trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN); + bool opt_acquired; + /* * mutex_optimistic_spin() can call schedule(), so - * clear blocked on so we don't become unselectable + * we need to release these locks before calling it, + * and clear blocked on so we don't become unselectable * to run. */ - clear_task_blocked_on(current, lock); - if (mutex_optimistic_spin(lock, ww_ctx, &waiter)) + __clear_task_blocked_on(current, lock); + raw_spin_unlock(¤t->blocked_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); + + trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN); + opt_acquired =3D mutex_optimistic_spin(lock, ww_ctx, &waiter); + + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); + __set_task_blocked_on(current, lock); + + if (opt_acquired) break; - set_task_blocked_on(current, lock); trace_contention_begin(lock, LCB_F_MUTEX); } - - raw_spin_lock_irqsave(&lock->wait_lock, flags); } - raw_spin_lock_irqsave(&lock->wait_lock, flags); -acquired: __clear_task_blocked_on(current, lock); __set_current_state(TASK_RUNNING); + raw_spin_unlock(¤t->blocked_lock); =20 if (ww_ctx) { /* @@ -756,11 +768,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas return 0; =20 err: - __clear_task_blocked_on(current, lock); + clear_task_blocked_on(current, lock); __set_current_state(TASK_RUNNING); __mutex_remove_waiter(lock, &waiter); err_early_kill: - WARN_ON(__get_task_blocked_on(current)); + WARN_ON(get_task_blocked_on(current)); trace_contention_end(lock, ret); raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q); debug_mutex_free_waiter(&waiter); @@ -971,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(st= ruct mutex *lock, unsigne next =3D waiter->task; =20 debug_mutex_wake_waiter(lock, waiter); - __clear_task_blocked_on(next, lock); + clear_task_blocked_on(next, lock); wake_q_add(&wake_q, next); } =20 diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h index 9ad4da8cea004..7a8ba13fee949 100644 --- a/kernel/locking/mutex.h +++ b/kernel/locking/mutex.h @@ -47,6 +47,12 @@ static inline struct task_struct *__mutex_owner(struct m= utex *lock) return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLA= GS); } =20 +static inline struct mutex *get_task_blocked_on(struct task_struct *p) +{ + guard(raw_spinlock_irqsave)(&p->blocked_lock); + return __get_task_blocked_on(p); +} + #ifdef CONFIG_DEBUG_MUTEXES extern void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter); diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h index 31a785afee6c0..e4a81790ea7dd 100644 --- a/kernel/locking/ww_mutex.h +++ b/kernel/locking/ww_mutex.h @@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER = *waiter, * blocked_on pointer. Otherwise we can see circular * blocked_on relationships that can't resolve. */ - __clear_task_blocked_on(waiter->task, lock); + clear_task_blocked_on(waiter->task, lock); wake_q_add(wake_q, waiter->task); } =20 @@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock, * are waking the mutex owner, who may be currently * blocked on a different mutex. */ - __clear_task_blocked_on(owner, NULL); + clear_task_blocked_on(owner, NULL); wake_q_add(wake_q, owner); } return true; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 610e48cdb66a9..7187c63174cd2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6587,6 +6587,7 @@ static struct task_struct *proxy_deactivate(struct rq= *rq, struct task_struct *d * p->pi_lock * rq->lock * mutex->wait_lock + * p->blocked_lock * * Returns the task that is going to be used as execution context (the one * that is actually going to be run on cpu_of(rq)). @@ -6606,8 +6607,9 @@ find_proxy_task(struct rq *rq, struct task_struct *do= nor, struct rq_flags *rf) * and ensure @owner sticks around. */ guard(raw_spinlock)(&mutex->wait_lock); + guard(raw_spinlock)(&p->blocked_lock); =20 - /* Check again that p is blocked with wait_lock held */ + /* Check again that p is blocked with blocked_lock held */ if (mutex !=3D __get_task_blocked_on(p)) { /* * Something changed in the blocked_on chain and --=20 2.53.0.1018.g2bb0e51243-goog From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6D26D3BC691 for ; Tue, 24 Mar 2026 19:13:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379630; cv=none; b=lOVA97O6D2pVwiDu4ayn8bE5dSnYjK6CS3QS3WKaY5BUDl+2AVhtzXznDY8J7kEQIc9/BJnRIZc9KefWoev4ZcV6a+fIyMo2SPY8xL6bz0kA0OJeiEEZ7W8hoctUmptp8Q6H/YWyeKFuyShVGHQnn9wEdIY/pmwRB4QmJPEF8Nw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379630; c=relaxed/simple; bh=msTCyA1PkdHuWuPlFQDkzCVw5Fo1Q6YWL1Ro+s/waj0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=aAfmpbnkITbSSFFOQb8vesx0XOK+dvTXOB4D9ubYikwmENEne37TQdbCJt6ij98uVTlw2XcMIKB04jIdWll1Qa9pasr9wqdd5ZbKlsUy/eD5SaeSlm+XDuM9EOOpWtB6E5DwqCqavnJL65FW9KSx28aUqUnfRZZxY4pi1u8UNY0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Ol8P+b57; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Ol8P+b57" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-35ba237d2e0so5230261a91.2 for ; Tue, 24 Mar 2026 12:13:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379629; x=1774984429; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Dl6sTkn6PQBY5ESOoos6QwsTSfV85Htk250jFpA56CI=; b=Ol8P+b57P+ehp+t6bLwT23fNVKNKHwedNeiAiSM0UDEXXX51HRh4MsNE/WVTCCaZFq 2nKXUkpvlsBLetgFb7jsho7eZhrtXjyM0QIhNBRScYxxp8wWkQhnlNg3dsXo4+XAT3K/ 0z/uvHmIBAJzpKEiaDPQ5+77P6oXk+aKCCpLHFhlijqsEy3uc860rdpKSBNMdrw1f4yc Y4V/zvuNP5TZRaIeUc0qsFiC3lU0dbZL2tDooM72hYM7HjjUXPHE5n+OfkEqkl9BGmVV 9uL2e+etqzK6APAj7Aj/wnl3rDzs6RlgW9sL2FJD3W0Vlh7s5kV03c7/WfgCxuoOL4DE z3gA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379629; x=1774984429; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Dl6sTkn6PQBY5ESOoos6QwsTSfV85Htk250jFpA56CI=; b=cMiRwCwhe5KVTCK5n+8XhZvoOip4daYQqo3FEaV4KyXePpTaAvxhKQa5E7xiq+8gmp 9t+7BPFEfNkQ1iE1zaWP1RRqfJbMPfnMstShTb5yZS75CybTs30RHL1c98HHUQnOyr7J uTYCPCkutfI48D8bG1HB68cJh6b9qG7EPnHAcNo+7bS27uIT2s8nb8lokOebi7oLcXxh OIjFqYmVxNgaVdndjMgnaC5sj3oEwruwePVKsXI5Ps41ztCRttWN1IALE88hsqwLImOT r3PfHQMMunh7Tyigba7IFC45z0K5LIWNPbAUqCxTmdk7bARl2qaugaTvl9xZuY7go1pF X9YA== X-Gm-Message-State: AOJu0YxAtXvKbaLmHQoPkWdVLF3ZsnUgIfvo286GFyyEMHNyuMvtL+97 ZCMlpL0Ylk6N9fj3Wibypx3iK91cFsh+xS83oL3MarfO0hx1j4C4GWzWwW4Rovm9mUAFRZatAa4 Tyt7FY27QBFEQZyD20OVtwNy1piy3+dG8mmCXJvEbx8r9DCTVUajA9ClxQjMXnozOI9iA6G6WgT pbz4iij1s1fEPdGYA8zMeETvRjV22Kz+zURxph4NUlgMtYRKvv X-Received: from pjbgv15.prod.google.com ([2002:a17:90b:11cf:b0:35b:cc2b:ddfa]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:1dc6:b0:359:f7d2:a1f9 with SMTP id 98e67ed59e1d1-35c0dc811c0mr438630a91.2.1774379628422; Tue, 24 Mar 2026 12:13:48 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:20 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-6-jstultz@google.com> Subject: [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce an action enum in find_proxy_task() which allows us to handle work needed to be done outside the mutex.wait_lock and task.blocked_lock guard scopes. This ensures proper locking when we clear the donor's blocked_on pointer in proxy_deactivate(), and the switch statement will be useful as we add more cases to handle later in this series. Reviewed-by: K Prateek Nayak Signed-off-by: John Stultz --- v23: * Split out from earlier patch. v24: * Minor re-ordering local variables to keep with style as suggested by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7187c63174cd2..c43e7926fda51 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6571,7 +6571,7 @@ static struct task_struct *proxy_deactivate(struct rq= *rq, struct task_struct *d * as unblocked, as we aren't doing proxy-migrations * yet (more logic will be needed then). */ - donor->blocked_on =3D NULL; + clear_task_blocked_on(donor, NULL); } return NULL; } @@ -6595,6 +6595,7 @@ static struct task_struct *proxy_deactivate(struct rq= *rq, struct task_struct *d static struct task_struct * find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) { + enum { FOUND, DEACTIVATE_DONOR } action =3D FOUND; struct task_struct *owner =3D NULL; int this_cpu =3D cpu_of(rq); struct task_struct *p; @@ -6628,12 +6629,14 @@ find_proxy_task(struct rq *rq, struct task_struct *= donor, struct rq_flags *rf) =20 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) { /* XXX Don't handle blocked owners/delayed dequeue yet */ - return proxy_deactivate(rq, donor); + action =3D DEACTIVATE_DONOR; + break; } =20 if (task_cpu(owner) !=3D this_cpu) { /* XXX Don't handle migrations yet */ - return proxy_deactivate(rq, donor); + action =3D DEACTIVATE_DONOR; + break; } =20 if (task_on_rq_migrating(owner)) { @@ -6691,6 +6694,13 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) */ } =20 + /* Handle actions we need to do outside of the guard() scope */ + switch (action) { + case DEACTIVATE_DONOR: + return proxy_deactivate(rq, donor); + case FOUND: + /* fallthrough */; + } WARN_ON_ONCE(owner && !owner->on_rq); return owner; } --=20 2.53.0.1018.g2bb0e51243-goog From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D96443BE65E for ; Tue, 24 Mar 2026 19:13:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379632; cv=none; b=a3f0qSj7mG4SljoJwT0Iz0Hm8kEH9zZfEDm1XBI2SuFfwpHJfeRGxBILSZNr1EZgbReyoylAXTgDaaNxeCMMLKb7hK6uymsuXJ5R6ViWbZ0ObDDX3Ctj4VcgVbG7vVXBNGV3NvwKvHMTxMvb1BQTmG432dY6cNaJMIEY7uvAuVk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379632; c=relaxed/simple; bh=UOYq7jwiA8Dz6NpCDTItHSoCBfnsYO4WxynXRmH4YAY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=tng0BJi6EqZHDckO9V/Fu/QmK6pecGXnTLKGL87p29z842GPqmPDxjs8CCfOeevXLW4K8WMzlPkawSGaJaDpA8u3OrOo86mhpJcXaZkvzr82SUEmGbOUyOa/JoZOVhyWvFK5fhlmqUZ385Dfi3XcnHVXrEzsQcTC7WCIjm/4lBg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=KqCLKjzq; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KqCLKjzq" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-354c0234c1fso5292291a91.2 for ; Tue, 24 Mar 2026 12:13:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379630; x=1774984430; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=UOAF2eQq8I/MAVyj0OKpSC0AfmcBEnARmoaP/lLtjkc=; b=KqCLKjzqBMBGPqNGdD4eW9vxKqzslSLLwmJ9R2QvzddS66KmSk2lDEeGeoSva+PAMr rAgQdIy9eKVBuxsuP02bHzpSl9e2w8P/M6gHTHqA8bypBRhAV1F7kXINvt/JMfVDUoda AULAqYXQnpF/HDV3YpLaXkYNFJdJgHIUo6SY+l0Ds3jljpmhYfJHoQ8JHzzm0pWIRNHS 3AakNH9/R3o+fiqv1I3LCfG2sqRQbejTq7qN2vwBL1L4EGACrVQ1yIdzAnlrllI9UCRE AIu107qqDa7S4dLfgf/FwVdWvXZ9Xg+lmJ4tkGbMYnzzW0lecxrTXcqj4YN3ZkvmSTxp kl2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379630; x=1774984430; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UOAF2eQq8I/MAVyj0OKpSC0AfmcBEnARmoaP/lLtjkc=; b=tXzMxwhgYVE50GWlLL6GYqkqBK0ZFrMUr0ItH2fj5FM+LQAVnriZNZzPAlUsnsbpFq yLLsy3dMR28zjXyc7EMiH3RMxoy8/3pj8QaN5aen+oJU7XFwHEUyb4DFxiICDMYr7G/F QCszZyu0fia/xrRGhYwJ5w1ldNbqyLe+W29R3ap5yJDG00OX7FSkanzn9t5trSFudrIN cWy2LU2z3FzInISpM/gYw1iZw7dwUkcQ9LghSnG57P23iKYWk5pqBjT4Ze29wEe52oL4 pQ3uMu/xsJmSwKn1KEZWGvsuONifaTgSedH4pkluwtnhXO51+kjUi0ACHV8nh3Ewqtip 4MYw== X-Gm-Message-State: AOJu0YzMSLnWWzNPqd+0ACtCHAde2SDudJM07ZIPAzwnz3l4PGcVWAES 6HgkRFcQApIfJa9mlgaVVnsSZxb35IbiPGGrSSiD5Uprhjg4ZZ2UgEtZMOJdLsUswIVbMLuLU6L xAEa0kzZUVLeRWyEHIrSNtz4gbYNFsVcOBBkdYS+oD6CZe/GTMZf2GHg8KeLHkLMA3v/U2LWOF4 y1adNLJRbDss7B3AdpZC4yExRyLdYeo/7DNM+4sbswYuDY32Hg X-Received: from pjte4.prod.google.com ([2002:a17:90a:c204:b0:359:bc10:36d6]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90a:d887:b0:359:ff8a:ee55 with SMTP id 98e67ed59e1d1-35c0dca6aafmr434920a91.14.1774379629830; Tue, 24 Mar 2026 12:13:49 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:21 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-7-jstultz@google.com> Subject: [PATCH v26 06/10] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" As we add functionality to proxy execution, we may migrate a donor task to a runqueue where it can't run due to cpu affinity. Thus, we must be careful to ensure we return-migrate the task back to a cpu in its cpumask when it becomes unblocked. Peter helpfully provided the following example with pictures: "Suppose we have a ww_mutex cycle: ,-+-* Mutex-1 <-. Task-A ---' | | ,-- Task-B `-> Mutex-2 *-+-' Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and where Task-B holds Mutex-2 and tries to acquire Mutex-1. Then the blocked_on->owner chain will go in circles. Task-A -> Mutex-2 ^ | | v Mutex-1 <- Task-B We need two things: - find_proxy_task() to stop iterating the circle; - the woken task to 'unblock' and run, such that it can back-off and re-try the transaction. Now, the current code [without this patch] does: __clear_task_blocked_on(); wake_q_add(); And surely clearing ->blocked_on is sufficient to break the cycle. Suppose it is Task-B that is made to back-off, then we have: Task-A -> Mutex-2 -> Task-B (no further blocked_on) and it would attempt to run Task-B. Or worse, it could directly pick Task-B and run it, without ever getting into find_proxy_task(). Now, here is a problem because Task-B might not be runnable on the CPU it is currently on; and because !task_is_blocked() we don't get into the proxy paths, so nobody is going to fix this up. Ideally we would have dequeued Task-B alongside of clearing ->blocked_on, but alas, [the lock ordering prevents us from getting the task_rq_lock() and] spoils things." Thus we need more than just a binary concept of the task being blocked on a mutex or not. So allow setting blocked_on to PROXY_WAKING as a special value which specifies the task is no longer blocked, but needs to be evaluated for return migration *before* it can be run. This will then be used in a later patch to handle proxy return-migration. Reviewed-by: K Prateek Nayak Signed-off-by: John Stultz --- v15: * Split blocked_on_state into its own patch later in the series, as the tri-state isn't necessary until we deal with proxy/return migrations v16: * Handle case where task in the chain is being set as BO_WAKING by another cpu (usually via ww_mutex die code). Make sure we release the rq lock so the wakeup can complete. * Rework to use guard() in find_proxy_task() as suggested by Peter v18: * Add initialization of blocked_on_state for init_task v19: * PREEMPT_RT build fixups and rework suggested by K Prateek Nayak v20: * Simplify one of the blocked_on_state changes to avoid extra PREMEPT_RT conditionals v21: * Slight reworks due to avoiding nested blocked_lock locking * Be consistent in use of blocked_on_state helper functions * Rework calls to proxy_deactivate() to do proper locking around blocked_on_state changes that we were cheating in previous versions. * Minor cleanups, some comment improvements v22: * Re-order blocked_on_state helpers to try to make it clearer the set_task_blocked_on() and clear_task_blocked_on() are the main enter/exit states and the blocked_on_state helpers help manage the transition states within. Per feedback from K Prateek Nayak. * Rework blocked_on_state to be defined within CONFIG_SCHED_PROXY_EXEC as suggested by K Prateek Nayak. * Reworked empty stub functions to just take one line as suggestd by K Prateek * Avoid using gotos out of a guard() scope, as highlighted by K Prateek, and instead rework logic to break and switch() on an action value. v23: * Big rework to using PROXY_WAKING instead of blocked_on_state as suggested by Peter. * Reworked commit message to include Peter's nice diagrams and example for why this extra state is necessary. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 51 +++++++++++++++++++++++++++++++++++++-- kernel/locking/mutex.c | 2 +- kernel/locking/ww_mutex.h | 16 ++++++------ kernel/sched/core.c | 16 ++++++++++++ 4 files changed, 74 insertions(+), 11 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 2eef9bc6daaab..8ec3b6d7d718b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2180,10 +2180,20 @@ extern int __cond_resched_rwlock_write(rwlock_t *lo= ck) __must_hold(lock); }) =20 #ifndef CONFIG_PREEMPT_RT + +/* + * With proxy exec, if a task has been proxy-migrated, it may be a donor + * on a cpu that it can't actually run on. Thus we need a special state + * to denote that the task is being woken, but that it needs to be + * evaluated for return-migration before it is run. So if the task is + * blocked_on PROXY_WAKING, return migrate it before running it. + */ +#define PROXY_WAKING ((struct mutex *)(-1L)) + static inline struct mutex *__get_task_blocked_on(struct task_struct *p) { lockdep_assert_held_once(&p->blocked_lock); - return p->blocked_on; + return p->blocked_on =3D=3D PROXY_WAKING ? NULL : p->blocked_on; } =20 static inline void __set_task_blocked_on(struct task_struct *p, struct mut= ex *m) @@ -2211,7 +2221,7 @@ static inline void __clear_task_blocked_on(struct tas= k_struct *p, struct mutex * * blocked_on relationships, but make sure we are not * clearing the relationship with a different lock. */ - WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m); + WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m && p->blocked_on = !=3D PROXY_WAKING); p->blocked_on =3D NULL; } =20 @@ -2220,6 +2230,35 @@ static inline void clear_task_blocked_on(struct task= _struct *p, struct mutex *m) guard(raw_spinlock_irqsave)(&p->blocked_lock); __clear_task_blocked_on(p, m); } + +static inline void __set_task_blocked_on_waking(struct task_struct *p, str= uct mutex *m) +{ + /* Currently we serialize blocked_on under the task::blocked_lock */ + lockdep_assert_held_once(&p->blocked_lock); + + if (!sched_proxy_exec()) { + __clear_task_blocked_on(p, m); + return; + } + + /* Don't set PROXY_WAKING if blocked_on was already cleared */ + if (!p->blocked_on) + return; + /* + * There may be cases where we set PROXY_WAKING on tasks that were + * already set to waking, but make sure we are not changing + * the relationship with a different lock. + */ + WARN_ON_ONCE(m && p->blocked_on !=3D m && p->blocked_on !=3D PROXY_WAKING= ); + p->blocked_on =3D PROXY_WAKING; +} + +static inline void set_task_blocked_on_waking(struct task_struct *p, struc= t mutex *m) +{ + guard(raw_spinlock_irqsave)(&p->blocked_lock); + __set_task_blocked_on_waking(p, m); +} + #else static inline void __clear_task_blocked_on(struct task_struct *p, struct r= t_mutex *m) { @@ -2228,6 +2267,14 @@ static inline void __clear_task_blocked_on(struct ta= sk_struct *p, struct rt_mute static inline void clear_task_blocked_on(struct task_struct *p, struct rt_= mutex *m) { } + +static inline void __set_task_blocked_on_waking(struct task_struct *p, str= uct rt_mutex *m) +{ +} + +static inline void set_task_blocked_on_waking(struct task_struct *p, struc= t rt_mutex *m) +{ +} #endif /* !CONFIG_PREEMPT_RT */ =20 static __always_inline bool need_resched(void) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 4aa79bcab08c7..7d359647156df 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -983,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(st= ruct mutex *lock, unsigne next =3D waiter->task; =20 debug_mutex_wake_waiter(lock, waiter); - clear_task_blocked_on(next, lock); + set_task_blocked_on_waking(next, lock); wake_q_add(&wake_q, next); } =20 diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h index e4a81790ea7dd..5cd9dfa4b31e6 100644 --- a/kernel/locking/ww_mutex.h +++ b/kernel/locking/ww_mutex.h @@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITE= R *waiter, debug_mutex_wake_waiter(lock, waiter); #endif /* - * When waking up the task to die, be sure to clear the - * blocked_on pointer. Otherwise we can see circular - * blocked_on relationships that can't resolve. + * When waking up the task to die, be sure to set the + * blocked_on to PROXY_WAKING. Otherwise we can see + * circular blocked_on relationships that can't resolve. */ - clear_task_blocked_on(waiter->task, lock); + set_task_blocked_on_waking(waiter->task, lock); wake_q_add(wake_q, waiter->task); } =20 @@ -339,15 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock, */ if (owner !=3D current) { /* - * When waking up the task to wound, be sure to clear the - * blocked_on pointer. Otherwise we can see circular - * blocked_on relationships that can't resolve. + * When waking up the task to wound, be sure to set the + * blocked_on to PROXY_WAKING. Otherwise we can see + * circular blocked_on relationships that can't resolve. * * NOTE: We pass NULL here instead of lock, because we * are waking the mutex owner, who may be currently * blocked on a different mutex. */ - clear_task_blocked_on(owner, NULL); + set_task_blocked_on_waking(owner, NULL); wake_q_add(wake_q, owner); } return true; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c43e7926fda51..aa2e7287235e3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4242,6 +4242,13 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) ttwu_queue(p, cpu, wake_flags); } out: + /* + * For now, if we've been woken up, clear the task->blocked_on + * regardless if it was set to a mutex or PROXY_WAKING so the + * task can run. We will need to be more careful later when + * properly handling proxy migration + */ + clear_task_blocked_on(p, NULL); if (success) ttwu_stat(p, task_cpu(p), wake_flags); =20 @@ -6603,6 +6610,10 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) =20 /* Follow blocked_on chain. */ for (p =3D donor; (mutex =3D p->blocked_on); p =3D owner) { + /* if its PROXY_WAKING, resched_idle so ttwu can complete */ + if (mutex =3D=3D PROXY_WAKING) + return proxy_resched_idle(rq); + /* * By taking mutex->wait_lock we hold off concurrent mutex_unlock() * and ensure @owner sticks around. @@ -6623,6 +6634,11 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) =20 owner =3D __mutex_owner(mutex); if (!owner) { + /* + * If there is no owner, clear blocked_on + * and return p so it can run and try to + * acquire the lock + */ __clear_task_blocked_on(p, mutex); return p; } --=20 2.53.0.1018.g2bb0e51243-goog From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A7A33BFE34 for ; Tue, 24 Mar 2026 19:13:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379634; cv=none; b=ESxXtSTWCGjKIPMOaI2FDhFWvQKBFQxsDtbu04f91c503w0C1Y4rhsB58+T1M1YWhf2NnO6czdeEFq1Kj1fAUlMOPaHXrTAuyIEuv1fXcrKy/QoLs3GYjzGrSnVZjfcR158zlDQK9kTYIcnpsZ1VTQfDJIn+FHIzN31vM/02zGM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379634; c=relaxed/simple; bh=/eZJFY5uhia92AI7Ho19mC2qaMCeVxXvpv/NzMKp2sA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=OSNSwtn5PdIoCvC0PspEu7gA2IFKXBFNdW9IHl9BQ4hfwGmUAy/0PBS9jpn4orvSBrPVqGIOOWT045T+3jgAdt45kVoRbAu1C46XLwy2IMjQcPdn4oVf6Kfm2vplD2M6TjIpR3JB5I0yo0zfUcnaXaJ5tUikkMsirDSiG5FsLr4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=tDxx9Q9R; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="tDxx9Q9R" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-2b0560c1320so121396585ad.2 for ; Tue, 24 Mar 2026 12:13:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379632; x=1774984432; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=mSL183B1Idyggo5Iz6pvpRnPaL/uqCRmDARcuYitFV4=; b=tDxx9Q9R7H+56VpE0kHLzoMICSu2ioK7/CGLmAfyloHsbFS+JC3bC1Idjewjep1t0P eULAOLsBR9vT8K0E7C0wBFfDIZWKyxxUSyNj4pFBjMkEGUtRkM7o/bFcsg7M0Ms8v9A4 l+Jjf/slo07uP6qt121UxYl/usUgEv0TgnyJksE//CG5nwUsGPwBCMnOkiuFH1hOfRYy 5Bx6BezthAHeavuR8jE/efGlF6/crmjjmnVWhn/AbfAUPFl2lhufEIp0fXDdZ4HPC2fF PPiUeimprwTW+cDhEh55rnLfZYuNgBMdKa0DRZwSdm0inWmO8UiZkQXYTz2sTLY0/J/f Gtrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379632; x=1774984432; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mSL183B1Idyggo5Iz6pvpRnPaL/uqCRmDARcuYitFV4=; b=ZDNMHPBCERqG1aUGzQC+Fp2UTlLSkxoFCljkX8GlvcIbBL/xtA+Lq+CvbcuAmXVIre UAwTxTr/yXGgL2nBtVhCfPEOu+kiAvH2zpjiJVOgBKRroA7cmJokNVDBQ/RToNg4xfvA WBn4ODmMhY9siBc8LEPl5vS/yQP/ls5eqJabwV1LFdblAWYt4HE5Paic6XCF4d7zaEIV v3BOECPA/Rk95dnVDTVnOGKv6CgORdNXolbe+x7li1HtsxDSK+ztXoZgHl2/i8xBFT5k DZqL372kMXqEeLMvGmbDeVG1g7aUDV4Xh3pXG7nIT4VYAm7A+LUzzDRKwW3D9SRL3f/e kapA== X-Gm-Message-State: AOJu0Yxf+wvJtoeh1mNqH9JLorwObdzdVKO8Vdt9WPpe/knCVvZ8dAPj 7KXlsnXBgSBbu+GFXdKLwjaU6jDq5r+jcmUwr5PrLipNFmtfLSym0TkvuGyuyG3xzJ+Mkj+L+db ssmko4e7AL42MOtND/SAuSevp6l+Tdkh/oBdvFq6tmmc3Sx+2mSqM9cHktK/SejenQUgA0pryOK DEIWau0D4c6mOQx4SoCVvInvCMWtyukv8iFUqIVjwNkBBtsUoT X-Received: from ploh5.prod.google.com ([2002:a17:902:f705:b0:2b0:abae:5cf7]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:46cb:b0:2b0:4b37:e9a5 with SMTP id d9443c01a7336-2b0b0b2efd3mr7048255ad.53.1774379631545; Tue, 24 Mar 2026 12:13:51 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:22 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-8-jstultz@google.com> Subject: [PATCH v26 07/10] sched: Add assert_balance_callbacks_empty helper From: John Stultz To: LKML Cc: John Stultz , Peter Zijlstra , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With proxy-exec utilizing pick-again logic, we can end up having balance callbacks set by the preivous pick_next_task() call left on the list. So pull the warning out into a helper function, and make sure we check it when we pick again. Suggested-by: Peter Zijlstra Reviewed-by: K Prateek Nayak Signed-off-by: John Stultz --- v24: * Use IS_ENABLED() as suggested by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 1 + kernel/sched/sched.h | 9 ++++++++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index aa2e7287235e3..b316b6015ffea 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6856,6 +6856,7 @@ static void __sched notrace __schedule(int sched_mode) } =20 pick_again: + assert_balance_callbacks_empty(rq); next =3D pick_next_task(rq, rq->donor, &rf); rq->next_class =3D next->sched_class; if (sched_proxy_exec()) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 43bbf0693cca4..2a0236d745832 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1853,6 +1853,13 @@ static inline void scx_rq_clock_update(struct rq *rq= , u64 clock) {} static inline void scx_rq_clock_invalidate(struct rq *rq) {} #endif /* !CONFIG_SCHED_CLASS_EXT */ =20 +static inline void assert_balance_callbacks_empty(struct rq *rq) +{ + WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_LOCKING) && + rq->balance_callback && + rq->balance_callback !=3D &balance_push_callback); +} + /* * Lockdep annotation that avoids accidental unlocks; it's like a * sticky/continuous lockdep_assert_held(). @@ -1869,7 +1876,7 @@ static inline void rq_pin_lock(struct rq *rq, struct = rq_flags *rf) =20 rq->clock_update_flags &=3D (RQCF_REQ_SKIP|RQCF_ACT_SKIP); rf->clock_update_flags =3D 0; - WARN_ON_ONCE(rq->balance_callback && rq->balance_callback !=3D &balance_p= ush_callback); + assert_balance_callbacks_empty(rq); } =20 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf) --=20 2.53.0.1018.g2bb0e51243-goog From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CBA2C3C061A for ; Tue, 24 Mar 2026 19:13:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379635; cv=none; b=uWny44JKBt/hEHtJBR2dscyaCNrsK4fr7xQayt1y7EI9liCxXbGH5NKiykE8qJvjP4ysBzODZrcMhnXzyMDm3ur9xUcwC/LpVm5ForNrODYKJiYuNzmbUcWjJRXb2mhsUdW3SGHI2TxhiI89HfdXF4cXrw7DM1K8iJzzqAqL1Fs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379635; c=relaxed/simple; bh=5vWmbZ72LnrTfT8IJ/kbRaBRf3CyXBDSIU9uSpfjdhg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=TF0VmHjfSQK9dChgDN4sTCqpoTntBxSbWxXzeAYI7LYA1nr+cmI4Ynh+7kPOLkQFbynF2PTBicCuCIihI+zih1OUQwSINrv8S33+4rlHIL+AxlKjcpWq83YRvIDgZCSSed/uqzn3Yp8O/7nIChXHTw2uVMqSI2WdZhJ8ya3enKo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=TaEbvhdS; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="TaEbvhdS" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-c73781252edso23241266a12.0 for ; Tue, 24 Mar 2026 12:13:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379633; x=1774984433; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Nvn72xE70j3mk1nVv0Rrlv//tMOmDYSjJGcB5sLcQ+E=; b=TaEbvhdSf0B5FiEl6ui1H2uxRlCxspidI+q1sBoZ2tazLqxCV5xTCJ3/hK0q7ndFsy AMOrNP9u9M125gYci9Aivx7ITJuJeojfboKnCIyAvT69YUeVpmFsaxo24gvZmSID4wVs O8eYRrzvfCZ/t5TnyB87sucWG/Jj+MgnoEHhon3qE7yUhfaSms5TINx/VRRwUE2NMn6I HHnN/uWLMFFq/OfvNURaVA5eaoXUsdIDNDJOsStHDOyGzvglpSmhX8vOihXlRGQORwKh xXnbwf+OjvfCJBt2fbtNES2v2KJaCl6b9XdaMUgPheMdkyBtPlVHFwfSS4UKsZ0WDfJn w4Ow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379633; x=1774984433; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Nvn72xE70j3mk1nVv0Rrlv//tMOmDYSjJGcB5sLcQ+E=; b=QmL/OudS7HTLJ3nN1z69D9ri9yXfD3S7EsKxboED4tbYeoczwEjCF/bbH1WUyBXnUY RSiahXGA5fFF/OUie9buuqI3QMg9kZsoK1R8ZdZHf/y1vsYvo1ez0fkSNqMsIGlTBUjU SdaiHfOsj6cv2iDhnxa4qwY6TAeKxdGk+glglIlSahscP3+PP5UagCvx5KUcGKLeHjzu iCfcA7P3z/AKoi64TjB9wUoESgEmtLbXP7qr+WXplBN8pO6yi13Y7Byvsj76kwk0E8eJ N9U/N5CcfCtiK2kbu/CK5OXIB1Q26PedPd69oZqI7eWbLKA4iI+UDrehMRI6T0UnauYO M3fg== X-Gm-Message-State: AOJu0Yx4yHdf/DlxoIJ0XbwMkgY8CNVd8ALP0IcPdofjh8bB23701hRQ huBPQC6e0daHI2MJsPGzvQqpMctOO1qlNGQy71oRIHCrky1PXu9l6EKtUtOdeBk1Q3gpPTk4L77 AHEDDzt3+NO+ybsPd1VkFRQxgtuTrgh1JAEZI85s31Py4kzVVol+kOJ5DjyurKZ2iFnBhUCm8BD ZSk/lrj3cdwpQvg1dOA38qrHSDBLg4bBmyGIm1V2DLDC/0/b/p X-Received: from pfbbw25.prod.google.com ([2002:a05:6a00:4099:b0:824:9ab3:ebe8]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:aa7:88d3:0:b0:82a:1044:3563 with SMTP id d2e1a72fcca58-82c6df392fcmr732502b3a.23.1774379632885; Tue, 24 Mar 2026 12:13:52 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:23 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-9-jstultz@google.com> Subject: [PATCH v26 08/10] sched: Add logic to zap balance callbacks if we pick again From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With proxy-exec, a task is selected to run via pick_next_task(), and then if it is a mutex blocked task, we call find_proxy_task() to find a runnable owner. If the runnable owner is on another cpu, we will need to migrate the selected donor task away, after which we will pick_again can call pick_next_task() to choose something else. However, in the first call to pick_next_task(), we may have had a balance_callback setup by the class scheduler. After we pick again, its possible pick_next_task_fair() will be called which calls sched_balance_newidle() and sched_balance_rq(). This will throw a warning: [ 8.796467] rq->balance_callback && rq->balance_callback !=3D &balance_p= ush_callback [ 8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched= _balance_rq+0xe92/0x1250 ... [ 8.796467] Call Trace: [ 8.796467] [ 8.796467] ? __warn.cold+0xb2/0x14e [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] ? report_bug+0x107/0x1a0 [ 8.796467] ? handle_bug+0x54/0x90 [ 8.796467] ? exc_invalid_op+0x17/0x70 [ 8.796467] ? asm_exc_invalid_op+0x1a/0x20 [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] sched_balance_newidle+0x295/0x820 [ 8.796467] pick_next_task_fair+0x51/0x3f0 [ 8.796467] __schedule+0x23a/0x14b0 [ 8.796467] ? lock_release+0x16d/0x2e0 [ 8.796467] schedule+0x3d/0x150 [ 8.796467] worker_thread+0xb5/0x350 [ 8.796467] ? __pfx_worker_thread+0x10/0x10 [ 8.796467] kthread+0xee/0x120 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork+0x31/0x50 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork_asm+0x1a/0x30 [ 8.796467] This is because if a RT task was originally picked, it will setup the rq->balance_callback with push_rt_tasks() via set_next_task_rt(). Once the task is migrated away and we pick again, we haven't processed any balance callbacks, so rq->balance_callback is not in the same state as it was the first time pick_next_task was called. To handle this, add a zap_balance_callbacks() helper function which cleans up the balance callbacks without running them. This should be ok, as we are effectively undoing the state set in the first call to pick_next_task(), and when we pick again, the new callback can be configured for the donor task actually selected. Reviewed-by: K Prateek Nayak Signed-off-by: John Stultz --- v20: * Tweaked to avoid build issues with different configs v22: * Spelling fix suggested by K Prateek * Collapsed the stub implementation to one line as suggested by K Prateek * Zap callbacks when we resched idle, as suggested by K Prateek v24: * Don't conditionalize function on CONFIG_SCHED_PROXY_EXEC as the callers will be optimized out if that is unset, and the dead function will be removed, as suggsted by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b316b6015ffea..4ed24ef590f73 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4920,6 +4920,34 @@ static inline void finish_task(struct task_struct *p= rev) smp_store_release(&prev->on_cpu, 0); } =20 +/* + * Only called from __schedule context + * + * There are some cases where we are going to re-do the action + * that added the balance callbacks. We may not be in a state + * where we can run them, so just zap them so they can be + * properly re-added on the next time around. This is similar + * handling to running the callbacks, except we just don't call + * them. + */ +static void zap_balance_callbacks(struct rq *rq) +{ + struct balance_callback *next, *head; + bool found =3D false; + + lockdep_assert_rq_held(rq); + + head =3D rq->balance_callback; + while (head) { + if (head =3D=3D &balance_push_callback) + found =3D true; + next =3D head->next; + head->next =3D NULL; + head =3D next; + } + rq->balance_callback =3D found ? &balance_push_callback : NULL; +} + static void do_balance_callbacks(struct rq *rq, struct balance_callback *h= ead) { void (*func)(struct rq *rq); @@ -6865,10 +6893,14 @@ static void __sched notrace __schedule(int sched_mo= de) rq_set_donor(rq, next); if (unlikely(next->blocked_on)) { next =3D find_proxy_task(rq, next, &rf); - if (!next) + if (!next) { + zap_balance_callbacks(rq); goto pick_again; - if (next =3D=3D rq->idle) + } + if (next =3D=3D rq->idle) { + zap_balance_callbacks(rq); goto keep_resched; + } } if (rq->donor =3D=3D prev_donor && prev !=3D next) { struct task_struct *donor =3D rq->donor; --=20 2.53.0.1018.g2bb0e51243-goog From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4BC003C13F5 for ; Tue, 24 Mar 2026 19:13:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379637; cv=none; b=U1nfWL9FXi6FckuC9Zyg6tPaFqWE7ImhYa/1Tb3C4CCAP1WsW/6hUCjjgYwSscrP3sfF8F6oL0Iv4HkqnaO5Pin5FYEBz3P5dOvyVCmDqz2RAn1o4UYG/vtuOwQ5s95h5OzC+NoURbz5cgMRGc/REGLTFRlPw+aKjMsuJx5/UdU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379637; c=relaxed/simple; bh=/V3gaHlTHv/x429QQd88aapwPmpMOfi/GWuEqIpg4Ig=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=CVOmSZ3/78u06tk02cSZ2dJOAAGCbHgHC5xHa6VDsHr5Sohm8yMKQuBveOY+QrkK0yliu1mRKPuBvFPaIEn0FFwgfP9rpKLlpB9aOGXeBe8daRqtxOYMS3s8EB3dJKzH2BmkFXy2HiNdvs21dQfmHehatt5QZuDdg1lyaJaxcHI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=fUw+LWbT; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="fUw+LWbT" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2b08025f240so2501805ad.1 for ; Tue, 24 Mar 2026 12:13:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379635; x=1774984435; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=R6cQ3WpqP5ss2YZOuRfU4M9m0rbTHLz4QTo0lRNq+Bk=; b=fUw+LWbTj1OTtbU0sTzbe6uOHP8t2Y933CYGeuYVA8Fysb+IKPrtS+eyhK02WH5G0d E86vFOAM7yhy2DFQQYgHPJ61Fvis1SgnBy9bkjl1RTLHpMKAt5WlWCPFd0RzMtQfiN4I Gi+C35Nu1uCMLsIgvPD9fYU9IZ0vYjZJMcVPHerZ5/9w9R2Ms+HmSWX/OZP2UFydKaaM uN/fR9VwWS9UBq6LBS8GuJ2/gUgfbIMV22vsmI+CDSPWdXk4rKttDwsW7/954GeXbgj/ EvVNTpuKfq2hxmVAx34fJRKkHe7b/Zae7wOEyVawVZhzs6vjLfWE3M38HXh58AGbV6XT M7gg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379635; x=1774984435; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=R6cQ3WpqP5ss2YZOuRfU4M9m0rbTHLz4QTo0lRNq+Bk=; b=lgA6dfLN19RlOFOXVWW5sfxaIVA+Q6ljj+jgXSJwYm//h5QI1IGk+PpLPguUNNEjm+ JrKbWt0+4nwyOBN6p1v6fLOVXAwSRgkvuCprluH2Pcei2VMRbn2iTsI6BnKKRXCUKuFG aorSfUR3zGdMtBaTe09Hsa4RxID9y2NKq+EPem5m8O7kQjP8EAPM5wSSm2JLkDBBd75Z h85xfxc5YNCloDARrlQm6eaoJbFSzYSd0NCBlwWp0OGVLufVfTmQZ2bTheoREc9i4Prc TFvwVfceQdv3gv8ZOptPFBAnepmqUL3a0LF/gPhmo93b4Ro9KZjVAQxNzT07rr0eDHv+ UBEQ== X-Gm-Message-State: AOJu0Yz7in3pQY3KZ2cwogSIHA8ejNtVSmAaTgVrTun9Lah+NDU025d7 NGQjGaDh8sweEFcCEzser9EJf8iWT5Q/UAaZ4F09LvHIp16tO8rhnD2ceYwwirX4gZ3mIJcA2Dp S9IyWRmI5Trikq9aBkipmLhx55p6gjLjZHwb6A8LHnwMAUOextDPUlZMjsOQI6MYKdZrSoCYrcF +dwKMD9i+7tbYLzPsRxKxPYL2K+TffBMpLyuTHhTgRlBWrRiRx X-Received: from plpn15.prod.google.com ([2002:a17:902:968f:b0:2ae:c5aa:fcd7]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:c943:b0:2b0:7502:6ebe with SMTP id d9443c01a7336-2b0b083a7d6mr8360365ad.25.1774379634332; Tue, 24 Mar 2026 12:13:54 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:24 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-10-jstultz@google.com> Subject: [PATCH v26 09/10] sched: Move attach_one_task and attach_task helpers to sched.h From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The fair scheduler locally introduced attach_one_task() and attach_task() helpers, but these could be generically useful so move this code to sched.h so we can use them elsewhere. One minor tweak made to utilize guard(rq_lock)(rq) to simplifiy the function. Suggested-by: K Prateek Nayak Reviewed-by: K Prateek Nayak Signed-off-by: John Stultz --- v26: * Folded in switch to use guard(rq_lock)(rq) as suggested by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/fair.c | 26 -------------------------- kernel/sched/sched.h | 23 +++++++++++++++++++++++ 2 files changed, 23 insertions(+), 26 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bf948db905ed1..53da01a251487 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9784,32 +9784,6 @@ static int detach_tasks(struct lb_env *env) return detached; } =20 -/* - * attach_task() -- attach the task detached by detach_task() to its new r= q. - */ -static void attach_task(struct rq *rq, struct task_struct *p) -{ - lockdep_assert_rq_held(rq); - - WARN_ON_ONCE(task_rq(p) !=3D rq); - activate_task(rq, p, ENQUEUE_NOCLOCK); - wakeup_preempt(rq, p, 0); -} - -/* - * attach_one_task() -- attaches the task returned from detach_one_task() = to - * its new rq. - */ -static void attach_one_task(struct rq *rq, struct task_struct *p) -{ - struct rq_flags rf; - - rq_lock(rq, &rf); - update_rq_clock(rq); - attach_task(rq, p); - rq_unlock(rq, &rf); -} - /* * attach_tasks() -- attaches all tasks detached by detach_tasks() to their * new rq. diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2a0236d745832..d4def70df05a6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3008,6 +3008,29 @@ extern void deactivate_task(struct rq *rq, struct ta= sk_struct *p, int flags); =20 extern void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags= ); =20 +/* + * attach_task() -- attach the task detached by detach_task() to its new r= q. + */ +static inline void attach_task(struct rq *rq, struct task_struct *p) +{ + lockdep_assert_rq_held(rq); + + WARN_ON_ONCE(task_rq(p) !=3D rq); + activate_task(rq, p, ENQUEUE_NOCLOCK); + wakeup_preempt(rq, p, 0); +} + +/* + * attach_one_task() -- attaches the task returned from detach_one_task() = to + * its new rq. + */ +static inline void attach_one_task(struct rq *rq, struct task_struct *p) +{ + guard(rq_lock)(rq); + update_rq_clock(rq); + attach_task(rq, p); +} + #ifdef CONFIG_PREEMPT_RT # define SCHED_NR_MIGRATE_BREAK 8 #else --=20 2.53.0.1018.g2bb0e51243-goog From nobody Fri Apr 3 08:34:57 2026 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AFC773C3425 for ; Tue, 24 Mar 2026 19:13:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379638; cv=none; b=DYQa93DHCtmOf0QLJBtBePwdoWWE4i+NnE7OdQ04iNOQ/n/yJi+QrX73SSl+xJF/gLsCjrWzbvPLshFLreSe3ebv6hI4yyuRVdKdRkoyyv3S4jLZFZUzPfYn+UEfnX2aFSzrNuYPxWghW6+6CPcdjH2VcbVB80p3c+gYZONnLHk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774379638; c=relaxed/simple; bh=zY6yibpoxUjfwuJKoR1TOi7KKOeJwI/O9L85xYREVfY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=u/oTAomlkFimMterf65uyAj/6WltxcZ+Pr/b6qRVcwJD1LMNueQWx3e+MBG8JOIfJWRdzv2xzZ8XFg/7wWN8pOhW4C5HxeZS+98TRkJ/AJUDJhgUnflnjk9SM+qfGlGi/8u8cfYna69KsVwPsBHUszpGN5rBwTbK+iKYe5M740U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=RWC9O+kw; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="RWC9O+kw" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-354490889b6so25155858a91.3 for ; Tue, 24 Mar 2026 12:13:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1774379636; x=1774984436; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=LJH/T4l7rFebVxD6yhovQF7CuzEOG3LuB9tnOojgwm8=; b=RWC9O+kwfsla8oI1ckAr/zXbyPJxUE3/wbQsOzto1w+n1YRIwzbWte41xU4hQys/2L 1iI3/bj4WVcgI81icma3ABMSQUSqTB90T7FtckmGu5PyxpMyYcuiku9OzG9LussznKzx mUqVQaw11KZdGqLGhVwwGv2GO4gLrv0fZVY6BuQ3GiPAM2fq1i06e7+NbvvMq0HPqaJ4 SDqlGYz+skbB/Tfd7M4VhmVHkUhsgZyvSkDm6M/Ey2M/UG/p+VY6wxnJ2YdfG7TtIBSc Ezb0Mm5vlEtIqc9Sy+5aZdMr9NfYJeerBKiy6b+BTzRB/rUKn5WCLlLhCVRwF18v4dv8 MmHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774379636; x=1774984436; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=LJH/T4l7rFebVxD6yhovQF7CuzEOG3LuB9tnOojgwm8=; b=K8N10zrz0wPufReJ+i4qhek2R6Is9lL5hQJh9pl+vlDkMPx14UlQySL14ErJJ1JUmo MlpPmddO1g8tuH4XK8b+CS9QyJaBNvt1dNWFEWG4v7yOw7gaufIeoAYohKOum5eUe1Ry nRmK9HHSP+O6mvFzpijVpgUi5HMS3d0aXEwmP0SOr17xwhSLFhzifVxas4bVqeKX2xep dxB5iI4MyDRZoe1QRHijuCJZcE1QeUR9GGAvjxW1TykvIhdjXDEwijD9CezhuIvcrqxr XDX4vBFGotwRkP608wkjAbtGWd/3QM2KxTkVPGjtB/D5EbPupvT4dkUHmJg8wTdC/bG1 6ArA== X-Gm-Message-State: AOJu0YweytgFd9BAMur/xM68D4Jv6hmm+wuDhMcae0RlQJSaZhREg+i0 nPEuWNuQSkUGUyJcZ9BqtSpzwvrlnljruzLrK4sJHaHbIHNFcBOhDXiKe/Rm3wf6QUULhCkri0B HP02jX/2YYQ4A7PzX8Wh165KmGlaWLjtNaJ/el6mOP2+Q4cl6Wf8oo/wXuG1w5dhReRYx0Koewf sE3nqgfdGfDG113x9seaBGGD/ebeXZo4wFagvadBM0DTlF7cau X-Received: from pjbgk23.prod.google.com ([2002:a17:90b:1197:b0:352:c130:fba7]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:1d4c:b0:35b:aca5:db39 with SMTP id 98e67ed59e1d1-35c0dcb2a8amr432000a91.9.1774379635807; Tue, 24 Mar 2026 12:13:55 -0700 (PDT) Date: Tue, 24 Mar 2026 19:13:25 +0000 In-Reply-To: <20260324191337.1841376-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260324191337.1841376-1-jstultz@google.com> X-Mailer: git-send-email 2.53.0.1018.g2bb0e51243-goog Message-ID: <20260324191337.1841376-11-jstultz@google.com> Subject: [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration) From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add logic to handle migrating a blocked waiter to a remote cpu where the lock owner is runnable. Additionally, as the blocked task may not be able to run on the remote cpu, add logic to handle return migration once the waiting task is given the mutex. Because tasks may get migrated to where they cannot run, also modify the scheduling classes to avoid sched class migrations on mutex blocked tasks, leaving find_proxy_task() and related logic to do the migrations and return migrations. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) Juri Lelli Valentin Schneider Connor O'Brien Signed-off-by: John Stultz --- v6: * Integrated sched_proxy_exec() check in proxy_return_migration() * Minor cleanups to diff * Unpin the rq before calling __balance_callbacks() * Tweak proxy migrate to migrate deeper task in chain, to avoid tasks pingponging between rqs v7: * Fixup for unused function arguments * Switch from that_rq -> target_rq, other minor tweaks, and typo fixes suggested by Metin Kaya * Switch back to doing return migration in the ttwu path, which avoids nasty lock juggling and performance issues * Fixes for UP builds v8: * More simplifications from Metin Kaya * Fixes for null owner case, including doing return migration * Cleanup proxy_needs_return logic v9: * Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed return migrations * Switch to using zap_balance_callbacks rathern then running them when we are dropping rq locks for proxy_migration. * Drop task_is_blocked check in sched_submit_work as suggested by Metin (may re-add later if this causes trouble) * Do return migration when we're not on wake_cpu. This avoids bad task placement caused by proxy migrations raised by Xuewen Yan * Fix to call set_next_task(rq->curr) prior to dropping rq lock to avoid rq->curr getting migrated before we have actually switched from it * Cleanup to re-use proxy_resched_idle() instead of open coding it in proxy_migrate_task() * Fix return migration not to use DEQUEUE_SLEEP, so that we properly see the task as task_on_rq_migrating() after it is dequeued but before set_task_cpu() has been called on it * Fix to broaden find_proxy_task() checks to avoid race where a task is dequeued off the rq due to return migration, but set_task_cpu() and the enqueue on another rq happened after we checked task_cpu(owner). This ensures we don't proxy using a task that is not actually on our runqueue. * Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition in try_to_wake_up() if proxy execution isn't enabled. * Cleanup to improve comment in proxy_migrate_task() explaining the set_next_task(rq->curr) logic * Cleanup deadline.c change to stylistically match rt.c change * Numerous cleanups suggested by Metin v10: * Drop WARN_ON(task_is_blocked(p)) in ttwu current case v11: * Include proxy_set_task_cpu from later in the series to this change so we can use it, rather then reworking logic later in the series. * Fix problem with return migration, where affinity was changed and wake_cpu was left outside the affinity mask. * Avoid reading the owner's cpu twice (as it might change inbetween) to avoid occasional migration-to-same-cpu edge cases * Add extra WARN_ON checks for wake_cpu and return migration edge cases. * Typo fix from Metin v13: * As we set ret, return it, not just NULL (pulling this change in from later patch) * Avoid deadlock between try_to_wake_up() and find_proxy_task() when blocked_on cycle with ww_mutex is trying a mid-chain wakeup. * Tweaks to use new __set_blocked_on_runnable() helper * Potential fix for incorrectly updated task->dl_server issues * Minor comment improvements * Add logic to handle missed wakeups, in that case doing return migration from the find_proxy_task() path * Minor cleanups v14: * Improve edge cases where we wouldn't set the task as BO_RUNNABLE v15: * Added comment to better describe proxy_needs_return() as suggested by Qais * Build fixes for !CONFIG_SMP reported by Maciej =C5=BBenczykowski * Adds fix for re-evaluating proxy_needs_return when sched_proxy_exec() is disabled, reported and diagnosed by: kuyo chang v16: * Larger rework of needs_return logic in find_proxy_task, in order to avoid problems with cpuhotplug * Rework to use guard() as suggested by Peter v18: * Integrate optimization suggested by Suleiman to do the checks for sleeping owners before checking if the task_cpu is this_cpu, so that we can avoid needlessly proxy-migrating tasks to only then dequeue them. Also check if migrating last. * Improve comments around guard locking * Include tweak to ttwu_runnable() as suggested by hupu * Rework the logic releasing the rq->donor reference before letting go of the rqlock. Just use rq->idle. * Go back to doing return migration on BO_WAKING owners, as I was hitting some softlockups caused by running tasks not making it out of BO_WAKING. v19: * Fixed proxy_force_return() logic for !SMP cases v21: * Reworked donor deactivation for unhandled sleeping owners * Commit message tweaks v22: * Add comments around zap_balance_callbacks in proxy_migration logic * Rework logic to avoid gotos out of guard() scopes, and instead use break and switch() on action value, as suggested by K Prateek * K Prateek suggested simplifications around putting donor and setting idle as next task in the migration paths, which I further simplified to using proxy_resched_idle() * Comment improvements * Dropped curr !=3D donor check in pick_next_task_fair() suggested by K Prateek v23: * Rework to use the PROXY_WAKING approach suggested by Peter * Drop unnecessarily setting wake_cpu when affinity changes as noticed by Peter * Split out the ttwu() logic changes into a later separate patch as suggested by Peter v24: * Numerous fixes for rq clock handling, pointed out by K Prateek * Slight tweak to where put_task() is called suggested by K Prateek v25: * Use WF_TTWU in proxy_force_return(), suggested by K Prateek * Drop get/put_task_struct() in proxy_force_return(), suggested by K Prateek * Use attach_one_task() to reduce repetitive logic, as suggested by K Prateek v26: * Add context analysis fixups suggested by Peter * Add proxy_release/reacquire_rq_lock helpers suggested by Peter * Rework comments as suggested by Peter * Rework logic to use scoped_guard (task_rq_lock, p) suggested by Peter * Move proxy_resched_idle() call up earlier before rq release in proxy_force_return() as suggested by K Prateek * If needed, mark task PROXY_WAKING if try_to_block_task() fails due to a signal, as noted by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 225 ++++++++++++++++++++++++++++++++++++++------ 1 file changed, 197 insertions(+), 28 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ed24ef590f73..49e4528450083 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq) rq->idle_stamp =3D 0; } =20 +#ifdef CONFIG_SCHED_PROXY_EXEC +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu) +{ + unsigned int wake_cpu; + + /* + * Since we are enqueuing a blocked task on a cpu it may + * not be able to run on, preserve wake_cpu when we + * __set_task_cpu so we can return the task to where it + * was previously runnable. + */ + wake_cpu =3D p->wake_cpu; + __set_task_cpu(p, cpu); + p->wake_cpu =3D wake_cpu; +} +#endif /* CONFIG_SCHED_PROXY_EXEC */ + static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, struct rq_flags *rf) @@ -4242,13 +4259,6 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) ttwu_queue(p, cpu, wake_flags); } out: - /* - * For now, if we've been woken up, clear the task->blocked_on - * regardless if it was set to a mutex or PROXY_WAKING so the - * task can run. We will need to be more careful later when - * properly handling proxy migration - */ - clear_task_blocked_on(p, NULL); if (success) ttwu_stat(p, task_cpu(p), wake_flags); =20 @@ -6533,6 +6543,8 @@ static bool try_to_block_task(struct rq *rq, struct t= ask_struct *p, if (signal_pending_state(task_state, p)) { WRITE_ONCE(p->__state, TASK_RUNNING); *task_state_p =3D TASK_RUNNING; + set_task_blocked_on_waking(p, NULL); + return false; } =20 @@ -6578,7 +6590,7 @@ static inline struct task_struct *proxy_resched_idle(= struct rq *rq) return rq->idle; } =20 -static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor) +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor) { unsigned long state =3D READ_ONCE(donor->__state); =20 @@ -6598,17 +6610,140 @@ static bool __proxy_deactivate(struct rq *rq, stru= ct task_struct *donor) return try_to_block_task(rq, donor, &state, true); } =20 -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_str= uct *donor) +static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *r= f) + __releases(__rq_lockp(rq)) +{ + /* + * The class scheduler may have queued a balance callback + * from pick_next_task() called earlier. + * + * So here we have to zap callbacks before unlocking the rq + * as another CPU may jump in and call sched_balance_rq + * which can trip the warning in rq_pin_lock() if we + * leave callbacks set. + * + * After we later reaquire the rq lock, we will force __schedule() + * to pick_again, so the callbacks will get re-established. + */ + zap_balance_callbacks(rq); + rq_unpin_lock(rq, rf); + raw_spin_rq_unlock(rq); +} + +static inline void proxy_reacquire_rq_lock(struct rq *rq, struct rq_flags = *rf) +__acquires(__rq_lockp(rq)) +{ + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); + update_rq_clock(rq); +} + +/* + * If the blocked-on relationship crosses CPUs, migrate @p to the + * owner's CPU. + * + * This is because we must respect the CPU affinity of execution + * contexts (owner) but we can ignore affinity for scheduling + * contexts (@p). So we have to move scheduling contexts towards + * potential execution contexts. + * + * Note: The owner can disappear, but simply migrate to @target_cpu + * and leave that CPU to sort things out. + */ +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf, + struct task_struct *p, int target_cpu) + __must_hold(__rq_lockp(rq)) +{ + struct rq *target_rq =3D cpu_rq(target_cpu); + + lockdep_assert_rq_held(rq); + WARN_ON(p =3D=3D rq->curr); + /* + * Since we are migrating a blocked donor, it could be rq->donor, + * and we want to make sure there aren't any references from this + * rq to it before we drop the lock. This avoids another cpu + * jumping in and grabbing the rq lock and referencing rq->donor + * or cfs_rq->curr, etc after we have migrated it to another cpu, + * and before we pick_again in __schedule. + * + * So call proxy_resched_idle() to drop the rq->donor references + * before we release the lock. + */ + proxy_resched_idle(rq); + + deactivate_task(rq, p, DEQUEUE_NOCLOCK); + proxy_set_task_cpu(p, target_cpu); + + proxy_release_rq_lock(rq, rf); + + attach_one_task(target_rq, p); + + proxy_reacquire_rq_lock(rq, rf); +} + +static void proxy_force_return(struct rq *rq, struct rq_flags *rf, + struct task_struct *p) + __must_hold(__rq_lockp(rq)) { - if (!__proxy_deactivate(rq, donor)) { + struct rq *task_rq, *target_rq =3D NULL; + int cpu, wake_flag =3D WF_TTWU; + + lockdep_assert_rq_held(rq); + WARN_ON(p =3D=3D rq->curr); + + if (p =3D=3D rq->donor) + proxy_resched_idle(rq); + + proxy_release_rq_lock(rq, rf); + /* + * We drop the rq lock, and re-grab task_rq_lock to get + * the pi_lock (needed for select_task_rq) as well. + */ + scoped_guard (task_rq_lock, p) { + task_rq =3D scope.rq; + /* - * XXX: For now, if deactivation failed, set donor - * as unblocked, as we aren't doing proxy-migrations - * yet (more logic will be needed then). + * Since we let go of the rq lock, the task may have been + * woken or migrated to another rq before we got the + * task_rq_lock. So re-check we're on the same RQ. If + * not, the task has already been migrated and that CPU + * will handle any futher migrations. */ - clear_task_blocked_on(donor, NULL); + if (task_rq !=3D rq) + break; + + /* + * Similarly, if we've been dequeued, someone else will + * wake us + */ + if (!task_on_rq_queued(p)) + break; + + /* + * Since we should only be calling here from __schedule() + * -> find_proxy_task(), no one else should have + * assigned current out from under us. But check and warn + * if we see this, then bail. + */ + if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) { + WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d on_cpu: %i\n", + __func__, cpu_of(task_rq), + p->comm, p->pid, p->on_cpu); + break; + } + + update_rq_clock(task_rq); + deactivate_task(task_rq, p, DEQUEUE_NOCLOCK); + cpu =3D select_task_rq(p, p->wake_cpu, &wake_flag); + set_task_cpu(p, cpu); + target_rq =3D cpu_rq(cpu); + clear_task_blocked_on(p, NULL); } - return NULL; + + if (target_rq) + attach_one_task(target_rq, p); + + proxy_reacquire_rq_lock(rq, rf); } =20 /* @@ -6629,18 +6764,27 @@ static struct task_struct *proxy_deactivate(struct = rq *rq, struct task_struct *d */ static struct task_struct * find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) + __must_hold(__rq_lockp(rq)) { - enum { FOUND, DEACTIVATE_DONOR } action =3D FOUND; + enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action =3D FOUND; struct task_struct *owner =3D NULL; + bool curr_in_chain =3D false; int this_cpu =3D cpu_of(rq); struct task_struct *p; struct mutex *mutex; + int owner_cpu; =20 /* Follow blocked_on chain. */ for (p =3D donor; (mutex =3D p->blocked_on); p =3D owner) { - /* if its PROXY_WAKING, resched_idle so ttwu can complete */ - if (mutex =3D=3D PROXY_WAKING) - return proxy_resched_idle(rq); + /* if its PROXY_WAKING, do return migration or run if current */ + if (mutex =3D=3D PROXY_WAKING) { + if (task_current(rq, p)) { + clear_task_blocked_on(p, PROXY_WAKING); + return p; + } + action =3D NEEDS_RETURN; + break; + } =20 /* * By taking mutex->wait_lock we hold off concurrent mutex_unlock() @@ -6660,26 +6804,41 @@ find_proxy_task(struct rq *rq, struct task_struct *= donor, struct rq_flags *rf) return NULL; } =20 + if (task_current(rq, p)) + curr_in_chain =3D true; + owner =3D __mutex_owner(mutex); if (!owner) { /* - * If there is no owner, clear blocked_on - * and return p so it can run and try to - * acquire the lock + * If there is no owner, either clear blocked_on + * and return p (if it is current and safe to + * just run on this rq), or return-migrate the task. */ - __clear_task_blocked_on(p, mutex); - return p; + if (task_current(rq, p)) { + __clear_task_blocked_on(p, NULL); + return p; + } + action =3D NEEDS_RETURN; + break; } =20 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) { /* XXX Don't handle blocked owners/delayed dequeue yet */ + if (curr_in_chain) + return proxy_resched_idle(rq); action =3D DEACTIVATE_DONOR; break; } =20 - if (task_cpu(owner) !=3D this_cpu) { - /* XXX Don't handle migrations yet */ - action =3D DEACTIVATE_DONOR; + owner_cpu =3D task_cpu(owner); + if (owner_cpu !=3D this_cpu) { + /* + * @owner can disappear, simply migrate to @owner_cpu + * and leave that CPU to sort things out. + */ + if (curr_in_chain) + return proxy_resched_idle(rq); + action =3D MIGRATE; break; } =20 @@ -6741,7 +6900,17 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) /* Handle actions we need to do outside of the guard() scope */ switch (action) { case DEACTIVATE_DONOR: - return proxy_deactivate(rq, donor); + if (proxy_deactivate(rq, donor)) + return NULL; + /* If deactivate fails, force return */ + p =3D donor; + fallthrough; + case NEEDS_RETURN: + proxy_force_return(rq, rf, p); + return NULL; + case MIGRATE: + proxy_migrate_task(rq, rf, p, owner_cpu); + return NULL; case FOUND: /* fallthrough */; } --=20 2.53.0.1018.g2bb0e51243-goog