From nobody Sat Feb 7 11:31:03 2026 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4D2A917E00F for ; Mon, 25 Nov 2024 19:52:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564330; cv=none; b=Uzn5FUgqR9ht7snfVxvh+raERiqBXusd8OocZfphTnXaelsK1Zd56r2hraiEi3bIVuGcBVyyC01pwr+q/yq22bKdIXifnmpDiHs9+PHqtX7xLip2VpqYkzP6g8kxpUQ1OQ9LvrkngVFyfhjuQ6PfR9XGHz/P2GQ16KtaKqiIXic= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564330; c=relaxed/simple; bh=ldvGcJ1p5BFMj5Gb5T1BYJE899OrjpqU47w++MQHuO8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=T+4NKM51BmWTK/3CfQtiDEs7TPjkHX4EcyrOnTXo1TbugNZrkZr0Ai6LsKIb32Y1ZEsA5upoq43zGCd0epKAtd1CCyxwuhJOsjK+cVmH71yLvyE+HJZI2RmGfuJOPsKrotam7LsDHnpOC88UicWxfv0agJCPKHfge4/R5S5+juU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=lwAlyHF4; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="lwAlyHF4" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-7fbbb61a67cso3222028a12.3 for ; Mon, 25 Nov 2024 11:52:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732564328; x=1733169128; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Ajms8VnNMkE1TQffsMnBvZtTPD6SPlAUmat2ZJ1/s7g=; b=lwAlyHF4ThPqLryoC6TccNW3ICNct0LQTa+m1NQTmWdm5thNa82mtGBeDrNWq33K7u /1jS3xkVW4zmfB5IbVKgo28ItHOCmsXhMU7MYllTE9MfzmRNSU7ggCBK3MZa1P6z4sIl IkwfWtbmM27nUccKdFTPJM67/DM0Bdki5cdjjhLOA5+4ykUXklBPsxqQblgou01B9eYs Bj2s2vxHbzzZrKr7OtJdR8Xvi/pLLKHpSFmUxkO5eG5lRJmmiBmHLqipqWV9H3D7jiys +Cfb1mVgG74KUirqV2d50kBcKP1vuiKHfQ/Zmz9P/TR8I4USJ0sgdrCCW+0bNs1afpnR dDtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732564328; x=1733169128; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Ajms8VnNMkE1TQffsMnBvZtTPD6SPlAUmat2ZJ1/s7g=; b=P+JlFdW6EoFXFwvIANcgUUezWa65WfA1soWz3IpC4T4JKAr+R8bwnX+gIWRzJwbMhW 2c9UhC2zDcFntT1Ke4aH3Vw+e9pU26I0mt57GbqnFJb5ModsdYEaUcMeixwNWTR4hqnN eP4wj/2H8EKUCOK4BcY1T1uhKq2FQ4W5JJ7iQKWZWeDc5UBJkeXha+bsLX9r8engG22T X/USqpsz5yI3L7ziqoN9ewr3YQU8OF+hyZkzQwT3VV1/Sq3z7OL3DVfzpDZIVbJ3yyHo dn3jIizty6l0HXPBkEJdVP3PePMD4qAGWpo6fEevGwAOMTDQeje2fl04I9rpLjrc2zPb pInQ== X-Gm-Message-State: AOJu0YzjWcae6TMxUspddtd5ehCq3QY3IjybTTGu4mOwzHyDlFnhom7U QYd+NUNeYeh4wNVlIqEufifooka2wIHxlDKaAN0ru2ebHkKRPMz/WLwClqbW7MMQcRc1mvM4Kxs s8BBUftBHZmLDauY6gZB98NWecymFpde/IE9Vk+j1LgvCLxODQQcLzxIZ1SlLCS2ozaxU+EBv9i bA4ReYM7xDPd4+JePD9ObChsAH5rNg5jX4nJBMe9X+Y3kH X-Google-Smtp-Source: AGHT+IFRW2wpYU0GwI7l0UB+J3wwg1ogfMsLSpZ+RMNEClBjAmofPmgplJSoSqIM/qfIAJxIjEQWZMqYOpYr X-Received: from pgbee13.prod.google.com ([2002:a05:6a02:458d:b0:7fc:3112:b331]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:164a:b0:1e0:c5f9:75a0 with SMTP id adf61e73a8af0-1e0c5f977dbmr8796928637.17.1732564328450; Mon, 25 Nov 2024 11:52:08 -0800 (PST) Date: Mon, 25 Nov 2024 11:51:55 -0800 In-Reply-To: <20241125195204.2374458-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241125195204.2374458-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.371.ga323438b13-goog Message-ID: <20241125195204.2374458-2-jstultz@google.com> Subject: [RFC][PATCH v14 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument sched_proxy_exec=3D that can be used to disable the feature at boot time if CONFIG_SCHED_PROXY_EXEC was enabled. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Tested-by: K Prateek Nayak Signed-off-by: John Stultz --- v7: * Switch to CONFIG_SCHED_PROXY_EXEC/sched_proxy_exec=3D as suggested by Metin Kaya. * Switch boot arg from =3Ddisable/enable to use kstrtobool(), which supports =3Dyes|no|1|0|true|false|on|off, as also suggested by Metin Kaya, and print a message when a boot argument is used. v8: * Move CONFIG_SCHED_PROXY_EXEC under Scheduler Features as Suggested by Metin * Minor rework reordering with split sched contexts patch v12: * Rework for selected -> donor renaming v14: * Depend on !PREEMPT_RT to avoid build issues for now --- .../admin-guide/kernel-parameters.txt | 5 ++++ include/linux/sched.h | 13 +++++++++ init/Kconfig | 9 ++++++ kernel/sched/core.c | 29 +++++++++++++++++++ kernel/sched/sched.h | 12 ++++++++ 5 files changed, 68 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentatio= n/admin-guide/kernel-parameters.txt index a4736ee87b1aa..761e858c62b92 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5955,6 +5955,11 @@ sa1100ir [NET] See drivers/net/irda/sa1100_ir.c. =20 + sched_proxy_exec=3D [KNL] + Enables or disables "proxy execution" style + solution to mutex-based priority inversion. + Format: + sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages. =20 schedstats=3D [KNL,X86] Enable or disable scheduled statistics. diff --git a/include/linux/sched.h b/include/linux/sched.h index f0e9e00d3cf52..24e338ac34d7b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1610,6 +1610,19 @@ struct task_struct { */ }; =20 +#ifdef CONFIG_SCHED_PROXY_EXEC +DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec); +static inline bool sched_proxy_exec(void) +{ + return static_branch_likely(&__sched_proxy_exec); +} +#else +static inline bool sched_proxy_exec(void) +{ + return false; +} +#endif + #define TASK_REPORT_IDLE (TASK_REPORT + 1) #define TASK_REPORT_MAX (TASK_REPORT_IDLE << 1) =20 diff --git a/init/Kconfig b/init/Kconfig index b07f238f3badb..364bd2065b1f1 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -863,6 +863,15 @@ config UCLAMP_BUCKETS_COUNT =20 If in doubt, use the default value. =20 +config SCHED_PROXY_EXEC + bool "Proxy Execution" + default n + # Avoid some build failures w/ PREEMPT_RT until it can be fixed + depends on !PREEMPT_RT + help + This option enables proxy execution, a mechanism for mutex-owning + tasks to inherit the scheduling context of higher priority waiters. + endmenu =20 # diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 95e40895a5190..d712e177d3b75 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,6 +119,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp); =20 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); =20 +#ifdef CONFIG_SCHED_PROXY_EXEC +DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec); +static int __init setup_proxy_exec(char *str) +{ + bool proxy_enable; + + if (kstrtobool(str, &proxy_enable)) { + pr_warn("Unable to parse sched_proxy_exec=3D\n"); + return 0; + } + + if (proxy_enable) { + pr_info("sched_proxy_exec enabled via boot arg\n"); + static_branch_enable(&__sched_proxy_exec); + } else { + pr_info("sched_proxy_exec disabled via boot arg\n"); + static_branch_disable(&__sched_proxy_exec); + } + return 1; +} +#else +static int __init setup_proxy_exec(char *str) +{ + pr_warn("CONFIG_SCHED_PROXY_EXEC=3Dn, so it cannot be enabled or disabled= at boot time\n"); + return 0; +} +#endif +__setup("sched_proxy_exec=3D", setup_proxy_exec); + #ifdef CONFIG_SCHED_DEBUG /* * Debugging: various feature bits diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 76f5f53a645fc..24eae02ddc7f6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1148,10 +1148,15 @@ struct rq { */ unsigned int nr_uninterruptible; =20 +#ifdef CONFIG_SCHED_PROXY_EXEC + struct task_struct __rcu *donor; /* Scheduling context */ + struct task_struct __rcu *curr; /* Execution context */ +#else union { struct task_struct __rcu *donor; /* Scheduler context */ struct task_struct __rcu *curr; /* Execution context */ }; +#endif struct sched_dl_entity *dl_server; struct task_struct *idle; struct task_struct *stop; @@ -1348,10 +1353,17 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues= ); #define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define raw_rq() raw_cpu_ptr(&runqueues) =20 +#ifdef CONFIG_SCHED_PROXY_EXEC +static inline void rq_set_donor(struct rq *rq, struct task_struct *t) +{ + rcu_assign_pointer(rq->donor, t); +} +#else static inline void rq_set_donor(struct rq *rq, struct task_struct *t) { /* Do nothing */ } +#endif =20 #ifdef CONFIG_SCHED_CORE static inline struct cpumask *sched_group_span(struct sched_group *sg); --=20 2.47.0.371.ga323438b13-goog From nobody Sat Feb 7 11:31:03 2026 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0A8B1BF7E8 for ; Mon, 25 Nov 2024 19:52:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564332; cv=none; b=tGEZNt0YZXXVEgFLEu7lDkWg32JdE8xU6Ougvk8vNws+0vui6SsUYmKb2mlxK+/Jgy0f/txbZVd6FfD48umlCZ9UuE4rez/YS7izrHAiAS/FiGYKoUU006PAax4YAjqrkuutVXqK8atk/PSg2PsslE54wF+UF92PojazeUgLIUE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564332; c=relaxed/simple; bh=xyZ12uWc11HrkARvtSL1yJv5GPBzvsXBQDaclEx5iuk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=N27nqWdXG0A/g9W3NpPGvVvxAB/wAN4Gm7iWXyI1+J3shXFvJjgxvd0TxDDenBMN2dTutc00qwws7MiIVznJODHX7zMxQ3zqwHqqYJSRL9ddzqoeuYn18B7D6r+CE3Q2stB2z13KfS6qqB4TkWKhKC0WgCj2IZ8zId03lLoiDOg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ikLesNxL; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ikLesNxL" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-212a337cf0bso33456865ad.0 for ; Mon, 25 Nov 2024 11:52:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732564330; x=1733169130; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=/LKBY/kpPXDrgd01Wy0hrPJQtD2fGMqawNoGE5KKOdE=; b=ikLesNxLVseK1v7bq41g6cCFFsmTifdrha4dYjmDaQo+UZIwq26UCvKx9Ysa7DvFRv LVdhSbZyjZRr6eXLJa9fQ0h9S/sJw517RLO0yClbzm3yLZEhtY/nB0K2iSHcPapQ7QdF si2fKRHjRuHufXKvSq6MC7qW79BhBff+NxXjOxNKpeHRnGvd3e5aDX7dPHs4glx0fXMc XU5vIlBQjlE7O0cfiee9gr6Ae4x1JxiIntGSEgvhmBsIFbwypwjoQxnrV/w334m4mGfC nm7Vf6XsnekVUjp9Ay20bUQ07zzOsfb8PJeOrvhoowN7AHFo36bXJcqJQ/2fMk11V9zG pqcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732564330; x=1733169130; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/LKBY/kpPXDrgd01Wy0hrPJQtD2fGMqawNoGE5KKOdE=; b=egDYa3lmbhIq46HyXKSG9yzWaiBeg7WIEf77xH+vOtU2PVNduLOO9o92X6xw76wxgO lIz7kkHYkOwuqNAIw823mRIrMsBBsc42WJjfTte5IlEijAY7pvT/ij6g4byyD2RzPzet kgRVX3PVU9RcdPCznMgDghaN3PGQ+6b8O+9P3guORH6YZl3gRHV6KwVG2rcrxyx+8akg rNJ6PZF+1ggw7wq+4xe7vT/jSmHjj3anidX6+ubNAVF6Bn5UpdbC+/5J7BnBJHw9Dsgw TvuYvoW+OEu3GfBBXx49pD9xfBi2taf1FIxO7PNWy1fyV8kMSpY9/B9b5+v+GRMYx3Wi de6g== X-Gm-Message-State: AOJu0YwtHa4nAY4zaS1XFy1vf7klfDvVL+V5LmkIZ2tpCbrIRPiPZYeF aVW+xe/Xlxnn1Tj8uaHCYvldR2LgPqeULRPP5MRzR2+BM1eGGEHepTREUJ0QbXL/uSZxQmWInZo qW+ilWojS9RHL4CKYg0DZDvf1UX75aF0UUpDZ20Z6bdze5fh30uAztkrpZ0QYdcmQ89AlAU3H5y Vlcn1vyaMXB2MwBHiuMzhOecZAPnG262/BSHhBGkSlZbuv X-Google-Smtp-Source: AGHT+IE+RMYGzLs6IPrxwch9JnbbYVK9DgFEwAHd9/qSzhKRkT1ZDF/SmDMf9RpBf6yyKgPg02niQ0oHTM54 X-Received: from plbkd8.prod.google.com ([2002:a17:903:13c8:b0:212:472e:4a64]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:cf01:b0:20c:e8df:2516 with SMTP id d9443c01a7336-2129f81e889mr179409315ad.42.1732564329853; Mon, 25 Nov 2024 11:52:09 -0800 (PST) Date: Mon, 25 Nov 2024 11:51:56 -0800 In-Reply-To: <20241125195204.2374458-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241125195204.2374458-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.371.ga323438b13-goog Message-ID: <20241125195204.2374458-3-jstultz@google.com> Subject: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on From: John Stultz To: LKML Cc: Peter Zijlstra , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com, "Connor O'Brien" , John Stultz Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Track the blocked-on relation for mutexes, to allow following this relation at schedule time. task | blocked-on v mutex | owner v task Also add a blocked_on_state value so we can distinguish when a task is blocked_on a mutex, but is either blocked, waking up, or runnable (such that it can try to acquire the lock its blocked on). This avoids some of the subtle & racy games where the blocked_on state gets cleared, only to have it re-added by the mutex_lock_slowpath call when it tries to acquire the lock on wakeup Also add blocked_lock to the task_struct so we can safely serialize the blocked-on state. Finally add wrappers that are useful to provide correctness checks. Folded in from a patch by: Valentin Schneider This all will be used for tracking blocked-task/mutex chains with the prox-execution patch in a similar fashion to how priority inheritance is done with rt_mutexes. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: Peter Zijlstra (Intel) [minor changes while rebasing] Signed-off-by: Juri Lelli Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Connor O'Brien [jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths] Signed-off-by: John Stultz --- v2: * Fixed blocked_on tracking in error paths that was causing crashes v4: * Ensure we clear blocked_on when waking ww_mutexes to die or wound. This is critical so we don't get circular blocked_on relationships that can't be resolved. v5: * Fix potential bug where the skip_wait path might clear blocked_on when that path never set it * Slight tweaks to where we set blocked_on to make it consistent, along with extra WARN_ON correctness checking * Minor comment changes v7: * Minor commit message change suggested by Metin Kaya * Fix WARN_ON conditionals in unlock path (as blocked_on might already be cleared), found while looking at issue Metin Kaya raised. * Minor tweaks to be consistent in what we do under the blocked_on lock, also tweaked variable name to avoid confusion with label, and comment typos, as suggested by Metin Kaya * Minor tweak for CONFIG_SCHED_PROXY_EXEC name change * Moved unused block of code to later in the series, as suggested by Metin Kaya * Switch to a tri-state to be able to distinguish from waking and runnable so we can later safely do return migration from ttwu * Folded together with related blocked_on changes v8: * Fix issue leaving task BO_BLOCKED when calling into optimistic spinning path. * Include helper to better handle BO_BLOCKED->BO_WAKING transitions v9: * Typo fixup pointed out by Metin * Cleanup BO_WAKING->BO_RUNNABLE transitions for the !proxy case * Many cleanups and simplifications suggested by Metin v11: * Whitespace fixup pointed out by Metin v13: * Refactor set_blocked_on helpers clean things up a bit v14: * Small build fixup with PREEMPT_RT --- include/linux/sched.h | 66 ++++++++++++++++++++++++++++++++---- init/init_task.c | 1 + kernel/fork.c | 4 +-- kernel/locking/mutex-debug.c | 9 ++--- kernel/locking/mutex.c | 40 ++++++++++++++++++---- kernel/locking/ww_mutex.h | 24 +++++++++++-- kernel/sched/core.c | 1 + 7 files changed, 125 insertions(+), 20 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 24e338ac34d7b..0ad8033f8c2b9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -34,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -775,6 +776,12 @@ struct kmap_ctrl { #endif }; =20 +enum blocked_on_state { + BO_RUNNABLE, + BO_BLOCKED, + BO_WAKING, +}; + struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* @@ -1195,10 +1202,9 @@ struct task_struct { struct rt_mutex_waiter *pi_blocked_on; #endif =20 -#ifdef CONFIG_DEBUG_MUTEXES - /* Mutex deadlock detection: */ - struct mutex_waiter *blocked_on; -#endif + enum blocked_on_state blocked_on_state; + struct mutex *blocked_on; /* lock we're blocked on */ + raw_spinlock_t blocked_lock; =20 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP int non_block_count; @@ -2118,6 +2124,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *loc= k); __cond_resched_rwlock_write(lock); \ }) =20 +static inline void __set_blocked_on_runnable(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + + if (p->blocked_on_state =3D=3D BO_WAKING) + p->blocked_on_state =3D BO_RUNNABLE; +} + +static inline void set_blocked_on_runnable(struct task_struct *p) +{ + unsigned long flags; + + if (!sched_proxy_exec()) + return; + + raw_spin_lock_irqsave(&p->blocked_lock, flags); + __set_blocked_on_runnable(p); + raw_spin_unlock_irqrestore(&p->blocked_lock, flags); +} + +static inline void set_blocked_on_waking(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + + if (p->blocked_on_state =3D=3D BO_BLOCKED) + p->blocked_on_state =3D BO_WAKING; +} + +static inline void set_task_blocked_on(struct task_struct *p, struct mutex= *m) +{ + lockdep_assert_held(&p->blocked_lock); + + /* + * Check we are clearing values to NULL or setting NULL + * to values to ensure we don't overwrite existing mutex + * values or clear already cleared values + */ + WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on)); + + p->blocked_on =3D m; + p->blocked_on_state =3D m ? BO_BLOCKED : BO_RUNNABLE; +} + +static inline struct mutex *get_task_blocked_on(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + + return p->blocked_on; +} + static __always_inline bool need_resched(void) { return unlikely(tif_need_resched()); @@ -2157,8 +2213,6 @@ extern bool sched_task_on_rq(struct task_struct *p); extern unsigned long get_wchan(struct task_struct *p); extern struct task_struct *cpu_curr_snapshot(int cpu); =20 -#include - /* * In order to reduce various lock holder preemption latencies provide an * interface to see if a vCPU is currently running or not. diff --git a/init/init_task.c b/init/init_task.c index e557f622bd906..7e29d86153d9f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -140,6 +140,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .journal_info =3D NULL, INIT_CPU_TIMERS(init_task) .pi_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock), + .blocked_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock), .timer_slack_ns =3D 50000, /* 50 usec default slack */ .thread_pid =3D &init_struct_pid, .thread_node =3D LIST_HEAD_INIT(init_signals.thread_head), diff --git a/kernel/fork.c b/kernel/fork.c index f253e81d0c28e..160bead843afb 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2231,6 +2231,7 @@ __latent_entropy struct task_struct *copy_process( ftrace_graph_init_task(p); =20 rt_mutex_init_task(p); + raw_spin_lock_init(&p->blocked_lock); =20 lockdep_assert_irqs_enabled(); #ifdef CONFIG_PROVE_LOCKING @@ -2329,9 +2330,8 @@ __latent_entropy struct task_struct *copy_process( lockdep_init_task(p); #endif =20 -#ifdef CONFIG_DEBUG_MUTEXES + p->blocked_on_state =3D BO_RUNNABLE; p->blocked_on =3D NULL; /* not blocked yet */ -#endif #ifdef CONFIG_BCACHE p->sequential_io =3D 0; p->sequential_io_avg =3D 0; diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c index 6e6f6071cfa27..1d8cff71f65e1 100644 --- a/kernel/locking/mutex-debug.c +++ b/kernel/locking/mutex-debug.c @@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct = mutex_waiter *waiter, { lockdep_assert_held(&lock->wait_lock); =20 - /* Mark the current thread as blocked on the lock: */ - task->blocked_on =3D waiter; + /* Current thread can't be already blocked (since it's executing!) */ + DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task)); } =20 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *wa= iter, struct task_struct *task) { + struct mutex *blocked_on =3D get_task_blocked_on(task); + DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list)); DEBUG_LOCKS_WARN_ON(waiter->task !=3D task); - DEBUG_LOCKS_WARN_ON(task->blocked_on !=3D waiter); - task->blocked_on =3D NULL; + DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on !=3D lock); =20 INIT_LIST_HEAD(&waiter->list); waiter->task =3D NULL; diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 3302e52f0c967..8f5d3fe6c1029 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -597,6 +597,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas } =20 raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); /* * After waiting to acquire the wait_lock, try again. */ @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas goto err_early_kill; } =20 + set_task_blocked_on(current, lock); set_current_state(state); trace_contention_begin(lock, LCB_F_MUTEX); for (;;) { @@ -639,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas * the handoff. */ if (__mutex_trylock(lock)) - goto acquired; + break; /* acquired */; =20 /* * Check for signals and kill conditions while holding @@ -657,6 +659,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas goto err; } =20 + raw_spin_unlock(¤t->blocked_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); /* Make sure we do wakeups before calling schedule */ wake_up_q(&wake_q); @@ -666,6 +669,13 @@ __mutex_lock_common(struct mutex *lock, unsigned int s= tate, unsigned int subclas =20 first =3D __mutex_waiter_is_first(lock, &waiter); =20 + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); + + /* + * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE + */ + current->blocked_on_state =3D BO_BLOCKED; set_current_state(state); /* * Here we order against unlock; we must either see it change @@ -676,16 +686,26 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas break; =20 if (first) { + bool opt_acquired; + + /* + * mutex_optimistic_spin() can schedule, so we need to + * release these locks before calling it. + */ + current->blocked_on_state =3D BO_RUNNABLE; + raw_spin_unlock(¤t->blocked_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN); - if (mutex_optimistic_spin(lock, ww_ctx, &waiter)) + opt_acquired =3D mutex_optimistic_spin(lock, ww_ctx, &waiter); + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); + current->blocked_on_state =3D BO_BLOCKED; + if (opt_acquired) break; trace_contention_begin(lock, LCB_F_MUTEX); } - - raw_spin_lock_irqsave(&lock->wait_lock, flags); } - raw_spin_lock_irqsave(&lock->wait_lock, flags); -acquired: + set_task_blocked_on(current, NULL); __set_current_state(TASK_RUNNING); =20 if (ww_ctx) { @@ -710,16 +730,20 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas if (ww_ctx) ww_mutex_lock_acquired(ww, ww_ctx); =20 + raw_spin_unlock(¤t->blocked_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); wake_up_q(&wake_q); preempt_enable(); return 0; =20 err: + set_task_blocked_on(current, NULL); __set_current_state(TASK_RUNNING); __mutex_remove_waiter(lock, &waiter); err_early_kill: + WARN_ON(get_task_blocked_on(current)); trace_contention_end(lock, ret); + raw_spin_unlock(¤t->blocked_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); debug_mutex_free_waiter(&waiter); mutex_release(&lock->dep_map, ip); @@ -928,8 +952,12 @@ static noinline void __sched __mutex_unlock_slowpath(s= truct mutex *lock, unsigne =20 next =3D waiter->task; =20 + raw_spin_lock(&next->blocked_lock); debug_mutex_wake_waiter(lock, waiter); + WARN_ON(get_task_blocked_on(next) !=3D lock); + set_blocked_on_waking(next); wake_q_add(&wake_q, next); + raw_spin_unlock(&next->blocked_lock); } =20 if (owner & MUTEX_FLAG_HANDOFF) diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h index 37f025a096c9d..d9ff2022eef6f 100644 --- a/kernel/locking/ww_mutex.h +++ b/kernel/locking/ww_mutex.h @@ -281,10 +281,21 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITE= R *waiter, return false; =20 if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)= ) { + /* nested as we should hold current->blocked_lock already */ + raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING); #ifndef WW_RT debug_mutex_wake_waiter(lock, waiter); + /* + * When waking up the task to die, be sure to set the + * blocked_on_state to WAKING. Otherwise we can see + * circular blocked_on relationships that can't + * resolve. + */ + WARN_ON(get_task_blocked_on(waiter->task) !=3D lock); #endif + set_blocked_on_waking(waiter->task); wake_q_add(wake_q, waiter->task); + raw_spin_unlock(&waiter->task->blocked_lock); } =20 return true; @@ -331,9 +342,18 @@ static bool __ww_mutex_wound(struct MUTEX *lock, * it's wounded in __ww_mutex_check_kill() or has a * wakeup pending to re-read the wounded state. */ - if (owner !=3D current) + if (owner !=3D current) { + /* nested as we should hold current->blocked_lock already */ + raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING); + /* + * When waking up the task to wound, be sure to set the + * blocked_on_state flag. Otherwise we can see circular + * blocked_on relationships that can't resolve. + */ + set_blocked_on_waking(owner); wake_q_add(wake_q, owner); - + raw_spin_unlock(&owner->blocked_lock); + } return true; } =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d712e177d3b75..f8714050b6d0d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4350,6 +4350,7 @@ int try_to_wake_up(struct task_struct *p, unsigned in= t state, int wake_flags) ttwu_queue(p, cpu, wake_flags); } out: + set_blocked_on_runnable(p); if (success) ttwu_stat(p, task_cpu(p), wake_flags); =20 --=20 2.47.0.371.ga323438b13-goog From nobody Sat Feb 7 11:31:03 2026 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 627591C1F13 for ; Mon, 25 Nov 2024 19:52:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564334; cv=none; b=I3U/qC+4L0PC+BTjScP2L5jDdYjkDsedewxO8f5ihb4gDq0TUn1CQ1kRQlYI5NdM5a24YJo1qR8S6vy5ldC1RWQ/pGYQUCkoax5RSDpafy0x5SRoOlZr6jlT3C9uM5RPwQaFQZxVz6z5DKBSWC/5DNbUbC/HHEDeneQAr2G3L3c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564334; c=relaxed/simple; bh=JXRMYnd0ucUfDV5Qzz/TxTcy1aNUDBMF3uGjnrjG3tE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Az2+07y5baCkuFhmRs5FK8G2ywUaQmTx9YBuxwVQG8GH30md5e+jPcinYJDxRPfcxAsF9Bv8T4xsA6DR0bUTd+U6fX1eMMoI92mWdep1KrcHdaDxAyDC42W4qxM4awvdGnTBCaKLMI3MpaVkGUXiNlCcXH4kqTO0RFY8DtzCefY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=wKe2+NPY; arc=none smtp.client-ip=209.85.210.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="wKe2+NPY" Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-724e565cd84so3018865b3a.3 for ; Mon, 25 Nov 2024 11:52:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732564331; x=1733169131; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=lWiy5IaznfIpEZ7WIDYWdQnPT0DdeHKfKM1+RyqUBX4=; b=wKe2+NPYC5z6UtHAJYhc6BtMB8HXXBrjAEBQImA+jG1Iq+763IMt/GTrZ7z4W7YUcR Q7p5CBRP7pV8ahiNHKv8iJAxYXq0tyCbf60anNKzjIsKD26YQtlDfVdYfoDzZ9FEKCtV WvZdf5p92PYRZYhb9b5DuiP3KV6csPZTir72/EfZSDG0P1T2nWgSSJ6MCr+2up/pv3iH pXnwDaICLIj7X6kY2+Rk5SF17fyxRn8VXwspRPfzKCYw+3wxBU8obOEyUfF92fIsUhxH fQyXaMy2egd1GQPbyXJjSkDB1teA6sxAHfSlSVk58OlEXIK5eSCoGtSlD+ElaAwKE7Vo VDoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732564331; x=1733169131; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lWiy5IaznfIpEZ7WIDYWdQnPT0DdeHKfKM1+RyqUBX4=; b=oEqvupBozmf2d5UvmgLIZazOGToW7oXZ2VzN7TY/7UioENZQITaUplM2rxgFro7pdD oDDqoP6k5ynV6vGEoiLfdRNrp2QBR0c3mHUyXO/f92mTyZBOjJfvHUWgMSya91xdXhQt siXfqKY8vEVcpTiuV0bhJ1M6TRMGiw2YjNrpy4Kmd9UNSlnzuutZ0td48bQFrrVp1oJk RTXrHUUboGyIL7qNmAu+VieN1I5EnhE2UstuqIEep8+Q0dT43qgFqJTRaOqJ6hlSzOMF CmrDdvy3aCcPzBZeVCa09jE5GHc9zt7Vg7cvhOybUIDvpAn7JF7Vbs3F8i/Z9m2kqILi UZzg== X-Gm-Message-State: AOJu0YzUcQ/frOFDWA2Sljw/N9CeoRLPSwR6dYijhFuRfA6GH+N3wagW wX4xFlJSam0NhkcOGfkLAmyVKbhiZz9vf98vMbnPTTrJJfMmnaskL7oXigj7WM7BGh4ue9OPnwV EBWqE1Wt+nJ5z7xJp/PL/infw7i74lGNQmyR2Rp2E0ZlME5ugClrO3u5PhhfIXOhO8h3z/w8JQK SvPu0Xy0jm7+TaUnH2xIUwDr6yC2kht9ssa3TAYEzcKSWK X-Google-Smtp-Source: AGHT+IFHXNDeOJVe69oi2JM6BqGCpzCO91LAcGe7Rb0UVct3JkWQPPkAKq+XzFe/oHCnanpggxnzCwqTROQ6 X-Received: from pfbf2.prod.google.com ([2002:a05:6a00:ad82:b0:724:fa30:554f]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:aa0e:b0:725:14fd:f62c with SMTP id d2e1a72fcca58-72514fdf9cfmr3665231b3a.15.1732564331587; Mon, 25 Nov 2024 11:52:11 -0800 (PST) Date: Mon, 25 Nov 2024 11:51:57 -0800 In-Reply-To: <20241125195204.2374458-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241125195204.2374458-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.371.ga323438b13-goog Message-ID: <20241125195204.2374458-4-jstultz@google.com> Subject: [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The idea here is we want to charge the scheduler-context task's vruntime but charge the execution-context task's sum_exec_runtime. This way cputime accounting goes against the task actually running but vruntime accounting goes against the rq->donor task so we get proper fairness. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: John Stultz --- kernel/sched/fair.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fbdca89c677f4..ebde314e151f1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1131,22 +1131,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_r= q) } #endif /* CONFIG_SMP */ =20 -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr) +static s64 update_curr_se(struct rq *rq, struct sched_entity *se) { u64 now =3D rq_clock_task(rq); s64 delta_exec; =20 - delta_exec =3D now - curr->exec_start; + delta_exec =3D now - se->exec_start; if (unlikely(delta_exec <=3D 0)) return delta_exec; =20 - curr->exec_start =3D now; - curr->sum_exec_runtime +=3D delta_exec; + se->exec_start =3D now; + if (entity_is_task(se)) { + struct task_struct *running =3D rq->curr; + /* + * If se is a task, we account the time against the running + * task, as w/ proxy-exec they may not be the same. + */ + running->se.exec_start =3D now; + running->se.sum_exec_runtime +=3D delta_exec; + } else { + /* If not task, account the time against se */ + se->sum_exec_runtime +=3D delta_exec; + } =20 if (schedstat_enabled()) { struct sched_statistics *stats; =20 - stats =3D __schedstats_from_se(curr); + stats =3D __schedstats_from_se(se); __schedstat_set(stats->exec_max, max(delta_exec, stats->exec_max)); } --=20 2.47.0.371.ga323438b13-goog From nobody Sat Feb 7 11:31:03 2026 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CFA31C8FD6 for ; Mon, 25 Nov 2024 19:52:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564336; cv=none; b=TbnJCmH0lM1pEq3Ii3+YSsk9LcpYgdHwr3b0y1KesJbl4kNKFvyHsCmum268vGKD61DEgyJ4NRziQC6Vwqqxnbu6fOI8QEolpAsUeWaNuM9UTf09PRd7DkIJqIxRnoWYB4+88zgAAi2U1zUGbuNCUMzN2Slp4IbFHD3BFm5toNw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564336; c=relaxed/simple; bh=M2SDcIOQuVxMIY2F55xh8D0sosZvE/g4I0hYYLIH2oM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=SeKKOeqXmfAHA9J6i5Q0zhi0N6At5lhCnt1oLXK6KERIROchk+ajnGk0HBnETjCTt+QM99k1M/dHfFuIU4V7o2cPhqa2gMbqL682wnWEYfJ53TgcFpg2R5NPB01PUIC7EKTgJfNg4vSXzqdHiO3rRhNMzpnm6QQJ0IAZR6tmO8A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=PyTYKlh9; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="PyTYKlh9" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-21147fea103so66660095ad.2 for ; Mon, 25 Nov 2024 11:52:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732564333; x=1733169133; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=DuS7non9bLiEZBdtAM6QuHjrqXfHHnVYgqWyiDCcd1A=; b=PyTYKlh9cZG85divNzLYEvJp+ksQqoQCVphkC62MbHGd3vFCLYCBrLS7rwYNzpGwBS 0ctgYgDsZSwAPumbhBl0V8B6PAx1AHfsqcdscgBH/UgFs1LUxa5/vsXl0aNphpbawlIK eXyCWv4x1LNrI1FFEgL6UyJMDkuRMApml9lubWGAS/QkL6ua+I1MG8UT5L9Qz5q7CRqR alY6nxI/UDMi4pYHKaHVqtaiExNadl2c+7wqST28gZSm2fk/3erR3JHdkcjM4Vj2pneO tPdrKMVN5+7BC4UEqa2bDzHqspR7lQmp0Qvrnf9dLK/ijD1Z0WvDOUMbboLNZFZrLSLL jFeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732564333; x=1733169133; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=DuS7non9bLiEZBdtAM6QuHjrqXfHHnVYgqWyiDCcd1A=; b=pjI+3JalNKgFkLIM2Euy61Lh9MXk1jqraw6w9K7NMHF/j1gJE6nT1LZWxSRVU1YHvO gy2Z2sIej6cq9RsDjBnhTfteCa/qR3agB5GuTf2UgbF5AcjyDv5MJK8rRYAfPM31G8WV Qfl27qt5gYV/YPAW2Q1lXPdW//kaeLggrDODgmXCVZwqgXNrcGvAGryVkJDSyQsPwzoW KZrCx4bZ2XmDEQiB+MZAkTtfh3gRbPX4UrkRI7wbdw7cP9argLMVwLu1A1H5WwebJnEl gG4JJskN6covnjtuQqBUW0Zb1IgRuLGOvS14eIA++CJQ0isCfZwQ6RbWirClRf3evu9c 7vLg== X-Gm-Message-State: AOJu0YwRfOjUwoc6RMUhGeVjqZU9ksTDkE29aJ7WJC7PoFufS7RZW6KS ZfR+P+nS6erv9EDNZLPXyCEquT5tSHJhZM/x8t5Qggy5wQjLoALLI8WQ530xzR6GfZaovtQObVs Hiv6gnTt7d1hV+X78s7VjUo/7zjdp3cqjvhQ5BRz/cqQciPQg2IkpxvMdW5mKgeVbyzcYaE51eN Snp7Ytc5s63RlIqkwx7359LBJ2PjZLJu3CwZVM06bhEqLG X-Google-Smtp-Source: AGHT+IE3ZsuhtPJ/efOBhGni1a8MxByd57KWn1zoKuXK4FmRZF1mXB5+snAwLut9I7Ml3vhqR3vRy8yl4y0o X-Received: from plgk8.prod.google.com ([2002:a17:902:ce08:b0:20f:ae59:441e]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:f681:b0:20c:c631:d81f with SMTP id d9443c01a7336-2129f69245cmr159206635ad.21.1732564333217; Mon, 25 Nov 2024 11:52:13 -0800 (PST) Date: Mon, 25 Nov 2024 11:51:58 -0800 In-Reply-To: <20241125195204.2374458-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241125195204.2374458-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.371.ga323438b13-goog Message-ID: <20241125195204.2374458-5-jstultz@google.com> Subject: [RFC][PATCH v14 4/7] sched: Fix psi_dequeue for Proxy Execution From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently, if the sleep flag is set, psi_dequeue() doesn't change any of the psi_flags. This is because psi_switch_task() will clear TSK_ONCPU as well as other potential flags (TSK_RUNNING), and the assumption is that a voluntary sleep always consists of a task being dequeued followed shortly there after with a psi_sched_switch() call. Proxy Execution changes this expectation, as mutex-blocked tasks that would normally sleep stay on the runqueue. In the case where the mutex owning task goes to sleep, we will then deactivate the blocked task as well. However, in that case, the mutex-blocked task will have had its TSK_ONCPU cleared when it was switched off the cpu, but it will stay TSK_RUNNING. Then when we later dequeue it becaues of a sleeping-owner, as it is sleeping psi_dequeue() won't change any state (leaving it TSK_RUNNING), as it incorrectly expects a psi_task_switch() call to immediately follow. Later on when it get re enqueued, and psi_flags are set for TSK_RUNNING, we hit an error as the task is already TSK_RUNNING: psi: inconsistent task state! To resolve this, extend the logic in psi_dequeue() so that if the sleep flag is set, we also check if psi_flags have TSK_ONCPU set (meaning the psi_task_switch is imminient) before we do the shortcut return. If TSK_ONCPU is not set, that means we've already swtiched away, and this psi_dequeue call needs to clear the flags. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: John Stultz --- v13: * Reworked for collision --- kernel/sched/stats.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 8ee0add5a48a8..c313fe76a7723 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -176,8 +176,12 @@ static inline void psi_dequeue(struct task_struct *p, = int flags) * avoid walking all ancestors twice, psi_task_switch() handles * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU. * Do nothing here. + * In the SCHED_PROXY_EXECUTION case we may do sleeping + * dequeues that are not followed by a task switch, so check + * TSK_ONCPU is set to ensure the task switch is imminent. + * Otherwise clear the flags as usual. */ - if (flags & DEQUEUE_SLEEP) + if ((flags & DEQUEUE_SLEEP) && (p->psi_flags & TSK_ONCPU)) return; =20 /* --=20 2.47.0.371.ga323438b13-goog From nobody Sat Feb 7 11:31:03 2026 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 41B881CB528 for ; Mon, 25 Nov 2024 19:52:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564338; cv=none; b=PRbsvz6gWbk3SIT1l3Zjmb96za9qiDwCUsDpIA5ttN8ILLeMX1ojxdPCVdZGFS5LsL9zuPRxOdDCMbv2KU9RynwFZ41me/g1ve+2cWtHlxXLl+aax8qn5nkq3bLOhB+HRMj/mEXm9Hohch1gg4eY4StiU2oGlXE9GivO6Ncv2MM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564338; c=relaxed/simple; bh=NNU4b6vYWC+pp13z/kreFmWIrgI60yMuB6wPhJjCZxQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=k8sqP+KZ7myfJ8toyTDBIP5rCBN93oZAkNqoAp9qS52Dfem+FidzPu7iruSURB2tgWNwonMqrwM+J0zwLWvI7UaER8GF5dzWbhQB/FMO4OVw0VbC/aa3c6mXbvPtQNPm5kALAnbahFcENAEnOjzlwSt+4dG4n7SZOfzdsyoomfw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=QCRUCbts; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="QCRUCbts" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2ea21082c99so5045792a91.3 for ; Mon, 25 Nov 2024 11:52:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732564335; x=1733169135; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=SLnULeoJXWgPfRnVPCmnhWhkq15uv2wq1pCAMSH3YsQ=; b=QCRUCbtsYpM6xR+MvOFChiCWn65LrAKInX9INXpISVWVD5PX+wMll4oBM/PXNU4gEF LlvRyNiBhpFuxrMiOlgARVSQmI7rO84oLi9zdW65U521dI62ddlYpe7Cm6kDHg84PI+6 H5ij9yqLNdDy+Mq3mNNNx6GNd7OwIm+w2uZbAPnLJkCh7mTaXKZzdXkq4VPlDwHNuC17 0yrN+s6xiNnZSaIxMdGOzkk+cUFYSEWNv1Z9P/MXoWBe1qlln6kpGA4rfrXLWLTVvALG wbbPzeHC7cceEU7Zrw6v4PCrdFF5jxnddYT7Zkwu0tUPkqA0gdfeXm28ycvI5/ax6JL+ TWDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732564335; x=1733169135; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SLnULeoJXWgPfRnVPCmnhWhkq15uv2wq1pCAMSH3YsQ=; b=W3hkBUqbx32HCaXvWMnvIzX7l/wozj/ttz/cLWZhsmfczyuCvxIl8sei/qHvwOZW65 NmEhMWpMt0Wuedgn/JVXry6g48CLERmo8sPojxHqcOaQ/EUGzV0ObEO0pL0Z0kBoH/gi v9M3EPG7JAe37JlSgZgG9gZGurpqK/Fbs81ngnvTalLbNanOrkOdUsH3kZyvnu3TlVMv yeoQlgehTjfc+ssqaOuWdoCXo1325uyeGk9LGCTYY7IebexszfEIzUZgLIImZpGO7ozb hznuesaw8kT6Zknu2+2jj9Fx1B7ASXIcp3JEUseIAHXGnhlvL3LwOv6eJf21ai/uNTVA sy0w== X-Gm-Message-State: AOJu0YzJcs0kDv2EWYsRP+PmC63EKoQwMdamA+YYl6C6NGePLLYjRQsx rbBaPkTaAX3EnBGxdFJ4ne5ONch0nbXKOw+CKAs2OJaD5unvdpGSuvAotc90UtDuuBFr0RQW0cu gI9OIH5g5IpItFJ90+IIiC3CYOXlJ8satYrVW9sDHj8GQu4ehfjfGswhdMTV2p/WKjw1J0ii32D d2Dz//G2VItYnELTt8kiO6pv9f+yz9KqZd1XswkImeDQQF X-Google-Smtp-Source: AGHT+IHEUw2v7KxtCDazuF9zCs2de10S81Xzx2nmKTJ6b9b3qL6s7Gie6aTltVTaUIS65if6ckdOtxuh6p+l X-Received: from plfz3.prod.google.com ([2002:a17:902:d543:b0:212:4e78:e2b6]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:2283:b0:211:f674:9d60 with SMTP id d9443c01a7336-2129f69e7a1mr181413475ad.50.1732564335366; Mon, 25 Nov 2024 11:52:15 -0800 (PST) Date: Mon, 25 Nov 2024 11:51:59 -0800 In-Reply-To: <20241125195204.2374458-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241125195204.2374458-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.371.ga323438b13-goog Message-ID: <20241125195204.2374458-6-jstultz@google.com> Subject: [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a find_proxy_task() function which doesn't do much. When we select a blocked task to run, we will just deactivate it and pick again. The exception being if it has become unblocked after find_proxy_task() was called. Greatly simplified from patch by: Peter Zijlstra (Intel) Juri Lelli Valentin Schneider Connor O'Brien Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com [jstultz: Split out from larger proxy patch and simplified for review and testing.] Signed-off-by: John Stultz --- v5: * Split out from larger proxy patch v7: * Fixed unused function arguments, spelling nits, and tweaks for clarity, pointed out by Metin Kaya * Fix build warning Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202311081028.yDLmCWgr-lkp@i= ntel.com/ v8: * Fixed case where we might return a blocked task from find_proxy_task() * Continued tweaks to handle avoiding returning blocked tasks v9: * Add zap_balance_callbacks helper to unwind balance_callbacks when we will re-call pick_next_task() again. * Add extra comment suggested by Metin * Typo fixes from Metin * Moved adding proxy_resched_idle earlier in the series, as suggested by Metin * Fix to call proxy_resched_idle() *prior* to deactivating next, to avoid crashes caused by stale references to next * s/PROXY/SCHED_PROXY_EXEC/ as suggested by Metin * Number of tweaks and cleanups suggested by Metin * Simplify proxy_deactivate as suggested by Metin v11: * Tweaks for earlier simplification in try_to_deactivate_task v13: * Rename rename "next" to "donor" in find_proxy_task() for clarity * Similarly use "donor" instead of next in proxy_deactivate * Refactor/simplify proxy_resched_idle * Moved up a needed fix from later in the series --- kernel/sched/core.c | 129 ++++++++++++++++++++++++++++++++++++++++++- kernel/sched/rt.c | 15 ++++- kernel/sched/sched.h | 10 +++- 3 files changed, 148 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f8714050b6d0d..b492506d33415 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5052,6 +5052,34 @@ static void do_balance_callbacks(struct rq *rq, stru= ct balance_callback *head) } } =20 +/* + * Only called from __schedule context + * + * There are some cases where we are going to re-do the action + * that added the balance callbacks. We may not be in a state + * where we can run them, so just zap them so they can be + * properly re-added on the next time around. This is similar + * handling to running the callbacks, except we just don't call + * them. + */ +static void zap_balance_callbacks(struct rq *rq) +{ + struct balance_callback *next, *head; + bool found =3D false; + + lockdep_assert_rq_held(rq); + + head =3D rq->balance_callback; + while (head) { + if (head =3D=3D &balance_push_callback) + found =3D true; + next =3D head->next; + head->next =3D NULL; + head =3D next; + } + rq->balance_callback =3D found ? &balance_push_callback : NULL; +} + static void balance_push(struct rq *rq); =20 /* @@ -6592,7 +6620,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) * Otherwise marks the task's __state as RUNNING */ static bool try_to_block_task(struct rq *rq, struct task_struct *p, - unsigned long task_state) + unsigned long task_state, bool deactivate_cond) { int flags =3D DEQUEUE_NOCLOCK; =20 @@ -6601,6 +6629,9 @@ static bool try_to_block_task(struct rq *rq, struct t= ask_struct *p, return false; } =20 + if (!deactivate_cond) + return false; + p->sched_contributes_to_load =3D (task_state & TASK_UNINTERRUPTIBLE) && !(task_state & TASK_NOLOAD) && @@ -6624,6 +6655,88 @@ static bool try_to_block_task(struct rq *rq, struct = task_struct *p, return true; } =20 +#ifdef CONFIG_SCHED_PROXY_EXEC + +static inline struct task_struct * +proxy_resched_idle(struct rq *rq) +{ + put_prev_task(rq, rq->donor); + rq_set_donor(rq, rq->idle); + set_next_task(rq, rq->idle); + set_tsk_need_resched(rq->idle); + return rq->idle; +} + +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor) +{ + unsigned long state =3D READ_ONCE(donor->__state); + + /* Don't deactivate if the state has been changed to TASK_RUNNING */ + if (state =3D=3D TASK_RUNNING) + return false; + /* + * Because we got donor from pick_next_task, it is *crucial* + * that we call proxy_resched_idle before we deactivate it. + * As once we deactivate donor, donor->on_rq is set to zero, + * which allows ttwu to immediately try to wake the task on + * another rq. So we cannot use *any* references to donor + * after that point. So things like cfs_rq->curr or rq->donor + * need to be changed from next *before* we deactivate. + */ + proxy_resched_idle(rq); + return try_to_block_task(rq, donor, state, true); +} + +/* + * Initial simple proxy that just returns the task if it's waking + * or deactivates the blocked task so we can pick something that + * isn't blocked. + */ +static struct task_struct * +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) +{ + struct task_struct *p =3D donor; + struct mutex *mutex; + + mutex =3D p->blocked_on; + /* Something changed in the chain, so pick again */ + if (!mutex) + return NULL; + /* + * By taking mutex->wait_lock we hold off concurrent mutex_unlock() + * and ensure @owner sticks around. + */ + raw_spin_lock(&mutex->wait_lock); + raw_spin_lock(&p->blocked_lock); + + /* Check again that p is blocked with blocked_lock held */ + if (!task_is_blocked(p) || mutex !=3D get_task_blocked_on(p)) { + /* + * Something changed in the blocked_on chain and + * we don't know if only at this level. So, let's + * just bail out completely and let __schedule + * figure things out (pick_again loop). + */ + goto out; + } + if (!proxy_deactivate(rq, donor)) + /* XXX: This hack won't work when we get to migrations */ + donor->blocked_on_state =3D BO_RUNNABLE; + +out: + raw_spin_unlock(&p->blocked_lock); + raw_spin_unlock(&mutex->wait_lock); + return NULL; +} +#else /* SCHED_PROXY_EXEC */ +static struct task_struct * +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) +{ + WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n= "); + return donor; +} +#endif /* SCHED_PROXY_EXEC */ + /* * __schedule() is the main scheduler function. * @@ -6732,12 +6845,22 @@ static void __sched notrace __schedule(int sched_mo= de) goto picked; } } else if (!preempt && prev_state) { - block =3D try_to_block_task(rq, prev, prev_state); + block =3D try_to_block_task(rq, prev, prev_state, + !task_is_blocked(prev)); switch_count =3D &prev->nvcsw; } =20 - next =3D pick_next_task(rq, prev, &rf); +pick_again: + next =3D pick_next_task(rq, rq->donor, &rf); rq_set_donor(rq, next); + if (unlikely(task_is_blocked(next))) { + next =3D find_proxy_task(rq, next, &rf); + if (!next) { + /* zap the balance_callbacks before picking again */ + zap_balance_callbacks(rq); + goto pick_again; + } + } picked: clear_tsk_need_resched(prev); clear_preempt_need_resched(); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index bd66a46b06aca..fa4d9bf76ad49 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1479,8 +1479,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p= , int flags) =20 enqueue_rt_entity(rt_se, flags); =20 - if (!task_current(rq, p) && p->nr_cpus_allowed > 1) - enqueue_pushable_task(rq, p); + /* + * Current can't be pushed away. Selected is tied to current, + * so don't push it either. + */ + if (task_current(rq, p) || task_current_donor(rq, p)) + return; + /* + * Pinned tasks can't be pushed. + */ + if (p->nr_cpus_allowed =3D=3D 1) + return; + + enqueue_pushable_task(rq, p); } =20 static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flag= s) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 24eae02ddc7f6..f560d1d1a7a0c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2272,6 +2272,14 @@ static inline int task_current_donor(struct rq *rq, = struct task_struct *p) return rq->donor =3D=3D p; } =20 +static inline bool task_is_blocked(struct task_struct *p) +{ + if (!sched_proxy_exec()) + return false; + + return !!p->blocked_on && p->blocked_on_state !=3D BO_RUNNABLE; +} + static inline int task_on_cpu(struct rq *rq, struct task_struct *p) { #ifdef CONFIG_SMP @@ -2481,7 +2489,7 @@ static inline void put_prev_set_next_task(struct rq *= rq, struct task_struct *prev, struct task_struct *next) { - WARN_ON_ONCE(rq->curr !=3D prev); + WARN_ON_ONCE(rq->donor !=3D prev); =20 __put_prev_set_next_dl_server(rq, prev, next); =20 --=20 2.47.0.371.ga323438b13-goog From nobody Sat Feb 7 11:31:03 2026 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D0DAF1CCEF6 for ; Mon, 25 Nov 2024 19:52:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564339; cv=none; b=AZn00aBHiUKAd0Gdv6fUuVNxX6u6hnDEeooey56i3BoaqAW4WnpFY9RWVJbHIJcfHdOW4Vl6f3rwn6X485oXwCvMxOnH0KWGr0CrmwsbSuuFWe5yw44D1j3XGEmEtmDhfKf9jWoQve1Mg1aKIj3VStoS5yinw6A5MoOw5dsd7Yg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564339; c=relaxed/simple; bh=YQhCzALqRDyDbecrAh9MRJPPWuTt5/T20h9nLExEeM0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=qmMAusLoa1w9y6DQ91qLmCkcWiM0b22aKjXDX/dCscUoB9vgMgHtCDTgsoVVbXrBv9cbBy9yVfv6qnTYwVxfLvHCSxC89BYmDDD+cVlp8GN+iwO63CylVUvjT7SDzaBOnWqAbmV0UVQY4Ljeu6y8rIdJmxObR5Z38z5lbeUMssg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=iHIGu9A+; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="iHIGu9A+" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-214caf579f9so13632855ad.2 for ; Mon, 25 Nov 2024 11:52:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732564337; x=1733169137; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6HwT7sFYNiac03qNvlAMYNsq6dw1Y8S2dHsVOsavG2c=; b=iHIGu9A+QiUZbbu4Sj9CVLq17oE+QD0wNclwR4NWZg/TBhYBQor4vjELK0vrOQtBuS 1Ev2DBbeEkH/nSX6Om6t4+c85bmOWupnn67p+b916ngxnyFfZ8eQK4/wr6427hCDA00T bQfQRdRjfgFYetOCXL5O9pQyoa39uyJg8X1y/g5zF0ZUvQ5zrOcW6D6XSzOjUcn+zM2W ecyUNyhgeToS0fCwU7kWKgA/JyvgwQOITPscMpy+NQhjWSIt0ZPU8wsXGcf88J/lTXoX 0SMJKcYhlvw1YEPnSE7zzQRqTAUt1AjAjWO8UXrcL/Ml2gyLVX4mK4Y8IpFVvM/3fTvr Dqzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732564337; x=1733169137; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6HwT7sFYNiac03qNvlAMYNsq6dw1Y8S2dHsVOsavG2c=; b=byWZsxI4F1LFcWWuZKAEsqwht6ocLs1OE+44xeYTOlEveSc2WdE4w0paN0W2GRCXuz qfTLlNNsAJwTwT/dycis7F/uLv3tDFjO11gRKQADFeiFP8qVgwAw3L+/d8yVodF3VkFk 9f2Wq8ExI913dTWUwnPIcGKXcQdeuDUqJ9lky92N2UUp8pVYPTYToBfzBsO0qeITUoIX 8TNyx1pVvemdEeWaZwT8fvxTh6r5n7Po6YoG+5/Rl5kxKOXgQQu9GrM4Phkd60kwmhcY r9SZ6veJnHllCN17gJO3quZzXEpYEMDXlaeLqHnIzWH9wOEPdyPX/OVuDFH68Sz3Wlx9 dSjw== X-Gm-Message-State: AOJu0YwvyJUQ03fpSQNOwxjHkvJ81slaRI+NLW/sllYWRBaqwxiyS6RF Nt49F/o/VHHCGKTsaqg6us0+kE3rnn6n587neioIBSZyd/ikI4sjr/hgJEuiq1ENUi9MGvadGlP lvje1LEP1gB+R3fJWb/ae7JeGBcHYPlTnGOaLCIOMXqh8CEUAehK2wvsiHJHhdclouuVQEKd3SO vdTdhHHV4075cETU53V4YvousRZnpUlobtaM2JJLZ2c+51 X-Google-Smtp-Source: AGHT+IGBeZIDh0a6pqhb9R7agS8KBzor9QImbCQUllu1nJQTGBmlhwnhZdqXG+5XE5dpVQevwr4+wxsvFvy9 X-Received: from plgm5.prod.google.com ([2002:a17:902:f645:b0:212:4a87:fa9a]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:244b:b0:20c:774b:5ae5 with SMTP id d9443c01a7336-2129fce2dd2mr194368775ad.9.1732564337034; Mon, 25 Nov 2024 11:52:17 -0800 (PST) Date: Mon, 25 Nov 2024 11:52:00 -0800 In-Reply-To: <20241125195204.2374458-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241125195204.2374458-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.371.ga323438b13-goog Message-ID: <20241125195204.2374458-7-jstultz@google.com> Subject: [RFC][PATCH v14 6/7] sched: Fix proxy/current (push,pull)ability From: John Stultz To: LKML Cc: Valentin Schneider , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com, "Connor O'Brien" , John Stultz Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Valentin Schneider Proxy execution forms atomic pairs of tasks: The waiting donor task (scheduling context) and a proxy (execution context). The donor task, along with the rest of the blocked chain, follows the proxy wrt CPU placement. They can be the same task, in which case push/pull doesn't need any modification. When they are different, however, FIFO1 & FIFO42: ,-> RT42 | | blocked-on | v blocked_donor | mutex | | owner | v `-- RT1 RT1 RT42 CPU0 CPU1 ^ ^ | | overloaded !overloaded rq prio =3D 42 rq prio =3D 0 RT1 is eligible to be pushed to CPU1, but should that happen it will "carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as push/pullable. Unfortunately, only the donor task is usually dequeued from the rq, and the proxy'ed execution context (rq->curr) remains on the rq. This can cause RT1 to be selected for migration from logic like the rt pushable_list. Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule returns, which allows the sched class logic to avoid adding the now current task to the pushable_list. Furthermore, tasks becoming blocked on a mutex don't need an explicit dequeue/enqueue cycle to be made (push/pull)able: they have to be running to block on a mutex, thus they will eventually hit put_prev_task(). Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: Valentin Schneider Signed-off-by: Connor O'Brien Signed-off-by: John Stultz --- v3: * Tweaked comments & commit message v5: * Minor simplifications to utilize the fix earlier in the patch series. * Rework the wording of the commit message to match selected/ proxy terminology and expand a bit to make it more clear how it works. v6: * Dropped now-unused proxied value, to be re-added later in the series when it is used, as caught by Dietmar v7: * Unused function argument fixup * Commit message nit pointed out by Metin Kaya * Dropped unproven unlikely() and use sched_proxy_exec() in proxy_tag_curr, suggested by Metin Kaya v8: * More cleanups and typo fixes suggested by Metin Kaya v11: * Cleanup of comimt message suggested by Metin v12: * Rework for rq_selected -> rq->donor renaming --- kernel/sched/core.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b492506d33415..a18523355fb18 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6737,6 +6737,23 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) } #endif /* SCHED_PROXY_EXEC */ =20 +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner) +{ + if (!sched_proxy_exec()) + return; + /* + * pick_next_task() calls set_next_task() on the chosen task + * at some point, which ensures it is not push/pullable. + * However, the chosen/donor task *and* the mutex owner form an + * atomic pair wrt push/pull. + * + * Make sure owner we run is not pushable. Unfortunately we can + * only deal with that by means of a dequeue/enqueue cycle. :-/ + */ + dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE); + enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE); +} + /* * __schedule() is the main scheduler function. * @@ -6875,6 +6892,10 @@ static void __sched notrace __schedule(int sched_mod= e) * changes to task_struct made by pick_next_task(). */ RCU_INIT_POINTER(rq->curr, next); + + if (!task_current_donor(rq, next)) + proxy_tag_curr(rq, next); + /* * The membarrier system call requires each architecture * to have a full memory barrier after updating @@ -6908,6 +6929,10 @@ static void __sched notrace __schedule(int sched_mod= e) /* Also unlocks the rq: */ rq =3D context_switch(rq, prev, next, &rf); } else { + /* In case next was already curr but just got blocked_donor */ + if (!task_current_donor(rq, next)) + proxy_tag_curr(rq, next); + rq_unpin_lock(rq, &rf); __balance_callbacks(rq); raw_spin_rq_unlock_irq(rq); --=20 2.47.0.371.ga323438b13-goog From nobody Sat Feb 7 11:31:03 2026 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 49BC61CD210 for ; Mon, 25 Nov 2024 19:52:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564341; cv=none; b=Efi2STIZjqk6KS6YzPNFB0HT9yK8IZtvEC3kqRqwf2age4YmEnt1jJnEifTdoU7iTAWLfrEVI4CcskDmQOKIycqmP+zGFltYv4NPQ6nWKetzHUJEUqDtMnkQqE+ZR+KsUfU5VIfRJWlJs75NxQMFHo++o1KiaEF78POHQUhttkc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732564341; c=relaxed/simple; bh=J0V+SYDTlsKBx4a0oSxVfq4AwcX1ZjFOzABscMrTpFY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=md9vI6SKqfZawWtMrF+AMSbiwpXhSjs04Sj+JkaXhW97jxOXEhwiVu/VbXfQLstZciA5PXYqiV9F4Oh1z7sqi6jn6dS5iUHm6cJ5Q4Lm8ph5PfqKjsyZgH4ENaivxDAIThRl4jC472UaXYm6yGRVQIGxWS3fD0rcQtIDPLvYzNk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Z+FpwR5E; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Z+FpwR5E" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-7fc2dc5861eso1265379a12.0 for ; Mon, 25 Nov 2024 11:52:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732564339; x=1733169139; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XOtAH6cdJdFL4IxlyXePH2qxBmoamy+ophSAltPJL/Y=; b=Z+FpwR5ES4w5XtzTxSkSDCMfYhQ/91zL1bVm799uQ0aUYrJFkSsUXz2pd1aPkG4Bjm yGM1H7YQsohl1LXOSP3U+13pxE4Nk/DOag14aRPNV6w7fI99AJ8UWAUid4yQyPvVflUu Htt92+2kfLS0DTPQ4vwWQ9eQnyJMGbfb3K/LKnydDrxa+343vzFaa3+cgCOpHpUW9nj7 Rle3wH3OOfAcb6r24QMrwHs7xRppfsSR7mCMlxGWZSzs71O2FU4ohYzIloNuHyhDu1OY gGFJ91AQ87Zj89xQXXUYlrhcCeUyDTDxp67NHYC2TYNEwL3Yq30K1IriTmFmlmYRWGpY G12w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732564339; x=1733169139; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XOtAH6cdJdFL4IxlyXePH2qxBmoamy+ophSAltPJL/Y=; b=QkGTpYmmrW5vz2LQoa6CWuPhaVDIZWAlUm2xP5aGQehia1IpKCoEzfXw4RXGBrqd+Y OGYoIPz3QN5tPPN/qbeSkFlwUPNeYDoeHv5YlJBJuuudDl+9CDuRpbqa0wYgZUHvKTx7 KKawWcEdVcA4ce/wCVhI0XbCq9KIltATpbxsSIS6g8Ly2LpDrN9wjYQI9Q6dtHcxtD7K ziTB+r/0vZKpkMXZjRnuo3XK4EHaqhXYR+R6D2RUsCnCaCABhfwW0B/wmaGgvNRlhXze pU22UIHKE0DhckIl37LtbWeex/kf50IjBsA2EbguAuen9nRfvPibis//OoIk98bElap8 KKVQ== X-Gm-Message-State: AOJu0YygQv8YrTRF3q9oE0G2/X/8JSi/kziWopFWWM5NJ/BA2asgSFMp V25tqWIUMZyolTRtU6uzlXXlPVj4zbs4ltQzF8YAnmF60sYQ0zUCVsQGKZZlUAVYloHVCEAKwzu jwLmkkIWiCdneBB/OyYsJ3kgoMS/+DLPhGfOaAizuQpF22tTXCQ0Y4HIBXGbKM/f/aKRH8Jzu8V pQxV7mfDbSnzKVOQK7std3EinAEtjoUtyas5mAUsZKVYVe X-Google-Smtp-Source: AGHT+IH6+pgLBwSnl6sWAv3xpjSy90jPbSII/bpFkF6vlEVfFOs6RTZCVP2LrSmVpuKdyXbOBieIMm/YV5i+ X-Received: from pgbcx11.prod.google.com ([2002:a05:6a02:220b:b0:7fc:2823:d6c4]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:748e:b0:1db:e3f6:55f with SMTP id adf61e73a8af0-1e09e420b79mr22740426637.18.1732564338393; Mon, 25 Nov 2024 11:52:18 -0800 (PST) Date: Mon, 25 Nov 2024 11:52:01 -0800 In-Reply-To: <20241125195204.2374458-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241125195204.2374458-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.371.ga323438b13-goog Message-ID: <20241125195204.2374458-8-jstultz@google.com> Subject: [RFC][PATCH v14 7/7] sched: Start blocked_on chain processing in find_proxy_task() From: John Stultz To: LKML Cc: Peter Zijlstra , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com, Valentin Schneider , "Connor O'Brien" , John Stultz Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Start to flesh out the real find_proxy_task() implementation, but avoid the migration cases for now, in those cases just deactivate the donor task and pick again. To ensure the donor task or other blocked tasks in the chain aren't migrated away while we're running the proxy, also tweak the CFS logic to avoid migrating donor or mutex blocked tasks. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Juri Lelli Signed-off-by: Valentin Schneider Signed-off-by: Connor O'Brien [jstultz: This change was split out from the larger proxy patch] Signed-off-by: John Stultz --- v5: * Split this out from larger proxy patch v7: * Minor refactoring of core find_proxy_task() function * Minor spelling and corrections suggested by Metin Kaya * Dropped an added BUG_ON that was frequently tripped v8: * Fix issue if proxy_deactivate fails, we don't leave task BO_BLOCKED * Switch to WARN_ON from BUG_ON checks v9: * Improve comments suggested by Metin * Minor cleanups v11: * Previously we checked next=3D=3Drq->idle && prev=3D=3Drq->idle, but I think we only really care if next=3D=3Drq->idle from find_proxy_task, as we will still want to resched regardless of what prev was. v12: * Commit message rework for selected -> donor rewording v13: * Address new delayed dequeue condition (deactivate donor for now) * Next to donor renaming in find_proxy_task * Improved comments for find_proxy_task * Rework for proxy_deactivate cleanup v14: * Fix build error from __mutex_owner() with CONFIG_PREEMPT_RT --- kernel/locking/mutex.h | 3 +- kernel/sched/core.c | 164 ++++++++++++++++++++++++++++++++++------- kernel/sched/fair.c | 10 ++- 3 files changed, 148 insertions(+), 29 deletions(-) diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h index cbff35b9b7ae3..2e8080a9bee37 100644 --- a/kernel/locking/mutex.h +++ b/kernel/locking/mutex.h @@ -6,7 +6,7 @@ * * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar */ - +#ifndef CONFIG_PREEMPT_RT /* * This is the control structure for tasks blocked on mutex, which resides * on the blocked task's kernel stack: @@ -70,3 +70,4 @@ extern void debug_mutex_init(struct mutex *lock, const ch= ar *name, # define debug_mutex_unlock(lock) do { } while (0) # define debug_mutex_init(lock, name, key) do { } while (0) #endif /* !CONFIG_DEBUG_MUTEXES */ +#endif /* CONFIG_PREEMPT_RT */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a18523355fb18..dec9fabb7e105 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -96,6 +96,7 @@ #include "../workqueue_internal.h" #include "../../io_uring/io-wq.h" #include "../smpboot.h" +#include "../locking/mutex.h" =20 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu); EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask); @@ -2941,8 +2942,15 @@ static int affine_move_task(struct rq *rq, struct ta= sk_struct *p, struct rq_flag struct set_affinity_pending my_pending =3D { }, *pending =3D NULL; bool stop_pending, complete =3D false; =20 - /* Can the task run on the task's current CPU? If so, we're done */ - if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) { + /* + * Can the task run on the task's current CPU? If so, we're done + * + * We are also done if the task is the current donor, boosting a lock- + * holding proxy, (and potentially has been migrated outside its + * current or previous affinity mask) + */ + if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) || + (task_current_donor(rq, p) && !task_current(rq, p))) { struct task_struct *push_task =3D NULL; =20 if ((flags & SCA_MIGRATE_ENABLE) && @@ -6688,41 +6696,139 @@ static bool proxy_deactivate(struct rq *rq, struct= task_struct *donor) } =20 /* - * Initial simple proxy that just returns the task if it's waking - * or deactivates the blocked task so we can pick something that - * isn't blocked. + * Find runnable lock owner to proxy for mutex blocked donor + * + * Follow the blocked-on relation: + * task->blocked_on -> mutex->owner -> task... + * + * Lock order: + * + * p->pi_lock + * rq->lock + * mutex->wait_lock + * p->blocked_lock + * + * Returns the task that is going to be used as execution context (the one + * that is actually going to be run on cpu_of(rq)). */ static struct task_struct * find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) { - struct task_struct *p =3D donor; + struct task_struct *owner =3D NULL; + struct task_struct *ret =3D NULL; + int this_cpu =3D cpu_of(rq); + struct task_struct *p; struct mutex *mutex; =20 - mutex =3D p->blocked_on; - /* Something changed in the chain, so pick again */ - if (!mutex) - return NULL; - /* - * By taking mutex->wait_lock we hold off concurrent mutex_unlock() - * and ensure @owner sticks around. - */ - raw_spin_lock(&mutex->wait_lock); - raw_spin_lock(&p->blocked_lock); + /* Follow blocked_on chain. */ + for (p =3D donor; task_is_blocked(p); p =3D owner) { + mutex =3D p->blocked_on; + /* Something changed in the chain, so pick again */ + if (!mutex) + return NULL; + /* + * By taking mutex->wait_lock we hold off concurrent mutex_unlock() + * and ensure @owner sticks around. + */ + raw_spin_lock(&mutex->wait_lock); + raw_spin_lock(&p->blocked_lock); + + /* Check again that p is blocked with blocked_lock held */ + if (mutex !=3D get_task_blocked_on(p)) { + /* + * Something changed in the blocked_on chain and + * we don't know if only at this level. So, let's + * just bail out completely and let __schedule + * figure things out (pick_again loop). + */ + goto out; + } + + owner =3D __mutex_owner(mutex); + if (!owner) { + p->blocked_on_state =3D BO_RUNNABLE; + ret =3D p; + goto out; + } + + if (task_cpu(owner) !=3D this_cpu) { + /* XXX Don't handle migrations yet */ + if (!proxy_deactivate(rq, donor)) + goto deactivate_failed; + goto out; + } + + if (task_on_rq_migrating(owner)) { + /* + * One of the chain of mutex owners is currently migrating to this + * CPU, but has not yet been enqueued because we are holding the + * rq lock. As a simple solution, just schedule rq->idle to give + * the migration a chance to complete. Much like the migrate_task + * case we should end up back in find_proxy_task(), this time + * hopefully with all relevant tasks already enqueued. + */ + raw_spin_unlock(&p->blocked_lock); + raw_spin_unlock(&mutex->wait_lock); + return proxy_resched_idle(rq); + } + + if (!owner->on_rq) { + /* XXX Don't handle blocked owners yet */ + if (!proxy_deactivate(rq, donor)) + goto deactivate_failed; + goto out; + } + + if (owner->se.sched_delayed) { + /* XXX Don't handle delayed dequeue yet */ + if (!proxy_deactivate(rq, donor)) + goto deactivate_failed; + goto out; + } + + if (owner =3D=3D p) { + /* + * It's possible we interleave with mutex_unlock like: + * + * lock(&rq->lock); + * find_proxy_task() + * mutex_unlock() + * lock(&wait_lock); + * donor(owner) =3D current->blocked_donor; + * unlock(&wait_lock); + * + * wake_up_q(); + * ... + * ttwu_runnable() + * __task_rq_lock() + * lock(&wait_lock); + * owner =3D=3D p + * + * Which leaves us to finish the ttwu_runnable() and make it go. + * + * So schedule rq->idle so that ttwu_runnable can get the rq lock + * and mark owner as running. + */ + raw_spin_unlock(&p->blocked_lock); + raw_spin_unlock(&mutex->wait_lock); + return proxy_resched_idle(rq); + } =20 - /* Check again that p is blocked with blocked_lock held */ - if (!task_is_blocked(p) || mutex !=3D get_task_blocked_on(p)) { /* - * Something changed in the blocked_on chain and - * we don't know if only at this level. So, let's - * just bail out completely and let __schedule - * figure things out (pick_again loop). + * OK, now we're absolutely sure @owner is on this + * rq, therefore holding @rq->lock is sufficient to + * guarantee its existence, as per ttwu_remote(). */ - goto out; + raw_spin_unlock(&p->blocked_lock); + raw_spin_unlock(&mutex->wait_lock); } - if (!proxy_deactivate(rq, donor)) - /* XXX: This hack won't work when we get to migrations */ - donor->blocked_on_state =3D BO_RUNNABLE; =20 + WARN_ON_ONCE(owner && !owner->on_rq); + return owner; + +deactivate_failed: + /* XXX: This hack won't work when we get to migrations */ + donor->blocked_on_state =3D BO_RUNNABLE; out: raw_spin_unlock(&p->blocked_lock); raw_spin_unlock(&mutex->wait_lock); @@ -6807,6 +6913,7 @@ static void __sched notrace __schedule(int sched_mode) struct rq_flags rf; struct rq *rq; int cpu; + bool preserve_need_resched =3D false; =20 cpu =3D smp_processor_id(); rq =3D cpu_rq(cpu); @@ -6877,9 +6984,12 @@ static void __sched notrace __schedule(int sched_mod= e) zap_balance_callbacks(rq); goto pick_again; } + if (next =3D=3D rq->idle) + preserve_need_resched =3D true; } picked: - clear_tsk_need_resched(prev); + if (!preserve_need_resched) + clear_tsk_need_resched(prev); clear_preempt_need_resched(); #ifdef CONFIG_SCHED_DEBUG rq->last_seen_need_resched_ns =3D 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ebde314e151f1..cc126cfcdac06 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9379,6 +9379,7 @@ int can_migrate_task(struct task_struct *p, struct lb= _env *env) * 2) cannot be migrated to this CPU due to cpus_ptr, or * 3) running (obviously), or * 4) are cache-hot on their current CPU. + * 5) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled) */ if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0; @@ -9387,6 +9388,9 @@ int can_migrate_task(struct task_struct *p, struct lb= _env *env) if (kthread_is_per_cpu(p)) return 0; =20 + if (task_is_blocked(p)) + return 0; + if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { int cpu; =20 @@ -9423,7 +9427,8 @@ int can_migrate_task(struct task_struct *p, struct lb= _env *env) /* Record that we found at least one task that could run on dst_cpu */ env->flags &=3D ~LBF_ALL_PINNED; =20 - if (task_on_cpu(env->src_rq, p)) { + if (task_on_cpu(env->src_rq, p) || + task_current_donor(env->src_rq, p)) { schedstat_inc(p->stats.nr_failed_migrations_running); return 0; } @@ -9462,6 +9467,9 @@ static void detach_task(struct task_struct *p, struct= lb_env *env) { lockdep_assert_rq_held(env->src_rq); =20 + WARN_ON(task_current(env->src_rq, p)); + WARN_ON(task_current_donor(env->src_rq, p)); + deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK); set_task_cpu(p, env->dst_cpu); } --=20 2.47.0.371.ga323438b13-goog