From nobody Sun Nov 24 09:01:04 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E46318FDCE for ; Wed, 6 Nov 2024 02:57:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861826; cv=none; b=NTTpUssbb4iZtpCx1NchEDitoLAB7eiNyFexUQlb+bDAjAHJdwLe1QlUvB3QWy/gP5YPlSxTk1V2E8TY1ot60GP4X9xcLnzVkQIPd4HkLsIQcDxQXlGf+CgjI6pvDu6aWptFVKwSyoUia85JYNxZbEQSBFqdEDwddqAeS8Fjvek= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861826; c=relaxed/simple; bh=h6TFkk3UWQbVFEnbht6yNjdbKrdQbKC/+TdjrBRVTyU=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=bIS5AigEqyTyxq0bPzqRycC9RgVwrzIplrN5mVyliYTpAEWvsc2IDULafz89nWnTnUif6Q79mTdo+uym7MnQbQ1Pcl4nSITsvfCaDFnpaHan8D6Q+66/iNsg83Y0dCM5HtqGweYdYbv/Fy1hpAIEDbxmrq6nepYfqkJRDmoL5Nk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=NovCl0/k; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="NovCl0/k" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ea863ecfe9so73139467b3.3 for ; Tue, 05 Nov 2024 18:57:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730861823; x=1731466623; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=FAcghQf6R6kdCkj8rTfS8LDuV3Z5MU4VGrydA8s7ieg=; b=NovCl0/k0OKEjUkIuGpuv7qCCz5NZHz9ajR5bExOPxLn3dUf8KVVWfF4/ZsFMEi4X5 ongeIzjq0UOFTIlXqJVi+oM1zr3+tOKs9s7xrQAl2+KUnXi20L2p3lm0seF1fD2ZCMyu BGGNZ4j7DnBsuPYPnkFMsfngUrnRDbwIJpQBT1s8WMRKs/P4P5sfDiVkf/smhnZeDhbY WMbbLDYzCxZfXU/j31zK07VlGEAUXD64o/g9jk6XGamucIIftVB76iUYUJX2XQbfjsjk 9Y9CNzAG+mAXJ822za3z2El2P9WEN93VqZe1A0C/28Z0cWCFWWtylnQEV0WngBuEI/gR cEUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730861823; x=1731466623; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FAcghQf6R6kdCkj8rTfS8LDuV3Z5MU4VGrydA8s7ieg=; b=HNau6TC17SM6XU9ucOxdh0W5TCRV07sxbOFSQBj4SUU3x82J9odL1KevUrqtuSzGrL ndCskPHltEToC4f47A0HM3gbcxU9Bv8DzuHevxQGWYj3oDOJsrHzNBxIx84MVd+a5062 FmlvfOZ5EpZv+4/k1wEe6pfdNsbkhD05HZqIbGD23uIo13kBU3gOsd6/Qs/tZBaanJNv RmrSpFNDZbGsebnpxwYbAf2lESz5EAHxEXS3uJcOStQVwAk5FwalscimhaxbmA5zr/uY bZi5xp7pHDtMdidCwiCLvG5xw7iWAjFbOgozVHoM0cYUAjAWWAf7zBim/CEe/l78W8N6 ac0Q== X-Gm-Message-State: AOJu0YwD4LFfuCsDCQEyVjiGVGCx+hosZpQjfpCtRt6Gos2/BfxRXdWZ E0hSPpqGXnoLHy/9+8mlA50Wdt/IBgsjcFqeXlaguAbfk5Md0LHKndclkXi2MO0cV6V5ZcP6zye FRk9PSJKtuctU0MtoTkDZ6P6CFOf2FfJgkCGtLhEyGgYHUX0M5+AdIGipCT0tjhr3TUDcmLsl8G cJIWws52wEHeBKFaDp9Z62N9JRwlH8n8++ruUsTv2o20b1 X-Google-Smtp-Source: AGHT+IGUWP+GWTBSONfEMXEftSPgQpmB2+QGLi3H5/olrnBR57+co6NioocX80C16bXkHPB1fkvygZCa2qvS X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a5b:91:0:b0:e28:f2a5:f1d with SMTP id 3f1490d57ef6-e3087a4ed65mr70312276.4.1730861822133; Tue, 05 Nov 2024 18:57:02 -0800 (PST) Date: Tue, 5 Nov 2024 18:56:41 -0800 In-Reply-To: <20241106025656.2326794-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241106025656.2326794-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241106025656.2326794-2-jstultz@google.com> Subject: [RFC][PATCH v13 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument sched_proxy_exec=3D that can be used to disable the feature at boot time if CONFIG_SCHED_PROXY_EXEC was enabled. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Tested-by: K Prateek Nayak Signed-off-by: John Stultz --- v7: * Switch to CONFIG_SCHED_PROXY_EXEC/sched_proxy_exec=3D as suggested by Metin Kaya. * Switch boot arg from =3Ddisable/enable to use kstrtobool(), which supports =3Dyes|no|1|0|true|false|on|off, as also suggested by Metin Kaya, and print a message when a boot argument is used. v8: * Move CONFIG_SCHED_PROXY_EXEC under Scheduler Features as Suggested by Metin * Minor rework reordering with split sched contexts patch v12: * Rework for selected -> donor renaming --- .../admin-guide/kernel-parameters.txt | 5 ++++ include/linux/sched.h | 13 +++++++++ init/Kconfig | 7 +++++ kernel/sched/core.c | 29 +++++++++++++++++++ kernel/sched/sched.h | 12 ++++++++ 5 files changed, 66 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentatio= n/admin-guide/kernel-parameters.txt index 1518343bbe22..5152935b54b6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5944,6 +5944,11 @@ sa1100ir [NET] See drivers/net/irda/sa1100_ir.c. =20 + sched_proxy_exec=3D [KNL] + Enables or disables "proxy execution" style + solution to mutex-based priority inversion. + Format: + sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages. =20 schedstats=3D [KNL,X86] Enable or disable scheduled statistics. diff --git a/include/linux/sched.h b/include/linux/sched.h index a76e3d074a2a..2a47228a4808 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1609,6 +1609,19 @@ struct task_struct { */ }; =20 +#ifdef CONFIG_SCHED_PROXY_EXEC +DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec); +static inline bool sched_proxy_exec(void) +{ + return static_branch_likely(&__sched_proxy_exec); +} +#else +static inline bool sched_proxy_exec(void) +{ + return false; +} +#endif + #define TASK_REPORT_IDLE (TASK_REPORT + 1) #define TASK_REPORT_MAX (TASK_REPORT_IDLE << 1) =20 diff --git a/init/Kconfig b/init/Kconfig index c521e1421ad4..a91c51850731 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -863,6 +863,13 @@ config UCLAMP_BUCKETS_COUNT =20 If in doubt, use the default value. =20 +config SCHED_PROXY_EXEC + bool "Proxy Execution" + default n + help + This option enables proxy execution, a mechanism for mutex-owning + tasks to inherit the scheduling context of higher priority waiters. + endmenu =20 # diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c57a79e34911..731ebd8614a9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,6 +119,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp); =20 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); =20 +#ifdef CONFIG_SCHED_PROXY_EXEC +DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec); +static int __init setup_proxy_exec(char *str) +{ + bool proxy_enable; + + if (kstrtobool(str, &proxy_enable)) { + pr_warn("Unable to parse sched_proxy_exec=3D\n"); + return 0; + } + + if (proxy_enable) { + pr_info("sched_proxy_exec enabled via boot arg\n"); + static_branch_enable(&__sched_proxy_exec); + } else { + pr_info("sched_proxy_exec disabled via boot arg\n"); + static_branch_disable(&__sched_proxy_exec); + } + return 1; +} +#else +static int __init setup_proxy_exec(char *str) +{ + pr_warn("CONFIG_SCHED_PROXY_EXEC=3Dn, so it cannot be enabled or disabled= at boot time\n"); + return 0; +} +#endif +__setup("sched_proxy_exec=3D", setup_proxy_exec); + #ifdef CONFIG_SCHED_DEBUG /* * Debugging: various feature bits diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e51bf5a344d3..258db6ef8c70 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1147,10 +1147,15 @@ struct rq { */ unsigned int nr_uninterruptible; =20 +#ifdef CONFIG_SCHED_PROXY_EXEC + struct task_struct __rcu *donor; /* Scheduling context */ + struct task_struct __rcu *curr; /* Execution context */ +#else union { struct task_struct __rcu *donor; /* Scheduler context */ struct task_struct __rcu *curr; /* Execution context */ }; +#endif struct sched_dl_entity *dl_server; struct task_struct *idle; struct task_struct *stop; @@ -1347,10 +1352,17 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues= ); #define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define raw_rq() raw_cpu_ptr(&runqueues) =20 +#ifdef CONFIG_SCHED_PROXY_EXEC +static inline void rq_set_donor(struct rq *rq, struct task_struct *t) +{ + rcu_assign_pointer(rq->donor, t); +} +#else static inline void rq_set_donor(struct rq *rq, struct task_struct *t) { /* Do nothing */ } +#endif =20 #ifdef CONFIG_SCHED_CORE static inline struct cpumask *sched_group_span(struct sched_group *sg); --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 09:01:04 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4837A190054 for ; Wed, 6 Nov 2024 02:57:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861828; cv=none; b=qd4KQ+lbYx6HDUwNNeU58PzpDlIReb+761AW2E2VAP6MTQR+AWvLGKASwySLsttZ5z0mVNRBpj0lds/6pa4+PfBE+3Q7eicpAEKDcSxUuZDs0wBg469pRNeUYvGgU9Am+0MCTp1sEDuqIqiLsvVme8+60G0981JFWv1LZtTDV9Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861828; c=relaxed/simple; bh=rGQtlEpcbH5FcjFwNcQYCFCUyRz3/P8mR0Q7FHjFhZg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Ek/pJoi0XwTWaDjvB911FkJBVZHAmbP09bM8zRK27f8qYFZ+9SmrS9pxwk7491+7wpLcblG2IZhbHb5X73F6Th5M13v566CY85nx1Cuznvl0fX/BlJLGocLMBQPLo6q87L9HANYgm7jHRuCidNXKB0veEOQj+Bfc50LgZUh0OIs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=OdfgTzNv; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="OdfgTzNv" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6ea8d6fb2ffso57541667b3.2 for ; Tue, 05 Nov 2024 18:57:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730861825; x=1731466625; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Dy8ujYSiTl3GPld1FOnShhk2/fC3GPk4dpw7u9r67Hw=; b=OdfgTzNvtitQjxTFEkAVpV86d543FYXrx1bwWDjW0uDpVqSP0YK81KK/SvPYES0Yfn 25Kw/JBldYyW2JjjkiNEvfgXBTqPfMuqEwAWhfeAo/AtFlE421I9JLBSvoJ5KZ/Ye2Km yjwOVKvfS1mUnjY5KqZQizs7yKK3qS+WTNLaHpA9smMeyy4GggrvDsj/4hghsiaoM9Gd DPoEun6SpxKT2W4DLPObnRyzMtQSLbNfL6E/4YEp2FQ1Xhsvox3/xqOix6uaW6Y8LiOi Hfp2lK932kVZboMegpDjOTw+3YgkXhiYvSuCcViWKwKnN8EVsrIZWyRQZI/YLfGmjgX2 v1Dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730861825; x=1731466625; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Dy8ujYSiTl3GPld1FOnShhk2/fC3GPk4dpw7u9r67Hw=; b=b/MFtys7Qby/V04h+w0GQIJViZUGFGf7JWOn9UvAF0saypNaQ7lt5qtTHHYrziL8uY xGnRsMp3irxFFQcbffxWuSnvWRrhC4Uto4aQFJvf5Uq8taaESqRiRSekndxDIAkuA8DK agajXnaBa0iJGnOPV0oXqgzNn5G95rs+kRGjYFNFibpNd3UV5ZPMFeWfKWTzMT9FX1US E8KmwBizyoS/CA8Pw10CSQAifasP9JeQkGRZ1EbIdp++V0MvkaqeiJ5SzUk2HVcpgxVn 8jMJfkgWLiaPZJqRSZhqg0Kaw1l0eUYrLJyAxMEpvvqoG4D08ucvVZ2ngUYL2OcUVX+u CiYA== X-Gm-Message-State: AOJu0YylBONT7FDLpwi6qRykkRTAdxsITdReUPkT2xSKJRBsNB42AEFl BDagInxEIAVV320EPfrRUB97N4tIKHbLwbKZekc22KocJX8L912o1fjauRf+YlcEaUBz7hkOl7c MHLCHB+aIK64OufMa1d5zrejos7wRqqF17u06lNBZiG+IWejnh6b9hqjjm/szgbrDH/L9rX8sCy +S4UzwczST/YSn/yEkkxAUet7Tbq5D4MTPxtW8t6ZW3vV8 X-Google-Smtp-Source: AGHT+IHvHSAbDQygaFhNCAVzyvsYcPMRaSN8WnKN79uFGaPTu4FMD3xyJWMtxa3W1WII9y47gFXE4LdzTunU X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a05:690c:650a:b0:6ea:384f:bd85 with SMTP id 00721157ae682-6ea384fd4bemr6325827b3.6.1730861825109; Tue, 05 Nov 2024 18:57:05 -0800 (PST) Date: Tue, 5 Nov 2024 18:56:42 -0800 In-Reply-To: <20241106025656.2326794-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241106025656.2326794-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241106025656.2326794-3-jstultz@google.com> Subject: [RFC][PATCH v13 2/7] locking/mutex: Rework task_struct::blocked_on From: John Stultz To: LKML Cc: Peter Zijlstra , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com, "Connor O'Brien" , John Stultz Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Track the blocked-on relation for mutexes, to allow following this relation at schedule time. task | blocked-on v mutex | owner v task Also add a blocked_on_state value so we can distinguish when a task is blocked_on a mutex, but is either blocked, waking up, or runnable (such that it can try to acquire the lock its blocked on). This avoids some of the subtle & racy games where the blocked_on state gets cleared, only to have it re-added by the mutex_lock_slowpath call when it tries to acquire the lock on wakeup Also add blocked_lock to the task_struct so we can safely serialize the blocked-on state. Finally add wrappers that are useful to provide correctness checks. Folded in from a patch by: Valentin Schneider This all will be used for tracking blocked-task/mutex chains with the prox-execution patch in a similar fashion to how priority inheritance is done with rt_mutexes. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: Peter Zijlstra (Intel) [minor changes while rebasing] Signed-off-by: Juri Lelli Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Connor O'Brien [jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths] Signed-off-by: John Stultz --- v2: * Fixed blocked_on tracking in error paths that was causing crashes v4: * Ensure we clear blocked_on when waking ww_mutexes to die or wound. This is critical so we don't get circular blocked_on relationships that can't be resolved. v5: * Fix potential bug where the skip_wait path might clear blocked_on when that path never set it * Slight tweaks to where we set blocked_on to make it consistent, along with extra WARN_ON correctness checking * Minor comment changes v7: * Minor commit message change suggested by Metin Kaya * Fix WARN_ON conditionals in unlock path (as blocked_on might already be cleared), found while looking at issue Metin Kaya raised. * Minor tweaks to be consistent in what we do under the blocked_on lock, also tweaked variable name to avoid confusion with label, and comment typos, as suggested by Metin Kaya * Minor tweak for CONFIG_SCHED_PROXY_EXEC name change * Moved unused block of code to later in the series, as suggested by Metin Kaya * Switch to a tri-state to be able to distinguish from waking and runnable so we can later safely do return migration from ttwu * Folded together with related blocked_on changes v8: * Fix issue leaving task BO_BLOCKED when calling into optimistic spinning path. * Include helper to better handle BO_BLOCKED->BO_WAKING transitions v9: * Typo fixup pointed out by Metin * Cleanup BO_WAKING->BO_RUNNABLE transitions for the !proxy case * Many cleanups and simplifications suggested by Metin v11: * Whitespace fixup pointed out by Metin v13: * Refactor set_blocked_on helpers clean things up a bit --- include/linux/sched.h | 66 ++++++++++++++++++++++++++++++++---- init/init_task.c | 1 + kernel/fork.c | 4 +-- kernel/locking/mutex-debug.c | 9 ++--- kernel/locking/mutex.c | 40 ++++++++++++++++++---- kernel/locking/ww_mutex.h | 24 +++++++++++-- kernel/sched/core.c | 1 + 7 files changed, 125 insertions(+), 20 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 2a47228a4808..abe23791de30 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -34,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -775,6 +776,12 @@ struct kmap_ctrl { #endif }; =20 +enum blocked_on_state { + BO_RUNNABLE, + BO_BLOCKED, + BO_WAKING, +}; + struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* @@ -1195,10 +1202,9 @@ struct task_struct { struct rt_mutex_waiter *pi_blocked_on; #endif =20 -#ifdef CONFIG_DEBUG_MUTEXES - /* Mutex deadlock detection: */ - struct mutex_waiter *blocked_on; -#endif + enum blocked_on_state blocked_on_state; + struct mutex *blocked_on; /* lock we're blocked on */ + raw_spinlock_t blocked_lock; =20 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP int non_block_count; @@ -2116,6 +2122,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *loc= k); __cond_resched_rwlock_write(lock); \ }) =20 +static inline void __set_blocked_on_runnable(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + + if (p->blocked_on_state =3D=3D BO_WAKING) + p->blocked_on_state =3D BO_RUNNABLE; +} + +static inline void set_blocked_on_runnable(struct task_struct *p) +{ + unsigned long flags; + + if (!sched_proxy_exec()) + return; + + raw_spin_lock_irqsave(&p->blocked_lock, flags); + __set_blocked_on_runnable(p); + raw_spin_unlock_irqrestore(&p->blocked_lock, flags); +} + +static inline void set_blocked_on_waking(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + + if (p->blocked_on_state =3D=3D BO_BLOCKED) + p->blocked_on_state =3D BO_WAKING; +} + +static inline void set_task_blocked_on(struct task_struct *p, struct mutex= *m) +{ + lockdep_assert_held(&p->blocked_lock); + + /* + * Check we are clearing values to NULL or setting NULL + * to values to ensure we don't overwrite existing mutex + * values or clear already cleared values + */ + WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on)); + + p->blocked_on =3D m; + p->blocked_on_state =3D m ? BO_BLOCKED : BO_RUNNABLE; +} + +static inline struct mutex *get_task_blocked_on(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + + return p->blocked_on; +} + static __always_inline bool need_resched(void) { return unlikely(tif_need_resched()); @@ -2155,8 +2211,6 @@ extern bool sched_task_on_rq(struct task_struct *p); extern unsigned long get_wchan(struct task_struct *p); extern struct task_struct *cpu_curr_snapshot(int cpu); =20 -#include - /* * In order to reduce various lock holder preemption latencies provide an * interface to see if a vCPU is currently running or not. diff --git a/init/init_task.c b/init/init_task.c index 136a8231355a..c2da9cce36bd 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -139,6 +139,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .journal_info =3D NULL, INIT_CPU_TIMERS(init_task) .pi_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock), + .blocked_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock), .timer_slack_ns =3D 50000, /* 50 usec default slack */ .thread_pid =3D &init_struct_pid, .thread_node =3D LIST_HEAD_INIT(init_signals.thread_head), diff --git a/kernel/fork.c b/kernel/fork.c index 7d950e93f080..99ebc8269140 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2229,6 +2229,7 @@ __latent_entropy struct task_struct *copy_process( ftrace_graph_init_task(p); =20 rt_mutex_init_task(p); + raw_spin_lock_init(&p->blocked_lock); =20 lockdep_assert_irqs_enabled(); #ifdef CONFIG_PROVE_LOCKING @@ -2326,9 +2327,8 @@ __latent_entropy struct task_struct *copy_process( lockdep_init_task(p); #endif =20 -#ifdef CONFIG_DEBUG_MUTEXES + p->blocked_on_state =3D BO_RUNNABLE; p->blocked_on =3D NULL; /* not blocked yet */ -#endif #ifdef CONFIG_BCACHE p->sequential_io =3D 0; p->sequential_io_avg =3D 0; diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c index 6e6f6071cfa2..1d8cff71f65e 100644 --- a/kernel/locking/mutex-debug.c +++ b/kernel/locking/mutex-debug.c @@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct = mutex_waiter *waiter, { lockdep_assert_held(&lock->wait_lock); =20 - /* Mark the current thread as blocked on the lock: */ - task->blocked_on =3D waiter; + /* Current thread can't be already blocked (since it's executing!) */ + DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task)); } =20 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *wa= iter, struct task_struct *task) { + struct mutex *blocked_on =3D get_task_blocked_on(task); + DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list)); DEBUG_LOCKS_WARN_ON(waiter->task !=3D task); - DEBUG_LOCKS_WARN_ON(task->blocked_on !=3D waiter); - task->blocked_on =3D NULL; + DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on !=3D lock); =20 INIT_LIST_HEAD(&waiter->list); waiter->task =3D NULL; diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 3302e52f0c96..8f5d3fe6c102 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -597,6 +597,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas } =20 raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); /* * After waiting to acquire the wait_lock, try again. */ @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas goto err_early_kill; } =20 + set_task_blocked_on(current, lock); set_current_state(state); trace_contention_begin(lock, LCB_F_MUTEX); for (;;) { @@ -639,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas * the handoff. */ if (__mutex_trylock(lock)) - goto acquired; + break; /* acquired */; =20 /* * Check for signals and kill conditions while holding @@ -657,6 +659,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas goto err; } =20 + raw_spin_unlock(¤t->blocked_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); /* Make sure we do wakeups before calling schedule */ wake_up_q(&wake_q); @@ -666,6 +669,13 @@ __mutex_lock_common(struct mutex *lock, unsigned int s= tate, unsigned int subclas =20 first =3D __mutex_waiter_is_first(lock, &waiter); =20 + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); + + /* + * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE + */ + current->blocked_on_state =3D BO_BLOCKED; set_current_state(state); /* * Here we order against unlock; we must either see it change @@ -676,16 +686,26 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas break; =20 if (first) { + bool opt_acquired; + + /* + * mutex_optimistic_spin() can schedule, so we need to + * release these locks before calling it. + */ + current->blocked_on_state =3D BO_RUNNABLE; + raw_spin_unlock(¤t->blocked_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN); - if (mutex_optimistic_spin(lock, ww_ctx, &waiter)) + opt_acquired =3D mutex_optimistic_spin(lock, ww_ctx, &waiter); + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); + current->blocked_on_state =3D BO_BLOCKED; + if (opt_acquired) break; trace_contention_begin(lock, LCB_F_MUTEX); } - - raw_spin_lock_irqsave(&lock->wait_lock, flags); } - raw_spin_lock_irqsave(&lock->wait_lock, flags); -acquired: + set_task_blocked_on(current, NULL); __set_current_state(TASK_RUNNING); =20 if (ww_ctx) { @@ -710,16 +730,20 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas if (ww_ctx) ww_mutex_lock_acquired(ww, ww_ctx); =20 + raw_spin_unlock(¤t->blocked_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); wake_up_q(&wake_q); preempt_enable(); return 0; =20 err: + set_task_blocked_on(current, NULL); __set_current_state(TASK_RUNNING); __mutex_remove_waiter(lock, &waiter); err_early_kill: + WARN_ON(get_task_blocked_on(current)); trace_contention_end(lock, ret); + raw_spin_unlock(¤t->blocked_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); debug_mutex_free_waiter(&waiter); mutex_release(&lock->dep_map, ip); @@ -928,8 +952,12 @@ static noinline void __sched __mutex_unlock_slowpath(s= truct mutex *lock, unsigne =20 next =3D waiter->task; =20 + raw_spin_lock(&next->blocked_lock); debug_mutex_wake_waiter(lock, waiter); + WARN_ON(get_task_blocked_on(next) !=3D lock); + set_blocked_on_waking(next); wake_q_add(&wake_q, next); + raw_spin_unlock(&next->blocked_lock); } =20 if (owner & MUTEX_FLAG_HANDOFF) diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h index 37f025a096c9..d1bec62c6ef0 100644 --- a/kernel/locking/ww_mutex.h +++ b/kernel/locking/ww_mutex.h @@ -281,10 +281,21 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITE= R *waiter, return false; =20 if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)= ) { + /* nested as we should hold current->blocked_lock already */ + raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING); #ifndef WW_RT debug_mutex_wake_waiter(lock, waiter); #endif + /* + * When waking up the task to die, be sure to set the + * blocked_on_state to WAKING. Otherwise we can see + * circular blocked_on relationships that can't + * resolve. + */ + WARN_ON(get_task_blocked_on(waiter->task) !=3D lock); + set_blocked_on_waking(waiter->task); wake_q_add(wake_q, waiter->task); + raw_spin_unlock(&waiter->task->blocked_lock); } =20 return true; @@ -331,9 +342,18 @@ static bool __ww_mutex_wound(struct MUTEX *lock, * it's wounded in __ww_mutex_check_kill() or has a * wakeup pending to re-read the wounded state. */ - if (owner !=3D current) + if (owner !=3D current) { + /* nested as we should hold current->blocked_lock already */ + raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING); + /* + * When waking up the task to wound, be sure to set the + * blocked_on_state flag. Otherwise we can see circular + * blocked_on relationships that can't resolve. + */ + set_blocked_on_waking(owner); wake_q_add(wake_q, owner); - + raw_spin_unlock(&owner->blocked_lock); + } return true; } =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 731ebd8614a9..f040feed9df3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4309,6 +4309,7 @@ int try_to_wake_up(struct task_struct *p, unsigned in= t state, int wake_flags) ttwu_queue(p, cpu, wake_flags); } out: + set_blocked_on_runnable(p); if (success) ttwu_stat(p, task_cpu(p), wake_flags); =20 --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 09:01:04 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3294A190052 for ; Wed, 6 Nov 2024 02:57:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861829; cv=none; b=W3TaZl2D5F2J0AivM/lwkfAYgyaWC0tmAR32SMyQnLbU82c3k7Lgu1OpPoHrloqPwHFaqTun+XDRVVkNjNMTOITeHCaF3FBT4vYBq+b1AXw4tdQFN1SHNAZGrf9wrz07RcroRwLyME9H6YIIat5427jvYB1NXZF6euRAWKQpn4E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861829; c=relaxed/simple; bh=epXx3nKHnvZo7ttPmlz+gQBJAvkWxrkiXzHmixCGH4E=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=aW4a01DoFad3oszDmO6UaWcl/SHv6hjAv/fGx+qIgfOAm3RxxDUrBPl39YdMa3j0DyBq4vRxIMUQIN2HTVZKI4JVmJmJKyQFKFqwiV6VV6GLphWl0Y7Q33YeZAFtY/4YUaJCZi+hhOzt+3Fm4GzCvZVdmBcSM3ihCyu4XMUrti8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=cZjjSYC+; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="cZjjSYC+" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6ea8d6fb2ffso57541967b3.2 for ; Tue, 05 Nov 2024 18:57:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730861827; x=1731466627; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=JIjrE5PEMg1ur9keW/GF/fKrt7H/YNFX6Ddge323Jlc=; b=cZjjSYC+nOROS8tslrvrpRn5o+49J9zZsoZ9kyX8aaZHvjHlKICuECJGqGKf027Qpu XKVEbERcA7PMH9TBvdEzidCPDt+Y2lau+mVYoAD+cZRrCy6/G3UBFBpIsFxgTb9MnH1A oS7Hl/cHxYTHOmu2fDKIZHxmPx1OMoHBlloeCiYNJWi+BEbTIqxiJ/ec/FQsZ2hyZZ6L WEQpktjkjzOElWQkyWbRWtj8LNR4TVCRJ5ioLCq4v12O+QTLDHwxPbeUIlFyKPKKdn/U QcJrxB1kSEkCZ5HG4XJ8XjNVKjxjUomLf4ZJfp/1k6S66Uo7SjDCh3LQMpD57Or/j8NU dKlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730861827; x=1731466627; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=JIjrE5PEMg1ur9keW/GF/fKrt7H/YNFX6Ddge323Jlc=; b=eDxJUvbQbdBMuRLdt4X9QzgYh0EpCeUP/RUBu0AAxAb6etVzKxEiz8IdjBFGk+iCj6 kHvLKFg9UoEsxkkio1jDZbwIATa/c5ZRwe77GIiZDgQylrQ3As+rvWHzT1ypA+PozYLm 32LpMFnSV5hjjwTj2pMLUF6k/8HZBz+4iLKavUaxZAimdNKzqGAwq+ezYrVKrZUV2WYQ qtcZhBkmj/lXFpFuHkbKkXPC56WB3u10WrfT3KRxTNSJkLmDzX5o8S1JZz7DDCHEzLun lcn95GXzqW1DRe7o6C9YFFeHNKF34Ce/02AIY0wmDyMTaDwGbr9lMKQ7Eb9h6FDevTUL TibQ== X-Gm-Message-State: AOJu0Yy5UCIz+NcqVgu4f3Rtx1ftSBwSflJcKMfFMExa64lclxMjov1J HgqgVaUglW+8ciaPgPwzcV6DyGJws6XPyzATfYA0PD19X15l7vKy0uScA54zpmNU8F+qwPu2rOr icQartdd6T8eC0PEKKQY94nMBJAlBWWs6cuMUj7m83PfmXPZLFLEpQUM6J9MPVYNTMcadXYm4n3 aXF10rQDDxyubISaBc32Vrl+ttPsh8Pd5ypTsNVXc2ZwvC X-Google-Smtp-Source: AGHT+IH57CaO7dkz9YeYIDEWbd5iZq3+IpRtvec2q0BntJ6KhDSB2nrieHTmma7LEhVdAsPW5ELtduDVRF3S X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a05:690c:308a:b0:6e9:f188:8638 with SMTP id 00721157ae682-6e9f1888858mr14057437b3.7.1730861827130; Tue, 05 Nov 2024 18:57:07 -0800 (PST) Date: Tue, 5 Nov 2024 18:56:43 -0800 In-Reply-To: <20241106025656.2326794-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241106025656.2326794-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241106025656.2326794-4-jstultz@google.com> Subject: [RFC][PATCH v13 3/7] sched: Fix runtime accounting w/ split exec & sched contexts From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The idea here is we want to charge the scheduler-context task's vruntime but charge the execution-context task's sum_exec_runtime. This way cputime accounting goes against the task actually running but vruntime accounting goes against the rq->donor task so we get proper fairness. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: John Stultz --- kernel/sched/fair.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6512258dc71f..42043310adfe 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1131,22 +1131,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_r= q) } #endif /* CONFIG_SMP */ =20 -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr) +static s64 update_curr_se(struct rq *rq, struct sched_entity *se) { u64 now =3D rq_clock_task(rq); s64 delta_exec; =20 - delta_exec =3D now - curr->exec_start; + delta_exec =3D now - se->exec_start; if (unlikely(delta_exec <=3D 0)) return delta_exec; =20 - curr->exec_start =3D now; - curr->sum_exec_runtime +=3D delta_exec; + se->exec_start =3D now; + if (entity_is_task(se)) { + struct task_struct *running =3D rq->curr; + /* + * If se is a task, we account the time against the running + * task, as w/ proxy-exec they may not be the same. + */ + running->se.exec_start =3D now; + running->se.sum_exec_runtime +=3D delta_exec; + } else { + /* If not task, account the time against se */ + se->sum_exec_runtime +=3D delta_exec; + } =20 if (schedstat_enabled()) { struct sched_statistics *stats; =20 - stats =3D __schedstats_from_se(curr); + stats =3D __schedstats_from_se(se); __schedstat_set(stats->exec_max, max(delta_exec, stats->exec_max)); } --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 09:01:04 2024 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9DEC7190676 for ; Wed, 6 Nov 2024 02:57:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861832; cv=none; b=PuJgf0XA9CP03YYLG7OB16WxlXIssqbgKB6R4AnWdYNHDyC+VOX3LqNhsKQ9t3X62ZwJCEgd14px8icd9OQbN9Og4dMZLw3y7HZq7GCUyNNRpUsh/Skgd6BzisT3cvUK0qPf5ALXp/5ZuQmEPyiGy5biJr25kGmIoZSABuTNUzM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861832; c=relaxed/simple; bh=zu4AqCl3xwWVEQVzRYde3/jWSmMCwe3WXBKhHvYFIHg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=d9A5flPlVZd7iajD7afp/9yOQ1QEXOA2JIY89lFVnM/Sh7jekHFNh7ispTKPpBFJn7gewpM2qEVfo7QvGh2GTH2vEaV7iroUpxHErQL16VxKb0apMsp27N5pOWfwcftEZjPtkHsP2TOeUA4kgYZnrHbfHAjLfUsodZKqtSGX7Oo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=H2iIpNxc; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="H2iIpNxc" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-7ea0069a8b0so5504043a12.0 for ; Tue, 05 Nov 2024 18:57:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730861830; x=1731466630; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=J1zz6/fs34S9LrjXAat21RL6lCkZJ4k/xrDwVmaAbQU=; b=H2iIpNxcL0NeuAI/Xr0mkT9OZYjZWpyNyeODLjXdZ/A2JQcWYk+POsYbjlVwZJZDnP h9a/MlA8ZPssCB8jLsupPuNq1N8GcQKWioxDuWEmKHV+8qUoPT9GpLpl3AjOtLlFvU1v dzUWrlMGfSlbhuB7wUSCTNHXjAA2eF8g9DEHhOxJF4SAu+h0fBlU5eXpgFlFUiQCZviD P3xaGmhPz+FjM6OUdQteRbZK++hIWl+Uo314u/PaKPgFZSjr8LH+wTwsan04VBtR/6Tb 2opkvUAmJPNHxE18qA9lKGB+uQoNOeHe74r436ZvkW+PQ+a2Z/t0T6LFYs2zhvEVwgm6 OWzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730861830; x=1731466630; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=J1zz6/fs34S9LrjXAat21RL6lCkZJ4k/xrDwVmaAbQU=; b=fO1h4P9GM9To3NadPQxbJuusSJfnNfwI2sk8EutwkzGsBqAyUWeujGhTH0AhcX/ImD KHew3Qj43DzZQMzFi/PjDdjz0I8zLSQL3PxSFOp/Tq1AAW7ZlUEKTCkMBN57aoLWwiVY NR3KgCDEaRfZwO2Y05YZO8BbTIX5H8O6OdcL6B9Z/XJE8RY8OV7yAnaTTPtv0QoGakSb VcMFKQfgWPCfKRtR55yEY466U1iW0zBJkgwDyQ/VHSknOTEiSqf96AGIe4Ne3vtVDJ5J 6DdBkQXkPeiTP+tDflcLvhXm4cqbeHtvFkUv00peGiilgwO8aiwnoByHLZvpl0nMeEMD Xkow== X-Gm-Message-State: AOJu0YyGebEtz7iOK7XgXipIrMC8IgwbBs72kWN7Y44lupyjmh3vC5wm 7Ise3UlaOyQ8SW6LFe7bsiXd8eY4nXDcrFqtQAnNQYwDSBKSNtqG5Qv7thjdp+a0AMogWW2PwG9 +/+uwXBS8745YW0XA8I538NRS3yaISfbH4KCWNoM6OST3sYrjusAqo9zXbd5Iq+0CdAf8e+UTT6 n7pFeTXWBt8mCmCpew+HpbG6ZNUZrjzG8VMxQpC7GhWcLI X-Google-Smtp-Source: AGHT+IEbGWzmNhrFjaMP/l/a+jzpcorzEtuz05hFTRxW82Zbvw8A6P7XW5+JgiYGXQhr7wUw+A8vs1aC67ah X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a65:628f:0:b0:7e9:fb45:5451 with SMTP id 41be03b00d2f7-7ee29144834mr34695a12.9.1730861828741; Tue, 05 Nov 2024 18:57:08 -0800 (PST) Date: Tue, 5 Nov 2024 18:56:44 -0800 In-Reply-To: <20241106025656.2326794-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241106025656.2326794-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241106025656.2326794-5-jstultz@google.com> Subject: [RFC][PATCH v13 4/7] sched: Fix psi_dequeue for Proxy Execution From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently, if the sleep flag is set, psi_dequeue() doesn't change any of the psi_flags. This is because psi_switch_task() will clear TSK_ONCPU as well as other potential flags (TSK_RUNNING), and the assumption is that a voluntary sleep always consists of a task being dequeued followed shortly there after with a psi_sched_switch() call. Proxy Execution changes this expectation, as mutex-blocked tasks that would normally sleep stay on the runqueue. In the case where the mutex owning task goes to sleep, we will then deactivate the blocked task as well. However, in that case, the mutex-blocked task will have had its TSK_ONCPU cleared when it was switched off the cpu, but it will stay TSK_RUNNING. Then when we later dequeue it becaues of a sleeping-owner, as it is sleeping psi_dequeue() won't change any state (leaving it TSK_RUNNING), as it incorrectly expects a psi_task_switch() call to immediately follow. Later on when it get re enqueued, and psi_flags are set for TSK_RUNNING, we hit an error as the task is already TSK_RUNNING: psi: inconsistent task state! To resolve this, extend the logic in psi_dequeue() so that if the sleep flag is set, we also check if psi_flags have TSK_ONCPU set (meaning the psi_task_switch is imminient) before we do the shortcut return. If TSK_ONCPU is not set, that means we've already swtiched away, and this psi_dequeue call needs to clear the flags. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: John Stultz --- v13: * Reworked for collision --- kernel/sched/stats.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 8ee0add5a48a..c313fe76a772 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -176,8 +176,12 @@ static inline void psi_dequeue(struct task_struct *p, = int flags) * avoid walking all ancestors twice, psi_task_switch() handles * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU. * Do nothing here. + * In the SCHED_PROXY_EXECUTION case we may do sleeping + * dequeues that are not followed by a task switch, so check + * TSK_ONCPU is set to ensure the task switch is imminent. + * Otherwise clear the flags as usual. */ - if (flags & DEQUEUE_SLEEP) + if ((flags & DEQUEUE_SLEEP) && (p->psi_flags & TSK_ONCPU)) return; =20 /* --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 09:01:04 2024 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 233FE1917CD for ; Wed, 6 Nov 2024 02:57:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861834; cv=none; b=ENamRpvnR0jcozCRBND8yQWGYDgS0b6E5Ox/qciTYGeF82qVBdCRcHeoDTVsWO+/kRTTq6XHgxSd6Sqpn1zJ3YhMR913+TAMi+XtG5JTZS9J7pEDP1j1h0dqWo1lUJAU2I4q/VNGCq+uDJKRsr3DAcR5clpUqYRkK7vNAnl13Ow= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861834; c=relaxed/simple; bh=BCKlAaluFVzmaxLxUBV9r80iaaEMfpsdDbBUZm0GTOY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=l94h/BzGBFztARI6CiYrq26Fjykz7ZdL4ZOI72GHric/wejGJmYM6l1k0SiKq0Lmbws4WpuTP59XQR3/+Lhbz8sTymjSGPZg+hNyM8XhkPzghqmSP+eYk9sOkXdMviTwAa5zZa0shL73lGP3n7VVnQp2YX76G3gBOKnj5fWIN2o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=lcnZUEhA; arc=none smtp.client-ip=209.85.210.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="lcnZUEhA" Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-71e64cbb445so6647976b3a.1 for ; Tue, 05 Nov 2024 18:57:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730861831; x=1731466631; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=xpsaX9dfxtcXJ7WCfwXr6jd+qov77G1Oc+/nLlaJwk4=; b=lcnZUEhA3Mi1UFAVN5bbODM0IpESAZVKry3Ip1+79DzHO0Unzy9dsGBgIdDdRDza7x rvVNzX0qC7AbDsNjLDbIj4kQbN492H0e6cLMio9f+8jIxgeL+jOFW/zyqFKt4nz/Jg/l QZO3rZf8K/NjQxhBvy8+TAVrSE7Qjb5UojjzUMuEBOEiuMR90HvMFdFPmeGKbb4esyj+ vmZk9fEL4d8EvW7l9LUVmgztRsHmcsK7T9SSYiOCdLQowyGefAr7IL9j3slEUsDHKws+ 0XdXS6MTjqHQ6+bYJ6MhZrJ1J68ZLW6A/pQEnORpryPF1f7bmzzmQYmHrlroVXzVgs8v NBpw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730861831; x=1731466631; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xpsaX9dfxtcXJ7WCfwXr6jd+qov77G1Oc+/nLlaJwk4=; b=BRPL2sR0bKWiCYqbGglfJd8LQZ4NxAId5MdKB5qwwhGMd4FwEmmiDzmDmje0AQ+Zbd 4jsP79TH6E8UXnKOu5YmUvh3oB+Z9q0XOR8CZWuQlS+VHQqV/kzBlTD6GtaGkdPNNN4U JOc87/ncuXMdyfTN8INlBgv7Jo07BfV0+wlEpJ/cGqi3qAiJTe251CxfNYGUG8LN6xSR UD8+dBedvnDWgpYQf0z4fR3XPWXaT8XH3gEH6IopY2MeP79lBkEczwm7Oq5f6mtEQbv5 tlE8fwWv5fdcODvIp5sB/tPOMEeS7bWCaxZCHj/P78HG3KxEe5GKT7wkUJNxyMINTb5P 02qg== X-Gm-Message-State: AOJu0YyOUvXP8fub8gBiJrbWw0Fr6a3FbUC/X5YP2Vt9gH5EgQkkcgsk 8xyhYlcvNvT4vrmsunOUcAYwpykBFiN/zqLK1vmd/XGYZ5Iv4ixs215GErYVMTNG4q6+rGULugt TL8+/hGx2/kqQnOslO94zbafn5++Rd1P+q1RACpr0RGFQB59zmIZuJyrJFevM+b0MyzUeYPW2PH e+l5XWCDHuV/7vkrRChprH311SnKjqYkI5HIY5/Ko4gH9w X-Google-Smtp-Source: AGHT+IHMsllkXokIvOVIrP/uAWZXOrfy9v1TV28adb/tzmqHM13i+JH0fU8OjOk6UUiv0mg3ZqdTmJpuZi3N X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a05:6a00:2d94:b0:71e:54b0:80d3 with SMTP id d2e1a72fcca58-720b9c24974mr280924b3a.2.1730861830797; Tue, 05 Nov 2024 18:57:10 -0800 (PST) Date: Tue, 5 Nov 2024 18:56:45 -0800 In-Reply-To: <20241106025656.2326794-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241106025656.2326794-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241106025656.2326794-6-jstultz@google.com> Subject: [RFC][PATCH v13 5/7] sched: Add an initial sketch of the find_proxy_task() function From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a find_proxy_task() function which doesn't do much. When we select a blocked task to run, we will just deactivate it and pick again. The exception being if it has become unblocked after find_proxy_task() was called. Greatly simplified from patch by: Peter Zijlstra (Intel) Juri Lelli Valentin Schneider Connor O'Brien Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com [jstultz: Split out from larger proxy patch and simplified for review and testing.] Signed-off-by: John Stultz --- v5: * Split out from larger proxy patch v7: * Fixed unused function arguments, spelling nits, and tweaks for clarity, pointed out by Metin Kaya * Fix build warning Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202311081028.yDLmCWgr-lkp@i= ntel.com/ v8: * Fixed case where we might return a blocked task from find_proxy_task() * Continued tweaks to handle avoiding returning blocked tasks v9: * Add zap_balance_callbacks helper to unwind balance_callbacks when we will re-call pick_next_task() again. * Add extra comment suggested by Metin * Typo fixes from Metin * Moved adding proxy_resched_idle earlier in the series, as suggested by Metin * Fix to call proxy_resched_idle() *prior* to deactivating next, to avoid crashes caused by stale references to next * s/PROXY/SCHED_PROXY_EXEC/ as suggested by Metin * Number of tweaks and cleanups suggested by Metin * Simplify proxy_deactivate as suggested by Metin v11: * Tweaks for earlier simplification in try_to_deactivate_task v13: * Rename rename "next" to "donor" in find_proxy_task() for clarity * Similarly use "donor" instead of next in proxy_deactivate * Refactor/simplify proxy_resched_idle * Moved up a needed fix from later in the series --- kernel/sched/core.c | 129 ++++++++++++++++++++++++++++++++++++++++++- kernel/sched/rt.c | 15 ++++- kernel/sched/sched.h | 10 +++- 3 files changed, 148 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f040feed9df3..4e2c51c477b0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5010,6 +5010,34 @@ static void do_balance_callbacks(struct rq *rq, stru= ct balance_callback *head) } } =20 +/* + * Only called from __schedule context + * + * There are some cases where we are going to re-do the action + * that added the balance callbacks. We may not be in a state + * where we can run them, so just zap them so they can be + * properly re-added on the next time around. This is similar + * handling to running the callbacks, except we just don't call + * them. + */ +static void zap_balance_callbacks(struct rq *rq) +{ + struct balance_callback *next, *head; + bool found =3D false; + + lockdep_assert_rq_held(rq); + + head =3D rq->balance_callback; + while (head) { + if (head =3D=3D &balance_push_callback) + found =3D true; + next =3D head->next; + head->next =3D NULL; + head =3D next; + } + rq->balance_callback =3D found ? &balance_push_callback : NULL; +} + static void balance_push(struct rq *rq); =20 /* @@ -6543,7 +6571,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) * Otherwise marks the task's __state as RUNNING */ static bool try_to_block_task(struct rq *rq, struct task_struct *p, - unsigned long task_state) + unsigned long task_state, bool deactivate_cond) { int flags =3D DEQUEUE_NOCLOCK; =20 @@ -6552,6 +6580,9 @@ static bool try_to_block_task(struct rq *rq, struct t= ask_struct *p, return false; } =20 + if (!deactivate_cond) + return false; + p->sched_contributes_to_load =3D (task_state & TASK_UNINTERRUPTIBLE) && !(task_state & TASK_NOLOAD) && @@ -6575,6 +6606,88 @@ static bool try_to_block_task(struct rq *rq, struct = task_struct *p, return true; } =20 +#ifdef CONFIG_SCHED_PROXY_EXEC + +static inline struct task_struct * +proxy_resched_idle(struct rq *rq) +{ + put_prev_task(rq, rq->donor); + rq_set_donor(rq, rq->idle); + set_next_task(rq, rq->idle); + set_tsk_need_resched(rq->idle); + return rq->idle; +} + +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor) +{ + unsigned long state =3D READ_ONCE(donor->__state); + + /* Don't deactivate if the state has been changed to TASK_RUNNING */ + if (state =3D=3D TASK_RUNNING) + return false; + /* + * Because we got donor from pick_next_task, it is *crucial* + * that we call proxy_resched_idle before we deactivate it. + * As once we deactivate donor, donor->on_rq is set to zero, + * which allows ttwu to immediately try to wake the task on + * another rq. So we cannot use *any* references to donor + * after that point. So things like cfs_rq->curr or rq->donor + * need to be changed from next *before* we deactivate. + */ + proxy_resched_idle(rq); + return try_to_block_task(rq, donor, state, true); +} + +/* + * Initial simple proxy that just returns the task if it's waking + * or deactivates the blocked task so we can pick something that + * isn't blocked. + */ +static struct task_struct * +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) +{ + struct task_struct *p =3D donor; + struct mutex *mutex; + + mutex =3D p->blocked_on; + /* Something changed in the chain, so pick again */ + if (!mutex) + return NULL; + /* + * By taking mutex->wait_lock we hold off concurrent mutex_unlock() + * and ensure @owner sticks around. + */ + raw_spin_lock(&mutex->wait_lock); + raw_spin_lock(&p->blocked_lock); + + /* Check again that p is blocked with blocked_lock held */ + if (!task_is_blocked(p) || mutex !=3D get_task_blocked_on(p)) { + /* + * Something changed in the blocked_on chain and + * we don't know if only at this level. So, let's + * just bail out completely and let __schedule + * figure things out (pick_again loop). + */ + goto out; + } + if (!proxy_deactivate(rq, donor)) + /* XXX: This hack won't work when we get to migrations */ + donor->blocked_on_state =3D BO_RUNNABLE; + +out: + raw_spin_unlock(&p->blocked_lock); + raw_spin_unlock(&mutex->wait_lock); + return NULL; +} +#else /* SCHED_PROXY_EXEC */ +static struct task_struct * +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) +{ + WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n= "); + return donor; +} +#endif /* SCHED_PROXY_EXEC */ + /* * __schedule() is the main scheduler function. * @@ -6683,12 +6796,22 @@ static void __sched notrace __schedule(int sched_mo= de) goto picked; } } else if (!preempt && prev_state) { - block =3D try_to_block_task(rq, prev, prev_state); + block =3D try_to_block_task(rq, prev, prev_state, + !task_is_blocked(prev)); switch_count =3D &prev->nvcsw; } =20 - next =3D pick_next_task(rq, prev, &rf); +pick_again: + next =3D pick_next_task(rq, rq->donor, &rf); rq_set_donor(rq, next); + if (unlikely(task_is_blocked(next))) { + next =3D find_proxy_task(rq, next, &rf); + if (!next) { + /* zap the balance_callbacks before picking again */ + zap_balance_callbacks(rq); + goto pick_again; + } + } picked: clear_tsk_need_resched(prev); clear_preempt_need_resched(); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index bd66a46b06ac..fa4d9bf76ad4 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1479,8 +1479,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p= , int flags) =20 enqueue_rt_entity(rt_se, flags); =20 - if (!task_current(rq, p) && p->nr_cpus_allowed > 1) - enqueue_pushable_task(rq, p); + /* + * Current can't be pushed away. Selected is tied to current, + * so don't push it either. + */ + if (task_current(rq, p) || task_current_donor(rq, p)) + return; + /* + * Pinned tasks can't be pushed. + */ + if (p->nr_cpus_allowed =3D=3D 1) + return; + + enqueue_pushable_task(rq, p); } =20 static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flag= s) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 258db6ef8c70..529d4f34ea7b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2271,6 +2271,14 @@ static inline int task_current_donor(struct rq *rq, = struct task_struct *p) return rq->donor =3D=3D p; } =20 +static inline bool task_is_blocked(struct task_struct *p) +{ + if (!sched_proxy_exec()) + return false; + + return !!p->blocked_on && p->blocked_on_state !=3D BO_RUNNABLE; +} + static inline int task_on_cpu(struct rq *rq, struct task_struct *p) { #ifdef CONFIG_SMP @@ -2480,7 +2488,7 @@ static inline void put_prev_set_next_task(struct rq *= rq, struct task_struct *prev, struct task_struct *next) { - WARN_ON_ONCE(rq->curr !=3D prev); + WARN_ON_ONCE(rq->donor !=3D prev); =20 __put_prev_set_next_dl_server(rq, prev, next); =20 --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 09:01:04 2024 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 991BB1917D7 for ; Wed, 6 Nov 2024 02:57:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861836; cv=none; b=cs1lakzTJuPGnlgjrKs9aZSEsqAyeiyXV0BblqrTaKTC11nCRVeipXE0+EZdNEzVBV8kkCLSj1yyC+vngXO3ppDdSqDSAkzlZ3cLAodh218HWkVavBk5hYWzZsDk2BF+liY5SwsDARHNEOizOqZYLElsFIrQ8IHmv0c5b4BIb7Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861836; c=relaxed/simple; bh=mlkvhwb0uPalS0AiigOgKNA1RMt/aZBZzzXvsd1pagg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=B7yHTwDdOrCazD6L9YQWm127nF92e0iE9WY6c9nKt1PQgjoHVPJA4zj38Ekq08IABF+AypZNYct6uk7cLZ8fwsqtO/W40p8mY8B3gVZlvleI5vtfUwTQjGXqQK5LPju+RdPGBThvgcYcOBvMKZ5M+Y5d2unEKrrEfw8cphF/0qA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=WV5MRFO7; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="WV5MRFO7" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-7ea6efcd658so1683406a12.3 for ; Tue, 05 Nov 2024 18:57:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730861834; x=1731466634; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=0RdZU9JqmApFBfR2JA2IyOtnzpnkWYvffqeBWmVivMg=; b=WV5MRFO7VOd1qziPtbVzbbsuxPCI5SK4TIfeig3fOBmT25Ggfv/xBh5Nb6Dbaz6WEt /VwmAX4yLKuzf/bNEFMuufM62e2wFd5B2QDyLEo+AR9U98eA5sqLENFqLraC6aqILWD1 aqSaUpu3SuGDJK4nabzbNeZEE1xca0CNOCMCc4DeKhsyYJrfHL9/7KKSageivEG90tHT 1Qf8uJ5DtAr7DY7F9WWsE3I3KGx/5YL/NoL5sENScD2FxdLvebZflkckDtONbZFT1ufL 7qou87ci9NTZkykBYryU8sb5f4r5MkmUkGmjax3yZCTEJebQLtDJx7pe9MSF9YjrPwHN JHoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730861834; x=1731466634; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0RdZU9JqmApFBfR2JA2IyOtnzpnkWYvffqeBWmVivMg=; b=d34bVKdwMAPwvOp2Ly9VGatoY0uRBFUlNW4AhbZzk6ny0XhcYeYPcVjGKMcCKokTBx kbrPUisq+bp1KT5UfFk//26Da/uJT2/3mhgNiycQ3s3x52tzVnpeJWeyu3LCr6lowK7I A3RuaFdxmqNGqvrYpPj3l503ENVdLk/yEYDxgVPdgAM644sxBDc94XSZNk7LUHf+CdWW 8ebyd95Lf/IQl1BDAvhZ8Vinhd8tHzNl2dWUjv9OEvhN0aUBkJV2ocjgAufb8CqAK9fN Z/dQx1RXkjIP4IGuDM7w0PVT28IJoD7D/S9GI0p0W4nnd+f70fGdBiDZ4NF94WdCw40l e/sw== X-Gm-Message-State: AOJu0YylkbG8eWOmTNrCc/W+rN0OecVjDbWv53lUHuLugWh4BHrRpX8M UpXhoHv7u6iOOOeXvcV6wxaMlaaU9gP1HZlpMwCnhYBr+xZFkpPFVpSQ1CHEhiQXx1JMjjSKsvq +Rlen/RsTa6Zy/Y/tyHO2zKucMmXsW8gZzL9xp9O2Hn7QY2VcyNHdBZ+qYYhjp1Q3420U8ZQoB2 lhyGrldgq2kPOoaAbQ9UwxoPSj+PL0DndxczBvSArd2wH+ X-Google-Smtp-Source: AGHT+IFrpJCoOBDGfJfjK+vKO+I+2M2RSr1iOWpEJa2k3GtOKNtwffWmp4DfMQrhwn4CeJAkcA8ZupEwbhQW X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a63:1641:0:b0:7ea:67a0:9651 with SMTP id 41be03b00d2f7-7ee58430880mr27581a12.3.1730861832987; Tue, 05 Nov 2024 18:57:12 -0800 (PST) Date: Tue, 5 Nov 2024 18:56:46 -0800 In-Reply-To: <20241106025656.2326794-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241106025656.2326794-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241106025656.2326794-7-jstultz@google.com> Subject: [RFC][PATCH v13 6/7] sched: Fix proxy/current (push,pull)ability From: John Stultz To: LKML Cc: Valentin Schneider , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com, "Connor O'Brien" , John Stultz Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Valentin Schneider Proxy execution forms atomic pairs of tasks: The waiting donor task (scheduling context) and a proxy (execution context). The donor task, along with the rest of the blocked chain, follows the proxy wrt CPU placement. They can be the same task, in which case push/pull doesn't need any modification. When they are different, however, FIFO1 & FIFO42: ,-> RT42 | | blocked-on | v blocked_donor | mutex | | owner | v `-- RT1 RT1 RT42 CPU0 CPU1 ^ ^ | | overloaded !overloaded rq prio =3D 42 rq prio =3D 0 RT1 is eligible to be pushed to CPU1, but should that happen it will "carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as push/pullable. Unfortunately, only the donor task is usually dequeued from the rq, and the proxy'ed execution context (rq->curr) remains on the rq. This can cause RT1 to be selected for migration from logic like the rt pushable_list. Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule returns, which allows the sched class logic to avoid adding the now current task to the pushable_list. Furthermore, tasks becoming blocked on a mutex don't need an explicit dequeue/enqueue cycle to be made (push/pull)able: they have to be running to block on a mutex, thus they will eventually hit put_prev_task(). Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: Valentin Schneider Signed-off-by: Connor O'Brien Signed-off-by: John Stultz --- v3: * Tweaked comments & commit message v5: * Minor simplifications to utilize the fix earlier in the patch series. * Rework the wording of the commit message to match selected/ proxy terminology and expand a bit to make it more clear how it works. v6: * Dropped now-unused proxied value, to be re-added later in the series when it is used, as caught by Dietmar v7: * Unused function argument fixup * Commit message nit pointed out by Metin Kaya * Dropped unproven unlikely() and use sched_proxy_exec() in proxy_tag_curr, suggested by Metin Kaya v8: * More cleanups and typo fixes suggested by Metin Kaya v11: * Cleanup of comimt message suggested by Metin v12: * Rework for rq_selected -> rq->donor renaming --- kernel/sched/core.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4e2c51c477b0..42ea651d1469 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6688,6 +6688,23 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) } #endif /* SCHED_PROXY_EXEC */ =20 +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner) +{ + if (!sched_proxy_exec()) + return; + /* + * pick_next_task() calls set_next_task() on the chosen task + * at some point, which ensures it is not push/pullable. + * However, the chosen/donor task *and* the mutex owner form an + * atomic pair wrt push/pull. + * + * Make sure owner we run is not pushable. Unfortunately we can + * only deal with that by means of a dequeue/enqueue cycle. :-/ + */ + dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE); + enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE); +} + /* * __schedule() is the main scheduler function. * @@ -6826,6 +6843,10 @@ static void __sched notrace __schedule(int sched_mod= e) * changes to task_struct made by pick_next_task(). */ RCU_INIT_POINTER(rq->curr, next); + + if (!task_current_donor(rq, next)) + proxy_tag_curr(rq, next); + /* * The membarrier system call requires each architecture * to have a full memory barrier after updating @@ -6859,6 +6880,10 @@ static void __sched notrace __schedule(int sched_mod= e) /* Also unlocks the rq: */ rq =3D context_switch(rq, prev, next, &rf); } else { + /* In case next was already curr but just got blocked_donor */ + if (!task_current_donor(rq, next)) + proxy_tag_curr(rq, next); + rq_unpin_lock(rq, &rf); __balance_callbacks(rq); raw_spin_rq_unlock_irq(rq); --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 09:01:04 2024 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9E8A01922DA for ; Wed, 6 Nov 2024 02:57:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861838; cv=none; b=AiCUqCO4CNF/dA1XvC8w/LcWQk4XOE8gTE5cUD5R+A4vNPb3uq9yt54FMwqCxEOqa4uez2TsGn9w5Kskjxns4hJ4VIP2Z7e19ze/Wsuq/B20YnueLqXT/KTco4O0kYto4eZ6QWkKkEsJcxr6IVK61QUFfHbMJOLJvzZqCrLiZug= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730861838; c=relaxed/simple; bh=/vVU3KtiXa1bb/gaPx5qALclX1SXM7b9Gp9148/W39U=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=VIXnlo+uHs6KWjVEoo9Tq+6U+s3/9iebVFR7j9GBxWQKWna3U02gd90VYDa0OEdi0KuvunayY5+hsYcP9u0RC642ZP35jdzV3QLqJa9sQGVFG0VvJVkz6BFt6vYuNkqoeuCcvE9fdaXw5entDy/zmS/Ja8arSEHM8EfEyjMmbJc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=pteYT6ck; arc=none smtp.client-ip=209.85.210.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="pteYT6ck" Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-71e8641f2a7so6534563b3a.2 for ; Tue, 05 Nov 2024 18:57:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730861836; x=1731466636; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=2wlALSOlWQsAXqapsVF9lfIXjoSowR0PL2PjcUcWZ1I=; b=pteYT6ckndwn5Ka/jgCz0oFCs8ax2mIyATLp7pNsJqJmAdmiqkX3WDft02Nup400IM taHpHIs2mVUrzmlvPIzmuID3J1P4HCpeLsNLlIes9YkBvYmXQI0wLHWbjRIeMz7R3n2s Bozo7FJElDgRROu5CgnL4ODn9pI7wDXvaTO3bU1kSJ9Qa1X8tvcZ/V9G3afXfxaaZl2P aSPdrYe3IVwb4da9PwMdIz+L1IQvZhzKnlSTobeoWRhuBlx090QrGbb6goSdO3xKsFcB LBlm9GGFoik5jBjFenohbYjHpCRUMfNOGE3y0LyBIdFPRzTg6p+1+Y3kLsmd6S5fzyv6 0jHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730861836; x=1731466636; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=2wlALSOlWQsAXqapsVF9lfIXjoSowR0PL2PjcUcWZ1I=; b=HyDRI8tmO+YmwTfMt0XeafoWZ8Yq52IDRvAF6Ar9ed2ruLMfC+oxE66gqfktu1qVlN AT+3ukso56xQ8JmTVL5dYqLnom2EAj19JmAn/LKqMHteBaE9C8jSzk2h4Mh60fkndqze 0KJLM2g7ksgvMc3NeRlBN4RyZAhcvu1ZoIsPLsM/OqSkzwR+BPfXB5sx2qH2B2Md+mkp GurQ4UXVUavIq3MYpHWkQ7cAQNLB6MjNTo7deRxB/yOF95G1d5spHXb1x+mnxJFEuxrR 6xTLLgZsgFIBkUS7T4honAsKDlnQPPbgG23GlupuiBSWsLGyCfo2yxH8nw6kuQ7VEtGN jozA== X-Gm-Message-State: AOJu0YzgQmLRpewYMS42QDq3hNU3G/9dnmup6+q9mq4bAlWMaaSqu3iI rYbq8L/PEUPu8SVASIL4rjuY4pgVPeCxsZr7HwNfqzgdThU3mE1qYrVMyxZA8iTCHfMjXGdD0sC HMnleo/hhWSdayeJU4CGQvGsqgu0Fj5BGuQzO2PjB+5ovW7NzYIsVRvtiJkz2CaqghvTiHIAx/l PrcceF/9RLS6WQtuMwzEBVbuLKZMic/NVc7cUJCXQJeIKq X-Google-Smtp-Source: AGHT+IGPqxrSs/h9K/uFxilQqk4kTreApctRx8a/id97oN2FGVyFvf0iM6+tZU8erD4wDEV2dH1YlcoV6BLt X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a05:6a00:1b5a:b0:71e:5a62:862 with SMTP id d2e1a72fcca58-720c99b49a7mr56955b3a.3.1730861834984; Tue, 05 Nov 2024 18:57:14 -0800 (PST) Date: Tue, 5 Nov 2024 18:56:47 -0800 In-Reply-To: <20241106025656.2326794-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241106025656.2326794-1-jstultz@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241106025656.2326794-8-jstultz@google.com> Subject: [RFC][PATCH v13 7/7] sched: Start blocked_on chain processing in find_proxy_task() From: John Stultz To: LKML Cc: Peter Zijlstra , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , kernel-team@android.com, Valentin Schneider , "Connor O'Brien" , John Stultz Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Start to flesh out the real find_proxy_task() implementation, but avoid the migration cases for now, in those cases just deactivate the donor task and pick again. To ensure the donor task or other blocked tasks in the chain aren't migrated away while we're running the proxy, also tweak the CFS logic to avoid migrating donor or mutex blocked tasks. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: kernel-team@android.com Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Juri Lelli Signed-off-by: Valentin Schneider Signed-off-by: Connor O'Brien [jstultz: This change was split out from the larger proxy patch] Signed-off-by: John Stultz --- v5: * Split this out from larger proxy patch v7: * Minor refactoring of core find_proxy_task() function * Minor spelling and corrections suggested by Metin Kaya * Dropped an added BUG_ON that was frequently tripped v8: * Fix issue if proxy_deactivate fails, we don't leave task BO_BLOCKED * Switch to WARN_ON from BUG_ON checks v9: * Improve comments suggested by Metin * Minor cleanups v11: * Previously we checked next=3D=3Drq->idle && prev=3D=3Drq->idle, but I think we only really care if next=3D=3Drq->idle from find_proxy_task, as we will still want to resched regardless of what prev was. v12: * Commit message rework for selected -> donor rewording v13: * Address new delayed dequeue condition (deactivate donor for now) * Next to donor renaming in find_proxy_task * Improved comments for find_proxy_task * Rework for proxy_deactivate cleanup --- kernel/sched/core.c | 164 ++++++++++++++++++++++++++++++++++++-------- kernel/sched/fair.c | 10 ++- 2 files changed, 146 insertions(+), 28 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 42ea651d1469..932f49765ddf 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -96,6 +96,7 @@ #include "../workqueue_internal.h" #include "../../io_uring/io-wq.h" #include "../smpboot.h" +#include "../locking/mutex.h" =20 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu); EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask); @@ -2900,8 +2901,15 @@ static int affine_move_task(struct rq *rq, struct ta= sk_struct *p, struct rq_flag struct set_affinity_pending my_pending =3D { }, *pending =3D NULL; bool stop_pending, complete =3D false; =20 - /* Can the task run on the task's current CPU? If so, we're done */ - if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) { + /* + * Can the task run on the task's current CPU? If so, we're done + * + * We are also done if the task is the current donor, boosting a lock- + * holding proxy, (and potentially has been migrated outside its + * current or previous affinity mask) + */ + if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) || + (task_current_donor(rq, p) && !task_current(rq, p))) { struct task_struct *push_task =3D NULL; =20 if ((flags & SCA_MIGRATE_ENABLE) && @@ -6639,41 +6647,139 @@ static bool proxy_deactivate(struct rq *rq, struct= task_struct *donor) } =20 /* - * Initial simple proxy that just returns the task if it's waking - * or deactivates the blocked task so we can pick something that - * isn't blocked. + * Find runnable lock owner to proxy for mutex blocked donor + * + * Follow the blocked-on relation: + * task->blocked_on -> mutex->owner -> task... + * + * Lock order: + * + * p->pi_lock + * rq->lock + * mutex->wait_lock + * p->blocked_lock + * + * Returns the task that is going to be used as execution context (the one + * that is actually going to be run on cpu_of(rq)). */ static struct task_struct * find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) { - struct task_struct *p =3D donor; + struct task_struct *owner =3D NULL; + struct task_struct *ret =3D NULL; + int this_cpu =3D cpu_of(rq); + struct task_struct *p; struct mutex *mutex; =20 - mutex =3D p->blocked_on; - /* Something changed in the chain, so pick again */ - if (!mutex) - return NULL; - /* - * By taking mutex->wait_lock we hold off concurrent mutex_unlock() - * and ensure @owner sticks around. - */ - raw_spin_lock(&mutex->wait_lock); - raw_spin_lock(&p->blocked_lock); + /* Follow blocked_on chain. */ + for (p =3D donor; task_is_blocked(p); p =3D owner) { + mutex =3D p->blocked_on; + /* Something changed in the chain, so pick again */ + if (!mutex) + return NULL; + /* + * By taking mutex->wait_lock we hold off concurrent mutex_unlock() + * and ensure @owner sticks around. + */ + raw_spin_lock(&mutex->wait_lock); + raw_spin_lock(&p->blocked_lock); + + /* Check again that p is blocked with blocked_lock held */ + if (mutex !=3D get_task_blocked_on(p)) { + /* + * Something changed in the blocked_on chain and + * we don't know if only at this level. So, let's + * just bail out completely and let __schedule + * figure things out (pick_again loop). + */ + goto out; + } + + owner =3D __mutex_owner(mutex); + if (!owner) { + p->blocked_on_state =3D BO_RUNNABLE; + ret =3D p; + goto out; + } + + if (task_cpu(owner) !=3D this_cpu) { + /* XXX Don't handle migrations yet */ + if (!proxy_deactivate(rq, donor)) + goto deactivate_failed; + goto out; + } + + if (task_on_rq_migrating(owner)) { + /* + * One of the chain of mutex owners is currently migrating to this + * CPU, but has not yet been enqueued because we are holding the + * rq lock. As a simple solution, just schedule rq->idle to give + * the migration a chance to complete. Much like the migrate_task + * case we should end up back in find_proxy_task(), this time + * hopefully with all relevant tasks already enqueued. + */ + raw_spin_unlock(&p->blocked_lock); + raw_spin_unlock(&mutex->wait_lock); + return proxy_resched_idle(rq); + } + + if (!owner->on_rq) { + /* XXX Don't handle blocked owners yet */ + if (!proxy_deactivate(rq, donor)) + goto deactivate_failed; + goto out; + } + + if (owner->se.sched_delayed) { + /* XXX Don't handle delayed dequeue yet */ + if (!proxy_deactivate(rq, donor)) + goto deactivate_failed; + goto out; + } + + if (owner =3D=3D p) { + /* + * It's possible we interleave with mutex_unlock like: + * + * lock(&rq->lock); + * find_proxy_task() + * mutex_unlock() + * lock(&wait_lock); + * donor(owner) =3D current->blocked_donor; + * unlock(&wait_lock); + * + * wake_up_q(); + * ... + * ttwu_runnable() + * __task_rq_lock() + * lock(&wait_lock); + * owner =3D=3D p + * + * Which leaves us to finish the ttwu_runnable() and make it go. + * + * So schedule rq->idle so that ttwu_runnable can get the rq lock + * and mark owner as running. + */ + raw_spin_unlock(&p->blocked_lock); + raw_spin_unlock(&mutex->wait_lock); + return proxy_resched_idle(rq); + } =20 - /* Check again that p is blocked with blocked_lock held */ - if (!task_is_blocked(p) || mutex !=3D get_task_blocked_on(p)) { /* - * Something changed in the blocked_on chain and - * we don't know if only at this level. So, let's - * just bail out completely and let __schedule - * figure things out (pick_again loop). + * OK, now we're absolutely sure @owner is on this + * rq, therefore holding @rq->lock is sufficient to + * guarantee its existence, as per ttwu_remote(). */ - goto out; + raw_spin_unlock(&p->blocked_lock); + raw_spin_unlock(&mutex->wait_lock); } - if (!proxy_deactivate(rq, donor)) - /* XXX: This hack won't work when we get to migrations */ - donor->blocked_on_state =3D BO_RUNNABLE; =20 + WARN_ON_ONCE(owner && !owner->on_rq); + return owner; + +deactivate_failed: + /* XXX: This hack won't work when we get to migrations */ + donor->blocked_on_state =3D BO_RUNNABLE; out: raw_spin_unlock(&p->blocked_lock); raw_spin_unlock(&mutex->wait_lock); @@ -6758,6 +6864,7 @@ static void __sched notrace __schedule(int sched_mode) struct rq_flags rf; struct rq *rq; int cpu; + bool preserve_need_resched =3D false; =20 cpu =3D smp_processor_id(); rq =3D cpu_rq(cpu); @@ -6828,9 +6935,12 @@ static void __sched notrace __schedule(int sched_mod= e) zap_balance_callbacks(rq); goto pick_again; } + if (next =3D=3D rq->idle) + preserve_need_resched =3D true; } picked: - clear_tsk_need_resched(prev); + if (!preserve_need_resched) + clear_tsk_need_resched(prev); clear_preempt_need_resched(); #ifdef CONFIG_SCHED_DEBUG rq->last_seen_need_resched_ns =3D 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 42043310adfe..9625c46aed1e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9372,6 +9372,7 @@ int can_migrate_task(struct task_struct *p, struct lb= _env *env) * 2) cannot be migrated to this CPU due to cpus_ptr, or * 3) running (obviously), or * 4) are cache-hot on their current CPU. + * 5) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled) */ if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0; @@ -9380,6 +9381,9 @@ int can_migrate_task(struct task_struct *p, struct lb= _env *env) if (kthread_is_per_cpu(p)) return 0; =20 + if (task_is_blocked(p)) + return 0; + if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { int cpu; =20 @@ -9416,7 +9420,8 @@ int can_migrate_task(struct task_struct *p, struct lb= _env *env) /* Record that we found at least one task that could run on dst_cpu */ env->flags &=3D ~LBF_ALL_PINNED; =20 - if (task_on_cpu(env->src_rq, p)) { + if (task_on_cpu(env->src_rq, p) || + task_current_donor(env->src_rq, p)) { schedstat_inc(p->stats.nr_failed_migrations_running); return 0; } @@ -9455,6 +9460,9 @@ static void detach_task(struct task_struct *p, struct= lb_env *env) { lockdep_assert_rq_held(env->src_rq); =20 + WARN_ON(task_current(env->src_rq, p)); + WARN_ON(task_current_donor(env->src_rq, p)); + deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK); set_task_cpu(p, env->dst_cpu); } --=20 2.47.0.199.ga7371fff76-goog