From nobody Thu Oct 9 10:02:59 2025 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C65BD27F008 for ; Wed, 18 Jun 2025 08:20:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234815; cv=none; b=Z1gxIDJ5j5TrxJPi8zKZWov33Q6NILtL5M+mCPiVDfio/X6wEV6QsafsNIr9kEw5adoaBuDvilCHRZL4PRnm6lRiwewV3Znzk5vGbNWdESSNnTcJl8oN+QlzIGlSFFyZ1sf7jjy4EI2GG0+xWjWXqa1CepS1ocBA7Wk4Wa5KCg0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234815; c=relaxed/simple; bh=n6FRq4hLBNgw5sNGtNO7D49/eL2X5zS4MeHEADLptN8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=XDUEUvSZvhBy1kotK7hrUOmkFoTsw5UwiM/sHE0pquULXnq0BisPPyLWeSLEeyBEY9q4gydVjFX35ppLf7MjaSlycfnTJszBoa7M7aNnDE80+QudLXNn9BBvabKpjA5qcyThZyG8bubY+1kfREhGAh+489eyutqJhTaCo/ICeP4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=BvencyQN; arc=none smtp.client-ip=209.85.215.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="BvencyQN" Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-b2f603b0ef6so4902881a12.1 for ; Wed, 18 Jun 2025 01:20:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1750234813; x=1750839613; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=bWYudEDJAAvgfGWWEcIqHeRQ/IYrWVCczg2BhjdVlC8=; b=BvencyQNF6PEUfMZM6czq6yJrMYbfEpo3o1KU4p+Oku0YgbkMcnTlxUGnFZFFYkSU1 dkHC9A0VpMu6d1exXOudNMXuZu+hGuHTF0hdl42smrRI9fya0J/7D7BvATX1Spx2giry xRjapu1aa3Zko1rSfKLwjMV7mB9aVvjBxHzWr5g+K7nOHcilfTy31qI++ZGPRAEa4U3k vgw6Yi0sBHJ25zwSzEbt8w71XcIeEBT5Js4pfjGlBfH8MLSB5ixsCnpN2aXDDJPWt9yJ 4SMVx5npdLeLWh6UyNX2bAXokjQBJ99HV8uot+Az1C93a+vm7JwCtKZKTBF/iL84YGCT 5tDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750234813; x=1750839613; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bWYudEDJAAvgfGWWEcIqHeRQ/IYrWVCczg2BhjdVlC8=; b=KSTMwVnCZdwuYhIjk49ky9AQSuHwUTq1T+QE+I1kuVAnkKyv/Fgj/dfkPi+SggDEGd 9ybpjvbB7fy6qrPKX8/Sh9ff86f7Ng5MbSMeLsqVQZ8BzqhwHTZjH1uW2e8dUy9r1Otn DaJc790HYCt5gaQYV43a3tDEdrjwAmK9CsHgYU9p9ko7r460bXtgAm/Zucnd/uMEe0Zs yT3Lafpwa1qSbxWduoUTevGZRILMwj0pxdd8XFOl0iyx5a3yKx6voxuihNbm/o1V0d1y AYzINxTBjaX2btZQLaf5vHAaW7QFeAlCtOKYfz/2w2RlJkvaSSKFWteHly3M1UBcBfHD ZOaw== X-Gm-Message-State: AOJu0YwPy29poZeSznpxH6z3MU80Kz0Qs5+rbtN/WBnORhN9K+f1k+DY ooBPSjsT962S8LkRxqtpjqFwbxhXOTk2PH7kB80sge1MGHFdafptwu2vwafTyl4d5g== X-Gm-Gg: ASbGncuhcueQUeAAvN2MZEAUkIT6HKal473ut+kdEu1DV3TOh/fRu2+rlA6qQTxS10I ZXrSoKKp9jcT8waIUJqVBLwjsbn5mQXMz8QdextthofFnbuzX0w+5///VOS072NCboKaqdtckKS plznYqlOaD167kqhswkETA6nlT8qfdwOp/bFmKyhEwYuJY1S4Va281r2QtFI9MYnTZA59cXHAv2 NLYhbDbJoaZEQd31B2wIQf1CXB/5PLbCT2gz6Nk5eb5wX3cSkUYuV53DeVaBJg4n85PYyKI4jXe c8t288RPSt3uERfecOGfPIbkY7KKarMUTLj10088J0//4Xay9wci7yEiCo5cUl+QLB6bHDgBq4Z 7Mb5a3oWgXZ5jep6I1Q5sbeaIzVDbGFfb X-Google-Smtp-Source: AGHT+IEGfnZMoac1TZrD0R+tvjt46YLurScQxzAmxfvwWqmkfICoBLwc3b76pZaoa4a8vAkVTTea3A== X-Received: by 2002:a05:6a21:2001:b0:21f:ff2a:af83 with SMTP id adf61e73a8af0-21fff2ab818mr2623220637.15.1750234813106; Wed, 18 Jun 2025 01:20:13 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.55]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-b2fe1691d7fsm10432084a12.69.2025.06.18.01.20.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 01:20:12 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka Subject: [PATCH v2 1/5] sched/fair: Add related data structure for task based throttle Date: Wed, 18 Jun 2025 16:19:36 +0800 Message-Id: <20250618081940.621-2-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250618081940.621-1-ziqianlu@bytedance.com> References: <20250618081940.621-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Valentin Schneider Add related data structures for this new throttle functionality. Reviewed-by: Chengming Zhou Signed-off-by: Valentin Schneider Signed-off-by: Aaron Lu Tested-by: K Prateek Nayak --- include/linux/sched.h | 5 +++++ kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 13 +++++++++++++ kernel/sched/sched.h | 3 +++ 4 files changed, 24 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index eec6b225e9d14..95f2453b17222 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -874,6 +874,11 @@ struct task_struct { =20 #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; +#ifdef CONFIG_CFS_BANDWIDTH + struct callback_head sched_throttle_work; + struct list_head throttle_node; + bool throttled; +#endif #endif =20 =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a03c3c1d7f50a..afb4da893c02f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4446,6 +4446,9 @@ static void __sched_fork(unsigned long clone_flags, s= truct task_struct *p) =20 #ifdef CONFIG_FAIR_GROUP_SCHED p->se.cfs_rq =3D NULL; +#ifdef CONFIG_CFS_BANDWIDTH + init_cfs_throttle_work(p); +#endif #endif =20 #ifdef CONFIG_SCHEDSTATS diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b17d3da034af..f7b8597bc95ac 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5757,6 +5757,18 @@ static inline int throttled_lb_pair(struct task_grou= p *tg, throttled_hierarchy(dest_cfs_rq); } =20 +static void throttle_cfs_rq_work(struct callback_head *work) +{ +} + +void init_cfs_throttle_work(struct task_struct *p) +{ + init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work); + /* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_= work() */ + p->sched_throttle_work.next =3D &p->sched_throttle_work; + INIT_LIST_HEAD(&p->throttle_node); +} + static int tg_unthrottle_up(struct task_group *tg, void *data) { struct rq *rq =3D data; @@ -6488,6 +6500,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_rq->runtime_enabled =3D 0; INIT_LIST_HEAD(&cfs_rq->throttled_list); INIT_LIST_HEAD(&cfs_rq->throttled_csd_list); + INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list); } =20 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c323d015486cf..fa78c0c3efe39 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -728,6 +728,7 @@ struct cfs_rq { int throttle_count; struct list_head throttled_list; struct list_head throttled_csd_list; + struct list_head throttled_limbo_list; #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ }; @@ -2627,6 +2628,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *= rt_rq); =20 extern void init_dl_entity(struct sched_dl_entity *dl_se); =20 +extern void init_cfs_throttle_work(struct task_struct *p); + #define BW_SHIFT 20 #define BW_UNIT (1 << BW_SHIFT) #define RATIO_SHIFT 8 --=20 2.39.5 From nobody Thu Oct 9 10:02:59 2025 Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 98ADA27EFF1 for ; Wed, 18 Jun 2025 08:20:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234821; cv=none; b=jPBBXDXH1dRJYsmm3x2Mc5SUTGbv/7KbHshxOprxxSryojwEyvShwX6b7v1vjFTckJXt+ZqRx65TjIj7FXUCTpZ5gKRAeRxib+qFt1XF565ZyXFZ57BsI6qduA2D/GaCI5zeLIMjW9DYzFY+12XCXcXUgkEejTPJw2zrh5c6cbo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234821; c=relaxed/simple; bh=7y6HZaSPKNRJy1JaP9jWDuQpqU+u3oSJyopNMN65hZ8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=JJwGqm/RsD+K6cCMSx4tTrUWCfheGlC5H7Dq2H9HTnOR3xFXlbfT3+8GxA0ld4XHLtjHeTQKJXC/oNxphSMA+R3qFlWyPYlRi7D1yEEL9mZZLy8mSOw6jZDVPlJXZnQ5pTls7ze98X+rWq6RNjPFsd6nucQZoSWg+4ZaIEsjuXM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=B9Qop6hi; arc=none smtp.client-ip=209.85.215.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="B9Qop6hi" Received: by mail-pg1-f180.google.com with SMTP id 41be03b00d2f7-b0db0b6a677so6117828a12.2 for ; Wed, 18 Jun 2025 01:20:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1750234819; x=1750839619; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=aZsKU8S+9M3jyPQspID+7IQD7NJj5OgiHkX2wV4OkrI=; b=B9Qop6hiIA2gFz+Ez7DAYMp8lX7Jit7HKZboYLAqusgfezg5qTcws75rId9xNNrgGv N00O6tnm+pZBwHzYvzi2DgT7/zELxRSNj7O4b+gyzQMBUxO4Vy+J93UjK9CuZiQ4KzRw KOkfeq5l10RVT3lalXVeRtZIrp1ZpM/ehHVDNrqjO6485d1fhXbQHYmU79fSrYhZPR8a au37POqYtAncIwJ7/7Sm4oKUcG0POvuddhiDMrj/+DrgHZl1UU0Jkc/B9ZOuZbwOdMEe a8k1IpUsgKhs192/5juJOc3N2K7OD9Tk52PAz81mFjOcTXBLIYnEC0m5MpqAm1MXchDA g8vA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750234819; x=1750839619; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aZsKU8S+9M3jyPQspID+7IQD7NJj5OgiHkX2wV4OkrI=; b=kQgtNI4eGbAmlj/9vxomle39wdZ1B5fOacTsGT45OylqC30EgIs9jPOItXTM2YuB/m xnLXId/hKb5I7fP9ete8nabw31FFT7GaKW1go1UzkwFMrMa9Pjm/fuJxQL637NhlDM9b CY9YaF1hLVvWz3M4RIdB/ZvqVcBqollIO3Kk0VUXH7Yxvu+KvvDJiKraS4dO1Bn1cNO1 OHh30wqq1b/fd6OayoSekWypFyHhQZO1HdtzK6nZBj97z04uXUpzd4oSSzbBvGswf2Qq AQly0eQUcNta6uroEbIDeVWBzi6MS3Bk14V079iPfclB05aT9CK0r5ynglx0BgH5B4GZ Xhuw== X-Gm-Message-State: AOJu0Yx+AovemAZmtDrYSiIy6F3uXZneC7/UpJgh3SykdTW0z1tPZe4B TcH4j9VMFyfVlW601rhkiXZ4+oSGSXH0A2V/+PWMt0Pmzd96X0lyXdwdvk/+uXydQg== X-Gm-Gg: ASbGncvQbMxz2vygVdPMPTXKvMZy4tyY5rZOi8eGaD6+/QDQ33x1Z7Ln1zNTfuoNwSD SxrE6L+wbPTfzE+qJCZtcHNEhAU1FUnWiOkEQP/9/im2Mix3k233i3gd5ifVfg6d3As6qNaKUju Pr++kmOTVgOzUGBQqAYodx35LWrqHLbRCW9c6XtES/Kh+956kTz0S97uP81MNHCuYQRUoqEooxz SJgKyrTbnOzFY2pvQv1U46SPkLydlmUH1vvVtujX5Sn1WPMpfbhHXJlriD01RIDy5iUuMn6AP7A f5A2VqYPO+8TXdHRDHhog2cxk/+vO7q0dMzVl3nn37vYXSPcOz2yYtxql1mg3yo7Jke+q19jDiA WkfVLcAag8GDO/arKd18MsQmzs71X8MPd X-Google-Smtp-Source: AGHT+IFupeInnX7DECfQyAU8PgaGQgMoGT6WGpX7zVeJF1Angiw7xI6NXHKjJPDjUqJFePh2T5+EGQ== X-Received: by 2002:a05:6a20:cd92:b0:219:935a:6e1e with SMTP id adf61e73a8af0-21fbd559011mr24018556637.26.1750234818843; Wed, 18 Jun 2025 01:20:18 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.55]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-b2fe1691d7fsm10432084a12.69.2025.06.18.01.20.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 01:20:18 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka Subject: [PATCH v2 2/5] sched/fair: Implement throttle task work and related helpers Date: Wed, 18 Jun 2025 16:19:37 +0800 Message-Id: <20250618081940.621-3-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250618081940.621-1-ziqianlu@bytedance.com> References: <20250618081940.621-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Valentin Schneider Implement throttle_cfs_rq_work() task work which gets executed on task's ret2user path where the task is dequeued and marked as throttled. Signed-off-by: Valentin Schneider Signed-off-by: Aaron Lu Reviewed-by: Chengming Zhou Tested-by: K Prateek Nayak --- kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f7b8597bc95ac..8226120b8771a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5757,8 +5757,51 @@ static inline int throttled_lb_pair(struct task_grou= p *tg, throttled_hierarchy(dest_cfs_rq); } =20 +static inline bool task_is_throttled(struct task_struct *p) +{ + return p->throttled; +} + +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int fl= ags); static void throttle_cfs_rq_work(struct callback_head *work) { + struct task_struct *p =3D container_of(work, struct task_struct, sched_th= rottle_work); + struct sched_entity *se; + struct cfs_rq *cfs_rq; + struct rq *rq; + + WARN_ON_ONCE(p !=3D current); + p->sched_throttle_work.next =3D &p->sched_throttle_work; + + /* + * If task is exiting, then there won't be a return to userspace, so we + * don't have to bother with any of this. + */ + if ((p->flags & PF_EXITING)) + return; + + scoped_guard(task_rq_lock, p) { + se =3D &p->se; + cfs_rq =3D cfs_rq_of(se); + + /* Raced, forget */ + if (p->sched_class !=3D &fair_sched_class) + return; + + /* + * If not in limbo, then either replenish has happened or this + * task got migrated out of the throttled cfs_rq, move along. + */ + if (!cfs_rq->throttle_count) + return; + rq =3D scope.rq; + update_rq_clock(rq); + WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node)); + dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL); + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); + p->throttled =3D true; + resched_curr(rq); + } } =20 void init_cfs_throttle_work(struct task_struct *p) @@ -5798,6 +5841,26 @@ static int tg_unthrottle_up(struct task_group *tg, v= oid *data) return 0; } =20 +static inline bool task_has_throttle_work(struct task_struct *p) +{ + return p->sched_throttle_work.next !=3D &p->sched_throttle_work; +} + +static inline void task_throttle_setup_work(struct task_struct *p) +{ + if (task_has_throttle_work(p)) + return; + + /* + * Kthreads and exiting tasks don't return to userspace, so adding the + * work is pointless + */ + if ((p->flags & (PF_EXITING | PF_KTHREAD))) + return; + + task_work_add(p, &p->sched_throttle_work, TWA_RESUME); +} + static int tg_throttle_down(struct task_group *tg, void *data) { struct rq *rq =3D data; @@ -6668,6 +6731,8 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_r= q) { return false; } static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {} static inline void sync_throttle(struct task_group *tg, int cpu) {} static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} +static void task_throttle_setup_work(struct task_struct *p) {} +static bool task_is_throttled(struct task_struct *p) { return false; } =20 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) { --=20 2.39.5 From nobody Thu Oct 9 10:02:59 2025 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9AA6C27F00E for ; Wed, 18 Jun 2025 08:20:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234827; cv=none; b=WB3kmeHSHQo09iG6zdLw7XqWAspjNDDZiKFnmYRIFbrgk84OpJnvxdWAW/YHVXlhN3eHAu1v0VRHSpEM57Ii3YRxIGvknCeJO5kFG19fgufrTUqKGCbUyZfKXLukSlRE/D4kGLYnwOzjydQ8+e/aEKAUz8t4nvwm0+yCsIN1AWw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234827; c=relaxed/simple; bh=y9AyYh62HEh1O6R6xmIGTerCgm7NUN8yHzY7XZuwo/I=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=UDF0IdNx2yLQ5FZsVc0TyR9NfF/5x0dkU4l+nn/Us7JBOGwcmz74ZdTGxiQjLHYKpVHSwzOg31Y+I3hOYdWlpAehGro17gbCFwrucvx9fTvyf8hVxUxfVqflgL6wdIkoBmbxAR5nclJgtV98GFNFmeccsl7dcPsJsM8AQCtaxsw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=KDapHZOI; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="KDapHZOI" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-235f9e87f78so69629005ad.2 for ; Wed, 18 Jun 2025 01:20:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1750234825; x=1750839625; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=0NXe/cf9qFtQFjwCo3Xjn8ayWVM4Iys5BjAx9JvlC+M=; b=KDapHZOIOPbOickLLgwXw2wkMhBkl6UImWx7ttKZ3ssHGSngAi2zs3Fy9Yvfv1kfMC CtbLfDEoaATHpKswIyz/GX+N66Zu9nctsmvviHC/GBOJkB9JhjONfXRPD/2gAMdocFcQ SbUC1DYJG9Ek4xQ3Pr1oLcpOhipo4Gk/1kuwfvF0Mm1aL/fnEOW7igcTtRtBX+JdEGg0 DWJ6X0CWFRm3PlkLZ5RJ9FFlfENXV5YYrw9ZwQ+JidVREzzRru5QWddZX/00N7hmyY1Y TejP0Rbba3aGWMJ8AVBJpN3LBfT6ZfwFLLydBoBXaFPEPQvZvdQB81T9ZbRw+mINTJrQ TZ0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750234825; x=1750839625; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0NXe/cf9qFtQFjwCo3Xjn8ayWVM4Iys5BjAx9JvlC+M=; b=sJGuiucdAdKTMjxXzh5nYTYHDpztskkdwiMLrNmZa7xf3OYqdPPhpyaGL9BRr+uhHE MlSWCbWQI+O4rKZ17M8L8IQigqS3oyrc90Uhgo/JsMfvknLmClB3OXOfMZbc3It4uG/4 4xm7GHjZZC4450Pp5sqEiX8m1rshvYLiU+JfpP00kKkCqfPRbkZTPbDra91JlN5WcLgT nJwaoE6VHt1nBgLs6kl0TNCsrBThBFCv9iC4uvSE2Ec23OVqwsHQvUNIrdBfdVCLgMqm GN2gIh7D5NlOw6gTDg6HPUpa44qL5D3VzdbY2uYMGpbzN/froxoTG9oP/+Zls8LlL5Hk 5YBg== X-Gm-Message-State: AOJu0YyGqrV6H1EmSSLJzMmbIgsH/Qsk6+1wFS8+wrGA1uNpeJLkmi1p E5yCFmCjfr3gLpecgglE88MQvgZlqIwttibCoXUIkoJUdyqlhRaeGDwyQmKYd3LZgA== X-Gm-Gg: ASbGnctsSIPbN3fW3kKknpdXIYqQxHuAYJ49PpnfSCl3z125I0JHRm4GXiCRBlQE/h1 FZ0obWKJXY/Bei6Y8AEujfV0oTr/cgQU6lz3PdKjJ0s1rn8u9UKxB97GQ3oem2+wNdL5sCMA6il 7KAkOAdmINtPc++V6EYPfzKpYeoTyf71m5lGAnQfJFWqnM0z6gETTHizqZEEPgQKUk+jbxDiv8A kOoqgAh7DxdAv5tL1RwZ9+RvuSlQ9SrTjf5FCS+Xc1+I/7IuLMYn6iSvBc6AGwVyanLpOSnoGCO 4Zfok7EoLoMnQTKLKaMQfBoWTXihl0+xI0r0tnjV4HFw8mBGQKPNeiI9v+V6wE8eceVHl8IVqh8 +BC9Sh8DtX4jePwhbfNcf0p1njhdSkwfa X-Google-Smtp-Source: AGHT+IGRndvep7MBM+UXCluTDBgc27r2XCnfOYFW3b1POsOSytzJpQpu8Z6txAoGCQYwor6LkyAUoA== X-Received: by 2002:a17:903:350b:b0:234:bef7:e227 with SMTP id d9443c01a7336-2366affb316mr228444535ad.18.1750234824786; Wed, 18 Jun 2025 01:20:24 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.55]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-b2fe1691d7fsm10432084a12.69.2025.06.18.01.20.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 01:20:24 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka Subject: [PATCH v2 3/5] sched/fair: Switch to task based throttle model Date: Wed, 18 Jun 2025 16:19:38 +0800 Message-Id: <20250618081940.621-4-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250618081940.621-1-ziqianlu@bytedance.com> References: <20250618081940.621-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Valentin Schneider In current throttle model, when a cfs_rq is throttled, its entity will be dequeued from cpu's rq, making tasks attached to it not able to run, thus achiveing the throttle target. This has a drawback though: assume a task is a reader of percpu_rwsem and is waiting. When it gets woken, it can not run till its task group's next period comes, which can be a relatively long time. Waiting writer will have to wait longer due to this and it also makes further reader build up and eventually trigger task hung. To improve this situation, change the throttle model to task based, i.e. when a cfs_rq is throttled, record its throttled status but do not remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when they get picked, add a task work to them so that when they return to user, they can be dequeued there. In this way, tasks throttled will not hold any kernel resources. And on unthrottle, enqueue back those tasks so they can continue to run. Throttled cfs_rq's leaf_cfs_rq_list is handled differently now: since a task can be enqueued to a throttled cfs_rq and gets to run, to not break the assert_list_leaf_cfs_rq() in enqueue_task_fair(), always add it to leaf cfs_rq list when it has its first entity enqueued and delete it from leaf cfs_rq list when it has no tasks enqueued. Suggested-by: Chengming Zhou # tag on pick Signed-off-by: Valentin Schneider Signed-off-by: Aaron Lu Tested-by: K Prateek Nayak --- kernel/sched/fair.c | 325 +++++++++++++++++++++----------------------- 1 file changed, 153 insertions(+), 172 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8226120b8771a..59b372ffae18c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5291,18 +5291,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_= entity *se, int flags) =20 if (cfs_rq->nr_queued =3D=3D 1) { check_enqueue_throttle(cfs_rq); - if (!throttled_hierarchy(cfs_rq)) { - list_add_leaf_cfs_rq(cfs_rq); - } else { + list_add_leaf_cfs_rq(cfs_rq); #ifdef CONFIG_CFS_BANDWIDTH + if (throttled_hierarchy(cfs_rq)) { struct rq *rq =3D rq_of(cfs_rq); =20 if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock) cfs_rq->throttled_clock =3D rq_clock(rq); if (!cfs_rq->throttled_clock_self) cfs_rq->throttled_clock_self =3D rq_clock(rq); -#endif } +#endif } } =20 @@ -5341,8 +5340,6 @@ static void set_delayed(struct sched_entity *se) struct cfs_rq *cfs_rq =3D cfs_rq_of(se); =20 cfs_rq->h_nr_runnable--; - if (cfs_rq_throttled(cfs_rq)) - break; } } =20 @@ -5363,8 +5360,6 @@ static void clear_delayed(struct sched_entity *se) struct cfs_rq *cfs_rq =3D cfs_rq_of(se); =20 cfs_rq->h_nr_runnable++; - if (cfs_rq_throttled(cfs_rq)) - break; } } =20 @@ -5450,8 +5445,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_e= ntity *se, int flags) if (flags & DEQUEUE_DELAYED) finish_delayed_dequeue_entity(se); =20 - if (cfs_rq->nr_queued =3D=3D 0) + if (cfs_rq->nr_queued =3D=3D 0) { update_idle_cfs_rq_clock_pelt(cfs_rq); + if (throttled_hierarchy(cfs_rq)) + list_del_leaf_cfs_rq(cfs_rq); + } =20 return true; } @@ -5799,6 +5797,10 @@ static void throttle_cfs_rq_work(struct callback_hea= d *work) WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node)); dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL); list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); + /* + * Must not set throttled before dequeue or dequeue will + * mistakenly regard this task as an already throttled one. + */ p->throttled =3D true; resched_curr(rq); } @@ -5812,32 +5814,116 @@ void init_cfs_throttle_work(struct task_struct *p) INIT_LIST_HEAD(&p->throttle_node); } =20 +/* + * Task is throttled and someone wants to dequeue it again: + * it could be sched/core when core needs to do things like + * task affinity change, task group change, task sched class + * change etc. and in these cases, DEQUEUE_SLEEP is not set; + * or the task is blocked after throttled due to freezer etc. + * and in these cases, DEQUEUE_SLEEP is set. + */ +static void detach_task_cfs_rq(struct task_struct *p); +static void dequeue_throttled_task(struct task_struct *p, int flags) +{ + WARN_ON_ONCE(p->se.on_rq); + list_del_init(&p->throttle_node); + + /* task blocked after throttled */ + if (flags & DEQUEUE_SLEEP) { + p->throttled =3D false; + return; + } + + /* + * task is migrating off its old cfs_rq, detach + * the task's load from its old cfs_rq. + */ + if (task_on_rq_migrating(p)) + detach_task_cfs_rq(p); +} + +static bool enqueue_throttled_task(struct task_struct *p) +{ + struct cfs_rq *cfs_rq =3D cfs_rq_of(&p->se); + + /* + * If the throttled task is enqueued to a throttled cfs_rq, + * take the fast path by directly put the task on target + * cfs_rq's limbo list, except when p is current because + * the following race can cause p's group_node left in rq's + * cfs_tasks list when it's throttled: + * + * cpuX cpuY + * taskA ret2user + * throttle_cfs_rq_work() sched_move_task(taskA) + * task_rq_lock acquired + * dequeue_task_fair(taskA) + * task_rq_lock released + * task_rq_lock acquired + * task_current_donor(taskA) =3D=3D true + * task_on_rq_queued(taskA) =3D=3D true + * dequeue_task(taskA) + * put_prev_task(taskA) + * sched_change_group() + * enqueue_task(taskA) -> taskA's new cfs_rq + * is throttled, go + * fast path and skip + * actual enqueue + * set_next_task(taskA) + * __set_next_task_fair(taskA) + * list_move(&se->group_node, &rq->cfs_tasks); // bug + * schedule() + * + * And in the above race case, the task's current cfs_rq is in the same + * rq as its previous cfs_rq because sched_move_task() doesn't migrate + * task so we can use its current cfs_rq to derive rq and test if the + * task is current. + */ + if (throttled_hierarchy(cfs_rq) && + !task_current_donor(rq_of(cfs_rq), p)) { + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); + return true; + } + + /* we can't take the fast path, do an actual enqueue*/ + p->throttled =3D false; + return false; +} + +static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int fl= ags); static int tg_unthrottle_up(struct task_group *tg, void *data) { struct rq *rq =3D data; struct cfs_rq *cfs_rq =3D tg->cfs_rq[cpu_of(rq)]; + struct task_struct *p, *tmp; =20 - cfs_rq->throttle_count--; - if (!cfs_rq->throttle_count) { - cfs_rq->throttled_clock_pelt_time +=3D rq_clock_pelt(rq) - - cfs_rq->throttled_clock_pelt; + if (--cfs_rq->throttle_count) + return 0; =20 - /* Add cfs_rq with load or one or more already running entities to the l= ist */ - if (!cfs_rq_is_decayed(cfs_rq)) - list_add_leaf_cfs_rq(cfs_rq); + cfs_rq->throttled_clock_pelt_time +=3D rq_clock_pelt(rq) - + cfs_rq->throttled_clock_pelt; =20 - if (cfs_rq->throttled_clock_self) { - u64 delta =3D rq_clock(rq) - cfs_rq->throttled_clock_self; + if (cfs_rq->throttled_clock_self) { + u64 delta =3D rq_clock(rq) - cfs_rq->throttled_clock_self; =20 - cfs_rq->throttled_clock_self =3D 0; + cfs_rq->throttled_clock_self =3D 0; =20 - if (WARN_ON_ONCE((s64)delta < 0)) - delta =3D 0; + if (WARN_ON_ONCE((s64)delta < 0)) + delta =3D 0; =20 - cfs_rq->throttled_clock_self_time +=3D delta; - } + cfs_rq->throttled_clock_self_time +=3D delta; } =20 + /* Re-enqueue the tasks that have been throttled at this level. */ + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_= node) { + list_del_init(&p->throttle_node); + enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP); + } + + /* Add cfs_rq with load or one or more already running entities to the li= st */ + if (!cfs_rq_is_decayed(cfs_rq)) + list_add_leaf_cfs_rq(cfs_rq); + return 0; } =20 @@ -5866,17 +5952,19 @@ static int tg_throttle_down(struct task_group *tg, = void *data) struct rq *rq =3D data; struct cfs_rq *cfs_rq =3D tg->cfs_rq[cpu_of(rq)]; =20 + if (cfs_rq->throttle_count++) + return 0; + /* group is entering throttled state, stop time */ - if (!cfs_rq->throttle_count) { - cfs_rq->throttled_clock_pelt =3D rq_clock_pelt(rq); - list_del_leaf_cfs_rq(cfs_rq); + cfs_rq->throttled_clock_pelt =3D rq_clock_pelt(rq); =20 - WARN_ON_ONCE(cfs_rq->throttled_clock_self); - if (cfs_rq->nr_queued) - cfs_rq->throttled_clock_self =3D rq_clock(rq); - } - cfs_rq->throttle_count++; + WARN_ON_ONCE(cfs_rq->throttled_clock_self); + if (cfs_rq->nr_queued) + cfs_rq->throttled_clock_self =3D rq_clock(rq); + else + list_del_leaf_cfs_rq(cfs_rq); =20 + WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list)); return 0; } =20 @@ -5884,9 +5972,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) { struct rq *rq =3D rq_of(cfs_rq); struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg); - struct sched_entity *se; - long queued_delta, runnable_delta, idle_delta, dequeue =3D 1; - long rq_h_nr_queued =3D rq->cfs.h_nr_queued; + int dequeue =3D 1; =20 raw_spin_lock(&cfs_b->lock); /* This will start the period timer if necessary */ @@ -5909,72 +5995,11 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) if (!dequeue) return false; /* Throttle no longer required. */ =20 - se =3D cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]; - /* freeze hierarchy runnable averages while throttled */ rcu_read_lock(); walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq); rcu_read_unlock(); =20 - queued_delta =3D cfs_rq->h_nr_queued; - runnable_delta =3D cfs_rq->h_nr_runnable; - idle_delta =3D cfs_rq->h_nr_idle; - for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); - int flags; - - /* throttled entity or throttle-on-deactivate */ - if (!se->on_rq) - goto done; - - /* - * Abuse SPECIAL to avoid delayed dequeue in this instance. - * This avoids teaching dequeue_entities() about throttled - * entities and keeps things relatively simple. - */ - flags =3D DEQUEUE_SLEEP | DEQUEUE_SPECIAL; - if (se->sched_delayed) - flags |=3D DEQUEUE_DELAYED; - dequeue_entity(qcfs_rq, se, flags); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_delta =3D cfs_rq->h_nr_queued; - - qcfs_rq->h_nr_queued -=3D queued_delta; - qcfs_rq->h_nr_runnable -=3D runnable_delta; - qcfs_rq->h_nr_idle -=3D idle_delta; - - if (qcfs_rq->load.weight) { - /* Avoid re-evaluating load for this entity: */ - se =3D parent_entity(se); - break; - } - } - - for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); - /* throttled entity or throttle-on-deactivate */ - if (!se->on_rq) - goto done; - - update_load_avg(qcfs_rq, se, 0); - se_update_runnable(se); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_delta =3D cfs_rq->h_nr_queued; - - qcfs_rq->h_nr_queued -=3D queued_delta; - qcfs_rq->h_nr_runnable -=3D runnable_delta; - qcfs_rq->h_nr_idle -=3D idle_delta; - } - - /* At this point se is NULL and we are at root level*/ - sub_nr_running(rq, queued_delta); - - /* Stop the fair server if throttling resulted in no runnable tasks */ - if (rq_h_nr_queued && !rq->cfs.h_nr_queued) - dl_server_stop(&rq->fair_server); -done: /* * Note: distribution will already see us throttled via the * throttled-list. rq->lock protects completion. @@ -5990,9 +6015,20 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) { struct rq *rq =3D rq_of(cfs_rq); struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg); - struct sched_entity *se; - long queued_delta, runnable_delta, idle_delta; - long rq_h_nr_queued =3D rq->cfs.h_nr_queued; + struct sched_entity *se =3D cfs_rq->tg->se[cpu_of(rq)]; + + /* + * It's possible we are called with !runtime_remaining due to things + * like user changed quota setting(see tg_set_cfs_bandwidth()) or async + * unthrottled us with a positive runtime_remaining but other still + * running entities consumed those runtime before we reached here. + * + * Anyway, we can't unthrottle this cfs_rq without any runtime remaining + * because any enqueue in tg_unthrottle_up() will immediately trigger a + * throttle, which is not supposed to happen on unthrottle path. + */ + if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <=3D 0) + return; =20 se =3D cfs_rq->tg->se[cpu_of(rq)]; =20 @@ -6022,62 +6058,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) if (list_add_leaf_cfs_rq(cfs_rq_of(se))) break; } - goto unthrottle_throttle; - } - - queued_delta =3D cfs_rq->h_nr_queued; - runnable_delta =3D cfs_rq->h_nr_runnable; - idle_delta =3D cfs_rq->h_nr_idle; - for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); - - /* Handle any unfinished DELAY_DEQUEUE business first. */ - if (se->sched_delayed) { - int flags =3D DEQUEUE_SLEEP | DEQUEUE_DELAYED; - - dequeue_entity(qcfs_rq, se, flags); - } else if (se->on_rq) - break; - enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_delta =3D cfs_rq->h_nr_queued; - - qcfs_rq->h_nr_queued +=3D queued_delta; - qcfs_rq->h_nr_runnable +=3D runnable_delta; - qcfs_rq->h_nr_idle +=3D idle_delta; - - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(qcfs_rq)) - goto unthrottle_throttle; } =20 - for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); - - update_load_avg(qcfs_rq, se, UPDATE_TG); - se_update_runnable(se); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_delta =3D cfs_rq->h_nr_queued; - - qcfs_rq->h_nr_queued +=3D queued_delta; - qcfs_rq->h_nr_runnable +=3D runnable_delta; - qcfs_rq->h_nr_idle +=3D idle_delta; - - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(qcfs_rq)) - goto unthrottle_throttle; - } - - /* Start the fair server if un-throttling resulted in new runnable tasks = */ - if (!rq_h_nr_queued && rq->cfs.h_nr_queued) - dl_server_start(&rq->fair_server); - - /* At this point se is NULL and we are at root level*/ - add_nr_running(rq, queued_delta); - -unthrottle_throttle: assert_list_leaf_cfs_rq(rq); =20 /* Determine whether we need to wake up potentially idle CPU: */ @@ -6733,6 +6715,8 @@ static inline void sync_throttle(struct task_group *t= g, int cpu) {} static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} static void task_throttle_setup_work(struct task_struct *p) {} static bool task_is_throttled(struct task_struct *p) { return false; } +static void dequeue_throttled_task(struct task_struct *p, int flags) {} +static bool enqueue_throttled_task(struct task_struct *p) { return false; } =20 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) { @@ -6925,6 +6909,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) int rq_h_nr_queued =3D rq->cfs.h_nr_queued; u64 slice =3D 0; =20 + if (unlikely(task_is_throttled(p) && enqueue_throttled_task(p))) + return; + /* * The code below (indirectly) updates schedutil which looks at * the cfs_rq utilization to select a frequency. @@ -6977,10 +6964,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct = *p, int flags) if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D 1; =20 - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - goto enqueue_throttle; - flags =3D ENQUEUE_WAKEUP; } =20 @@ -7002,10 +6985,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct = *p, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D 1; - - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - goto enqueue_throttle; } =20 if (!rq_h_nr_queued && rq->cfs.h_nr_queued) { @@ -7035,7 +7014,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) if (!task_new) check_update_overutilized_status(rq); =20 -enqueue_throttle: assert_list_leaf_cfs_rq(rq); =20 hrtick_update(rq); @@ -7091,10 +7069,6 @@ static int dequeue_entities(struct rq *rq, struct sc= hed_entity *se, int flags) if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D h_nr_queued; =20 - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - return 0; - /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { slice =3D cfs_rq_min_slice(cfs_rq); @@ -7131,10 +7105,6 @@ static int dequeue_entities(struct rq *rq, struct sc= hed_entity *se, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D h_nr_queued; - - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - return 0; } =20 sub_nr_running(rq, h_nr_queued); @@ -7171,6 +7141,11 @@ static int dequeue_entities(struct rq *rq, struct sc= hed_entity *se, int flags) */ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int fl= ags) { + if (unlikely(task_is_throttled(p))) { + dequeue_throttled_task(p, flags); + return true; + } + if (!p->se.sched_delayed) util_est_dequeue(&rq->cfs, p); =20 @@ -8836,19 +8811,22 @@ static struct task_struct *pick_task_fair(struct rq= *rq) { struct sched_entity *se; struct cfs_rq *cfs_rq; + struct task_struct *p; + bool throttled; =20 again: cfs_rq =3D &rq->cfs; if (!cfs_rq->nr_queued) return NULL; =20 + throttled =3D false; + do { /* Might not have done put_prev_entity() */ if (cfs_rq->curr && cfs_rq->curr->on_rq) update_curr(cfs_rq); =20 - if (unlikely(check_cfs_rq_runtime(cfs_rq))) - goto again; + throttled |=3D check_cfs_rq_runtime(cfs_rq); =20 se =3D pick_next_entity(rq, cfs_rq); if (!se) @@ -8856,7 +8834,10 @@ static struct task_struct *pick_task_fair(struct rq = *rq) cfs_rq =3D group_cfs_rq(se); } while (cfs_rq); =20 - return task_of(se); + p =3D task_of(se); + if (unlikely(throttled)) + task_throttle_setup_work(p); + return p; } =20 static void __set_next_task_fair(struct rq *rq, struct task_struct *p, boo= l first); --=20 2.39.5 From nobody Thu Oct 9 10:02:59 2025 Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 552C727F75C for ; Wed, 18 Jun 2025 08:20:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234833; cv=none; b=KC4mCTmB74PtZpxV8dwTleCKB2hAwWBKbPz+qzVgNMxkZU06S/tp2j3z3yjJSs9nGbgmjGakDiWtEe0HmVY3AAlBbCT768kccZuY1PISDm5FCywbKcxrnzDKt0Bpd2BB/fMmovqPWUwCl+ApwAK4q0lpTLncTzB5ECIAq6+cDs8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234833; c=relaxed/simple; bh=z021WrGctKuppj0/h8tKBraJ0dwBjvVoczMC6LXzLZk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=sXPI/cC3LD9eHZfvG91FuwNSeqTs2FNG4CcmfF4yctM7rAs15Qpp0A4zVpES47q8VmNIGXyOH4Sm2TO/ZQArEMzxjVTIPiMLxc4GxqR/osiaJ9SwVqAbwj9hUcg17hWYFX1Ki7qRX7J07r1oo757UphFPrvvvt/vrK6fSde6ZfU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=au4jI7Ne; arc=none smtp.client-ip=209.85.215.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="au4jI7Ne" Received: by mail-pg1-f180.google.com with SMTP id 41be03b00d2f7-b2c4e46a89fso5314792a12.2 for ; Wed, 18 Jun 2025 01:20:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1750234830; x=1750839630; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=amuhs5zxMAjg/6Xi+wuLV4ot6W15AoFQiopBdQFm/lE=; b=au4jI7NeYy/VYRFPCBuBstAviedQXLBWS9ov1/3a6dgOdHQ0hd9C3odSUZdwVrhyM6 vJJR5bm+w0WAHsNwxKoElw/XcRUCW/XB6XTgtX4hZai36Vv+ijXesLYY3rTqs90DKBow 28dDCz90Us+9PsHqmdnBeKtmQiZ3PN9EvtvHBe52wSBI+LLjSVpgXC5L9bLmwUme3d3H Sn/FSY4GTrboPXPPtGFeraoCI6QHgYJKJVB2Bo3kNVfpMJLs8hAB1/62g+yD6hoTQgCQ snLANrVwYva81O1vD8uRJN0fHHanuzC8FRNp0FPLD5Onyu7BxCE9AFRQo7wVw06UTzgj MY9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750234830; x=1750839630; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=amuhs5zxMAjg/6Xi+wuLV4ot6W15AoFQiopBdQFm/lE=; b=kU4piTjtNfTJrpDwt1p0gr4D+pWLRj7mB3obYeSF1+VbjrKThDxKBtoC/mU8bY8frm Wek+phhJmaffwfcEUfh1Hx8DHxvALMta9TKNsKvwrd91UXeAht+soXa0tkt4y8+1wTBm JzL4Q7TSKbAhS/EbSUjCtIXkneHbHMgF4DbmPApTVlu5WJee1QEZ25qFmrJ9uo6RcNEu Rghs1M7d68lArQxyZVjn3B/4UPNUrROcff2l3FxCxIXfqx1+oLRSIHB6s9lspv7xYvta PKoIYOsgBDDx3PZXiKHBYjRQs8UkMUMeMrMX2rYZGpMOWl7mb401eNRS3t/EN7jCez/D spXg== X-Gm-Message-State: AOJu0Yx1t/yc49ZJMqzd9nzFdNjiGVd2aGcxofsJuADKg/tH0wSzS4rP Ei44hEVqtY/3sR/PvGSp83h0I7y6Dxmr4EpLB+2yTgWcgB+R0CNZbEYPbELEKDuVdQ== X-Gm-Gg: ASbGncufLD4pikAPvovO/qeNFe1dU+8ARCm5krlGI1EEjoVQp8iVLMjygSLFB2Ic+5R QlDuXSDuvWXN/IBRA1wq3yKTaBxMJMqIDW1siElzKFRe28dsnnridXcJSJ49l2qdD7BLyOOG9rg 43wF52EesLQB/sCavo/m8GB9t3RfV9iNMfo5QvafnRb9y/LX5IiHm/REQ4DQ0XNX38d8F60K52Y hD4+uAYaoXCSPgl7AWlPCPQ5NIVlD7tlfB6uPwRdpHuxZO0p+yIb3aoxwKXGUhctw6vGQaTaUl4 CO69CQP9om2uMAVgy2xk2AVezubgS8bwuF/A3Arxc7xxkPMEY4ETkLo2P1qRgktPNBIauXNobzI Bqr0ehvoMvPAIt88g7y+oEPIeJvzGWSk6 X-Google-Smtp-Source: AGHT+IGbJOpGAPZ2P+Evy7NQHVjjK9LjvUNeeJuOSqs8TCv16hzIsTsx5gNxMdGrJR4QJGa9l3v8kQ== X-Received: by 2002:a05:6a21:790:b0:20a:bde0:beb9 with SMTP id adf61e73a8af0-21fbd494fccmr24851593637.1.1750234830469; Wed, 18 Jun 2025 01:20:30 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.55]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-b2fe1691d7fsm10432084a12.69.2025.06.18.01.20.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 01:20:29 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka Subject: [PATCH v2 4/5] sched/fair: Task based throttle time accounting Date: Wed, 18 Jun 2025 16:19:39 +0800 Message-Id: <20250618081940.621-5-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250618081940.621-1-ziqianlu@bytedance.com> References: <20250618081940.621-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With task based throttle model, the previous way to check cfs_rq's nr_queued to decide if throttled time should be accounted doesn't work as expected, e.g. when a cfs_rq which has a single task is throttled, that task could later block in kernel mode instead of being dequeued on limbo list and account this as throttled time is not accurate. Rework throttle time accounting for a cfs_rq as follows: - start accounting when the first task gets throttled in its hierarchy; - stop accounting on unthrottle. Suggested-by: Chengming Zhou # accounting mechan= ism Co-developed-by: K Prateek Nayak # simplify implem= entation Signed-off-by: K Prateek Nayak Signed-off-by: Aaron Lu Tested-by: K Prateek Nayak --- kernel/sched/fair.c | 41 +++++++++++++++++++++++------------------ kernel/sched/sched.h | 1 + 2 files changed, 24 insertions(+), 18 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 59b372ffae18c..aab3ce4073582 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5292,16 +5292,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_e= ntity *se, int flags) if (cfs_rq->nr_queued =3D=3D 1) { check_enqueue_throttle(cfs_rq); list_add_leaf_cfs_rq(cfs_rq); -#ifdef CONFIG_CFS_BANDWIDTH - if (throttled_hierarchy(cfs_rq)) { - struct rq *rq =3D rq_of(cfs_rq); - - if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock) - cfs_rq->throttled_clock =3D rq_clock(rq); - if (!cfs_rq->throttled_clock_self) - cfs_rq->throttled_clock_self =3D rq_clock(rq); - } -#endif } } =20 @@ -5387,7 +5377,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_en= tity *se, int flags) * DELAY_DEQUEUE relies on spurious wakeups, special task * states must not suffer spurious wakeups, excempt them. */ - if (flags & DEQUEUE_SPECIAL) + if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE)) delay =3D false; =20 WARN_ON_ONCE(delay && se->sched_delayed); @@ -5795,7 +5785,7 @@ static void throttle_cfs_rq_work(struct callback_head= *work) rq =3D scope.rq; update_rq_clock(rq); WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node)); - dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL); + dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE); list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); /* * Must not set throttled before dequeue or dequeue will @@ -5947,6 +5937,17 @@ static inline void task_throttle_setup_work(struct t= ask_struct *p) task_work_add(p, &p->sched_throttle_work, TWA_RESUME); } =20 +static void record_throttle_clock(struct cfs_rq *cfs_rq) +{ + struct rq *rq =3D rq_of(cfs_rq); + + if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock) + cfs_rq->throttled_clock =3D rq_clock(rq); + + if (!cfs_rq->throttled_clock_self) + cfs_rq->throttled_clock_self =3D rq_clock(rq); +} + static int tg_throttle_down(struct task_group *tg, void *data) { struct rq *rq =3D data; @@ -5958,12 +5959,10 @@ static int tg_throttle_down(struct task_group *tg, = void *data) /* group is entering throttled state, stop time */ cfs_rq->throttled_clock_pelt =3D rq_clock_pelt(rq); =20 - WARN_ON_ONCE(cfs_rq->throttled_clock_self); - if (cfs_rq->nr_queued) - cfs_rq->throttled_clock_self =3D rq_clock(rq); - else + if (!cfs_rq->nr_queued) list_del_leaf_cfs_rq(cfs_rq); =20 + WARN_ON_ONCE(cfs_rq->throttled_clock_self); WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list)); return 0; } @@ -6006,8 +6005,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) */ cfs_rq->throttled =3D 1; WARN_ON_ONCE(cfs_rq->throttled_clock); - if (cfs_rq->nr_queued) - cfs_rq->throttled_clock =3D rq_clock(rq); return true; } =20 @@ -6717,6 +6714,7 @@ static void task_throttle_setup_work(struct task_stru= ct *p) {} static bool task_is_throttled(struct task_struct *p) { return false; } static void dequeue_throttled_task(struct task_struct *p, int flags) {} static bool enqueue_throttled_task(struct task_struct *p) { return false; } +static void record_throttle_clock(struct cfs_rq *cfs_rq) {} =20 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) { @@ -7036,6 +7034,7 @@ static int dequeue_entities(struct rq *rq, struct sch= ed_entity *se, int flags) int rq_h_nr_queued =3D rq->cfs.h_nr_queued; bool task_sleep =3D flags & DEQUEUE_SLEEP; bool task_delayed =3D flags & DEQUEUE_DELAYED; + bool task_throttled =3D flags & DEQUEUE_THROTTLE; struct task_struct *p =3D NULL; int h_nr_idle =3D 0; int h_nr_queued =3D 0; @@ -7069,6 +7068,9 @@ static int dequeue_entities(struct rq *rq, struct sch= ed_entity *se, int flags) if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D h_nr_queued; =20 + if (throttled_hierarchy(cfs_rq) && task_throttled) + record_throttle_clock(cfs_rq); + /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { slice =3D cfs_rq_min_slice(cfs_rq); @@ -7105,6 +7107,9 @@ static int dequeue_entities(struct rq *rq, struct sch= ed_entity *se, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D h_nr_queued; + + if (throttled_hierarchy(cfs_rq) && task_throttled) + record_throttle_clock(cfs_rq); } =20 sub_nr_running(rq, h_nr_queued); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fa78c0c3efe39..f2a07537d3c12 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2312,6 +2312,7 @@ extern const u32 sched_prio_to_wmult[40]; #define DEQUEUE_SPECIAL 0x10 #define DEQUEUE_MIGRATING 0x100 /* Matches ENQUEUE_MIGRATING */ #define DEQUEUE_DELAYED 0x200 /* Matches ENQUEUE_DELAYED */ +#define DEQUEUE_THROTTLE 0x800 =20 #define ENQUEUE_WAKEUP 0x01 #define ENQUEUE_RESTORE 0x02 --=20 2.39.5 From nobody Thu Oct 9 10:02:59 2025 Received: from mail-pg1-f171.google.com (mail-pg1-f171.google.com [209.85.215.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 01FA035963 for ; Wed, 18 Jun 2025 08:20:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234840; cv=none; b=us2GL5EO07XVfcu7Th7TEQu6djPsXA55zHrRT3vAS0row7lKxpdMletNOytBX5kaDquaSB2k59ios5mvYqlR60knuZsoe9UnBiPJj9rAFjzGxZD9kWtt/IH+GR1t7aIA2ixghpwl9f/Bfs3Lv3QDp8JFJMfwnk44hyX/iyuOg90= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750234840; c=relaxed/simple; bh=BeldqmaA4JVUctnGdxwNNNGtilo+WIwG+ckVCDsWG3M=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=lDTcSEDGWmtKKOtTh5CiCvx8kGy86ijqHk+II2d2OrUVXiPVH8ljDHgdw3mL+urdH/qoyW9HERr2p7zlnE7vbPzQFdCwr+G4GfJoEFQ3echGHpt65EhTstoECZmZLKYqtyvCUZjRrVIjt6DMmqI2yr05sB/BclH41qjEuODlM+0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=GxnzTMwX; arc=none smtp.client-ip=209.85.215.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="GxnzTMwX" Received: by mail-pg1-f171.google.com with SMTP id 41be03b00d2f7-b2f62bbb5d6so5378973a12.0 for ; Wed, 18 Jun 2025 01:20:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1750234836; x=1750839636; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=+LKSeQMInYM4eVnyTBiOILRNGaEo266O97TjE8trNAw=; b=GxnzTMwXbH2rMTOtOn1hRRylWqlo5CwzkWUC26vxZFd7hPoQyKurwwQ2j6+O11+2Kg 6zKEIKIaDajM8K7qKP1YsBh4waQlPgdrvKL67KJeS6L8NwRbI48IHZMdUkhr/4FVLRAg RP1T6PtvIhItPyET0nj8fUbaDLs86mMZqX8xasTHazWUrscx1TQJ8aYGzXUwbMudHAK6 +X1KRdG1ZN47Y+gC24bPeF3gukLBYzVaKA9XOeuv7nxTNk5lZHYE1a5Pylelrd5W2OPO 3aV3I76Vl8gNcpX42nLnNbTpCy4VNR2XZzZxWMZO3R6qqrrr/kE4sQ20ZNbBdPO1lAF+ 7xFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750234836; x=1750839636; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+LKSeQMInYM4eVnyTBiOILRNGaEo266O97TjE8trNAw=; b=qsJPMUDzxVKJYj+uYvUSltB+xJ0JDi6jA1u8aOMkB/Q7Q1Huxi1D1utZsc5jhGFwcd 4HNV7JPayxP0mRTpgZw9JYue7u5st4VD9F7GITUOCfB27Mug6uU+C5pnJFFwg5ewjMSx FQNbkYFzAaGAHXoiRHfjMQ7VRPn2OlATU4B1cjavlUa/bLLinpWP2kvviryOQo4zMSyR r7N9Jfij8CIuIJExYMva7BOA9d9mbPTHXmYGuy3zrUPtNbeH0YMxbGLZrfOOk12YzyCK UJC8+YdTwo3TXPqUZ4NyQiKjC4xDEc6Xmgqsyj3a3O9wjL6TsgA9kDTBKWkowyZGFlau 2COg== X-Gm-Message-State: AOJu0Yz8PY/rH17TSN1d9JPAzT+Xo8iSCHk08qQiMgZ5+JQquvLR6N5Y Lz+hSzv1XPLdsBmeg/ugA+YMLjn4G+oVdawp6gGnF2zwgJVfhxOVDMNCeqC7pLWWLQ== X-Gm-Gg: ASbGncvAeAVj4AfCiiYy5zIqohhNwQFvp+9Dk0As0M4JnR8DMChs7Sqk+ZI6oLsnAqa 8iie0apv58dheFgk3+KZ4IcpOrMJ8O1c0Lv6wruDG0DUp7tRs1EFOxEe6yaolnnIkbJ3d7ynLqz v3jYeXO2NE6hOtHvInmO2mw6RKYp7/hoSS9DQl7vbg+LJcqmv6f8T+lyIr02T8fKIQI/0ecuu4G QcvChPFEsKqtZtjv+Bstq++5YnOd+nFrDNpJeOa/2mrfz9DfZnhZgipfe/OAWJvSdriZfAna+Db uuPSih8bryiQC3mm95R+UR39LYBEqVqkbA/gARZP3SK4Nm7jNnDfj7ZRUf6DmR2mflrEBKaN2i6 m57cO0Q28UlcMsc6Lb6lxoRaqDTFCRfoE X-Google-Smtp-Source: AGHT+IGxPU8o0sJXokT05J4TqsfJNsHo4zLF38+SbBee7L5yqtFKos637eV4Xo+NdvodD0OJoEFEYA== X-Received: by 2002:a05:6a21:329e:b0:1f5:97c3:41b9 with SMTP id adf61e73a8af0-21fbd505d57mr22227785637.5.1750234836121; Wed, 18 Jun 2025 01:20:36 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.55]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-b2fe1691d7fsm10432084a12.69.2025.06.18.01.20.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 01:20:35 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka Subject: [PATCH v2 5/5] sched/fair: Get rid of throttled_lb_pair() Date: Wed, 18 Jun 2025 16:19:40 +0800 Message-Id: <20250618081940.621-6-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250618081940.621-1-ziqianlu@bytedance.com> References: <20250618081940.621-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks list, there is no need to take special care of these throttled tasks anymore in load balance. Suggested-by: K Prateek Nayak Signed-off-by: Aaron Lu Tested-by: K Prateek Nayak --- kernel/sched/fair.c | 33 +++------------------------------ 1 file changed, 3 insertions(+), 30 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aab3ce4073582..d869c8b51c5a6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5728,23 +5728,6 @@ static inline int throttled_hierarchy(struct cfs_rq = *cfs_rq) return cfs_bandwidth_used() && cfs_rq->throttle_count; } =20 -/* - * Ensure that neither of the group entities corresponding to src_cpu or - * dest_cpu are members of a throttled hierarchy when performing group - * load-balance operations. - */ -static inline int throttled_lb_pair(struct task_group *tg, - int src_cpu, int dest_cpu) -{ - struct cfs_rq *src_cfs_rq, *dest_cfs_rq; - - src_cfs_rq =3D tg->cfs_rq[src_cpu]; - dest_cfs_rq =3D tg->cfs_rq[dest_cpu]; - - return throttled_hierarchy(src_cfs_rq) || - throttled_hierarchy(dest_cfs_rq); -} - static inline bool task_is_throttled(struct task_struct *p) { return p->throttled; @@ -6726,12 +6709,6 @@ static inline int throttled_hierarchy(struct cfs_rq = *cfs_rq) return 0; } =20 -static inline int throttled_lb_pair(struct task_group *tg, - int src_cpu, int dest_cpu) -{ - return 0; -} - #ifdef CONFIG_FAIR_GROUP_SCHED void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth = *parent) {} static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} @@ -9369,17 +9346,13 @@ int can_migrate_task(struct task_struct *p, struct = lb_env *env) /* * We do not migrate tasks that are: * 1) delayed dequeued unless we migrate load, or - * 2) throttled_lb_pair, or - * 3) cannot be migrated to this CPU due to cpus_ptr, or - * 4) running (obviously), or - * 5) are cache-hot on their current CPU. + * 2) cannot be migrated to this CPU due to cpus_ptr, or + * 3) running (obviously), or + * 4) are cache-hot on their current CPU. */ if ((p->se.sched_delayed) && (env->migration_type !=3D migrate_load)) return 0; =20 - if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) - return 0; - /* * We want to prioritize the migration of eligible tasks. * For ineligible tasks we soft-limit them and only allow --=20 2.39.5