From nobody Tue Oct 7 03:50:38 2025 Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3484E2749C0 for ; Tue, 15 Jul 2025 07:17:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563850; cv=none; b=HRKAoz0/hOM+iPBHxTiiomfw74efIl0yopbHtLlZjiTQ2glq0Vdwmyw/aJ9rE0/7Q4hpq2N8Al+nZekqvz4G/dQgECUSCIQR/lmAnCerFgylrSUdqUFgcdHU2nHdblbE8MHqK4R62aJONnhRPdxaDTXSX7oTozNWZq/4AscUUnI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563850; c=relaxed/simple; bh=rptweIyw/NRNiRqKRKKVpvw+BcWwy7ff0JBJo9uuJTo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=cCkeWyzWgR4RXCTa6djE5M0tDPGykZLaQmveaPw1rZA47DrGJYmMzPqV3qGEBVNj26S/Sle6H8gxmpooIRLITnFJdsgLTDFiF25s8MULZLXzu0V+pMe1FHNEBW0QtP98sEbUp0pNXozBQDfQV/NHMpGLe/mOJzDHKRvXhKgLnXs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=fJgtAhtY; arc=none smtp.client-ip=209.85.215.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="fJgtAhtY" Received: by mail-pg1-f170.google.com with SMTP id 41be03b00d2f7-b31d578e774so5149021a12.1 for ; Tue, 15 Jul 2025 00:17:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1752563848; x=1753168648; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=D6ciipY/2CcI3D5MilhNzIQovP73zwUUwEnia6Ju3vQ=; b=fJgtAhtYgKpsyAbGkHRp65guXx05qXk2c4E8mK9F2piGqWGQcjdxm7yc5NSj7TXrRm q9iRbyFpjA+dep7ffsubaQRyGfjgvRnSAIW5YvQ4pZMzCDwTONPEE5vFGvttz9NmOt0S xSYVAhI7QY38/1SGRwtiK7nCss10pVeOQTwB+RpO27VpaUzpZuOnEofBl2MNn41Q39L2 rwjn+1XOn8+5GR0CnboUReVlqzysuwx0R+RCYqlbx9w1BG6ntx7GZeUiWJljk5rzRqkc EsyG/axzivC7EDWNvliHD3HtzH7NHMlEsamqTpAKk+bF2kICODECKVto7PnxDMBkUfU3 XG2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752563848; x=1753168648; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D6ciipY/2CcI3D5MilhNzIQovP73zwUUwEnia6Ju3vQ=; b=SJb5WaE7t9dXVQX251sJoEV070d673/YsjpFdR0HQqSGtwOdY4chZ0yH4qK5XijYR3 mLdMPq5WZC2tzLGRssXXeHCw0Ro50V+a9LtYY9zRfUXTiANAM7i8Vj2QZvscyYy3wNVC Xiy6hbR9NqWGs+M5t/SnDZe+kFyeeHJGARJ1GX1SakXWSPd0FITEm21Kwmkeg1ofpujP mLLMAS53A8TWYf+kJiA0v5dMDRhBz4xOQWbjXBr8FCgzKdR6Hmk72gTDkGQmdz2DR5cj gXvSe/BPNNXFBSthjMQAcm89r6KmsdvTAcsuaOt7a+gUVzfP/V6k4BuK4AYGjsQ7NGA/ yMJg== X-Gm-Message-State: AOJu0Yy3T4OAACkxOEj8Gpn7a7fo7WwGPDUc0u8TIlXvQVJa8WJ3nBRW fVbuueOtdrYweE6jZONQnZdYpd8Y2l1HLSA2E+4EARtJvAcjF7yMHIS7D94DYd+bvA== X-Gm-Gg: ASbGncvlsk/9rvcw6KxEo1lHS1WwhXvFWRwKilifpYhWr/POnKxg0MIkGdCS2/oKz/W s3ab4ufLkpI+8Ftv9D0uTtagNp4rrILaarR0v2+jricawXX3nX+H0ysKZ0D4JP9zx6kic7CtPMp JcppZrHWR+Gnrh3ShValFudg9n3maR6v5w4jl1LNhbXbs7TZnri1Dq8/fH1RE5HFbnkHvbf7V1N E0SHbY36XPrdqocLA+EZnos71s97KONcDaRxHuD+7cMahhUFGj0wnaVwMBhgfbL+M7CHecZ4N5r JpCEMFRLsspEY77r0CUvlFQUWbmFfd9dn+BsLuBgLl+zeREe6V0cqy3L6g5WYLE3sQQgO+ssUIH A+g1IH4Jn3d/BaN/1WvCnvXtNDEf3OvlSS6idquOI4TFIPqNn7wCR+23X4X45+DMl9Gsb X-Google-Smtp-Source: AGHT+IFedT46DgA5ze04v+deQqrmROgZV3DFwKIlx5WfmpHKQ7zMe9WnqNhy/1b7VP+7NhOOjx6AGA== X-Received: by 2002:a17:90b:4c8f:b0:313:d361:73d7 with SMTP id 98e67ed59e1d1-31c8f88e440mr4456583a91.13.1752563848169; Tue, 15 Jul 2025 00:17:28 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.56]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-31c3017c9dasm15013418a91.25.2025.07.15.00.17.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Jul 2025 00:17:27 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu Subject: [PATCH v3 1/5] sched/fair: Add related data structure for task based throttle Date: Tue, 15 Jul 2025 15:16:54 +0800 Message-Id: <20250715071658.267-2-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250715071658.267-1-ziqianlu@bytedance.com> References: <20250715071658.267-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Valentin Schneider Add related data structures for this new throttle functionality. Tesed-by: K Prateek Nayak Reviewed-by: Chengming Zhou Signed-off-by: Valentin Schneider Signed-off-by: Aaron Lu Tested-by: Matteo Martelli Tested-by: Valentin Schneider --- include/linux/sched.h | 5 +++++ kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 13 +++++++++++++ kernel/sched/sched.h | 3 +++ 4 files changed, 24 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 55921385927d8..ec4b54540c244 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -883,6 +883,11 @@ struct task_struct { =20 #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; +#ifdef CONFIG_CFS_BANDWIDTH + struct callback_head sched_throttle_work; + struct list_head throttle_node; + bool throttled; +#endif #endif =20 =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2f8caa9db78d5..410acc7435e86 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4446,6 +4446,9 @@ static void __sched_fork(unsigned long clone_flags, s= truct task_struct *p) =20 #ifdef CONFIG_FAIR_GROUP_SCHED p->se.cfs_rq =3D NULL; +#ifdef CONFIG_CFS_BANDWIDTH + init_cfs_throttle_work(p); +#endif #endif =20 #ifdef CONFIG_SCHEDSTATS diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 20a845697c1dc..c072e87c5bd9f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5742,6 +5742,18 @@ static inline int throttled_lb_pair(struct task_grou= p *tg, throttled_hierarchy(dest_cfs_rq); } =20 +static void throttle_cfs_rq_work(struct callback_head *work) +{ +} + +void init_cfs_throttle_work(struct task_struct *p) +{ + init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work); + /* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_= work() */ + p->sched_throttle_work.next =3D &p->sched_throttle_work; + INIT_LIST_HEAD(&p->throttle_node); +} + static int tg_unthrottle_up(struct task_group *tg, void *data) { struct rq *rq =3D data; @@ -6466,6 +6478,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_rq->runtime_enabled =3D 0; INIT_LIST_HEAD(&cfs_rq->throttled_list); INIT_LIST_HEAD(&cfs_rq->throttled_csd_list); + INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list); } =20 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 105190b180203..b0c9559992d8a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -741,6 +741,7 @@ struct cfs_rq { int throttle_count; struct list_head throttled_list; struct list_head throttled_csd_list; + struct list_head throttled_limbo_list; #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ }; @@ -2640,6 +2641,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *= rt_rq); =20 extern void init_dl_entity(struct sched_dl_entity *dl_se); =20 +extern void init_cfs_throttle_work(struct task_struct *p); + #define BW_SHIFT 20 #define BW_UNIT (1 << BW_SHIFT) #define RATIO_SHIFT 8 --=20 2.39.5 From nobody Tue Oct 7 03:50:38 2025 Received: from mail-pg1-f176.google.com (mail-pg1-f176.google.com [209.85.215.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E993D2749C0 for ; Tue, 15 Jul 2025 07:17:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563859; cv=none; b=lY0MfLU0yK22oiRiUEgZ2atQ8k63Lf1xfBZZ/gYYV7vxYSYAXTEHmo3Y+EBnzZ4zuwCg/PyaU0Gg/QfEx1R8omdo1imq6sMVQn9fCFgy0/rd/0VH8u964vqNM4FHulHTPymaPkyXVfv9RyX4uizmTSi90LTeVRAnNwONok8to88= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563859; c=relaxed/simple; bh=kv4q1+GtNsNWkkzIL7fvKJeWh15IGR2DYt5dRt6h4h8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Ik+C3N4A3TqUap+IxLGY89upBRw+ButzJuZ9ryWz7eu/urbFLx2z3yef1ezwUnJ/7Uw+V1j+0Xiqr5ItMx74KoC3dVcM38Cj1NCnT2F+nQDpLDIc2f5DeksljHVWmRzn/Z6gMDa6OunQmydQXQj08fJ0UoNCwOOA2tAVPoXcr1Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=BS9ZkBX7; arc=none smtp.client-ip=209.85.215.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="BS9ZkBX7" Received: by mail-pg1-f176.google.com with SMTP id 41be03b00d2f7-b34c068faf8so5378647a12.2 for ; Tue, 15 Jul 2025 00:17:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1752563855; x=1753168655; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=4jm+5GtZBsuCDlQuPvsVi8w+p/Jy1KYl1D9W0MemFZY=; b=BS9ZkBX7FeQYro54DIzHHHDg8c9fDGGQbgUcbUMFVn9KQK0E8oikp4UkftG2ens+Tr YbpT2rcrCtshq/0jiGPNRhGPBtkOxljemGDeVoeL+8HlQ62Hj+sQeh9DtfgmQ9dR5DKv ob/88zq9/+8AngY+jfY9mxxhKr7Blf75Dn2yhbDkaAjvDePxy5bvEbccHVphISrV8phd FS4faMuJDHcm4Y7jq3fkcBLO0uGclZYn3k8wFfMBIEw81O+RgI/qxgrBe9Y5GgAqL+af aYPcRRUYltYZpCwM/NktwAC1h8S/Kk/YHgoNPj7J65U660PTiSMFC8UQTAtiNPDt9MH/ jj0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752563855; x=1753168655; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4jm+5GtZBsuCDlQuPvsVi8w+p/Jy1KYl1D9W0MemFZY=; b=XGB9nsO86sy4QuZe0L3qlIkYK5+J2t+aFXyW4VGALOrcBHva8DA2+gdeCDxrgIsyUY TwyNTNp7JovMJWgTypsp9WCSwC2yoGSf+E752dI91f1Kjut1VVQ08d4wAwV4Iii9SlQv 2sRUrbCoA9T7GJf0UzIoMg7fRKxvzCRQtgISiWo/38aJ9HKQzINtaYOXTVUtdYkiW/LZ Un+WBURNB2CKbuwZQinDKFWL5du7DnKwpHNWztFSG3ArCn5izWoWfg94ntkhYbRfmL+s iVSh19Wcelwa4Cj6DGF+iTytjkZm6Ib3P1qcRcBG7Lfs0YvzA469f69DZLEB5kI/Wc90 ffTQ== X-Gm-Message-State: AOJu0YwdJnRsNC5Gq4ovBy5NdwV740L22r+Tes+9+zSuhs9LOLxXKMZj +tGKbWq9GBfqu/t/9bVjt4Y4NyjJPvfHDCcawC7pbFAFVEsNrtn0x9dxkaMFQ8HfgA== X-Gm-Gg: ASbGnculsOja61Y1O4c6FWSWRuWDXGY035ViPMfX3KoKjkEQ2dp1CjCEEudymWWaUmS v8hTF38cXfc6ZAOax1wtes1XP64ISCzGmwEyDVWT39MAB5s1AxIJD+kkYrmqxPU8L/n3p1AayKu EajgLDExz4QZ0dA8HcZAUyDbFVHbdMxckNpnls6ddAc01lDxPc8V/bcsd1kJZc+SBzZcwNxE8LD C8x2uspLJO8O8K1xrDYpz3w9V3U8iBzMQg10KFM3e19FOqwZQDaKoPktBuI4GNE4VANqCmTYgCQ jlyh1TYTalbHbk58s6BV07ragqQfceWd7rFSO/8Nx5GI2PgApEyRX+HO+dUmDRZOGfv9yWpngyO YRgZQ4xZezoV2EbwOR7L/Zxv3pwlaN62pqhuP3Rf4c/qnkk/SgTsAxcH/vFvvhd67QgJJ X-Google-Smtp-Source: AGHT+IFEpAh3PcSCPddU1wf4eLRvvEkday1d0d69gMr5UtRzx1og9SMqMF57153D01itmcA3A+x0CA== X-Received: by 2002:a17:90b:4b49:b0:312:db8:dbdc with SMTP id 98e67ed59e1d1-31c4ccea7a7mr21828418a91.20.1752563855062; Tue, 15 Jul 2025 00:17:35 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.56]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-31c3017c9dasm15013418a91.25.2025.07.15.00.17.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Jul 2025 00:17:34 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu Subject: [PATCH v3 2/5] sched/fair: Implement throttle task work and related helpers Date: Tue, 15 Jul 2025 15:16:55 +0800 Message-Id: <20250715071658.267-3-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250715071658.267-1-ziqianlu@bytedance.com> References: <20250715071658.267-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Valentin Schneider Implement throttle_cfs_rq_work() task work which gets executed on task's ret2user path where the task is dequeued and marked as throttled. Tested-by: K Prateek Nayak Reviewed-by: Chengming Zhou Signed-off-by: Valentin Schneider Signed-off-by: Aaron Lu Tested-by: Matteo Martelli Tested-by: Valentin Schneider --- kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c072e87c5bd9f..54c2a4df6a5d1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5742,8 +5742,51 @@ static inline int throttled_lb_pair(struct task_grou= p *tg, throttled_hierarchy(dest_cfs_rq); } =20 +static inline bool task_is_throttled(struct task_struct *p) +{ + return p->throttled; +} + +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int fl= ags); static void throttle_cfs_rq_work(struct callback_head *work) { + struct task_struct *p =3D container_of(work, struct task_struct, sched_th= rottle_work); + struct sched_entity *se; + struct cfs_rq *cfs_rq; + struct rq *rq; + + WARN_ON_ONCE(p !=3D current); + p->sched_throttle_work.next =3D &p->sched_throttle_work; + + /* + * If task is exiting, then there won't be a return to userspace, so we + * don't have to bother with any of this. + */ + if ((p->flags & PF_EXITING)) + return; + + scoped_guard(task_rq_lock, p) { + se =3D &p->se; + cfs_rq =3D cfs_rq_of(se); + + /* Raced, forget */ + if (p->sched_class !=3D &fair_sched_class) + return; + + /* + * If not in limbo, then either replenish has happened or this + * task got migrated out of the throttled cfs_rq, move along. + */ + if (!cfs_rq->throttle_count) + return; + rq =3D scope.rq; + update_rq_clock(rq); + WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node)); + dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL); + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); + p->throttled =3D true; + resched_curr(rq); + } } =20 void init_cfs_throttle_work(struct task_struct *p) @@ -5783,6 +5826,26 @@ static int tg_unthrottle_up(struct task_group *tg, v= oid *data) return 0; } =20 +static inline bool task_has_throttle_work(struct task_struct *p) +{ + return p->sched_throttle_work.next !=3D &p->sched_throttle_work; +} + +static inline void task_throttle_setup_work(struct task_struct *p) +{ + if (task_has_throttle_work(p)) + return; + + /* + * Kthreads and exiting tasks don't return to userspace, so adding the + * work is pointless + */ + if ((p->flags & (PF_EXITING | PF_KTHREAD))) + return; + + task_work_add(p, &p->sched_throttle_work, TWA_RESUME); +} + static int tg_throttle_down(struct task_group *tg, void *data) { struct rq *rq =3D data; @@ -6646,6 +6709,8 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_r= q) { return false; } static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {} static inline void sync_throttle(struct task_group *tg, int cpu) {} static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} +static void task_throttle_setup_work(struct task_struct *p) {} +static bool task_is_throttled(struct task_struct *p) { return false; } =20 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) { --=20 2.39.5 From nobody Tue Oct 7 03:50:38 2025 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 22B802749C0 for ; Tue, 15 Jul 2025 07:17:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.51 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563866; cv=none; b=fnq1xHT1GrNuzA1v93fk9gmzDhIA8fasOHVtiT+k2uxQtX4XYnUflrsLtQAwQLbuZLN27VdIGFFtN2IGsab9MloC4DN7U1OVHQzUE5ICBv8kFC2Px+oWRcPu8eQ7gpr6eh3QfgPc55GALYYfx7cOpwok5eLWplPQrGeABofY8o0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563866; c=relaxed/simple; bh=EO27zZ3NbgfYQWfFdQ3XqCbTw5QQ9uRlrR2JkES++X8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=PnkJDmnJBts/nCKif8IBhMzYawIVYGElSrK9gJAottizeUVLzyniJI+SoEf+duUYW7fk9Ew5H60CMkOzynamX/qdF3UOJuj15wbGeweIMown36bMZp8zU+W7ctou+nvP6BJ271Kl9YpvtYJMLJ2J3ByWIWMwStjODGKUeRZsZ1w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=AWwYbZZP; arc=none smtp.client-ip=209.85.216.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="AWwYbZZP" Received: by mail-pj1-f51.google.com with SMTP id 98e67ed59e1d1-313f68bc519so3954591a91.0 for ; Tue, 15 Jul 2025 00:17:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1752563862; x=1753168662; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=SWH+y5+o9+l4hCSFs30fOj6t5scjaLBcpXmImmkgyN4=; b=AWwYbZZPUN1z5qecixfP5FiSMtUNe7dSZKe4Hbzhf9WtM7fBUeLRWmBPis9oBMONCg k64lx4a1kTvJvg80/ny8HcZLsQKoiemC9A3GU9FWZShbKnnB0o2jMov2Xmp7TPeUY5C5 x9on2qs+BA4StUjQV0PSuz1gobo5B//hB7/1GBcgsgeVMOJ5/zVFZCEH8sHEXJwcS9+T 2PP6XMBW87x/KD3oyujJYGkORRq9s1LSjxn2EBzrPWQ55ANNxKTJ2kksweJKoHcumlKZ lig059bssdletWmKeMDv64SzlV51rChJXUrmOzvDdJDwzd7AwlGL+2Ve9PatXhAfxsrL voMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752563862; x=1753168662; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SWH+y5+o9+l4hCSFs30fOj6t5scjaLBcpXmImmkgyN4=; b=Mf40ZERs6xqLYKA93IKFmxq6Bn8bx4qCRStvobRYKbBbahYjbZMRYeaJxLmGwPNTUh 2rAc25xd9szAtQhpy8GpZWqOADLjJLHlQUGZXDFa4fag6mtf0o8fTswGIwiKB4N0SOff aUPLvGG11ZNLoPaR0T2TKNGlZE+kYd1tGfqm0DRhOTO1KnIvlVaGGTD0pN94evc2MmPU qS8RQyy6vWT776Fo9Hv3rsgg3SHDjnntCuzPlMWw3bpvE9S6T1aRD1a6TWK7OgdhrvwT M1FHrN478Y+cs7NNrHCU5bgnHjSi/FFOPjxmJM/SjNweXoz0q+zCOr6AgWvml2ehkOTA DztA== X-Gm-Message-State: AOJu0YwUi3HedAQ/OyBPs/BcSZvKXmpyVB41AWH+P+65R3om6X7x8GSG FXwMOQtKwvfv/19s72n8QL9J0gnRRtFp80EdsHV4VCBvecPHhhKsCFDZ4ihoRMnwmw== X-Gm-Gg: ASbGncu3lpP9ZrpRqJEqxA0iQXwHZYoRn4DlAQRcuLp6HdrULQIIc944FyV6fSHMH2c NfeSZYBKFLUWE84Z0pcQDRFInTMaZb+GeqKnJK5dJjzKWDXPdc+q0HSPDcVbqrlKgh2AefKPXq7 rsqGlU4wOb0I1LHdnyDglRQgnbT8PkDF3LIhN95Vg7nqwOKVcqV05Na4CMyp7yQAIOmqn5G0DcG oyw1Hbb2dCMBtYB+tm7q2QfVDRHmUA1fF12dQAt2j/NpCTElu57mg+FCLg9T+7kpLAuDTZfdmp1 xlRS7GiJ4bXywxjfcmklZpM4O1VHgXwsA9EXd8H5tFlp9eNR7gB0HGU2u8eo39V9M+ZBjm2eBiy jivlbP0N4gcoKNjxOEoDn8JFTUnvyOOQLVdpIVUwBaB2RFfDD8KEmKEdqOzpsO5kAqjYz2oWF6x LAfMI= X-Google-Smtp-Source: AGHT+IFL/H0XKiagj4z0DlhafqTYOLisx8byr8iid7vHV7XAcH7riVF00cMDd9AlRHKGFKqAzJjVIg== X-Received: by 2002:a17:90b:46:b0:311:a561:86f3 with SMTP id 98e67ed59e1d1-31c4f48baa6mr25595808a91.6.1752563861982; Tue, 15 Jul 2025 00:17:41 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.56]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-31c3017c9dasm15013418a91.25.2025.07.15.00.17.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Jul 2025 00:17:41 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu Subject: [PATCH v3 3/5] sched/fair: Switch to task based throttle model Date: Tue, 15 Jul 2025 15:16:56 +0800 Message-Id: <20250715071658.267-4-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250715071658.267-1-ziqianlu@bytedance.com> References: <20250715071658.267-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Valentin Schneider In current throttle model, when a cfs_rq is throttled, its entity will be dequeued from cpu's rq, making tasks attached to it not able to run, thus achiveing the throttle target. This has a drawback though: assume a task is a reader of percpu_rwsem and is waiting. When it gets woken, it can not run till its task group's next period comes, which can be a relatively long time. Waiting writer will have to wait longer due to this and it also makes further reader build up and eventually trigger task hung. To improve this situation, change the throttle model to task based, i.e. when a cfs_rq is throttled, record its throttled status but do not remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when they get picked, add a task work to them so that when they return to user, they can be dequeued there. In this way, tasks throttled will not hold any kernel resources. And on unthrottle, enqueue back those tasks so they can continue to run. Throttled cfs_rq's PELT clock is handled differently now: previously the cfs_rq's PELT clock is stopped once it entered throttled state but since now tasks(in kernel mode) can continue to run, change the behaviour to stop PELT clock only when the throttled cfs_rq has no tasks left. Tested-by: K Prateek Nayak Suggested-by: Chengming Zhou # tag on pick Signed-off-by: Valentin Schneider Signed-off-by: Aaron Lu Tested-by: Chen Yu Tested-by: Matteo Martelli Tested-by: Valentin Schneider --- kernel/sched/fair.c | 336 ++++++++++++++++++++++--------------------- kernel/sched/pelt.h | 4 +- kernel/sched/sched.h | 3 +- 3 files changed, 176 insertions(+), 167 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 54c2a4df6a5d1..0eeea7f2e693d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5285,18 +5285,23 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_= entity *se, int flags) =20 if (cfs_rq->nr_queued =3D=3D 1) { check_enqueue_throttle(cfs_rq); - if (!throttled_hierarchy(cfs_rq)) { - list_add_leaf_cfs_rq(cfs_rq); - } else { + list_add_leaf_cfs_rq(cfs_rq); #ifdef CONFIG_CFS_BANDWIDTH + if (throttled_hierarchy(cfs_rq)) { struct rq *rq =3D rq_of(cfs_rq); =20 if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock) cfs_rq->throttled_clock =3D rq_clock(rq); if (!cfs_rq->throttled_clock_self) cfs_rq->throttled_clock_self =3D rq_clock(rq); -#endif + + if (cfs_rq->pelt_clock_throttled) { + cfs_rq->throttled_clock_pelt_time +=3D rq_clock_pelt(rq) - + cfs_rq->throttled_clock_pelt; + cfs_rq->pelt_clock_throttled =3D 0; + } } +#endif } } =20 @@ -5335,8 +5340,6 @@ static void set_delayed(struct sched_entity *se) struct cfs_rq *cfs_rq =3D cfs_rq_of(se); =20 cfs_rq->h_nr_runnable--; - if (cfs_rq_throttled(cfs_rq)) - break; } } =20 @@ -5357,8 +5360,6 @@ static void clear_delayed(struct sched_entity *se) struct cfs_rq *cfs_rq =3D cfs_rq_of(se); =20 cfs_rq->h_nr_runnable++; - if (cfs_rq_throttled(cfs_rq)) - break; } } =20 @@ -5444,8 +5445,18 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_e= ntity *se, int flags) if (flags & DEQUEUE_DELAYED) finish_delayed_dequeue_entity(se); =20 - if (cfs_rq->nr_queued =3D=3D 0) + if (cfs_rq->nr_queued =3D=3D 0) { update_idle_cfs_rq_clock_pelt(cfs_rq); +#ifdef CONFIG_CFS_BANDWIDTH + if (throttled_hierarchy(cfs_rq)) { + struct rq *rq =3D rq_of(cfs_rq); + + list_del_leaf_cfs_rq(cfs_rq); + cfs_rq->throttled_clock_pelt =3D rq_clock_pelt(rq); + cfs_rq->pelt_clock_throttled =3D 1; + } +#endif + } =20 return true; } @@ -5784,6 +5795,10 @@ static void throttle_cfs_rq_work(struct callback_hea= d *work) WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node)); dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL); list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); + /* + * Must not set throttled before dequeue or dequeue will + * mistakenly regard this task as an already throttled one. + */ p->throttled =3D true; resched_curr(rq); } @@ -5797,32 +5812,119 @@ void init_cfs_throttle_work(struct task_struct *p) INIT_LIST_HEAD(&p->throttle_node); } =20 +/* + * Task is throttled and someone wants to dequeue it again: + * it could be sched/core when core needs to do things like + * task affinity change, task group change, task sched class + * change etc. and in these cases, DEQUEUE_SLEEP is not set; + * or the task is blocked after throttled due to freezer etc. + * and in these cases, DEQUEUE_SLEEP is set. + */ +static void detach_task_cfs_rq(struct task_struct *p); +static void dequeue_throttled_task(struct task_struct *p, int flags) +{ + WARN_ON_ONCE(p->se.on_rq); + list_del_init(&p->throttle_node); + + /* task blocked after throttled */ + if (flags & DEQUEUE_SLEEP) { + p->throttled =3D false; + return; + } + + /* + * task is migrating off its old cfs_rq, detach + * the task's load from its old cfs_rq. + */ + if (task_on_rq_migrating(p)) + detach_task_cfs_rq(p); +} + +static bool enqueue_throttled_task(struct task_struct *p) +{ + struct cfs_rq *cfs_rq =3D cfs_rq_of(&p->se); + + /* + * If the throttled task is enqueued to a throttled cfs_rq, + * take the fast path by directly put the task on target + * cfs_rq's limbo list, except when p is current because + * the following race can cause p's group_node left in rq's + * cfs_tasks list when it's throttled: + * + * cpuX cpuY + * taskA ret2user + * throttle_cfs_rq_work() sched_move_task(taskA) + * task_rq_lock acquired + * dequeue_task_fair(taskA) + * task_rq_lock released + * task_rq_lock acquired + * task_current_donor(taskA) =3D=3D true + * task_on_rq_queued(taskA) =3D=3D true + * dequeue_task(taskA) + * put_prev_task(taskA) + * sched_change_group() + * enqueue_task(taskA) -> taskA's new cfs_rq + * is throttled, go + * fast path and skip + * actual enqueue + * set_next_task(taskA) + * __set_next_task_fair(taskA) + * list_move(&se->group_node, &rq->cfs_tasks); // bug + * schedule() + * + * And in the above race case, the task's current cfs_rq is in the same + * rq as its previous cfs_rq because sched_move_task() doesn't migrate + * task so we can use its current cfs_rq to derive rq and test if the + * task is current. + */ + if (throttled_hierarchy(cfs_rq) && + !task_current_donor(rq_of(cfs_rq), p)) { + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); + return true; + } + + /* we can't take the fast path, do an actual enqueue*/ + p->throttled =3D false; + return false; +} + +static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int fl= ags); static int tg_unthrottle_up(struct task_group *tg, void *data) { struct rq *rq =3D data; struct cfs_rq *cfs_rq =3D tg->cfs_rq[cpu_of(rq)]; + struct task_struct *p, *tmp; + + if (--cfs_rq->throttle_count) + return 0; =20 - cfs_rq->throttle_count--; - if (!cfs_rq->throttle_count) { + if (cfs_rq->pelt_clock_throttled) { cfs_rq->throttled_clock_pelt_time +=3D rq_clock_pelt(rq) - cfs_rq->throttled_clock_pelt; + cfs_rq->pelt_clock_throttled =3D 0; + } =20 - /* Add cfs_rq with load or one or more already running entities to the l= ist */ - if (!cfs_rq_is_decayed(cfs_rq)) - list_add_leaf_cfs_rq(cfs_rq); + if (cfs_rq->throttled_clock_self) { + u64 delta =3D rq_clock(rq) - cfs_rq->throttled_clock_self; =20 - if (cfs_rq->throttled_clock_self) { - u64 delta =3D rq_clock(rq) - cfs_rq->throttled_clock_self; + cfs_rq->throttled_clock_self =3D 0; =20 - cfs_rq->throttled_clock_self =3D 0; + if (WARN_ON_ONCE((s64)delta < 0)) + delta =3D 0; =20 - if (WARN_ON_ONCE((s64)delta < 0)) - delta =3D 0; + cfs_rq->throttled_clock_self_time +=3D delta; + } =20 - cfs_rq->throttled_clock_self_time +=3D delta; - } + /* Re-enqueue the tasks that have been throttled at this level. */ + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_= node) { + list_del_init(&p->throttle_node); + enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP); } =20 + /* Add cfs_rq with load or one or more already running entities to the li= st */ + if (!cfs_rq_is_decayed(cfs_rq)) + list_add_leaf_cfs_rq(cfs_rq); + return 0; } =20 @@ -5851,17 +5953,25 @@ static int tg_throttle_down(struct task_group *tg, = void *data) struct rq *rq =3D data; struct cfs_rq *cfs_rq =3D tg->cfs_rq[cpu_of(rq)]; =20 + if (cfs_rq->throttle_count++) + return 0; + + /* group is entering throttled state, stop time */ - if (!cfs_rq->throttle_count) { - cfs_rq->throttled_clock_pelt =3D rq_clock_pelt(rq); + WARN_ON_ONCE(cfs_rq->throttled_clock_self); + if (cfs_rq->nr_queued) + cfs_rq->throttled_clock_self =3D rq_clock(rq); + else { + /* + * For cfs_rqs that still have entities enqueued, PELT clock + * stop happens at dequeue time when all entities are dequeued. + */ list_del_leaf_cfs_rq(cfs_rq); - - WARN_ON_ONCE(cfs_rq->throttled_clock_self); - if (cfs_rq->nr_queued) - cfs_rq->throttled_clock_self =3D rq_clock(rq); + cfs_rq->throttled_clock_pelt =3D rq_clock_pelt(rq); + cfs_rq->pelt_clock_throttled =3D 1; } - cfs_rq->throttle_count++; =20 + WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list)); return 0; } =20 @@ -5869,8 +5979,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) { struct rq *rq =3D rq_of(cfs_rq); struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg); - struct sched_entity *se; - long queued_delta, runnable_delta, idle_delta, dequeue =3D 1; + int dequeue =3D 1; =20 raw_spin_lock(&cfs_b->lock); /* This will start the period timer if necessary */ @@ -5893,68 +6002,11 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) if (!dequeue) return false; /* Throttle no longer required. */ =20 - se =3D cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]; - /* freeze hierarchy runnable averages while throttled */ rcu_read_lock(); walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq); rcu_read_unlock(); =20 - queued_delta =3D cfs_rq->h_nr_queued; - runnable_delta =3D cfs_rq->h_nr_runnable; - idle_delta =3D cfs_rq->h_nr_idle; - for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); - int flags; - - /* throttled entity or throttle-on-deactivate */ - if (!se->on_rq) - goto done; - - /* - * Abuse SPECIAL to avoid delayed dequeue in this instance. - * This avoids teaching dequeue_entities() about throttled - * entities and keeps things relatively simple. - */ - flags =3D DEQUEUE_SLEEP | DEQUEUE_SPECIAL; - if (se->sched_delayed) - flags |=3D DEQUEUE_DELAYED; - dequeue_entity(qcfs_rq, se, flags); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_delta =3D cfs_rq->h_nr_queued; - - qcfs_rq->h_nr_queued -=3D queued_delta; - qcfs_rq->h_nr_runnable -=3D runnable_delta; - qcfs_rq->h_nr_idle -=3D idle_delta; - - if (qcfs_rq->load.weight) { - /* Avoid re-evaluating load for this entity: */ - se =3D parent_entity(se); - break; - } - } - - for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); - /* throttled entity or throttle-on-deactivate */ - if (!se->on_rq) - goto done; - - update_load_avg(qcfs_rq, se, 0); - se_update_runnable(se); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_delta =3D cfs_rq->h_nr_queued; - - qcfs_rq->h_nr_queued -=3D queued_delta; - qcfs_rq->h_nr_runnable -=3D runnable_delta; - qcfs_rq->h_nr_idle -=3D idle_delta; - } - - /* At this point se is NULL and we are at root level*/ - sub_nr_running(rq, queued_delta); -done: /* * Note: distribution will already see us throttled via the * throttled-list. rq->lock protects completion. @@ -5970,9 +6022,20 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) { struct rq *rq =3D rq_of(cfs_rq); struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg); - struct sched_entity *se; - long queued_delta, runnable_delta, idle_delta; - long rq_h_nr_queued =3D rq->cfs.h_nr_queued; + struct sched_entity *se =3D cfs_rq->tg->se[cpu_of(rq)]; + + /* + * It's possible we are called with !runtime_remaining due to things + * like user changed quota setting(see tg_set_cfs_bandwidth()) or async + * unthrottled us with a positive runtime_remaining but other still + * running entities consumed those runtime before we reached here. + * + * Anyway, we can't unthrottle this cfs_rq without any runtime remaining + * because any enqueue in tg_unthrottle_up() will immediately trigger a + * throttle, which is not supposed to happen on unthrottle path. + */ + if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <=3D 0) + return; =20 se =3D cfs_rq->tg->se[cpu_of(rq)]; =20 @@ -6002,62 +6065,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) if (list_add_leaf_cfs_rq(cfs_rq_of(se))) break; } - goto unthrottle_throttle; } =20 - queued_delta =3D cfs_rq->h_nr_queued; - runnable_delta =3D cfs_rq->h_nr_runnable; - idle_delta =3D cfs_rq->h_nr_idle; - for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); - - /* Handle any unfinished DELAY_DEQUEUE business first. */ - if (se->sched_delayed) { - int flags =3D DEQUEUE_SLEEP | DEQUEUE_DELAYED; - - dequeue_entity(qcfs_rq, se, flags); - } else if (se->on_rq) - break; - enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_delta =3D cfs_rq->h_nr_queued; - - qcfs_rq->h_nr_queued +=3D queued_delta; - qcfs_rq->h_nr_runnable +=3D runnable_delta; - qcfs_rq->h_nr_idle +=3D idle_delta; - - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(qcfs_rq)) - goto unthrottle_throttle; - } - - for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); - - update_load_avg(qcfs_rq, se, UPDATE_TG); - se_update_runnable(se); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_delta =3D cfs_rq->h_nr_queued; - - qcfs_rq->h_nr_queued +=3D queued_delta; - qcfs_rq->h_nr_runnable +=3D runnable_delta; - qcfs_rq->h_nr_idle +=3D idle_delta; - - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(qcfs_rq)) - goto unthrottle_throttle; - } - - /* Start the fair server if un-throttling resulted in new runnable tasks = */ - if (!rq_h_nr_queued && rq->cfs.h_nr_queued) - dl_server_start(&rq->fair_server); - - /* At this point se is NULL and we are at root level*/ - add_nr_running(rq, queued_delta); - -unthrottle_throttle: assert_list_leaf_cfs_rq(rq); =20 /* Determine whether we need to wake up potentially idle CPU: */ @@ -6711,6 +6720,8 @@ static inline void sync_throttle(struct task_group *t= g, int cpu) {} static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} static void task_throttle_setup_work(struct task_struct *p) {} static bool task_is_throttled(struct task_struct *p) { return false; } +static void dequeue_throttled_task(struct task_struct *p, int flags) {} +static bool enqueue_throttled_task(struct task_struct *p) { return false; } =20 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) { @@ -6903,6 +6914,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) int rq_h_nr_queued =3D rq->cfs.h_nr_queued; u64 slice =3D 0; =20 + if (unlikely(task_is_throttled(p) && enqueue_throttled_task(p))) + return; + /* * The code below (indirectly) updates schedutil which looks at * the cfs_rq utilization to select a frequency. @@ -6955,10 +6969,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct = *p, int flags) if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D 1; =20 - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - goto enqueue_throttle; - flags =3D ENQUEUE_WAKEUP; } =20 @@ -6980,10 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct = *p, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D 1; - - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - goto enqueue_throttle; } =20 if (!rq_h_nr_queued && rq->cfs.h_nr_queued) { @@ -7013,7 +7019,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) if (!task_new) check_update_overutilized_status(rq); =20 -enqueue_throttle: assert_list_leaf_cfs_rq(rq); =20 hrtick_update(rq); @@ -7068,10 +7073,6 @@ static int dequeue_entities(struct rq *rq, struct sc= hed_entity *se, int flags) if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D h_nr_queued; =20 - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - return 0; - /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { slice =3D cfs_rq_min_slice(cfs_rq); @@ -7108,10 +7109,6 @@ static int dequeue_entities(struct rq *rq, struct sc= hed_entity *se, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D h_nr_queued; - - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - return 0; } =20 sub_nr_running(rq, h_nr_queued); @@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sc= hed_entity *se, int flags) */ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int fl= ags) { + if (unlikely(task_is_throttled(p))) { + dequeue_throttled_task(p, flags); + return true; + } + if (!p->se.sched_delayed) util_est_dequeue(&rq->cfs, p); =20 @@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq= *rq) { struct sched_entity *se; struct cfs_rq *cfs_rq; + struct task_struct *p; + bool throttled; =20 again: cfs_rq =3D &rq->cfs; if (!cfs_rq->nr_queued) return NULL; =20 + throttled =3D false; + do { /* Might not have done put_prev_entity() */ if (cfs_rq->curr && cfs_rq->curr->on_rq) update_curr(cfs_rq); =20 - if (unlikely(check_cfs_rq_runtime(cfs_rq))) - goto again; + throttled |=3D check_cfs_rq_runtime(cfs_rq); =20 se =3D pick_next_entity(rq, cfs_rq); if (!se) @@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq = *rq) cfs_rq =3D group_cfs_rq(se); } while (cfs_rq); =20 - return task_of(se); + p =3D task_of(se); + if (unlikely(throttled)) + task_throttle_setup_work(p); + return p; } =20 static void __set_next_task_fair(struct rq *rq, struct task_struct *p, boo= l first); diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h index 62c3fa543c0f2..f921302dc40fb 100644 --- a/kernel/sched/pelt.h +++ b/kernel/sched/pelt.h @@ -162,7 +162,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct= cfs_rq *cfs_rq) { u64 throttled; =20 - if (unlikely(cfs_rq->throttle_count)) + if (unlikely(cfs_rq->pelt_clock_throttled)) throttled =3D U64_MAX; else throttled =3D cfs_rq->throttled_clock_pelt_time; @@ -173,7 +173,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct= cfs_rq *cfs_rq) /* rq->task_clock normalized against any time this cfs_rq has spent thrott= led */ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { - if (unlikely(cfs_rq->throttle_count)) + if (unlikely(cfs_rq->pelt_clock_throttled)) return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time; =20 return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b0c9559992d8a..fc697d4bf6685 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -737,7 +737,8 @@ struct cfs_rq { u64 throttled_clock_pelt_time; u64 throttled_clock_self; u64 throttled_clock_self_time; - int throttled; + int throttled:1; + int pelt_clock_throttled:1; int throttle_count; struct list_head throttled_list; struct list_head throttled_csd_list; --=20 2.39.5 From nobody Tue Oct 7 03:50:38 2025 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B91DD274B41 for ; Tue, 15 Jul 2025 07:17:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563871; cv=none; b=F8rikOBkuPRQQ0ud2zNeZ88zqaWMYcLcUsopUpnr/SD55fDJGZfZwJXTc5+z8FPIK+fgkpocVKBJCLc4j8qQlpQ1cY0yg8c1JrxnxFFy3/oVMki+AZB7fNTWA7nE7MtFsPy9MuvyK3WJpRwLs88WnCwWPliVhVbu7M8KdY3uWSM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563871; c=relaxed/simple; bh=4QZzFDbanSVqBtO7SP+CHBCtZ4IeyyrN0JDtjgkqzTQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=oK72HOzLohw3KReEUQneieEGQK1OlaVNXE4lFDYID4K9z41raRxTEmcOGIAlqOQyBm8YC3AV2WzPzcXm9p27NBV6uB8VwfTnJMrQS9lRn9aIY30iU7ezV5UnXnIrDxtZ9uUmeWJeI+RhohGIpLPtVzO7NX23gfaETnVx4UE+rkY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=BXhOIqyl; arc=none smtp.client-ip=209.85.214.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="BXhOIqyl" Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-237e6963f63so30137805ad.2 for ; Tue, 15 Jul 2025 00:17:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1752563869; x=1753168669; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=EulDqtUuotSDsd2Q5jECOn9f5MW3ABQMTJUuJKS/Z0M=; b=BXhOIqylKTk5UsrG+bvgfT2scsfsU1VhVEIMofSls2ryI6sxSsCDreGs4hxC3Au47l yHy+jHpGn1fBhFSwCdQ3Is1S8/qYTCYc8Qa0tF21ozqCVp0o7eMgn3MAbhkhxFHLRcs/ h0So7seyJvmmUPPyWb/NWYt+01GYyuO3DmwlagQiPZgJJX0PwVtx7rjhYQrfbpbZrJEP l66Up8PSbful7Xr7Qn3Mxwd1wySVNb/kDbN1x5Te8NBeASlMhibHTfoZo2Eh/wuDqE06 625t9yy4VWAT5yhkf6s+D8bsTqSwYCv/r81rtrG6oA8IbnTZV1tYv12dGPaMQ8LCylth gGVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752563869; x=1753168669; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EulDqtUuotSDsd2Q5jECOn9f5MW3ABQMTJUuJKS/Z0M=; b=J3sIOyKtNybHCeMGvSnpGU1h7HS6qtmEUREj9XjIL4PJ4nTGTKVCiNdt3tAcwBhiBc fKO9VBUNMo9sFk8Jg2WZHDAKnUpbYUmFJP72Oy+2Hh7M27O+ZMoOSvXAS7byhICvkhXe rYH4gg0Y+O8R150Cltu0y4vKbeOf8KsDnMhyhFNzNgHjqhe2Pr0BZWsvUVR7sbWLgiHI hczkEMlp5s6B83AE2+kJoYdzjwewZQ0H1e2feRX1/vYSmetLPdjV0Bl0uXr2QC/BXI2X IIOixtC9PbPaNmN9SAwzhtYTP8tjFhKHtDJcWOOJ9oERbXzjBPT8AIhzm5/5/O0xbjp3 rSGg== X-Gm-Message-State: AOJu0YxVRj5LEWmXo2043GUzv3smDAaYxhj8+aLIAp6iNLyjDK5UabDd jvFd1TAb3m7UheIOJuMAAg2xmIfKoQ2zMaj92dAd+9y/Wg8lp6RwUEbuq75Tj2asSiVknOnz8a9 m3Z4= X-Gm-Gg: ASbGncv6fIxlQIk3s4VYMRTC0RwBB9sg+id8cT8giux5TWVxc5VTlaxgS8uqsSXE4A+ 6I+8DdjEBq6PJ1IUJS3Bbi/ZoBOiBO+VbSSAZdMkS2rXk8Z5pSUZGtXLUGp+9pt4IADxiqhTo0F eF9sQ778wUVy/MKYvXc6uVt2FoAT8F6ie4sk1aE3GBAkP8Yq6PmMUuxMTxaCSdKJ6sptX5yLxwb bIOa13zYGPjw+H8BR2UJFVbB1xaPA5wVl6WqlmAm+8emKYdou4ufzshcC08P4PFXmyw26yzYpWT zo3fphGhatKDLX4sGPHE8OMkiuYFeXnQhVnEkk+Ge5zPK3q54GdZVHIYEzunANk0ptfojTw7O39 dHJFVAK88n1Rtf7rgziFvlAn08wM/de6Kl3PhoqYbzr1Pvphp/Mnf5aa57BYmZoHuZQM2 X-Google-Smtp-Source: AGHT+IF5OBlE3nhf1VOjTEkcvS8ti1Lyjs8uhfYSO3xrM5D6jZBBjmQz+gxVpajVAP4GE+ldBLChVw== X-Received: by 2002:a17:90b:1845:b0:313:1a8c:c2d3 with SMTP id 98e67ed59e1d1-31c4f512a39mr21465311a91.22.1752563868999; Tue, 15 Jul 2025 00:17:48 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.56]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-31c3017c9dasm15013418a91.25.2025.07.15.00.17.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Jul 2025 00:17:48 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu Subject: [PATCH v3 4/5] sched/fair: Task based throttle time accounting Date: Tue, 15 Jul 2025 15:16:57 +0800 Message-Id: <20250715071658.267-5-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250715071658.267-1-ziqianlu@bytedance.com> References: <20250715071658.267-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With task based throttle model, the previous way to check cfs_rq's nr_queued to decide if throttled time should be accounted doesn't work as expected, e.g. when a cfs_rq which has a single task is throttled, that task could later block in kernel mode instead of being dequeued on limbo list and account this as throttled time is not accurate. Rework throttle time accounting for a cfs_rq as follows: - start accounting when the first task gets throttled in its hierarchy; - stop accounting on unthrottle. Tested-by: K Prateek Nayak Suggested-by: Chengming Zhou # accounting mechan= ism Co-developed-by: K Prateek Nayak # simplify implem= entation Signed-off-by: K Prateek Nayak Signed-off-by: Aaron Lu Tested-by: Matteo Martelli Tested-by: Valentin Schneider --- kernel/sched/fair.c | 56 ++++++++++++++++++++++++-------------------- kernel/sched/sched.h | 1 + 2 files changed, 32 insertions(+), 25 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0eeea7f2e693d..6f534fbe89bcf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5287,19 +5287,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_= entity *se, int flags) check_enqueue_throttle(cfs_rq); list_add_leaf_cfs_rq(cfs_rq); #ifdef CONFIG_CFS_BANDWIDTH - if (throttled_hierarchy(cfs_rq)) { + if (cfs_rq->pelt_clock_throttled) { struct rq *rq =3D rq_of(cfs_rq); =20 - if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock) - cfs_rq->throttled_clock =3D rq_clock(rq); - if (!cfs_rq->throttled_clock_self) - cfs_rq->throttled_clock_self =3D rq_clock(rq); - - if (cfs_rq->pelt_clock_throttled) { - cfs_rq->throttled_clock_pelt_time +=3D rq_clock_pelt(rq) - - cfs_rq->throttled_clock_pelt; - cfs_rq->pelt_clock_throttled =3D 0; - } + cfs_rq->throttled_clock_pelt_time +=3D rq_clock_pelt(rq) - + cfs_rq->throttled_clock_pelt; + cfs_rq->pelt_clock_throttled =3D 0; } #endif } @@ -5387,7 +5380,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_en= tity *se, int flags) * DELAY_DEQUEUE relies on spurious wakeups, special task * states must not suffer spurious wakeups, excempt them. */ - if (flags & DEQUEUE_SPECIAL) + if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE)) delay =3D false; =20 WARN_ON_ONCE(delay && se->sched_delayed); @@ -5793,7 +5786,7 @@ static void throttle_cfs_rq_work(struct callback_head= *work) rq =3D scope.rq; update_rq_clock(rq); WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node)); - dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL); + dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE); list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); /* * Must not set throttled before dequeue or dequeue will @@ -5948,6 +5941,17 @@ static inline void task_throttle_setup_work(struct t= ask_struct *p) task_work_add(p, &p->sched_throttle_work, TWA_RESUME); } =20 +static void record_throttle_clock(struct cfs_rq *cfs_rq) +{ + struct rq *rq =3D rq_of(cfs_rq); + + if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock) + cfs_rq->throttled_clock =3D rq_clock(rq); + + if (!cfs_rq->throttled_clock_self) + cfs_rq->throttled_clock_self =3D rq_clock(rq); +} + static int tg_throttle_down(struct task_group *tg, void *data) { struct rq *rq =3D data; @@ -5956,21 +5960,17 @@ static int tg_throttle_down(struct task_group *tg, = void *data) if (cfs_rq->throttle_count++) return 0; =20 - - /* group is entering throttled state, stop time */ - WARN_ON_ONCE(cfs_rq->throttled_clock_self); - if (cfs_rq->nr_queued) - cfs_rq->throttled_clock_self =3D rq_clock(rq); - else { - /* - * For cfs_rqs that still have entities enqueued, PELT clock - * stop happens at dequeue time when all entities are dequeued. - */ + /* + * For cfs_rqs that still have entities enqueued, PELT clock + * stop happens at dequeue time when all entities are dequeued. + */ + if (!cfs_rq->nr_queued) { list_del_leaf_cfs_rq(cfs_rq); cfs_rq->throttled_clock_pelt =3D rq_clock_pelt(rq); cfs_rq->pelt_clock_throttled =3D 1; } =20 + WARN_ON_ONCE(cfs_rq->throttled_clock_self); WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list)); return 0; } @@ -6013,8 +6013,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) */ cfs_rq->throttled =3D 1; WARN_ON_ONCE(cfs_rq->throttled_clock); - if (cfs_rq->nr_queued) - cfs_rq->throttled_clock =3D rq_clock(rq); return true; } =20 @@ -6722,6 +6720,7 @@ static void task_throttle_setup_work(struct task_stru= ct *p) {} static bool task_is_throttled(struct task_struct *p) { return false; } static void dequeue_throttled_task(struct task_struct *p, int flags) {} static bool enqueue_throttled_task(struct task_struct *p) { return false; } +static void record_throttle_clock(struct cfs_rq *cfs_rq) {} =20 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) { @@ -7040,6 +7039,7 @@ static int dequeue_entities(struct rq *rq, struct sch= ed_entity *se, int flags) bool was_sched_idle =3D sched_idle_rq(rq); bool task_sleep =3D flags & DEQUEUE_SLEEP; bool task_delayed =3D flags & DEQUEUE_DELAYED; + bool task_throttled =3D flags & DEQUEUE_THROTTLE; struct task_struct *p =3D NULL; int h_nr_idle =3D 0; int h_nr_queued =3D 0; @@ -7073,6 +7073,9 @@ static int dequeue_entities(struct rq *rq, struct sch= ed_entity *se, int flags) if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D h_nr_queued; =20 + if (throttled_hierarchy(cfs_rq) && task_throttled) + record_throttle_clock(cfs_rq); + /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { slice =3D cfs_rq_min_slice(cfs_rq); @@ -7109,6 +7112,9 @@ static int dequeue_entities(struct rq *rq, struct sch= ed_entity *se, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) h_nr_idle =3D h_nr_queued; + + if (throttled_hierarchy(cfs_rq) && task_throttled) + record_throttle_clock(cfs_rq); } =20 sub_nr_running(rq, h_nr_queued); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fc697d4bf6685..dbe52e18b93a0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2326,6 +2326,7 @@ extern const u32 sched_prio_to_wmult[40]; #define DEQUEUE_SPECIAL 0x10 #define DEQUEUE_MIGRATING 0x100 /* Matches ENQUEUE_MIGRATING */ #define DEQUEUE_DELAYED 0x200 /* Matches ENQUEUE_DELAYED */ +#define DEQUEUE_THROTTLE 0x800 =20 #define ENQUEUE_WAKEUP 0x01 #define ENQUEUE_RESTORE 0x02 --=20 2.39.5 From nobody Tue Oct 7 03:50:38 2025 Received: from mail-pj1-f46.google.com (mail-pj1-f46.google.com [209.85.216.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 878ED27056F for ; Tue, 15 Jul 2025 07:17:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563879; cv=none; b=aj0hatEUIpPQ+jvgS5YYS6Rk4V5EStVAn/ubRjpDGDmqGhDanM4MEKR3Wp1Vk5HhFVzzfohyTPGsmnHifVVMQrIsjCsZxttxoNwEilqBuLQO0PvmLfc8FOkJ3NfSsyUnNgXc+zz/sz+EE82F9z4OGomPCXuyiozX8BPT1rzj/KM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752563879; c=relaxed/simple; bh=7S12Kx/+pu7va46srDHLwpoos5E7kNGOGKl3r3NNQYU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ZjKk4k90kvorCuJz8oVBFrE/aER++qFiAoTSJDO2bQKAbirjE2zTzJZxp2sizKK2fAPVzCxtvDMZz9LuUM+XsfGrm1+TYVJ86/zLQvq2+xkh9qhXIJvKQm0Qoq9sarecl9H3cDLHMahjDRzPIetsj4Y/TTwC9V8Kv6ncTjFh/80= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=STGbBMjs; arc=none smtp.client-ip=209.85.216.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="STGbBMjs" Received: by mail-pj1-f46.google.com with SMTP id 98e67ed59e1d1-311bd8ce7e4so4439946a91.3 for ; Tue, 15 Jul 2025 00:17:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1752563876; x=1753168676; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Qx22eCiTcsq4MxY3QAurLhlT/uXLzbG6OYMradToRYU=; b=STGbBMjsuBxqDBQLoJD4dwzh6kvbGiTGIMAh5x5CcNmbMww9NIezvTZWVyxBXC2YhT mzFGwGMIjIPt5uJ5jgID+Tnlvb4f9NLl98rPy2A+oHgf2CXwOQh5dL0jzNRDE611XMS6 qvyHxURZNKpHJP7KsG10IMJxIPqc/iPqidwJyS7UZlxysg1ZBPQ3ST3I4irN7JWf6ZU0 3iy8KLiSDV8liTCic5SePFrZ3mEgwKVxuBt2ed3+6fEWiU1VQ4trUwFYKKu7RbwjDnIG O58SmE8DETPXpNSrraffHBdweJn4EtGEgl54bj8sAO8/RX1BFK0TT5KYy33lqZIDsz9+ QyxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752563876; x=1753168676; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Qx22eCiTcsq4MxY3QAurLhlT/uXLzbG6OYMradToRYU=; b=gO9NY1vKrMHuwsKccktAtShr5eH5mebB6XS6APJdkBZ/MzN3jYR/B/6At/hETW3Dn4 OVIyvqH8+bNQPG3LYPO4rPXVPuH/ciqybrujNgGQPqKWCJE263xYR8hBtcEeO147CS2g q93IhWG9/oFjoirUgsfhci+fH3kTPdwk0OtcLLtCdShx1oLIEdODaylwe9W9ux/HpFP+ HYKkO/Kvc9qXTdMdiORatlPyaQp4JKJVGIrbIDMeCxkRogyRoe3PQ3ckZAkzOFMnZqt0 wavBFUM5GvK4ZVysGZL/iHSUj2/Gp04tTNkmGOQw7Kc47uGQm0gSaN8S5oraKXhS30oZ bWFg== X-Gm-Message-State: AOJu0YwOEb+SMirMQy2PDLCippvL7UooZtyKUiUocK4JHmqiyCY8SruP jjapCPcYZzsJy7XL7JQdaPxer+G5ISY6y/KH9VJ2nOkpmeseQdEG6zjPh9+kZN/pIQ== X-Gm-Gg: ASbGncsOnRIY605PrF9wW0FX5au4L1xqMKlCSzN7SS7NY5g6D9GOTsZ4FxC7LoCm3rc o9yhiNQiKY7poLfuePBQXZyRcMmbGEvDP3xWxHg+Y1m65FXeveACo1YAK9OgRjlMhklG89eWEvS u+3S5AhRtYP6pcvVU2Lrb9T2CNXcrOnZ9zlhZp1/H5KI8XSI6yZ9iq/n0KzpLbb2tEWjX/3EpNt VVoUHLVZ3phKAWHXm8ncpgwH4b6MvQnTaYf7cdjHfaLkV69PVfg3lOdrpn3xq6tFhzoODGXPueE T8FmJVjzRhgpr3t2SRamuPlsxsYpKX60d9/A3AOR9lEop21eXqK+VVrFF0N1dtOMvOSZL9mRVBR 5NI2ytRpzbB3qJC/+l5mq2ZgTlbCHCoA2Dkm3lTTZ2Nm+nC3kfGgweJM/sHgy1azBXXiGfk2kkI UGNoQ= X-Google-Smtp-Source: AGHT+IEPIx73XZCxGMmIDvVdsyoxwRZZttJSvyDIEDE848LozIB8uztqgF2wAlwMGCbUORYqxVrvXg== X-Received: by 2002:a17:90a:da90:b0:31c:72d7:558b with SMTP id 98e67ed59e1d1-31c72d75601mr11721528a91.32.1752563875745; Tue, 15 Jul 2025 00:17:55 -0700 (PDT) Received: from 5CG4011XCS-JQI.bytedance.net ([61.213.176.56]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-31c3017c9dasm15013418a91.25.2025.07.15.00.17.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Jul 2025 00:17:55 -0700 (PDT) From: Aaron Lu To: Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu Subject: [PATCH v3 5/5] sched/fair: Get rid of throttled_lb_pair() Date: Tue, 15 Jul 2025 15:16:58 +0800 Message-Id: <20250715071658.267-6-ziqianlu@bytedance.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250715071658.267-1-ziqianlu@bytedance.com> References: <20250715071658.267-1-ziqianlu@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks list, there is no need to take special care of these throttled tasks anymore in load balance. Tested-by: K Prateek Nayak Suggested-by: K Prateek Nayak Signed-off-by: Aaron Lu Tested-by: Matteo Martelli Tested-by: Valentin Schneider --- kernel/sched/fair.c | 33 +++------------------------------ 1 file changed, 3 insertions(+), 30 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6f534fbe89bcf..af33d107d8034 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5729,23 +5729,6 @@ static inline int throttled_hierarchy(struct cfs_rq = *cfs_rq) return cfs_bandwidth_used() && cfs_rq->throttle_count; } =20 -/* - * Ensure that neither of the group entities corresponding to src_cpu or - * dest_cpu are members of a throttled hierarchy when performing group - * load-balance operations. - */ -static inline int throttled_lb_pair(struct task_group *tg, - int src_cpu, int dest_cpu) -{ - struct cfs_rq *src_cfs_rq, *dest_cfs_rq; - - src_cfs_rq =3D tg->cfs_rq[src_cpu]; - dest_cfs_rq =3D tg->cfs_rq[dest_cpu]; - - return throttled_hierarchy(src_cfs_rq) || - throttled_hierarchy(dest_cfs_rq); -} - static inline bool task_is_throttled(struct task_struct *p) { return p->throttled; @@ -6732,12 +6715,6 @@ static inline int throttled_hierarchy(struct cfs_rq = *cfs_rq) return 0; } =20 -static inline int throttled_lb_pair(struct task_group *tg, - int src_cpu, int dest_cpu) -{ - return 0; -} - #ifdef CONFIG_FAIR_GROUP_SCHED void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth = *parent) {} static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} @@ -9374,17 +9351,13 @@ int can_migrate_task(struct task_struct *p, struct = lb_env *env) /* * We do not migrate tasks that are: * 1) delayed dequeued unless we migrate load, or - * 2) throttled_lb_pair, or - * 3) cannot be migrated to this CPU due to cpus_ptr, or - * 4) running (obviously), or - * 5) are cache-hot on their current CPU. + * 2) cannot be migrated to this CPU due to cpus_ptr, or + * 3) running (obviously), or + * 4) are cache-hot on their current CPU. */ if ((p->se.sched_delayed) && (env->migration_type !=3D migrate_load)) return 0; =20 - if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) - return 0; - /* * We want to prioritize the migration of eligible tasks. * For ineligible tasks we soft-limit them and only allow --=20 2.39.5