From nobody Mon Dec 1 21:29:59 2025 Received: from fra-out-011.esa.eu-central-1.outbound.mail-perimeter.amazon.com (fra-out-011.esa.eu-central-1.outbound.mail-perimeter.amazon.com [52.28.197.132]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 102C330BF4E for ; Mon, 1 Dec 2025 12:43:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.28.197.132 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764592987; cv=none; b=W1PDGTYkWGJBBdqFPeH/6Akbg49ZusHLKEqItiWwRqB2Pknq0wOKQEDjCRDBNAJdYsbhlDdLoYn1TiZoc8q8rhrjH3S1EQ/Cxb/Ole4bY3IazgBrjVCToG3a9V+A9ZCS3w25L2IPcFNX04a6zjLAx2FFbO3/wrJDPnaemQ3IAL8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764592987; c=relaxed/simple; bh=1rWKfDdNfho4A1UuoEk/aABOvXN0wK2WCGbONdmZNsw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=A69bF7j95K2HbnVrQE7APN+SxCoxSYDE2uL+nLleQFJVlXFdIdmTvje/ix4nnl275PpnCWpGA/7Iq9JPPiSboyBN7zxm0Zle0VCH1OQSicqikNYjg9J/H3twfKNX8e2ueY/Tdmkm+XvkfLtqUKl3DhdoTu3Kl+go7S3ogxHQa1Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (2048-bit key) header.d=amazon.com header.i=@amazon.com header.b=DqhjOBmP; arc=none smtp.client-ip=52.28.197.132 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.com header.i=@amazon.com header.b="DqhjOBmP" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazoncorp2; t=1764592984; x=1796128984; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=OD3dGMKodIDNqYVdBCyiz0xfEQdee3KsuwRFYqtNWy4=; b=DqhjOBmPU/7ae75GyETZnjx5AiJSZ91xKEbMjUSsgQoilAZSry8QZGCq UtrJpfjowpgX+/45sRB/JWAN1FnbD07Ou6hokXkLHdwVsf2w2qNAE5rc+ oIELVjwV6RqpHM57/cwftemZrMrH+WQR9xVDmHmllHojLynF70opzZ0iP GuwOyGfuapDoUPXJWB9n4t8ycpAHW8ifi6vDGbkjrjLnLWr97WPcqCJkU sAiAW2iIGbiCLKJliU7pFNWQ4rrdjBB2YHdX3eH18mmyHpF89NwvhrXn4 zNgL5Il/9XFK6pS+cFG/3nKzsdYJADX8ARgNNwTDNs6j4OlThGN6WasBR g==; X-CSE-ConnectionGUID: fh/bnvNrTKWH/hi8sCD54w== X-CSE-MsgGUID: 6wSPAC44TX+T5QuVT/M7Yw== X-IronPort-AV: E=Sophos;i="6.20,240,1758585600"; d="scan'208";a="5941793" Received: from ip-10-6-11-83.eu-central-1.compute.internal (HELO smtpout.naws.eu-central-1.prod.farcaster.email.amazon.dev) ([10.6.11.83]) by internal-fra-out-011.esa.eu-central-1.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Dec 2025 12:42:48 +0000 Received: from EX19MTAEUB002.ant.amazon.com [54.240.197.224:6070] by smtpin.naws.eu-central-1.prod.farcaster.email.amazon.dev [10.0.33.168:2525] with esmtp (Farcaster) id 265b2d6f-0fc3-446f-9d7b-06a9fc181544; Mon, 1 Dec 2025 12:42:47 +0000 (UTC) X-Farcaster-Flow-ID: 265b2d6f-0fc3-446f-9d7b-06a9fc181544 Received: from EX19D003EUB001.ant.amazon.com (10.252.51.97) by EX19MTAEUB002.ant.amazon.com (10.252.51.79) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.29; Mon, 1 Dec 2025 12:42:47 +0000 Received: from u5934974a1cdd59.ant.amazon.com (10.146.13.109) by EX19D003EUB001.ant.amazon.com (10.252.51.97) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.29; Mon, 1 Dec 2025 12:42:38 +0000 From: Fernand Sieber To: , CC: , , , , , , , , , , , , , , , , Subject: [PATCH v2] sched/fair: Force idle aware load balancing Date: Mon, 1 Dec 2025 14:42:22 +0200 Message-ID: <20251201124223.247107-1-sieberf@amazon.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20251127202719.963766-1-sieberf@amazon.com> References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D046UWB003.ant.amazon.com (10.13.139.174) To EX19D003EUB001.ant.amazon.com (10.252.51.97) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Consider force idle wasted capacity when computing if a group is idle or overloaded. We use a rather crude mechanism based on the current force idle state of the rq. It may be preferable to use a decaying average, similar to other load metrics, to avoid jittering. If the busiest group has force idle, make it a task migration. This way we will try to move one task regardless of the load. There are still subsequent checks later on to verify that this doesn't cause more force idle on the destination. =3D=3D=3D rev1->rev2: * addressed feedback about asym scheduling * removed redundant force idle check for idle cpus * removed migrate_task override for LB with force idle (no perf gains) =3D=3D=3D Testing Testing is aimed at measuring perceived guest noise on hypervisor system with time shared scenarios. Setup is on system where the load is nearing 100% which should allow no steal time. The system has 64 CPUs, with 8 VMs, each VM using core scheduling with 8 vCPUs per VM, time shared. 7 VMs are running stressors (`stress-ng --cpu 0`) while the last VM is running the hwlat tracer with a width of 100ms, a period of 300ms, and a threshold of 100us. Each VM runs a cookied non vCPU VMM process that adds a light level of noise which forces some level of load balancing. The test scenario is ran 10x60s and the average noise is measured. At baseline, we measure about 1.20% of noise (computed from hwlat breaches). With the proposed patch, the noise drops to 0.63%. Signed-off-by: Fernand Sieber --- kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++++++---- kernel/sched/sched.h | 12 ++++++++ 2 files changed, 73 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5b752324270b..c4ef8aaf1142 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9932,6 +9932,10 @@ struct sg_lb_stats { unsigned int nr_numa_running; unsigned int nr_preferred_running; #endif +#ifdef CONFIG_SCHED_CORE + unsigned int forceidle_weight; + unsigned long forceidle_capacity; +#endif }; =20 /* @@ -10120,6 +10124,29 @@ static inline int sg_imbalanced(struct sched_group= *group) return group->sgc->imbalance; } =20 + +#ifdef CONFIG_SCHED_CORE +static inline unsigned int sgs_available_weight(struct sg_lb_stats *sgs) +{ + return sgs->group_weight - sgs->forceidle_weight; +} + +static inline unsigned long sgs_available_capacity(struct sg_lb_stats *sgs) +{ + return sgs->group_capacity - sgs->forceidle_capacity; +} +#else +static inline unsigned int sgs_available_weight(struct sg_lb_stats *sgs) +{ + return sgs->group_weight; +} + +static inline unsigned long sgs_available_capacity(struct sg_lb_stats *sgs) +{ + return sgs->group_capacity; +} +#endif /* CONFIG_SCHED_CORE */ + /* * group_has_capacity returns true if the group has spare capacity that co= uld * be used by some tasks. @@ -10135,14 +10162,14 @@ static inline int sg_imbalanced(struct sched_grou= p *group) static inline bool group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs) { - if (sgs->sum_nr_running < sgs->group_weight) + if (sgs->sum_nr_running < sgs_available_weight(sgs)) return true; =20 - if ((sgs->group_capacity * imbalance_pct) < + if ((sgs_available_capacity(sgs) * imbalance_pct) < (sgs->group_runnable * 100)) return false; =20 - if ((sgs->group_capacity * 100) > + if ((sgs_available_capacity(sgs) * 100) > (sgs->group_util * imbalance_pct)) return true; =20 @@ -10160,14 +10187,14 @@ group_has_capacity(unsigned int imbalance_pct, st= ruct sg_lb_stats *sgs) static inline bool group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs) { - if (sgs->sum_nr_running <=3D sgs->group_weight) + if (sgs->sum_nr_running <=3D sgs_available_weight(sgs)) return false; =20 - if ((sgs->group_capacity * 100) < + if ((sgs_available_capacity(sgs) * 100) < (sgs->group_util * imbalance_pct)) return true; =20 - if ((sgs->group_capacity * imbalance_pct) < + if ((sgs_available_capacity(sgs) * imbalance_pct) < (sgs->group_runnable * 100)) return true; =20 @@ -10336,6 +10363,30 @@ sched_reduced_capacity(struct rq *rq, struct sched= _domain *sd) return check_cpu_capacity(rq, sd); } =20 +#ifdef CONFIG_SCHED_CORE +static inline void +update_forceidle_capacity(struct sched_domain *sd, + struct sg_lb_stats *sgs, + struct rq *rq) +{ + /* + * Ignore force idle if we are balancing within the SMT mask + */ + if (sd->flags & SD_SHARE_CPUCAPACITY) + return; + + if (rq_in_forceidle(rq)) { + sgs->forceidle_weight++; + sgs->forceidle_capacity +=3D rq->cpu_capacity; + } +} +#else +static inline void +update_forceidle_capacity(struct sched_domain *sd, + struct sg_lb_stats *sgs, + struct rq *rq) {} +#endif /* !CONFIG_SCHED_CORE */ + /** * update_sg_lb_stats - Update sched_group's statistics for load balancing. * @env: The load balancing environment. @@ -10371,6 +10422,8 @@ static inline void update_sg_lb_stats(struct lb_env= *env, nr_running =3D rq->nr_running; sgs->sum_nr_running +=3D nr_running; =20 + update_forceidle_capacity(env->sd, sgs, rq); + if (cpu_overutilized(i)) *sg_overutilized =3D 1; =20 @@ -10691,6 +10744,8 @@ static inline void update_sg_wakeup_stats(struct sc= hed_domain *sd, nr_running =3D rq->nr_running - local; sgs->sum_nr_running +=3D nr_running; =20 + update_forceidle_capacity(sd, sgs, rq); + /* * No need to call idle_cpu_without() if nr_running is not 0 */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index adfb6e3409d7..fdee101b1a66 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1468,6 +1468,13 @@ static inline bool sched_core_enqueued(struct task_s= truct *p) return !RB_EMPTY_NODE(&p->core_node); } =20 +static inline bool rq_in_forceidle(struct rq *rq) +{ + return rq->core->core_forceidle_count > 0 && + rq->nr_running && + rq->curr =3D=3D rq->idle; +} + extern void sched_core_enqueue(struct rq *rq, struct task_struct *p); extern void sched_core_dequeue(struct rq *rq, struct task_struct *p, int f= lags); =20 @@ -1513,6 +1520,11 @@ static inline bool sched_group_cookie_match(struct r= q *rq, return true; } =20 +static inline bool rq_in_forceidle(struct rq *rq) +{ + return false; +} + #endif /* !CONFIG_SCHED_CORE */ =20 #ifdef CONFIG_RT_GROUP_SCHED --=20 2.43.0 Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07