From nobody Mon Dec 1 21:30:00 2025 Received: from fra-out-015.esa.eu-central-1.outbound.mail-perimeter.amazon.com (fra-out-015.esa.eu-central-1.outbound.mail-perimeter.amazon.com [18.158.153.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 09FC2283121 for ; Thu, 27 Nov 2025 20:28:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=18.158.153.154 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764275286; cv=none; b=CAMEqiw5uF32G6mx3105aJdDlfUONCsGBZLpMjczuPZsYtasDrlI0hVm5BgdmYnA1Uy58ZN/mleVx4DcPIRD7Fj2SfX0hIQ0e1WLutp+6lj5HcqvfHeBbrFbpiil5EoG9sHdhTzaw54oNu6gEOdDMzhQ8KvfOxCt0LWMmKw8L+g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764275286; c=relaxed/simple; bh=pgUyEIISMIyWfj1LrOii2YJZJ3Wv+ATLOFtTkESu3MM=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=TJkJDxB/Kw/rs4C1elbTqzeaiw2jLDTSCdefQTEBJcacd5Dj8vAzh6938g5HYPaBipp0aCPQhduZelqVzElxXcf7ikSCHdQWjBdqBFWmY1Xc4l8NVjHDpuXd/nWv7ZrDK3wPxFa/540p6jFTH3sIrnyyutft9PCK+Ppm278J07o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (2048-bit key) header.d=amazon.com header.i=@amazon.com header.b=lIRX5KPy; arc=none smtp.client-ip=18.158.153.154 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.com header.i=@amazon.com header.b="lIRX5KPy" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazoncorp2; t=1764275284; x=1795811284; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=n3YLvhf+h+hEXKMcmNqiuLrYQn+OEILlpSBB/B02emQ=; b=lIRX5KPyHGMFxhq4JE+ewwkxOSKVankPte09P54U6Ay3s6j8AgnQgf4N lj1UeETG0yDl3mHjsw9cINL7kYV4y7IulgBdrC59sbrGUq8tBy69bG1Lw ApBjMRTVMzadRs0ZeC/x4DCOfglWIku4fs5gLJmNzWbwC8C/PFE1zf3un IaKPEZV0x4CwhRJ59IDcdTjXGaMQko0QrSgUvWX90sxLiABAG/kvUflD6 J2lz2ZMa71yopmmrN0d5dzanYQVU5PRauWmgvyi8NoDDbbHjTb8RJ2+Mu 8RqHQ3XmybgpFMrWTLuYnEcHnGEFGT0jTHxH5EXDVyzVS49RBsADmeMDH A==; X-CSE-ConnectionGUID: +cvLI5IeQDCJ7ns80vnHhg== X-CSE-MsgGUID: 8OrThmC1QUW/AIPMDqEXpA== X-IronPort-AV: E=Sophos;i="6.20,231,1758585600"; d="scan'208";a="5794510" Received: from ip-10-6-6-97.eu-central-1.compute.internal (HELO smtpout.naws.eu-central-1.prod.farcaster.email.amazon.dev) ([10.6.6.97]) by internal-fra-out-015.esa.eu-central-1.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Nov 2025 20:27:46 +0000 Received: from EX19MTAEUB002.ant.amazon.com [54.240.197.224:20979] by smtpin.naws.eu-central-1.prod.farcaster.email.amazon.dev [10.0.14.220:2525] with esmtp (Farcaster) id e5700ae7-7054-46f1-8574-381b8fd6f096; Thu, 27 Nov 2025 20:27:46 +0000 (UTC) X-Farcaster-Flow-ID: e5700ae7-7054-46f1-8574-381b8fd6f096 Received: from EX19D003EUB001.ant.amazon.com (10.252.51.97) by EX19MTAEUB002.ant.amazon.com (10.252.51.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.29; Thu, 27 Nov 2025 20:27:44 +0000 Received: from u5934974a1cdd59.ant.amazon.com (10.146.13.226) by EX19D003EUB001.ant.amazon.com (10.252.51.97) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.29; Thu, 27 Nov 2025 20:27:35 +0000 From: Fernand Sieber To: , CC: , , , , , , , , , , , , , , , Subject: [PATCH] sched/fair: Force idle aware load balancing Date: Thu, 27 Nov 2025 22:27:17 +0200 Message-ID: <20251127202719.963766-1-sieberf@amazon.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D042UWA004.ant.amazon.com (10.13.139.16) To EX19D003EUB001.ant.amazon.com (10.252.51.97) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Consider force idle wasted capacity when computing if a group is idle or overloaded. We use a rather crude mechanism based on the current force idle state of the rq. It may be preferable to use a decaying average, similar to other load metrics, to avoid jittering. If the busiest group has force idle, make it a task migration. This way we will try to move one task regardless of the load. There are still subsequent checks later on to verify that this doesn't cause more force idle on the destination. =3D=3D=3D Testing Testing is aimed at measuring perceived guest noise on hypervisor system with time shared scenarios. Setup is on system where the load is nearing 100% which should allow no steal time. The system has 64 CPUs, with 8 VMs, each VM using core scheduling with 8 vCPUs per VM, time shared. 7 VMs are running stressors (`stress-ng --cpu 0`) while the last VM is running the hwlat tracer with a width of 100ms, a period of 300ms, and a threshold of 100us. Each VM runs a cookied non vCPU VMM process that adds a light level of noise which forces some level of load balancing. Signed-off-by: Fernand Sieber The test scenario is ran 10x60s and the average noise is measured. At baseline, we measure about 1.20% of noise (computed from hwlat breaches). With the proposed patch, the noise drops to 0.63%. --- kernel/sched/fair.c | 40 +++++++++++++++++++++++++++------------- kernel/sched/sched.h | 12 ++++++++++++ 2 files changed, 39 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5b752324270b..ab8c9aa09107 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9932,6 +9932,7 @@ struct sg_lb_stats { unsigned int nr_numa_running; unsigned int nr_preferred_running; #endif + unsigned int forceidle_weight; }; =20 /* @@ -10135,15 +10136,15 @@ static inline int sg_imbalanced(struct sched_grou= p *group) static inline bool group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs) { - if (sgs->sum_nr_running < sgs->group_weight) + if (sgs->sum_nr_running < (sgs->group_weight - sgs->forceidle_weight)) return true; =20 - if ((sgs->group_capacity * imbalance_pct) < - (sgs->group_runnable * 100)) + if ((sgs->group_capacity * imbalance_pct * (sgs->group_weight - sgs->forc= eidle_weight)) < + (sgs->group_runnable * 100 * sgs->group_weight)) return false; =20 - if ((sgs->group_capacity * 100) > - (sgs->group_util * imbalance_pct)) + if ((sgs->group_capacity * 100 * (sgs->group_weight - sgs->forceidle_weig= ht)) > + (sgs->group_util * imbalance_pct * sgs->group_weight)) return true; =20 return false; @@ -10160,15 +10161,15 @@ group_has_capacity(unsigned int imbalance_pct, st= ruct sg_lb_stats *sgs) static inline bool group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs) { - if (sgs->sum_nr_running <=3D sgs->group_weight) + if (sgs->sum_nr_running <=3D (sgs->group_weight - sgs->forceidle_weight)) return false; =20 - if ((sgs->group_capacity * 100) < - (sgs->group_util * imbalance_pct)) + if ((sgs->group_capacity * 100 * (sgs->group_weight - sgs->forceidle_weig= ht)) < + (sgs->group_util * imbalance_pct * sgs->group_weight)) return true; =20 - if ((sgs->group_capacity * imbalance_pct) < - (sgs->group_runnable * 100)) + if ((sgs->group_capacity * imbalance_pct * (sgs->group_weight - sgs->forc= eidle_weight)) < + (sgs->group_runnable * 100 * sgs->group_weight)) return true; =20 return false; @@ -10371,13 +10372,19 @@ static inline void update_sg_lb_stats(struct lb_e= nv *env, nr_running =3D rq->nr_running; sgs->sum_nr_running +=3D nr_running; =20 + /* + * Ignore force idle if we are balancing within the SMT mask + */ + if (rq_in_forceidle(rq) && !(env->sd->flags & SD_SHARE_CPUCAPACITY)) + sgs->forceidle_weight++; + if (cpu_overutilized(i)) *sg_overutilized =3D 1; =20 /* * No need to call idle_cpu() if nr_running is not 0 */ - if (!nr_running && idle_cpu(i)) { + if (!rq_in_forceidle(rq) && !nr_running && idle_cpu(i)) { sgs->idle_cpus++; /* Idle cpu can't have misfit task */ continue; @@ -10691,10 +10698,16 @@ static inline void update_sg_wakeup_stats(struct = sched_domain *sd, nr_running =3D rq->nr_running - local; sgs->sum_nr_running +=3D nr_running; =20 + /* + * Ignore force idle if we are balancing within the SMT mask + */ + if (rq_in_forceidle(rq) && !(sd->flags & SD_SHARE_CPUCAPACITY)) + sgs->forceidle_weight++; + /* * No need to call idle_cpu_without() if nr_running is not 0 */ - if (!nr_running && idle_cpu_without(i, p)) + if (!rq_in_forceidle(rq) && !nr_running && idle_cpu_without(i, p)) sgs->idle_cpus++; =20 /* Check if task fits in the CPU */ @@ -11123,7 +11136,8 @@ static inline void calculate_imbalance(struct lb_en= v *env, struct sd_lb_stats *s return; } =20 - if (busiest->group_type =3D=3D group_smt_balance) { + if (busiest->group_type =3D=3D group_smt_balance || + busiest->forceidle_weight) { /* Reduce number of tasks sharing CPU capacity */ env->migration_type =3D migrate_task; env->imbalance =3D 1; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index adfb6e3409d7..fdee101b1a66 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1468,6 +1468,13 @@ static inline bool sched_core_enqueued(struct task_s= truct *p) return !RB_EMPTY_NODE(&p->core_node); } =20 +static inline bool rq_in_forceidle(struct rq *rq) +{ + return rq->core->core_forceidle_count > 0 && + rq->nr_running && + rq->curr =3D=3D rq->idle; +} + extern void sched_core_enqueue(struct rq *rq, struct task_struct *p); extern void sched_core_dequeue(struct rq *rq, struct task_struct *p, int f= lags); =20 @@ -1513,6 +1520,11 @@ static inline bool sched_group_cookie_match(struct r= q *rq, return true; } =20 +static inline bool rq_in_forceidle(struct rq *rq) +{ + return false; +} + #endif /* !CONFIG_SCHED_CORE */ =20 #ifdef CONFIG_RT_GROUP_SCHED --=20 2.43.0 Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07