From nobody Tue Jun 23 09:09:16 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 990C1C433F5 for ; Tue, 8 Mar 2022 09:26:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345339AbiCHJ1q (ORCPT ); Tue, 8 Mar 2022 04:27:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53892 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345335AbiCHJ1n (ORCPT ); Tue, 8 Mar 2022 04:27:43 -0500 Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ED61940A1B; Tue, 8 Mar 2022 01:26:45 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R621e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=28;SR=0;TI=SMTPD_---0V6e-tjP_1646731598; Received: from localhost.localdomain(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0V6e-tjP_1646731598) by smtp.aliyun-inc.com(127.0.0.1); Tue, 08 Mar 2022 17:26:40 +0800 From: Tianchen Ding To: Zefan Li , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Tejun Heo , Johannes Weiner , Tianchen Ding , Michael Wang , Cruz Zhao , Masahiro Yamada , Nathan Chancellor , Kees Cook , Andrew Morton , Vlastimil Babka , "Gustavo A. R. Silva" , Arnd Bergmann , Miguel Ojeda , Chris Down , Vipin Sharma , Daniel Borkmann Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 1/4] sched, cpuset: Introduce infrastructure of group balancer Date: Tue, 8 Mar 2022 17:26:26 +0800 Message-Id: <20220308092629.40431-2-dtcccc@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220308092629.40431-1-dtcccc@linux.alibaba.com> References: <20220308092629.40431-1-dtcccc@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Introduce CONFIG and struct about group balancer. Add a few interfaces on cpuset and debugfs to read&write params. In detail: cgroup: cpuset.gb.period_us: To control work period. Write a positive number to enable group balancer for all descendants EXCEPT itself. Write 0 to disable it. MAX 1000000. (1s) Can only write after partition info set. cpuset.gb.partition: Partition info used by group balancer. Tasks in child cgroups (EXCEPT itself) will try to settle at one of given partitions when gb enabled. /sys/kernel/debug/sched/group_balancer/: settle_period_ms: Default settle period for each group. settle_period_max_ms: Max settle period for each group. global_settle_period_ms: Global settle period for all groups. For one group, the operation of settling must satisfy the periods of itself and global. Signed-off-by: Tianchen Ding --- include/linux/sched/gb.h | 29 +++++++++++ init/Kconfig | 12 +++++ kernel/cgroup/cpuset.c | 109 +++++++++++++++++++++++++++++++++++++++ kernel/sched/Makefile | 1 + kernel/sched/debug.c | 10 +++- kernel/sched/gb.c | 14 +++++ kernel/sched/sched.h | 4 ++ 7 files changed, 178 insertions(+), 1 deletion(-) create mode 100644 include/linux/sched/gb.h create mode 100644 kernel/sched/gb.c diff --git a/include/linux/sched/gb.h b/include/linux/sched/gb.h new file mode 100644 index 000000000000..63c14289a748 --- /dev/null +++ b/include/linux/sched/gb.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Group Balancer header + * + * Copyright (C) 2021 Alibaba, Inc., Michael Wang + */ +#ifndef _LINUX_SCHED_GB_H +#define _LINUX_SCHED_GB_H + +#include + +struct gb_info { + /* ---for ancestor as controller--- */ + /* Period(ns) for task work in task tick, 0 means disabled. */ + u64 gb_period; + + /* ---for descendants who scheduled by group balancer--- */ + /* Period for settling to specific partition. */ + unsigned long settle_period; + + /* Stamp to record next settling. */ + unsigned long settle_next; +}; + +#ifdef CONFIG_GROUP_BALANCER +extern unsigned int sysctl_gb_settle_period; +#endif + +#endif diff --git a/init/Kconfig b/init/Kconfig index 8b79eda8e1d4..8f05e5e7e299 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1043,6 +1043,18 @@ config RT_GROUP_SCHED realtime bandwidth for them. See Documentation/scheduler/sched-rt-group.rst for more information. =20 +config GROUP_BALANCER + bool "Group balancer support for SCHED_OTHER" + depends on (CGROUP_SCHED && SMP) + default CGROUP_SCHED + help + This feature allow you to do load balance in group mode. In other + word, balance groups of tasks among groups of CPUs. + + This can reduce the conflict between task groups and gain benefit + from hot cache and affinity domain, usually for the cases when there + are multiple apps sharing the same CPUs. + endif #CGROUP_SCHED =20 config UCLAMP_TASK_GROUP diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index ef88cc366bb8..0349f3f64e3d 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -65,6 +65,7 @@ #include #include #include +#include =20 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); @@ -170,6 +171,14 @@ struct cpuset { =20 /* Handle for cpuset.cpus.partition */ struct cgroup_file partition_file; + +#ifdef CONFIG_GROUP_BALANCER + /* Partition info and configs of group balancer */ + struct gb_info gi; + + /* Point to the controller (in ancestor cgroup) who has partition info */ + struct gb_info *control_gi; +#endif }; =20 /* @@ -215,6 +224,24 @@ static inline struct cpuset *parent_cs(struct cpuset *= cs) return css_cs(cs->css.parent); } =20 +#ifdef CONFIG_GROUP_BALANCER +static inline struct gb_info *cs_gi(struct cpuset *cs) +{ + return &cs->gi; +} +static struct gb_info *css_gi(struct cgroup_subsys_state *css, bool get_co= ntroller) +{ + struct cpuset *cs =3D css_cs(css); + + return get_controller ? cs->control_gi : &cs->gi; +} +#else +static inline struct gb_info *cs_gi(struct cpuset *cs) +{ + return NULL; +} +#endif /* CONFIG_GROUP_BALANCER */ + /* bits in struct cpuset flags field */ typedef enum { CS_ONLINE, @@ -2365,6 +2392,7 @@ typedef enum { FILE_MEMORY_PRESSURE, FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, + FILE_GB_CPULIST, } cpuset_filetype_t; =20 static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype= *cft, @@ -2510,6 +2538,69 @@ static ssize_t cpuset_write_resmask(struct kernfs_op= en_file *of, return retval ?: nbytes; } =20 +#ifdef CONFIG_GROUP_BALANCER +static inline void init_gb(struct cpuset *cs) +{ + struct gb_info *gi =3D cs_gi(cs); + + gi->settle_period =3D msecs_to_jiffies(sysctl_gb_settle_period); + gi->settle_next =3D jiffies + gi->settle_period; +} + +static u64 gb_period_read_u64(struct cgroup_subsys_state *css, struct cfty= pe *cft) +{ + struct gb_info *gi =3D css_gi(css, false); + + return gi->gb_period / NSEC_PER_USEC; +} + +#define MAX_GB_PERIOD USEC_PER_SEC /* 1s */ + +static int gb_period_write_u64(struct cgroup_subsys_state *css, struct cft= ype *cft, u64 val) +{ + struct cpuset *cs =3D css_cs(css); + struct gb_info *gi =3D cs_gi(cs); + + if (val > MAX_GB_PERIOD) + return -EINVAL; + + percpu_down_write(&cpuset_rwsem); + + gi->gb_period =3D val * NSEC_PER_USEC; + + percpu_up_write(&cpuset_rwsem); + return 0; +} + +static void gb_partition_read(struct seq_file *sf, struct gb_info *gi) +{ + seq_putc(sf, '\n'); +} + +static ssize_t gb_partition_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct cgroup_subsys_state *css =3D of_css(of); + struct cpuset *cs =3D css_cs(css); + int retval =3D -ENODEV; + + cpus_read_lock(); + percpu_down_write(&cpuset_rwsem); + if (!is_cpuset_online(cs)) + goto out_unlock; + + retval =3D 0; + +out_unlock: + percpu_up_write(&cpuset_rwsem); + cpus_read_unlock(); + return retval ?: nbytes; +} +#else +static inline void init_gb(struct cpuset *cs) { } +static inline void gb_partition_read(struct seq_file *sf, struct gb_info *= gi) { } +#endif /* CONFIG_GROUP_BALANCER */ + /* * These ascii lists should be read in a single call, by using a user * buffer large enough to hold the entire map. If read in smaller @@ -2542,6 +2633,9 @@ static int cpuset_common_seq_show(struct seq_file *sf= , void *v) case FILE_SUBPARTS_CPULIST: seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->subparts_cpus)); break; + case FILE_GB_CPULIST: + gb_partition_read(sf, cs_gi(cs)); + break; default: ret =3D -EINVAL; } @@ -2803,7 +2897,21 @@ static struct cftype dfl_files[] =3D { .private =3D FILE_SUBPARTS_CPULIST, .flags =3D CFTYPE_DEBUG, }, +#ifdef CONFIG_GROUP_BALANCER + { + .name =3D "gb.period_us", + .read_u64 =3D gb_period_read_u64, + .write_u64 =3D gb_period_write_u64, + }, =20 + { + .name =3D "gb.partition", + .seq_show =3D cpuset_common_seq_show, + .write =3D gb_partition_write, + .max_write_len =3D (100U + 6 * NR_CPUS), + .private =3D FILE_GB_CPULIST, + }, +#endif { } /* terminate */ }; =20 @@ -2870,6 +2978,7 @@ static int cpuset_css_online(struct cgroup_subsys_sta= te *css) cs->effective_mems =3D parent->effective_mems; cs->use_parent_ecpus =3D true; parent->child_ecpus_count++; + init_gb(cs); } spin_unlock_irq(&callback_lock); =20 diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index c83b37af155b..cb5b86bb161e 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMBARRIER) +=3D membarrier.o obj-$(CONFIG_CPU_ISOLATION) +=3D isolation.o obj-$(CONFIG_PSI) +=3D psi.o obj-$(CONFIG_SCHED_CORE) +=3D core_sched.o +obj-$(CONFIG_GROUP_BALANCER) +=3D gb.o diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 102d6f70e84d..1800bdfe1d61 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -299,7 +299,7 @@ static struct dentry *debugfs_sched; =20 static __init int sched_init_debug(void) { - struct dentry __maybe_unused *numa; + struct dentry __maybe_unused *numa, *gb; =20 debugfs_sched =3D debugfs_create_dir("sched", NULL); =20 @@ -336,6 +336,14 @@ static __init int sched_init_debug(void) debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_sca= n_size); #endif =20 +#ifdef CONFIG_GROUP_BALANCER + gb =3D debugfs_create_dir("group_balancer", debugfs_sched); + + debugfs_create_u32("settle_period_ms", 0644, gb, &sysctl_gb_settle_period= ); + debugfs_create_u32("settle_period_max_ms", 0644, gb, &sysctl_gb_settle_pe= riod_max); + debugfs_create_u32("global_settle_period_ms", 0644, gb, &sysctl_gb_global= _settle_period); +#endif + debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); =20 return 0; diff --git a/kernel/sched/gb.c b/kernel/sched/gb.c new file mode 100644 index 000000000000..4a2876ec1cca --- /dev/null +++ b/kernel/sched/gb.c @@ -0,0 +1,14 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Group Balancer code + * + * Copyright (C) 2021 Alibaba, Inc., Michael Wang + */ +#include + +#include "sched.h" + +/* Params about settle period. (ms) */ +unsigned int sysctl_gb_settle_period =3D 200; +unsigned int sysctl_gb_settle_period_max =3D 10000; +unsigned int sysctl_gb_global_settle_period =3D 200; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ac1f9470c617..d6555aacdf78 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2433,6 +2433,10 @@ extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; + +extern unsigned int sysctl_gb_settle_period; +extern unsigned int sysctl_gb_settle_period_max; +extern unsigned int sysctl_gb_global_settle_period; #endif =20 #ifdef CONFIG_SCHED_HRTICK --=20 2.27.0 From nobody Tue Jun 23 09:09:16 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F4CBC433EF for ; Tue, 8 Mar 2022 09:27:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345355AbiCHJ1u (ORCPT ); Tue, 8 Mar 2022 04:27:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53952 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345338AbiCHJ1p (ORCPT ); Tue, 8 Mar 2022 04:27:45 -0500 Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A01C941312; Tue, 8 Mar 2022 01:26:48 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=28;SR=0;TI=SMTPD_---0V6e-tkD_1646731601; Received: from localhost.localdomain(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0V6e-tkD_1646731601) by smtp.aliyun-inc.com(127.0.0.1); Tue, 08 Mar 2022 17:26:43 +0800 From: Tianchen Ding To: Zefan Li , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Tejun Heo , Johannes Weiner , Tianchen Ding , Michael Wang , Cruz Zhao , Masahiro Yamada , Nathan Chancellor , Kees Cook , Andrew Morton , Vlastimil Babka , "Gustavo A. R. Silva" , Arnd Bergmann , Miguel Ojeda , Chris Down , Vipin Sharma , Daniel Borkmann Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 2/4] cpuset: Handle input of partition info for group balancer Date: Tue, 8 Mar 2022 17:26:27 +0800 Message-Id: <20220308092629.40431-3-dtcccc@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220308092629.40431-1-dtcccc@linux.alibaba.com> References: <20220308092629.40431-1-dtcccc@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Partition info is the key param of group balancer. It represents the division of effective cpus. Cpus are divided into several parts, and group balancer will select one part for each cgroup (called settle) and try to gather all tasks of that cgroup running on the cpus of selected part. For example, the effective cpulist is "0-63". A valid input of partition info can be "0-15;16-31;32-47;48-63;". This divide cpus into 4 parts, each part with 16 cpus. Intersections between parts or conflicts with effective cpus are not allowd, e.g., "0-40;30-63" or "0-31;32-70;". Once the partition info is set, the corresponding cgroup becomes a controller. All descendants of it are managed by group balancer, according to params set in this controller cgroup. A cgroup can become a controller only when its ancestors and descendants are not controller. To unset a controller, write empty string to it. (i.e., echo > cpuset.gb.partition) This will clear its partition info and reset the relationship between itself and its descendants. Setting, modifying, or unsetting partition info must be done when group balancer is disabled. (i.e., gb.period_us =3D 0.) ROOT_CG | | |_________ | | CG_A CG_B(controller, gb.partition =3D "...", gb.period_us > 0) | | |________________ | | CG_C(gb active) CG_D(gb active) (prefer=3D0) (prefer=3D3) For example, if we input valid partition info "0-15;16-31;32-47;48-63;" and positive period to CG_B, group balancer for all descendants of CG_B is enabled. All tasks in CG_C will tend to run on cpus of part0 (0-15), and all tasks in CG_D will tend to run on part3 (48-63). After setting partition info, CG_B becomes a controller. So setting another partition info in ROOT_CG, CG_C or CG_D is not allowed until CG_B is disabled and unset. While CG_A can still be set. Signed-off-by: Tianchen Ding --- include/linux/sched/gb.h | 29 +++++ kernel/cgroup/cpuset.c | 245 +++++++++++++++++++++++++++++++++++++-- 2 files changed, 267 insertions(+), 7 deletions(-) diff --git a/include/linux/sched/gb.h b/include/linux/sched/gb.h index 63c14289a748..b03a2b4ef4b7 100644 --- a/include/linux/sched/gb.h +++ b/include/linux/sched/gb.h @@ -8,9 +8,30 @@ #define _LINUX_SCHED_GB_H =20 #include +#include + +#define CPU_PART_MAX 32 +struct gb_part { + int id; + struct cpumask cpus; +}; + +struct gb_part_info { + int nr_part; + struct rcu_head rcu_head; + struct gb_part parts[CPU_PART_MAX]; + int ctop[NR_CPUS]; +}; =20 struct gb_info { /* ---for ancestor as controller--- */ + /* + * Partition info. Non-NULL means this cgroup acting as a controller. + * While otherwise, NULL means not a controller. + * (Maybe a descendant of a controller, or not managed by gb at all.) + */ + struct gb_part_info *part_info; + /* Period(ns) for task work in task tick, 0 means disabled. */ u64 gb_period; =20 @@ -22,6 +43,14 @@ struct gb_info { unsigned long settle_next; }; =20 +#define for_each_gbpart(i, pi) \ + for (i =3D 0; i < (pi)->nr_part; i++) + +static inline struct cpumask *part_cpus(struct gb_part_info *pi, int id) +{ + return &pi->parts[id].cpus; +} + #ifdef CONFIG_GROUP_BALANCER extern unsigned int sysctl_gb_settle_period; #endif diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 0349f3f64e3d..4b456b379b87 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -683,13 +683,24 @@ static int validate_change(struct cpuset *cur, struct= cpuset *trial) if (cur =3D=3D &top_cpuset) goto out; =20 + ret =3D -EINVAL; + +#ifdef CONFIG_GROUP_BALANCER + /* + * If group balancer part_info is set, this cgroup acts as a controller. + * Not allow to change cpumask until unset it + * by writing empty string to cpuset.gb.partition. + */ + if (cs_gi(cur)->part_info) + goto out; +#endif + par =3D parent_cs(cur); =20 /* * If either I or some sibling (!=3D me) is exclusive, we can't * overlap */ - ret =3D -EINVAL; cpuset_for_each_child(c, css, par) { if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) && c !=3D cur && @@ -2539,12 +2550,77 @@ static ssize_t cpuset_write_resmask(struct kernfs_o= pen_file *of, } =20 #ifdef CONFIG_GROUP_BALANCER -static inline void init_gb(struct cpuset *cs) +static void free_gb_part_info(struct rcu_head *rcu_head) +{ + struct gb_part_info *pi =3D container_of(rcu_head, struct gb_part_info, r= cu_head); + + kfree(pi); +} + +static inline void update_child_gb_controller(struct cpuset *cs, struct gb= _info *gi) +{ + struct cpuset *cp; + struct cgroup_subsys_state *pos_css; + + rcu_read_lock(); + cpuset_for_each_descendant_pre(cp, pos_css, cs) { + if (cp =3D=3D cs) + continue; + + cp->control_gi =3D gi; + } + rcu_read_unlock(); +} + +static inline void update_gb_part_info(struct cpuset *cs, struct gb_part_i= nfo *new) { struct gb_info *gi =3D cs_gi(cs); + struct gb_part_info *old; + + if (gi->part_info && !new) { + /* + * We are clearing partition info. + * This cgroup is no longer a controller. + * Reset all descendants. + */ + update_child_gb_controller(cs, NULL); + } + + old =3D xchg(&gi->part_info, new); + + if (old) { + call_rcu(&old->rcu_head, free_gb_part_info); + } else if (new) { + /* + * This cgroup is newly becoming a controller. + * Set relationship between cs and descendants. + */ + update_child_gb_controller(cs, gi); + } +} + +static inline void init_gb(struct cpuset *cs) +{ + struct gb_info *gi =3D cs_gi(cs), *gi_iter; + struct cgroup_subsys_state *css; =20 gi->settle_period =3D msecs_to_jiffies(sysctl_gb_settle_period); gi->settle_next =3D jiffies + gi->settle_period; + + /* Search upwards to find any existing controller. */ + for (css =3D cs->css.parent; css; css =3D css->parent) { + gi_iter =3D css_gi(css, false); + if (gi_iter->part_info) { + cs->control_gi =3D gi_iter; + break; + } + } +} + +static inline void remove_gb(struct cpuset *cs) +{ + if (cs_gi(cs)->part_info) + update_gb_part_info(cs, NULL); } =20 static u64 gb_period_read_u64(struct cgroup_subsys_state *css, struct cfty= pe *cft) @@ -2560,28 +2636,156 @@ static int gb_period_write_u64(struct cgroup_subsy= s_state *css, struct cftype *c { struct cpuset *cs =3D css_cs(css); struct gb_info *gi =3D cs_gi(cs); + int retval =3D -EINVAL; =20 if (val > MAX_GB_PERIOD) - return -EINVAL; + return retval; =20 percpu_down_write(&cpuset_rwsem); =20 + /* + * Cannot enable group balancer on cgroups + * whose partition info not set. + */ + if (!gi->part_info) + goto out_unlock; + gi->gb_period =3D val * NSEC_PER_USEC; + retval =3D 0; =20 +out_unlock: percpu_up_write(&cpuset_rwsem); - return 0; + return retval; } =20 static void gb_partition_read(struct seq_file *sf, struct gb_info *gi) { + struct gb_part_info *pi =3D gi->part_info; + int i; + + if (!pi) + return; + + for_each_gbpart(i, pi) + seq_printf(sf, "%*pbl;", cpumask_pr_args(part_cpus(pi, i))); + seq_putc(sf, '\n'); } =20 +static void build_gb_partition(struct cpumask *cpus_allowed, struct gb_par= t_info *pi, int id) +{ + int i; + struct gb_part *part =3D &pi->parts[id]; + + part->id =3D id; + cpumask_copy(&part->cpus, cpus_allowed); + + for_each_cpu(i, &part->cpus) + pi->ctop[i] =3D id; +} + +static int __gb_partition_write(struct cpuset *cs, char *buf, size_t nbyte= s) +{ + struct gb_part_info *new_pi; + bool should_stop =3D false; + int ret, retval =3D -EINVAL, id =3D 0; + char *start, *end; + cpumask_var_t summask, cpus_allowed; + + /* + * Write empty string to clear partition info. + * Then this cgroup is not a controller. + */ + if (nbytes < 2) { + update_gb_part_info(cs, NULL); + retval =3D 0; + goto out; + } + + retval =3D -ENOMEM; + if (!zalloc_cpumask_var(&summask, GFP_KERNEL)) + goto out; + if (!zalloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) + goto out_free_cpumask; + new_pi =3D kzalloc(sizeof(*new_pi), GFP_KERNEL); + if (!new_pi) + goto out_free_cpumask2; + + buf =3D strstrip(buf); + memset(new_pi->ctop, -1, sizeof(int) * num_possible_cpus()); + start =3D buf; + end =3D strchr(start, ';'); + retval =3D -EINVAL; + + /* Handle user input in format of "cpulist1;cpulist2;...;cpulistN;" */ + for (;;) { + if (!end) + should_stop =3D true; + else + *end =3D '\0'; + + if (*start =3D=3D '\0') + goto next; + + if (new_pi->nr_part >=3D CPU_PART_MAX) { + pr_warn("part number should be no larger than %d\n", CPU_PART_MAX); + goto out_free_pi; + } + + ret =3D cpulist_parse(start, cpus_allowed); + if (ret || cpumask_empty(cpus_allowed)) { + pr_warn("invalid cpulist: %s\n", start); + goto out_free_pi; + } + + /* There should not be intersections betweem partitions. */ + if (cpumask_intersects(summask, cpus_allowed)) { + pr_warn("%*pbl intersect with others\n", cpumask_pr_args(cpus_allowed)); + goto out_free_pi; + } + + cpumask_or(summask, summask, cpus_allowed); + + build_gb_partition(cpus_allowed, new_pi, id); + id++; + new_pi->nr_part++; +next: + if (should_stop) + break; + + start =3D end + 1; + end =3D strchr(start, ';'); + } + + /* + * Check whether the input is valid. + * Should not conflict with effective_cpus. + */ + if (!cpumask_subset(summask, cs->effective_cpus) || new_pi->nr_part < 2) { + pr_warn("invalid cpulist\n"); + goto out_free_pi; + } + + update_gb_part_info(cs, new_pi); + + retval =3D 0; + goto out_free_cpumask2; + +out_free_pi: + kfree(new_pi); +out_free_cpumask2: + free_cpumask_var(cpus_allowed); +out_free_cpumask: + free_cpumask_var(summask); +out: + return retval; +} + static ssize_t gb_partition_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - struct cgroup_subsys_state *css =3D of_css(of); - struct cpuset *cs =3D css_cs(css); + struct cgroup_subsys_state *css =3D of_css(of), *pos_css; + struct cpuset *cs =3D css_cs(css), *cp; int retval =3D -ENODEV; =20 cpus_read_lock(); @@ -2589,7 +2793,31 @@ static ssize_t gb_partition_write(struct kernfs_open= _file *of, char *buf, if (!is_cpuset_online(cs)) goto out_unlock; =20 - retval =3D 0; + retval =3D -EINVAL; + /* Cannot change gb partition during enabling. */ + if (cs_gi(cs)->gb_period) + goto out_unlock; + + /* + * Cannot set gb partitons on cgroup whose + * ancestor or descendant has been set. + */ + if (css_gi(css, true)) + goto out_unlock; + + rcu_read_lock(); + cpuset_for_each_descendant_pre(cp, pos_css, cs) { + if (cp =3D=3D cs) + continue; + + if (cs_gi(cp)->part_info) { + rcu_read_unlock(); + goto out_unlock; + } + } + rcu_read_unlock(); + + retval =3D __gb_partition_write(cs, buf, nbytes); =20 out_unlock: percpu_up_write(&cpuset_rwsem); @@ -2598,6 +2826,7 @@ static ssize_t gb_partition_write(struct kernfs_open_= file *of, char *buf, } #else static inline void init_gb(struct cpuset *cs) { } +static inline void remove_gb(struct cpuset *cs) { } static inline void gb_partition_read(struct seq_file *sf, struct gb_info *= gi) { } #endif /* CONFIG_GROUP_BALANCER */ =20 @@ -3051,6 +3280,8 @@ static void cpuset_css_offline(struct cgroup_subsys_s= tate *css) parent->child_ecpus_count--; } =20 + remove_gb(cs); + cpuset_dec(); clear_bit(CS_ONLINE, &cs->flags); =20 --=20 2.27.0 From nobody Tue Jun 23 09:09:16 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A039EC433EF for ; Tue, 8 Mar 2022 09:27:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345360AbiCHJ2I (ORCPT ); Tue, 8 Mar 2022 04:28:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54170 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345358AbiCHJ1w (ORCPT ); Tue, 8 Mar 2022 04:27:52 -0500 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B7CC34131E; Tue, 8 Mar 2022 01:26:51 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R141e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01424;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=28;SR=0;TI=SMTPD_---0V6e-tl1_1646731604; Received: from localhost.localdomain(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0V6e-tl1_1646731604) by smtp.aliyun-inc.com(127.0.0.1); Tue, 08 Mar 2022 17:26:46 +0800 From: Tianchen Ding To: Zefan Li , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Tejun Heo , Johannes Weiner , Tianchen Ding , Michael Wang , Cruz Zhao , Masahiro Yamada , Nathan Chancellor , Kees Cook , Andrew Morton , Vlastimil Babka , "Gustavo A. R. Silva" , Arnd Bergmann , Miguel Ojeda , Chris Down , Vipin Sharma , Daniel Borkmann Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 3/4] sched: Introduce group balancer Date: Tue, 8 Mar 2022 17:26:28 +0800 Message-Id: <20220308092629.40431-4-dtcccc@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220308092629.40431-1-dtcccc@linux.alibaba.com> References: <20220308092629.40431-1-dtcccc@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Michael Wang Modern platform are growing fast on CPU numbers, multiple apps sharing one box are very common, they used to have exclusive cpu setting but nowadays things are changing. To achieve better utility of CPU resource, multiple apps are starting to sharing the CPUs. The CPU resources usually overcommitted since app's workload are undulated. This introduced problems on performance when share mode vs exclusive mode, for eg with cgroup A,B and C deployed in exclusive mode, it will be: CPU_X (100%) CPU_Y (100%) CPU_Z (50%) T_1_CG_A T_1_CG_B T_1_CG_C T_2_CG_A T_2_CG_B T_2_CG_C T_3_CG_A T_3_CG_B T_4_CG_A T_4_CG_B while the share mode will be: CPU_X (100%) CPU_Y (75%) CPU_Z (75%) T_1_CG_A T_2_CG_A T_1_CG_B T_2_CG_B T_3_CG_B T_2_CG_C T_4_CG_B T_4_CG_A T_3_CG_A T_1_CG_C As we can see, the confliction between groups on CPU resources are now happening all over the CPUs. The testing on sysbench-memory show 30+% drop on share mode, and redis-benchmark show 10+% drop too, compared to the exclusive mode. However, despite of the performance drop, in real world we still prefer share mode. The undulated workload can make the exclusive mode so unefficient on CPU utilization, for eg the next period, when CG_A become 'idle', exclusive mode will like: CPU_X (0%) CPU_Y (100%) CPU_Z (50%) T_1_CG_B T_1_CG_C T_2_CG_B T_2_CG_C T_3_CG_B T_4_CG_B while share mode like: CPU_X (50%) CPU_Y (50%) CPU_Z (50%) T_2_CG_B T_1_CG_C T_3_CG_B T_4_CG_B T_1_CG_B T_2_CG_C The CPU_X is totally wasted in exclusive mode, the resource efficiency are really poor. Thus what we need, is a way to ease confliction in share mode, make groups as exclusive as possible, to gain both performance and resource efficiency. The main idea of group balancer is to fulfill this requirement by balancing groups of tasks among groups of CPUs, consider this as a dynamic demi-exclusive mode. Task trigger work to settle it's group into a proper partition (minimum predicted load), then try migrate itself into it. To gradually settle groups into the most exclusively partition. Just like balance the task among CPUs, now with GB a user can put CPU X,Y,Z into three partitions, and balance group A,B,C into these partition, to make them as exclusive as possible. GB can be seen as an optimize policy based on load balance, it obeys the main idea of load balance and makes adjustment based on that. How to use: First, make sure the children of your cgroup both enabling "cpu" and "cpuset" subsys, because group balancer gather load info from task group and partition info from cpuset group. So tasks should stay at the same cgroup. Do this recursively if necessary: echo "+cpu +cpuset" > $CGROUP_PATH/cgroup.subtree_control To create partition, for example run: echo "0-15;16-31;32-47;48-63;" > $CGROUP_PATH/cpuset.gb.partition This will create 4 partitions contain CPUs 0-15,16-31,32-47 and 48-63 separately. Then enable GB for your cgroup, run: echo 200000 > $CGROUP_PATH/cpuset.gb.period_us This will enable GB for all descendants of $CGROUP_PATH, EXCEPT itself. Testing Results: In order to enlarge the differences, we do testing on ARM platform with 128 CPUs, create 8 partition according to cluster info. Since we pick benchmark which can gain benefit from exclusive mode, this is more like a functional testing rather than performance, to show that GB help winback the performance. Create 8 cgroup each running 'sysbench memory --threads=3D16 run', the output of share mode is: events/s (eps): 4939865.8892 events/s (eps): 4699033.0351 events/s (eps): 4373262.0563 events/s (eps): 3534852.1000 events/s (eps): 4724359.4354 events/s (eps): 3438985.1082 events/s (eps): 3600268.9196 events/s (eps): 3782130.8202 the output of gb mode is: events/s (eps): 4919287.0608 events/s (eps): 5926525.9995 events/s (eps): 4933459.3272 events/s (eps): 5184040.1349 events/s (eps): 5030940.1116 events/s (eps): 5773255.0246 events/s (eps): 4649109.0129 events/s (eps): 5683217.7641 Create 4 cgroup each running redis-server with 16 io threads, 4 redis-benchmark per each server show average rps as: share mode gb mode PING_INLINE 45903.94 46529.9 1.36% PING_MBULK 48342.58 50881.99 5.25% SET 38681.95 42108.17 8.86% GET 46774.09 51067.99 9.18% INCR 46092.24 50543.98 9.66% LPUSH 41723.35 45464.73 8.97% RPUSH 42722.77 47667.76 11.57% LPOP 41010.75 45077.65 9.92% RPOP 43198.33 47248.05 9.37% SADD 44750.16 50253.79 12.30% HSET 44352 47940.12 8.09% SPOP 47436.64 51658.99 8.90% ZADD 43124.02 45992.96 6.65% ZPOPMIN 46854.7 51561.52 10.05% LPUSH 41723.35 45464.73 8.97% LRANGE_100 22411.47 23311.32 4.02% LRANGE_300 11323.8 11585.06 2.31% LRANGE_500 7516.12 7577.76 0.82% LRANGE_600 6632.1 6737.31 1.59% MSET 27945.01 29401.3 5.21% Co-developed-by: Cruz Zhao Signed-off-by: Cruz Zhao Signed-off-by: Michael Wang Co-developed-by: Tianchen Ding Signed-off-by: Tianchen Ding --- include/linux/cpuset.h | 5 + include/linux/sched.h | 5 + include/linux/sched/gb.h | 10 + kernel/cgroup/cpuset.c | 39 ++++ kernel/sched/core.c | 5 + kernel/sched/fair.c | 26 ++- kernel/sched/gb.c | 448 +++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 10 + 8 files changed, 547 insertions(+), 1 deletion(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d58e0476ee8e..3be2ab42bb98 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -178,6 +178,11 @@ static inline void set_mems_allowed(nodemask_t nodemas= k) task_unlock(current); } =20 +#ifdef CONFIG_GROUP_BALANCER +struct gb_info *task_gi(struct task_struct *p, bool get_controller); +cpumask_var_t *gi_cpus(struct gb_info *gi); +#endif + #else /* !CONFIG_CPUSETS */ =20 static inline bool cpusets_enabled(void) { return false; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 5e0b5b4a4c8f..fca655770925 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1289,6 +1289,11 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ =20 +#ifdef CONFIG_GROUP_BALANCER + u64 gb_stamp; + struct callback_head gb_work; +#endif + #ifdef CONFIG_RSEQ struct rseq __user *rseq; u32 rseq_sig; diff --git a/include/linux/sched/gb.h b/include/linux/sched/gb.h index b03a2b4ef4b7..7af91662b740 100644 --- a/include/linux/sched/gb.h +++ b/include/linux/sched/gb.h @@ -9,10 +9,14 @@ =20 #include #include +#include +#include =20 #define CPU_PART_MAX 32 struct gb_part { int id; + unsigned int mgrt_on; + u64 predict_load; struct cpumask cpus; }; =20 @@ -41,6 +45,12 @@ struct gb_info { =20 /* Stamp to record next settling. */ unsigned long settle_next; + + /* + * Record preferred partition (in part_info of controller) of this cgroup. + * Default -1. + */ + int gb_prefer; }; =20 #define for_each_gbpart(i, pi) \ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 4b456b379b87..de13c22c1921 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -235,6 +235,20 @@ static struct gb_info *css_gi(struct cgroup_subsys_sta= te *css, bool get_controll =20 return get_controller ? cs->control_gi : &cs->gi; } + +struct gb_info *task_gi(struct task_struct *p, bool get_controller) +{ + struct cpuset *cs =3D task_cs(p); + + return cs ? css_gi(&cs->css, get_controller) : NULL; +} + +cpumask_var_t *gi_cpus(struct gb_info *gi) +{ + struct cpuset *cs =3D container_of(gi, struct cpuset, gi); + + return &cs->effective_cpus; +} #else static inline struct gb_info *cs_gi(struct cpuset *cs) { @@ -2572,6 +2586,21 @@ static inline void update_child_gb_controller(struct= cpuset *cs, struct gb_info rcu_read_unlock(); } =20 +static inline void reset_child_gb_prefer(struct cpuset *cs) +{ + struct cpuset *cp; + struct cgroup_subsys_state *pos_css; + + rcu_read_lock(); + cpuset_for_each_descendant_pre(cp, pos_css, cs) { + if (cp =3D=3D cs) + continue; + + cs_gi(cp)->gb_prefer =3D -1; + } + rcu_read_unlock(); +} + static inline void update_gb_part_info(struct cpuset *cs, struct gb_part_i= nfo *new) { struct gb_info *gi =3D cs_gi(cs); @@ -2606,6 +2635,7 @@ static inline void init_gb(struct cpuset *cs) =20 gi->settle_period =3D msecs_to_jiffies(sysctl_gb_settle_period); gi->settle_next =3D jiffies + gi->settle_period; + gi->gb_prefer =3D -1; =20 /* Search upwards to find any existing controller. */ for (css =3D cs->css.parent; css; css =3D css->parent) { @@ -2637,6 +2667,7 @@ static int gb_period_write_u64(struct cgroup_subsys_s= tate *css, struct cftype *c struct cpuset *cs =3D css_cs(css); struct gb_info *gi =3D cs_gi(cs); int retval =3D -EINVAL; + bool reset; =20 if (val > MAX_GB_PERIOD) return retval; @@ -2650,9 +2681,17 @@ static int gb_period_write_u64(struct cgroup_subsys_= state *css, struct cftype *c if (!gi->part_info) goto out_unlock; =20 + /* gb_period =3D 0 means disabling group balancer. */ + reset =3D gi->gb_period && !val; + gi->gb_period =3D val * NSEC_PER_USEC; retval =3D 0; =20 + if (reset) { + synchronize_rcu(); + reset_child_gb_prefer(cs); + } + out_unlock: percpu_up_write(&cpuset_rwsem); return retval; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5e5365755bdd..d57d6210bcfc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4484,6 +4484,11 @@ int sched_fork(unsigned long clone_flags, struct tas= k_struct *p) #ifdef CONFIG_SMP plist_node_init(&p->pushable_tasks, MAX_PRIO); RB_CLEAR_NODE(&p->pushable_dl_tasks); +#endif +#ifdef CONFIG_GROUP_BALANCER + p->gb_stamp =3D 0; + p->gb_work.next =3D &p->gb_work; + init_task_work(&p->gb_work, group_balancing_work); #endif return 0; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 10c8dc9b7494..c1bd3f62e39c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1609,6 +1609,9 @@ static void update_numa_stats(struct task_numa_env *e= nv, !cpumask_test_cpu(cpu, env->p->cpus_ptr)) continue; =20 + if (group_hot(env->p, env->src_cpu, cpu) =3D=3D 0) + continue; + if (ns->idle_cpu =3D=3D -1) ns->idle_cpu =3D cpu; =20 @@ -1944,6 +1947,9 @@ static void task_numa_find_cpu(struct task_numa_env *= env, if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) continue; =20 + if (group_hot(env->p, env->src_cpu, cpu) =3D=3D 0) + continue; + env->dst_cpu =3D cpu; if (task_numa_compare(env, taskimp, groupimp, maymove)) break; @@ -6050,8 +6056,11 @@ static int wake_affine(struct sched_domain *sd, stru= ct task_struct *p, target =3D wake_affine_weight(sd, p, this_cpu, prev_cpu, sync); =20 schedstat_inc(p->stats.nr_wakeups_affine_attempts); - if (target =3D=3D nr_cpumask_bits) + if (target =3D=3D nr_cpumask_bits) { + if (group_hot(p, prev_cpu, this_cpu) =3D=3D 1) + return this_cpu; return prev_cpu; + } =20 schedstat_inc(sd->ttwu_move_affine); schedstat_inc(p->stats.nr_wakeups_affine); @@ -7846,6 +7855,10 @@ int can_migrate_task(struct task_struct *p, struct l= b_env *env) return 0; } =20 + tsk_cache_hot =3D group_hot(p, env->src_cpu, env->dst_cpu); + if (tsk_cache_hot !=3D -1) + return tsk_cache_hot; + /* * Aggressive migration if: * 1) active balance @@ -8266,6 +8279,15 @@ static unsigned long task_h_load(struct task_struct = *p) return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, cfs_rq_load_avg(cfs_rq) + 1); } + +#ifdef CONFIG_GROUP_BALANCER +unsigned long cfs_h_load(struct cfs_rq *cfs_rq) +{ + update_cfs_rq_h_load(cfs_rq); + return cfs_rq->h_load; +} +#endif + #else static bool __update_blocked_fair(struct rq *rq, bool *done) { @@ -11213,6 +11235,8 @@ static void task_tick_fair(struct rq *rq, struct ta= sk_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); =20 + task_tick_gb(rq, curr); + update_misfit_status(curr, rq); update_overutilized_status(task_rq(curr)); =20 diff --git a/kernel/sched/gb.c b/kernel/sched/gb.c index 4a2876ec1cca..f7da96253ad0 100644 --- a/kernel/sched/gb.c +++ b/kernel/sched/gb.c @@ -8,7 +8,455 @@ =20 #include "sched.h" =20 +#define DECAY_PERIOD HZ + /* Params about settle period. (ms) */ unsigned int sysctl_gb_settle_period =3D 200; unsigned int sysctl_gb_settle_period_max =3D 10000; unsigned int sysctl_gb_global_settle_period =3D 200; + +static unsigned long global_settle_next; +static unsigned long predict_period_stamp; +DEFINE_SPINLOCK(settle_lock); + +static inline int ctop(struct gb_part_info *pi, int cpu) +{ + return pi->ctop[cpu]; +} + +static u64 tg_load_of_part(struct gb_part_info *pi, struct task_group *tg,= int id) +{ + int i; + u64 load =3D 0; + + for_each_cpu(i, part_cpus(pi, id)) + load +=3D cfs_h_load(tg->cfs_rq[i]); + + return load; +} + +static u64 load_of_part(struct gb_part_info *pi, int id) +{ + int i; + u64 load =3D 0; + + for_each_cpu(i, part_cpus(pi, id)) + load +=3D cpu_rq(i)->cfs.avg.load_avg; + + return load; +} + +static inline int part_mgrt_lock(struct gb_part_info *pi, int src, int dst) +{ + struct gb_part *src_part, *dst_part; + + if (src !=3D -1) { + src_part =3D &pi->parts[src]; + if (READ_ONCE(src_part->mgrt_on)) + return 0; + } + + if (dst !=3D -1) { + dst_part =3D &pi->parts[dst]; + if (READ_ONCE(dst_part->mgrt_on)) + return 0; + } + + if (src !=3D -1 && xchg(&src_part->mgrt_on, 1)) + return 0; + + if (dst !=3D -1 && xchg(&dst_part->mgrt_on, 1)) { + WRITE_ONCE(src_part->mgrt_on, 0); + return 0; + } + + return 1; +} + +static inline void part_mgrt_unlock(struct gb_part_info *pi, int src, int = dst) +{ + struct gb_part *src_part, *dst_part; + + if (src !=3D -1) { + src_part =3D &pi->parts[src]; + WRITE_ONCE(src_part->mgrt_on, 0); + } + + if (dst !=3D -1) { + dst_part =3D &pi->parts[dst]; + WRITE_ONCE(dst_part->mgrt_on, 0); + } +} + +static u64 cap_of_part(struct gb_part_info *pi, int id) +{ + int i; + u64 cap =3D 0; + + for_each_cpu(i, part_cpus(pi, id)) + cap +=3D cpu_rq(i)->cpu_capacity; + + return cap; +} + +static inline void predict_load_add(struct gb_part_info *pi, int id, u64 l= oad) +{ + pi->parts[id].predict_load +=3D load; +} + +static void predict_load_decay(struct gb_part_info *pi) +{ + int i, fact, passed; + + passed =3D jiffies - global_settle_next + predict_period_stamp; + predict_period_stamp =3D passed % DECAY_PERIOD; + fact =3D passed / DECAY_PERIOD; + + if (!fact) + return; + + for_each_gbpart(i, pi) { + struct gb_part *gp =3D &pi->parts[i]; + + /* + * Decay NICE_0_LOAD into zero after 10 seconds + */ + if (fact > 10) + gp->predict_load =3D 0; + else + gp->predict_load >>=3D fact; + } +} + +static int try_to_settle(struct gb_part_info *pi, struct gb_info *gi, stru= ct task_group *tg) +{ + int i, src, dst, ret; + u64 mgrt_load, tg_load, min_load, src_load, dst_load; + cpumask_var_t *effective_cpus =3D gi_cpus(gi); + + src =3D dst =3D -1; + min_load =3D U64_MAX; + tg_load =3D 0; + for_each_gbpart(i, pi) { + u64 mgrt, load; + + /* DO NOT settle in parts out of effective_cpus. */ + if (!cpumask_intersects(part_cpus(pi, i), effective_cpus[0])) + continue; + + mgrt =3D tg_load_of_part(pi, tg, i); + load =3D load_of_part(pi, i); + /* load after migration */ + if (load > mgrt) + load -=3D mgrt; + else + load =3D 0; + + /* + * Try to find out the partition contain + * minimum load, and the load of this task + * group is excluded on comparison. + * + * This help to prevent that a partition + * full of the tasks from this task group was + * considered as busy. + * + * As for the prediction load, the partition + * this group preferred will be excluded, since + * these prediction load could be introduced by + * itself. + * + * This is not precise, but it serves the idea + * to prefer a partition as long as possible, + * to save the cost of resettle as much as + * possible. + */ + if (i =3D=3D gi->gb_prefer) { + src =3D i; + src_load =3D load + mgrt; + mgrt_load =3D mgrt; + } else + load +=3D pi->parts[i].predict_load; + + if (load < min_load) { + dst =3D i; + min_load =3D load; + dst_load =3D load + mgrt; + } + + tg_load +=3D mgrt; + } + + if (!tg_load) + return 0; + + ret =3D 0; + gi->settle_period *=3D 2; + if (src =3D=3D -1) { + /* First settle */ + gi->gb_prefer =3D dst; + predict_load_add(pi, dst, tg_load); + ret =3D 1; + } else if (src !=3D dst) { + /* Resettle will cost, be careful */ + long dst_imb, src_imb, dst_cap, src_cap; + + src_cap =3D cap_of_part(pi, src); + dst_cap =3D cap_of_part(pi, dst); + + /* + * src_load dst_load + * ------------ vs --------- + * src_capacity dst_capacity + * + * Should not cause further imbalancing after + * resettle. + */ + src_imb =3D abs(src_load * dst_cap - dst_load * src_cap); + dst_imb =3D abs((src_load - mgrt_load) * dst_cap - (dst_load + mgrt_load= ) * src_cap); + + if (dst_imb <=3D src_imb) { + gi->gb_prefer =3D dst; + predict_load_add(pi, dst, tg_load); + gi->settle_period =3D msecs_to_jiffies(sysctl_gb_settle_period) * 2; + ret =3D 1; + } + } + + if (gi->settle_period > msecs_to_jiffies(sysctl_gb_settle_period_max)) + gi->settle_period =3D msecs_to_jiffies(sysctl_gb_settle_period_max); + + return ret; +} + +/* + * group_hot() will tell us which cpu is contained in the + * preferred CPU partition of the task group of a task. + * + * return 1 if prefer dst_cpu + * return 0 if prefer src_cpu + * return -1 if prefer either or neither + */ +int group_hot(struct task_struct *p, int src_cpu, int dst_cpu) +{ + int ret =3D -1; + struct task_group *tg; + struct gb_info *gi, *control_gi; + struct gb_part_info *pi; + + rcu_read_lock(); + + control_gi =3D task_gi(p, true); + + if (!control_gi || !control_gi->gb_period) + goto out_unlock; + + gi =3D task_gi(p, false); + pi =3D control_gi->part_info; + tg =3D task_group(p); + if (gi->gb_prefer !=3D -1 && ctop(pi, src_cpu) !=3D ctop(pi, dst_cpu)) + ret =3D (gi->gb_prefer =3D=3D ctop(pi, dst_cpu)); + +out_unlock: + rcu_read_unlock(); + return ret; +} + +void task_tick_gb(struct rq *rq, struct task_struct *curr) +{ + struct callback_head *work =3D &curr->gb_work; + struct gb_info *gi, *control_gi; + struct gb_part_info *pi; + u64 period, now; + + if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next !=3D work) + return; + + rcu_read_lock(); + + control_gi =3D task_gi(curr, true); + + if (!control_gi || !control_gi->gb_period) + goto unlock; + + gi =3D task_gi(curr, false); + pi =3D control_gi->part_info; + + /* Save it when already satisfied. */ + if (gi->gb_prefer !=3D -1 && + gi->gb_prefer =3D=3D ctop(pi, task_cpu(curr))) + goto unlock; + + now =3D curr->se.sum_exec_runtime; + period =3D control_gi->gb_period; + + if (now > curr->gb_stamp + period) { + curr->gb_stamp =3D now; + task_work_add(curr, work, TWA_RESUME); + } + +unlock: + rcu_read_unlock(); +} + +void group_balancing_work(struct callback_head *work) +{ + int cpu, this_cpu, this_part, this_prefer, best_cpu; + struct task_group *this_tg; + struct gb_info *this_gi, *control_gi; + struct gb_part_info *pi; + struct task_struct *best_task; + struct cpumask cpus; + + SCHED_WARN_ON(current !=3D container_of(work, struct task_struct, gb_work= )); + + work->next =3D work; + if (current->flags & PF_EXITING) + return; + + rcu_read_lock(); + /* + * We build group balancer on "cpuset" subsys, and gather load info from + * "cpu" subsys. So we need to ensure these two subsys belonging to + * the same cgroup. + */ + if (task_cgroup(current, cpuset_cgrp_id) !=3D task_group(current)->css.cg= roup) + goto unlock; + + control_gi =3D task_gi(current, true); + + if (!control_gi || !control_gi->gb_period) + goto unlock; + + this_gi =3D task_gi(current, false); + pi =3D control_gi->part_info; + this_tg =3D task_group(current); + + /* + * Settle task group one-by-one help prevent the + * situation when multiple group try to settle the + * same partition at the same time. + * + * However, when bunch of groups trying to settle at + * the same time, there are no guarantee on the + * fairness, some of them may get more chances and + * settle sooner than the others. + * + * So one trick here is to grow the cg_settle_period + * of settled group, to make sure they yield the + * next chances to others. + * + * Another trick here is about prediction, as settle + * group will followed by bunch of task migration, + * the current load of CPU partition can't imply it's + * busyness in future, and we may pick a busy one in + * the end. + * + * Thus we maintain the predict load after each settle, + * so next try will be able to do the prediction and + * avoid to pick those which is already busy enough. + */ + if (spin_trylock(&settle_lock)) { + if (time_after(jiffies, global_settle_next) && + time_after(jiffies, this_gi->settle_next)) { + predict_load_decay(pi); + + global_settle_next =3D jiffies; + if (try_to_settle(pi, this_gi, this_tg)) + global_settle_next +=3D + msecs_to_jiffies(sysctl_gb_global_settle_period); + + this_gi->settle_next =3D jiffies + this_gi->settle_period; + } + spin_unlock(&settle_lock); + } + + this_cpu =3D task_cpu(current); + this_prefer =3D this_gi->gb_prefer; + this_part =3D ctop(pi, this_cpu); + if (this_prefer =3D=3D -1 || + this_part =3D=3D this_prefer || + !part_mgrt_lock(pi, this_part, this_prefer)) { + goto unlock; + } + + cpumask_copy(&cpus, part_cpus(pi, this_prefer)); + + /* + * We arrived here when current task A don't prefer + * it's current CPU, but prefer CPUs of partition Y. + * + * In other word, if task A could run on CPUs of + * partition Y, it have good chance to reduce conflict + * with the tasks from other groups, and share hot + * cache with the tasks from the same group. + * + * So here is the main logical of group balancer to + * achieve it's purpose, make sure groups of tasks + * are balanced into groups of CPUs. + * + * If A's current CPU belong to CPU partition X, + * try to find a CPU from partition Y, which is + * running a task prefer partition X, and swap them. + * + * Otherwise, or if can't find such CPU and task, + * just find an idle CPU from partition Y, and do + * migration. + * + * Ideally the migration and swap work will finally + * put the tasks into right places, but the wakeup + * stuff can easily break that by locate an idle CPU + * out of the range. + * + * However, since the whole idea is to gain cache + * benefit and reduce conflict between groups, if + * there are enough idle CPU out there then every + * thing just fine, so let it go. + */ + + best_cpu =3D -1; + best_task =3D NULL; + for_each_cpu_and(cpu, &cpus, current->cpus_ptr) { + struct task_struct *p; + + WARN_ON(cpu =3D=3D this_cpu); + + if (available_idle_cpu(cpu)) { + best_cpu =3D cpu; + continue; + } + + if (this_part =3D=3D -1) + continue; + + p =3D rcu_dereference(cpu_rq(cpu)->curr); + + if (!p || (p->flags & PF_EXITING) || is_idle_task(p) + || p =3D=3D current || task_gi(p, true) !=3D control_gi) + continue; + + if (task_cpu(p) =3D=3D cpu && + this_part =3D=3D task_gi(p, false)->gb_prefer && + cpumask_test_cpu(this_cpu, p->cpus_ptr)) { + get_task_struct(p); + best_task =3D p; + best_cpu =3D cpu; + break; + } + } + + rcu_read_unlock(); + + if (best_task) { + migrate_swap(current, best_task, best_cpu, this_cpu); + put_task_struct(best_task); + } else if (best_cpu !=3D -1) { + migrate_task_to(current, best_cpu); + } + + part_mgrt_unlock(pi, this_part, this_prefer); + return; + +unlock: + rcu_read_unlock(); +} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d6555aacdf78..97c621bcded3 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3148,3 +3148,13 @@ extern int sched_dynamic_mode(const char *str); extern void sched_dynamic_update(int mode); #endif =20 +#ifdef CONFIG_GROUP_BALANCER +extern unsigned long cfs_h_load(struct cfs_rq *cfs_rq); +extern int group_hot(struct task_struct *p, int src_cpu, int dst_cpu); +extern void task_tick_gb(struct rq *rq, struct task_struct *curr); +extern void group_balancing_work(struct callback_head *work); +#else +static inline int group_hot(struct task_struct *p, int src_cpu, int dst_cp= u) { return -1; } +static inline void task_tick_gb(struct rq *rq, struct task_struct *curr) {= }; +static inline void group_balancing_work(struct callback_head *work) {}; +#endif --=20 2.27.0 From nobody Tue Jun 23 09:09:16 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 049EBC433EF for ; Tue, 8 Mar 2022 09:27:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345369AbiCHJ2L (ORCPT ); Tue, 8 Mar 2022 04:28:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54190 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345366AbiCHJ1w (ORCPT ); Tue, 8 Mar 2022 04:27:52 -0500 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 03D3F4132F; Tue, 8 Mar 2022 01:26:53 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=28;SR=0;TI=SMTPD_---0V6e-tlT_1646731606; Received: from localhost.localdomain(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0V6e-tlT_1646731606) by smtp.aliyun-inc.com(127.0.0.1); Tue, 08 Mar 2022 17:26:48 +0800 From: Tianchen Ding To: Zefan Li , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Tejun Heo , Johannes Weiner , Tianchen Ding , Michael Wang , Cruz Zhao , Masahiro Yamada , Nathan Chancellor , Kees Cook , Andrew Morton , Vlastimil Babka , "Gustavo A. R. Silva" , Arnd Bergmann , Miguel Ojeda , Chris Down , Vipin Sharma , Daniel Borkmann Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [RFC PATCH v2 4/4] cpuset, gb: Add stat for group balancer Date: Tue, 8 Mar 2022 17:26:29 +0800 Message-Id: <20220308092629.40431-5-dtcccc@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220308092629.40431-1-dtcccc@linux.alibaba.com> References: <20220308092629.40431-1-dtcccc@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When group balancer is enabled by: echo 200000 > $CGROUP_PATH/cpuset.gb.period_us Then you can check: $CPU_CGROUP_PATH/childX/cpuset.gb.stat which give output as: PART-0 0-15 1008 1086 * PART-1 16-31 0 2 PART-2 32-47 0 0 PART-3 48-63 0 1024 The partition ID followed by it's CPUs range, load of group, load of partition and a star mark as preferred. Signed-off-by: Tianchen Ding --- include/linux/sched/gb.h | 2 ++ kernel/cgroup/cpuset.c | 24 ++++++++++++++++++++++++ kernel/sched/gb.c | 25 +++++++++++++++++++++++++ 3 files changed, 51 insertions(+) diff --git a/include/linux/sched/gb.h b/include/linux/sched/gb.h index 7af91662b740..ec5a97d8160a 100644 --- a/include/linux/sched/gb.h +++ b/include/linux/sched/gb.h @@ -63,6 +63,8 @@ static inline struct cpumask *part_cpus(struct gb_part_in= fo *pi, int id) =20 #ifdef CONFIG_GROUP_BALANCER extern unsigned int sysctl_gb_settle_period; +int gb_stat_show(struct seq_file *sf, struct cgroup_subsys_state *css, + struct gb_info *gi, struct gb_part_info *pi); #endif =20 #endif diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index de13c22c1921..035606e8fa95 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2863,6 +2863,24 @@ static ssize_t gb_partition_write(struct kernfs_open= _file *of, char *buf, cpus_read_unlock(); return retval ?: nbytes; } + +static int gb_stat_seq_show(struct seq_file *sf, void *v) +{ + struct cgroup_subsys_state *css =3D seq_css(sf); + struct gb_info *control_gi; + int retval =3D -EINVAL; + + rcu_read_lock(); + control_gi =3D css_gi(css, true); + if (!control_gi || !control_gi->gb_period) + goto out_unlock; + + retval =3D gb_stat_show(sf, css, css_gi(css, false), control_gi->part_inf= o); + +out_unlock: + rcu_read_unlock(); + return retval; +} #else static inline void init_gb(struct cpuset *cs) { } static inline void remove_gb(struct cpuset *cs) { } @@ -3179,6 +3197,12 @@ static struct cftype dfl_files[] =3D { .max_write_len =3D (100U + 6 * NR_CPUS), .private =3D FILE_GB_CPULIST, }, + + { + .name =3D "gb.stat", + .seq_show =3D gb_stat_seq_show, + .flags =3D CFTYPE_NOT_ON_ROOT, + }, #endif { } /* terminate */ }; diff --git a/kernel/sched/gb.c b/kernel/sched/gb.c index f7da96253ad0..8ae1db83b587 100644 --- a/kernel/sched/gb.c +++ b/kernel/sched/gb.c @@ -46,6 +46,31 @@ static u64 load_of_part(struct gb_part_info *pi, int id) return load; } =20 +int gb_stat_show(struct seq_file *sf, struct cgroup_subsys_state *css, + struct gb_info *gi, struct gb_part_info *pi) +{ + struct cgroup_subsys_state *tg_css; + struct task_group *tg; + int i; + + tg_css =3D cgroup_e_css(css->cgroup, &cpu_cgrp_subsys); + /* Make sure that "cpu" and "cpuset" subsys belonging to the same cgroup.= */ + if (tg_css->cgroup !=3D css->cgroup) + return -EINVAL; + tg =3D container_of(tg_css, struct task_group, css); + + for_each_gbpart(i, pi) { + seq_printf(sf, "PART-%d ", i); + seq_printf(sf, "%*pbl ", cpumask_pr_args(part_cpus(pi, i))); + seq_printf(sf, "%llu ", tg_load_of_part(pi, tg, i)); + seq_printf(sf, "%llu ", load_of_part(pi, i)); + if (gi->gb_prefer =3D=3D i) + seq_puts(sf, " *"); + seq_putc(sf, '\n'); + } + return 0; +} + static inline int part_mgrt_lock(struct gb_part_info *pi, int src, int dst) { struct gb_part *src_part, *dst_part; --=20 2.27.0