From nobody Wed Dec 31 03:44:19 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B270C4332F for ; Thu, 9 Nov 2023 00:25:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231455AbjKIAZd (ORCPT ); Wed, 8 Nov 2023 19:25:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48166 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229473AbjKIAZa (ORCPT ); Wed, 8 Nov 2023 19:25:30 -0500 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 65622268D; Wed, 8 Nov 2023 16:25:28 -0800 (PST) Received: by mail-pg1-x542.google.com with SMTP id 41be03b00d2f7-5bdc185c449so219684a12.0; Wed, 08 Nov 2023 16:25:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1699489528; x=1700094328; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=x+vLMcQqPa9/uA9h+Ooe3B/NT+SKXukU8inBObWmrPM=; b=DMtTE/fxZKcO0Pu4KdZbzOyX4CRAGb7abzyRvRYcIS2rhep/xGRNBVn10foSlKfPOC g2Z2vcQ8WOuZLVzknRhmP10XPPrf3ARuHeF3zFOjq36ZlukrT4eOBxwWA0ujx86yRfsr cpD1f5kH+/yzkGFQHRHPaoqQ/NPpYLd+wfcwF0vvDRO1n7ZBx8CW4gNUlxSlSvdjKnWn fxX3/tASSkkap531Gx4TxVTeSommGgZJttLtCIUylNqXJT6aYBgvPg5GB6cSRqj5CgfN rtRLb4oOc/ssnpRf9ErrRtTdZ6mcN3lUuRrZkX7jgtXetx4Z95o5vGuJYgSha6DCcGr1 obIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699489528; x=1700094328; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=x+vLMcQqPa9/uA9h+Ooe3B/NT+SKXukU8inBObWmrPM=; b=DaulEHck6Sljv37jBF4U4VDdydWg7AtuwwO9ZKl+kxJl8m/52gQ/yw5wQneNvEPfVh M7rLjoYIdTLFWVvR/PAqOTQ12C1i0kJQE8lPLZaYgSE7Z1s0uJvD8fWXQRyEKj9NSGeA 3i+eJp7Jp8H2fC8DpWqNNkfK6pfkGBJ/tQNrUyhQUEFQzG0k0eZnaslVC4tpHqIMrj0D boLPG4v9anBceZ7J26WPWMqwKkBciKesrs2B4OgB8OQU7/iSNeuU0D+4hiWdvQMygw52 2vl0cvHzUjj3gfbKI/KtVxvPrK33Nu1vXaE+MJKuu3v2goQoU+oSD1CVQtwxIa38m9vk kLvQ== X-Gm-Message-State: AOJu0YxViwmZlFPYwwjYLzzv1BOcq0JCu7tlufZUIY8bK50rr7jsCi+C QhTZtMc07E/l24Ic2EW/5qdNUgQU+lJl X-Google-Smtp-Source: AGHT+IFJa3XqHfr7w43VvBJ++JO55x37+essItDkdmOarWpBCb4W6tU0UNUrxGBma9nAsrqSWjotag== X-Received: by 2002:a05:6a20:7f93:b0:180:f0ed:2992 with SMTP id d19-20020a056a207f9300b00180f0ed2992mr2006580pzj.51.1699489527665; Wed, 08 Nov 2023 16:25:27 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id b10-20020a170902a9ca00b001bc21222e34sm2219073plr.285.2023.11.08.16.25.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 Nov 2023 16:25:27 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-kernel@vger.kernel.org Cc: linux-cxl@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, ying.huang@intel.com, akpm@linux-foundation.org, mhocko@kernel.org, tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, corbet@lwn.net, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, Gregory Price Subject: [RFC PATCH v4 1/3] mm/memcontrol: implement memcg.interleave_weights Date: Wed, 8 Nov 2023 19:25:15 -0500 Message-Id: <20231109002517.106829-2-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231109002517.106829-1-gregory.price@memverge.com> References: <20231109002517.106829-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Create an RCU-protected array of unsigned char[MAX_NUMNODES] where interleave weights can be stored. The intent of these weights are to be used by mempolicy to implement weighted interleave for bandwidth optimization. Node weights assigned via cgroup/memory.interleave_weights Example: Set a 3:1 weighting ratio for nodes 0 and 1 respectively. echo 0:3 > cgroup/memory.interleave_weights echo 1:1 > cgroup/memory.interleave_weights Example output: cat cgroup/memory.interleave_weights 0:3,1:1 Child cgroups inherit parent interleave weights and may override them. To revert weights to inheriting from the parent, write "-1:0" Example: echo -1:0 > cgroup/memory.interleave_weights Signed-off-by: Gregory Price --- include/linux/memcontrol.h | 31 +++++++ mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++ 2 files changed, 203 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e4e24da16d2c..338a9dcda446 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -21,6 +21,8 @@ #include #include #include +#include +#include =20 struct mem_cgroup; struct obj_cgroup; @@ -167,6 +169,15 @@ struct mem_cgroup_thresholds { struct mem_cgroup_threshold_ary *spare; }; =20 +/* For mempolicy information */ +struct mem_cgroup_mempolicy { + /* + * When interleaving is applied, do allocations on each node by the + * weight value. Size is always MAX_NUMNODES. Protected by RCU. + */ + unsigned char *il_weights; +}; + /* * Remember four most recent foreign writebacks with dirty pages in this * cgroup. Inode sharing is expected to be uncommon and, even if we miss @@ -265,6 +276,12 @@ struct mem_cgroup { /* thresholds for mem+swap usage. RCU-protected */ struct mem_cgroup_thresholds memsw_thresholds; =20 + /* protect the mempolicy settings */ + struct mutex mempolicy_lock; + + /* mempolicy defaults for tasks */ + struct mem_cgroup_mempolicy mempolicy; + /* For oom notifier event fd */ struct list_head oom_notify; =20 @@ -1159,6 +1176,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_= t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned); =20 + +unsigned char mem_cgroup_get_il_weight(unsigned int nid); + +unsigned int mem_cgroup_get_il_weights(nodemask_t *nodes, + unsigned char *weights); + #else /* CONFIG_MEMCG */ =20 #define MEM_CGROUP_ID_SHIFT 0 @@ -1591,6 +1614,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_= t *pgdat, int order, { return 0; } + +static unsigned char mem_cgroup_get_il_weight(unsigned int nid) { return 0= ; } + +static unsigned int mem_cgroup_get_il_weights(nodemask_t *nodes, + unsigned char *weights) +{ + return 0; +} #endif /* CONFIG_MEMCG */ =20 static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item id= x) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5b009b233ab8..67e8c1767471 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5319,6 +5319,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) INIT_WORK(&memcg->high_work, high_work_func); INIT_LIST_HEAD(&memcg->oom_notify); mutex_init(&memcg->thresholds_lock); + mutex_init(&memcg->mempolicy_lock); spin_lock_init(&memcg->move_lock); vmpressure_init(&memcg->vmpressure); INIT_LIST_HEAD(&memcg->event_list); @@ -7896,6 +7897,176 @@ static struct cftype zswap_files[] =3D { }; #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */ =20 +unsigned char mem_cgroup_get_il_weight(unsigned int nid) +{ + struct mem_cgroup *memcg; + unsigned char weight =3D 0; + unsigned char *weights; + + rcu_read_lock(); + memcg =3D mem_cgroup_from_task(current); + while (!mem_cgroup_is_root(memcg)) { + weights =3D rcu_dereference(memcg->mempolicy.il_weights); + if (weights) { + weight =3D weights[nid]; + break; + } + memcg =3D parent_mem_cgroup(memcg); + } + rcu_read_unlock(); + + return weight; +} + +unsigned int mem_cgroup_get_il_weights(nodemask_t *nodes, + unsigned char *weights) +{ + struct mem_cgroup *memcg; + unsigned char *memcg_weights; + unsigned int nid; + unsigned int total =3D 0; + unsigned char weight; + + rcu_read_lock(); + memcg =3D mem_cgroup_from_task(current); + while (memcg && !mem_cgroup_is_root(memcg)) { + memcg_weights =3D rcu_dereference(memcg->mempolicy.il_weights); + if (!memcg_weights) { + memcg =3D parent_mem_cgroup(memcg); + continue; + } + + for_each_node_mask(nid, *nodes) { + weight =3D memcg_weights[nid]; + weights[nid] =3D weight ? weight : 1; + total +=3D weights[nid]; + } + break; + } + rcu_read_unlock(); + + return total; +} + +static int mpol_ilw_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg; + unsigned char *weights; + unsigned int nid; + unsigned int count =3D 0; + + rcu_read_lock(); + memcg =3D mem_cgroup_from_seq(m); + + while (memcg && !mem_cgroup_is_root(memcg)) { + weights =3D rcu_dereference(memcg->mempolicy.il_weights); + if (weights) + break; + memcg =3D parent_mem_cgroup(memcg); + } + for_each_node(nid) { + seq_printf(m, "%s%d:%d", (count++ ? "," : ""), nid, + weights ? weights[nid] : 1); + } + seq_putc(m, '\n'); + rcu_read_unlock(); + + return 0; +} + +static ssize_t mpol_ilw_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of)); + struct mem_cgroup *pmcg; + unsigned char *new_weights =3D NULL, *old_weights =3D NULL; + int node; + unsigned char weight; + ssize_t ret; + char *sep =3D memchr(buf, ':', nbytes); + bool parent_weights =3D false; + + if (!sep || sep =3D=3D buf || sep =3D=3D (buf + nbytes - 1)) + return -EINVAL; + *sep =3D '\0'; + + ret =3D kstrtoint(buf, 10, &node); + if (ret) + return ret; + + ret =3D kstrtou8(sep + 1, 10, &weight); + if (ret) + return ret; + + /* + * if value is -1:0, clear weights and set pointer to NULL + * this allows the parent cgroup settings to take over + */ + if (node =3D=3D -1 && weight =3D=3D 0) + goto set_weights; + else if (node < 0) + return -EINVAL; + else if (node >=3D MAX_NUMNODES || weight =3D=3D 0) + return -EINVAL; + + new_weights =3D kzalloc(sizeof(unsigned char)*MAX_NUMNODES, GFP_KERNEL); + if (!new_weights) + return -ENOMEM; +set_weights: + /* acquire mutex and readlock so we can read from parents if needed */ + mutex_lock(&memcg->mempolicy_lock); + rcu_read_lock(); + old_weights =3D rcu_dereference(memcg->mempolicy.il_weights); + + /* If we're clearing the weights, don't bother looking at old ones */ + if (!new_weights) + goto swap_weights; + + /* Check for parent weights to inherit */ + pmcg =3D memcg; + while (!old_weights) { + pmcg =3D parent_mem_cgroup(pmcg); + + if (!pmcg || mem_cgroup_is_root(pmcg)) + break; + old_weights =3D rcu_dereference(pmcg->mempolicy.il_weights); + parent_weights =3D true; + } + + /* Copy the old weights or default all nodes to 1 */ + if (old_weights) + memcpy(new_weights, old_weights, + sizeof(unsigned char)*MAX_NUMNODES); + else + memset(new_weights, 1, + sizeof(unsigned char)*MAX_NUMNODES); + new_weights[node] =3D weight; + +swap_weights: + rcu_assign_pointer(memcg->mempolicy.il_weights, new_weights); + + rcu_read_unlock(); + synchronize_rcu(); + + /* If we are inheriting weights from the parent, do not free */ + if (old_weights && !parent_weights) + kfree(old_weights); + + mutex_unlock(&memcg->mempolicy_lock); + + return nbytes; +} + +static struct cftype mempolicy_files[] =3D { + { + .name =3D "interleave_weights", + .flags =3D CFTYPE_NOT_ON_ROOT, + .seq_show =3D mpol_ilw_show, + .write =3D mpol_ilw_write, + }, + { } /* terminate */ +}; + static int __init mem_cgroup_swap_init(void) { if (mem_cgroup_disabled()) @@ -7906,6 +8077,7 @@ static int __init mem_cgroup_swap_init(void) #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, zswap_files)); #endif + WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, mempolicy_files)); return 0; } subsys_initcall(mem_cgroup_swap_init); --=20 2.39.1