From nobody Tue Apr 7 17:55:53 2026 Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B668438C2A7 for ; Thu, 12 Mar 2026 09:12:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773306739; cv=none; b=tdU0CJ1dlVm6JAP3Y0zK/cGp/xown3rb0EjoCrAoAPzrMCRppHmyQwQtfBIcC122YOm1I35did5IoejpT/2qOJevFwvWJuUL2j3+IexQLWuVaCs+OxlV6dZVifrKbfVoc6tfkobwtZdvc3HXtWQtbUFiPlhYpQmpu16NGmp+Q4E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773306739; c=relaxed/simple; bh=4B8fUe4Y6rdWT0cATHjZd7WuX2LL1Yv8MRXAsdqpRcw=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=jU8zCnMyFzhcQri3qV6eHUzZgzU756iYJEK23h+0F1IXfsVKAMU9VIuoeYMic5w9BLWoQBELEYLEf6hPBWEbRz6o3WO2aTlgjelzxCUxJVU63sMLOJO8sOOXM7YgYgq3SYagjS1ZUEqxvbhweNR3mxycylBa2Ezn3mTSWgUT0/E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=S1ki8sM1; arc=none smtp.client-ip=209.85.210.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="S1ki8sM1" Received: by mail-pf1-f177.google.com with SMTP id d2e1a72fcca58-82984c077b2so479247b3a.1 for ; Thu, 12 Mar 2026 02:12:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773306737; x=1773911537; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=sZaIDNiBTsN9aL3bUzrKNeXVeD+/QF1qmzkDXQhoczs=; b=S1ki8sM1SQDE7TZhztQ7yabOs1/hF2MASEB0rOr3ZtTSKZvXGijLhrCIUkOp88Czdr ZDg4KyzYhsud4oWIvHDmC+GNjw+U4YqINa4T1BzxJIt3Zoz7MVcb8+NQqx7mY3glM3EV rrnGrF7k6dxQ/YKrWp+amvRplwB0p230aanLFXctO+JkmTswJKyOCrqLQIr+qZ35iirR sizK8ZfGzfxI/VDhK+GSVg6CFAz13Ka/kBmcCpnJ/Spjr8dMbnLeSZK4yESgyfs9HDk1 3meSDpvF+srRf+HnKPqU6XmWjfrFtQ8BI31WO9uM3Ysp4SdM3PJSdjgZyqIU6fqu2/cb kjzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773306737; x=1773911537; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=sZaIDNiBTsN9aL3bUzrKNeXVeD+/QF1qmzkDXQhoczs=; b=Jvbvs4nvKzQ6sNbC0hq31f7unZkfPyPwVDydrkGIhBBuyludGyD8W85gnSMrJRFTs3 X0dorY1IqjCGLCTfmftfHjoim5lF5qEu5/qK8gI8cFSgzeMWZ9V29PHffkRjUDLh6NpF pPG2uCQjC1AEtowoJ7uMTlkicvk/R09jTI9KpOt3cWBbcQDB/1YWy1wUZWenYldZ8BO8 oOaxxEF68oik328ZV50M7wLUMdLju4d5eO08bi8lbr8F77K+i3Gzz+EqBlxWxHh13ahU sj4xhIBPRIhZ3oJLJNnNVK0JDq8+1uJjB/Nkju5AUq8R+KYf5hEcixOSZyX6OsmSAJp1 gy5w== X-Gm-Message-State: AOJu0YwltrRK9n2NAvfTsoEdGPvXIZOpRRseLkfPeQIsW9wwUI5YztnM Nxol+y/fTaAK4jmHB/hvCdmjIzdI0maPcqqsr0CXzKEjS9O4T9agTe4m X-Gm-Gg: ATEYQzyI5IFfy/+HB/lDxS8iKJPKNgp6jJlCPK9PhuYU+u3prBAMrrobWsIlugcN7QK OGrNlKnoob7/TW/Df5fwmWNIclH/ruCozigB7PQ8jW6xNtaNPbXH6MUzDQoVirL7irqY4k6G6H5 1pp+GT+K7hQxjTLnstkoGn5HM8b5xxF6jOVWOAa+pW2sMQqsOTFS5MGukwjJ183oT0J0qJhQOEM 5Kj2nKQVrT7N1KexN/jX0HBowMoDq/zqtBq+lmkMGXEF2sesMedoERWQGkg8Xxkfn0DcZmF8qxy 4LviG/t1X1UaVet5Cuo5kqt0tBFzY/YokgXDbvsGk+ZyPFwRyisqrGkh6Po1FcbfnFOFLS6lC/J alOGXP2yrAkBrJQblYC/VbFu13h1PX9WpQsG9DPe//1kVf+7IygROqZH2Vg3xZk9QIK3q/R7VUE VJqqmwBoVaKIBbe29Bg/5hgYAGHHVR/Wvlr1VLQprD9M4cBJ0uxNjT1LAjgxKhKzS91kRL X-Received: by 2002:a05:6a00:ab08:b0:827:32d7:668f with SMTP id d2e1a72fcca58-829f6e7897amr5024122b3a.6.1773306737021; Thu, 12 Mar 2026 02:12:17 -0700 (PDT) Received: from Yee-680G4.lan (n11212047001.netvigator.com. [112.120.47.1]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-82a0734199csm2299470b3a.36.2026.03.12.02.12.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Mar 2026 02:12:16 -0700 (PDT) From: YeeLi To: akpm@linux-foundation.org, david@kernel.org, dan.j.williams@intel.com, ying.huang@linux.alibaba.com, linux-mm@kvack.org, joshua.hahnjy@gmail.com Cc: linux-kernel@vger.kernel.org, Jonathan.Cameron@huawei.com, linux-cxl@vger.kernel.org, dave.jiang@intel.com, yeeli Subject: [PATCH] mm/mempolicy: add sysfs interface to override NUMA node bandwidth Date: Thu, 12 Mar 2026 17:12:07 +0800 Message-Id: <20260312091207.2016518-1-seven.yi.lee@gmail.com> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: yeeli Automatic tuning for weighted interleaving [1] provides real benefits on systems with CXL support. However, platforms that lack HMAT or CDAT information cannot make use of this feature. If the bandwidth reported by firmware or the device deviates from the actual measured bandwidth, administrators also lack a clear way to adjust the per-node weight values. This patch introduces an optional Kconfig option, CONFIG_NUMA_BW_MANUAL_OVERRIDE (default n), which exposes node bandwidth R/W sysfs attributes under: /sys/kernel/mm/mempolicy/weighted_interleave/bw_nodeN The sysfs files are created and removed dynamically on node hotplug events, in sync with the existing weighted_interleave/nodeN attributes. Userspace can write a single bandwidth value (in MB/s) to override both read_bandwidth and write_bandwidth for the corresponding NUMA node. The value is then propagated to the internal node_bw_table via mempolicy_set_node_perf(). This interface is intended for debugging and experimentation only. [1] Link: https://lkml.kernel.org/r/20250505182328.4148265-1-joshua.hahnjy@gmail.com Signed-off-by: yeeli --- mm/Kconfig | 20 +++++++ mm/mempolicy.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 168 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index bd0ea5454af8..40554df18edc 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1441,6 +1441,26 @@ config NUMA_EMU into virtual nodes when booted with "numa=3Dfake=3DN", where N is the number of nodes. This is only useful for debugging. =20 +config NUMA_BW_MANUAL_OVERRIDE + bool "Allow manual override of per-NUMA-node bandwidth for weighted inter= leave" + depends on NUMA && SYSFS + default n + help + This option exposes writable sysfs attributes under + /sys/kernel/mm/mempolicy/weighted_interleave/bw_nodeN, allowing + userspace to manually set read/write bandwidth values for each NUMA nod= e. + + These values update the internal node_bw_table and can influence + weighted interleave auto-tuning (if enabled). + + WARNING: This is intended for debugging, development, or platforms + with incorrect HMAT/CDAT firmware data. Overriding hardware-reported + bandwidth can lead to suboptimal performance, instability, or + incorrect resource allocation decisions. + + Say N unless you are actively developing or debugging bandwidth-aware + memory policies. + config ARCH_HAS_USER_SHADOW_STACK bool help diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 68a98ba57882..0b7f42491748 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -226,6 +226,7 @@ int mempolicy_set_node_perf(unsigned int node, struct a= ccess_coordinate *coords) =20 bw_val =3D min(coords->read_bandwidth, coords->write_bandwidth); new_bw =3D kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL); + if (!new_bw) return -ENOMEM; =20 @@ -3614,6 +3615,9 @@ struct iw_node_attr { struct sysfs_wi_group { struct kobject wi_kobj; struct mutex kobj_lock; +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE + struct iw_node_attr *bw_attrs[MAX_NUMNODES]; +#endif struct iw_node_attr *nattrs[]; }; =20 @@ -3855,6 +3859,128 @@ static int sysfs_wi_node_add(int nid) return ret; } =20 +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE +static ssize_t bw_node_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + struct iw_node_attr *node_attr; + + node_attr =3D container_of(attr, struct iw_node_attr, kobj_attr); + + /*A Node without CDAT or HMAT*/ + if (!node_bw_table) + return sprintf(buf, "N/A\n"); + + if (!node_bw_table[node_attr->nid]) + return sprintf(buf, "0\n"); + + return sprintf(buf, "%u(MB/s)\n", node_bw_table[node_attr->nid]); +} + +static ssize_t bw_node_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct iw_node_attr *node_attr; + unsigned long val =3D 0; + int ret; + struct access_coordinate coords =3D { + .read_bandwidth =3D 0, + .write_bandwidth =3D 0, + }; + + node_attr =3D container_of(attr, struct iw_node_attr, kobj_attr); + + ret =3D kstrtoul(buf, 0, &val); + + coords.read_bandwidth =3D val; + coords.write_bandwidth =3D val; + + if (ret) + return ret; + + if (val > UINT_MAX) + return -EINVAL; + + ret =3D mempolicy_set_node_perf(node_attr->nid, &coords); + if (ret) + return ret; + + return count; +} + +static int sysfs_bw_node_add(int nid) +{ + int ret; + char *name; + struct iw_node_attr *new_attr; + + if (nid < 0 || nid >=3D nr_node_ids) { + pr_err("invalid node id: %d\n", nid); + return -EINVAL; + } + + new_attr =3D kzalloc(sizeof(*new_attr), GFP_KERNEL); + if (!new_attr) + return -ENOMEM; + + name =3D kasprintf(GFP_KERNEL, "bw_node%d", nid); + if (!name) { + kfree(new_attr); + return -ENOMEM; + } + + sysfs_attr_init(&new_attr->kobj_attr.attr); + new_attr->kobj_attr.attr.name =3D name; + new_attr->kobj_attr.attr.mode =3D 0644; + new_attr->kobj_attr.show =3D bw_node_show; + new_attr->kobj_attr.store =3D bw_node_store; + new_attr->nid =3D nid; + + mutex_lock(&wi_group->kobj_lock); + if (wi_group->bw_attrs[nid]) { + mutex_unlock(&wi_group->kobj_lock); + ret =3D -EEXIST; + goto out; + } + + ret =3D sysfs_create_file(&wi_group->wi_kobj, &new_attr->kobj_attr.attr); + + if (ret) { + mutex_unlock(&wi_group->kobj_lock); + goto out; + } + wi_group->bw_attrs[nid] =3D new_attr; + mutex_unlock(&wi_group->kobj_lock); + return 0; + +out: + kfree(new_attr->kobj_attr.attr.name); + kfree(new_attr); + return ret; +} + +static void sysfs_bw_node_delete(int nid) +{ + struct iw_node_attr *attr; + + if (nid < 0 || nid >=3D nr_node_ids) + return; + + mutex_lock(&wi_group->kobj_lock); + attr =3D wi_group->bw_attrs[nid]; + + if (attr) { + sysfs_remove_file(&wi_group->wi_kobj, &attr->kobj_attr.attr); + kfree(attr->kobj_attr.attr.name); + kfree(attr); + wi_group->nattrs[nid] =3D NULL; + } + mutex_unlock(&wi_group->kobj_lock); +} +#endif + static int wi_node_notifier(struct notifier_block *nb, unsigned long action, void *data) { @@ -3868,9 +3994,22 @@ static int wi_node_notifier(struct notifier_block *n= b, if (err) pr_err("failed to add sysfs for node%d during hotplug: %d\n", nid, err); + +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE + err =3D sysfs_bw_node_add(nid); + if (err) + pr_err("failed to add sysfs bw_node%d: %d\n", + nid, err); +#endif break; + case NODE_REMOVED_LAST_MEMORY: sysfs_wi_node_delete(nid); + +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE + sysfs_bw_node_delete(nid); +#endif + break; } =20 @@ -3906,6 +4045,15 @@ static int __init add_weighted_interleave_group(stru= ct kobject *mempolicy_kobj) nid, err); goto err_cleanup_kobj; } + +#ifdef CONFIG_NUMA_BW_MANUAL_OVERRIDE + err =3D sysfs_bw_node_add(nid); + if (err) { + pr_err("failed to add sysfs bw_node%d during init: %d\n", nid, err); + goto err_cleanup_kobj; + } +#endif + } =20 hotplug_node_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI); --=20 2.34.1