From nobody Tue Apr  7 05:58:52 2026
Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 31A2F33A9E9;
	Mon, 16 Mar 2026 05:13:20 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=166.125.252.92
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773638002; cv=none;
 b=FXe52AXWYRDhI5NVr6p6MPlAitvX8iIfdhYoNnMdRUn3Aa+VcSu8hiQx0iNGiNGAjaOJU2Q5BoeKd+ACj44h7h+B55dLUtF+V2qQ4XQSQDULeQ5wTvPO74xsJSdn1pUvJTowzG7VziQoyM0VNf8ahY35lafliShltkhIeLif/Tk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773638002; c=relaxed/simple;
	bh=3u397GeryJKzFd7s/4nxD7m9qs90WJido5jXQ0N/mtM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=XVAykeQLw5ijpLfzzd93ApkaHUgL1t2mjeAMUktH4xILx+qju/h0CSMkLSbS4YjqXTBBIhRXq1qh0MtjWPgD7JSaKqQZAm5sCY4+QJatZWSBMWF2Pcxbw5kDPbHv/q2tVkEo/go0+rDrCK2DlR1b57vGAYlT1Z9VVDmhRChsyzs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=sk.com;
 spf=pass smtp.mailfrom=sk.com; arc=none smtp.client-ip=166.125.252.92
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=sk.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=sk.com
X-AuditID: a67dfc5b-c45ff70000001609-5a-69b7916a4adf
From: Rakie Kim <rakie.kim@sk.com>
To: akpm@linux-foundation.org
Cc: gourry@gourry.net,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	ziy@nvidia.com,
	matthew.brost@intel.com,
	joshua.hahnjy@gmail.com,
	byungchul@sk.com,
	ying.huang@linux.alibaba.com,
	apopple@nvidia.com,
	david@kernel.org,
	lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com,
	vbabka@suse.cz,
	rppt@kernel.org,
	surenb@google.com,
	mhocko@suse.com,
	dave@stgolabs.net,
	jonathan.cameron@huawei.com,
	dave.jiang@intel.com,
	alison.schofield@intel.com,
	vishal.l.verma@intel.com,
	ira.weiny@intel.com,
	dan.j.williams@intel.com,
	kernel_team@skhynix.com,
	honggyu.kim@sk.com,
	yunjeong.mun@sk.com,
	rakie.kim@sk.com
Subject: [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with
 socket-aware locality
Date: Mon, 16 Mar 2026 14:12:52 +0900
Message-ID: <20260316051258.246-5-rakie.kim@sk.com>
X-Mailer: git-send-email 2.52.0.windows.1
In-Reply-To: <20260316051258.246-1-rakie.kim@sk.com>
References: <20260316051258.246-1-rakie.kim@sk.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Brightmail-Tracker: 
 H4sIAAAAAAAAA+NgFtrHIsWRmVeSWpSXmKPExsXC9ZZnkW7WxO2ZBvvPcVrMWb+GzeLu4wts
	FrtuhFhMn3qB0eLEzUY2i9U31zBaPN/6i9Hi593j7Bb7nz5nsVi18BqbxfGt89gttjc8YLc4
	P+sUi8XlXXPYLO6t+c9qcXLWShaLb33SFvf7HCyOrN/OZDH50gI2i9mNfYwWtyYcY7JYvSbD
	YvbRe+wOEh47Z91l91iwqdSju+0yu0fLkbesHov3vGTy2LSqk81j06dJ7B4nZvxm8dj50NKj
	t/kdm8fHp7dYPKbOrvdYv+Uqi8eZBUfYPT5vkgvgj+KySUnNySxLLdK3S+DKePl6P1PBQfOK
	+b/nszYwrtLqYuTkkBAwkXi+cS0bjP1lynsgm4ODTUBJ4tjeGJCwiICsxNS/51m6GLk4mAVW
	skqcP/mbGaRGWCBW4uPEEJAaFgFViS+zJrKA2LwCxhJbJ02CGqkpsW7jLRaQck6g8dsWGIOE
	hYBK5j35wA5RLihxcuYTsFZmAXmJ5q2zmUFWSQh8ZZc42XeUFWKOpMTBFTdYJjDyz0LSMwtJ
	zwJGplWMQpl5ZbmJmTkmehmVeZkVesn5uZsYgbG6rPZP9A7GTxeCDzEKcDAq8fBmHNqWKcSa
	WFZcmXuIUYKDWUmEd9kRoBBvSmJlVWpRfnxRaU5q8SFGaQ4WJXFeo2/lKUIC6YklqdmpqQWp
	RTBZJg5OqQZGlWcMG6fMNrV33XPozfYK/3zvKcnuv+b+3GrFfp/33EdfhdiUu+Yr5CsidWXV
	Y9bonIkvSM5aaH7t25LWz7xdq1UVveabTzojpqYrW/qouSo0Puwv7wOXzzo/9521r/M/pfPv
	4ZTIK+pP3m0/Pm312d9NdltnvP2svW/jBdbssuRm04fLJ+m4KLEUZyQaajEXFScCAIL+GRHR
	AgAA
X-Brightmail-Tracker: 
 H4sIAAAAAAAAA+NgFtrJIsWRmVeSWpSXmKPExsXCNUM9Rjdz4vZMg/UHLS3mrF/DZnH38QU2
	i103QizOTZnNZjF96gVGixM3G9ksVt9cw2jxfOsvRoufd4+zW3x+9prZYv/T5ywWqxZeY7M4
	vnUeu8XhuSdZLbY3PGC3OD/rFIvF5V1z2CzurfnPanFy1koWi2990hb3+xwsDl17zmpxZP12
	JovJlxawWcxu7GO0uDXhGJPF6jUZFr+3rQAKHb3H7iDrsXPWXXaPBZtKPbrbLrN7tBx5y+qx
	eM9LJo9NqzrZPDZ9msTucWLGbxaPnQ8tPXqb37F5fHx6i8Xj220Pj8UvPjB5TJ1d77F+y1UW
	jzMLjrAHCEZx2aSk5mSWpRbp2yVwZbx8vZ+p4KB5xfzf81kbGFdpdTFyckgImEh8mfKerYuR
	g4NNQEni2N4YkLCIgKzE1L/nWboYuTiYBVaySpw/+ZsZpEZYIFbi48QQkBoWAVWJL7MmsoDY
	vALGElsnTWKDGKkpsW7jLRaQck6g8dsWGIOEhYBK5j35wA5RLihxcuYTsFZmAXmJ5q2zmScw
	8sxCkpqFJLWAkWkVo0hmXlluYmaOqV5xdkZlXmaFXnJ+7iZGYIQuq/0zcQfjl8vuhxgFOBiV
	eHgzDm3LFGJNLCuuzD3EKMHBrCTCu+wIUIg3JbGyKrUoP76oNCe1+BCjNAeLkjivV3hqgpBA
	emJJanZqakFqEUyWiYNTqoGRZ3dI+3bxaV/7Mz4vfXsj/0LV26Jws7Damw86M8/Hi/opxKhL
	azRHLbx5JuVD0WORrwsObFZ3rc/p22fUJ3dOcOPPzi1rOW6dn7snUNqwqmsy15e8WNHrM5q0
	/m+f5rX/7tL83iyVf1xNXAH9fw9VSOw2n3dReOmOjRMWybjfuRy8U2YqR/MvJZbijERDLeai
	4kQAbKmZC8wCAAA=
X-CFilter-Loop: Reflected
Content-Type: text/plain; charset="utf-8"

Flat weighted interleave applies one global weight vector regardless of
where a task runs. On multi-socket systems this ignores inter-socket
interconnect costs and can steer allocations to remote sockets even when
local capacity exists, degrading effective bandwidth and increasing
latency.

Consider a dual-socket system:

          node0             node1
        +-------+         +-------+
        | CPU0  |---------| CPU1  |
        +-------+         +-------+
        | DRAM0 |         | DRAM1 |
        +---+---+         +---+---+
            |                 |
        +---+---+         +---+---+
        | CXL0  |         | CXL1  |
        +-------+         +-------+
          node2             node3

Local device capabilities (GB/s) versus cross-socket effective bandwidth:

         0     1     2     3
     0  300   150   100    50
     1  150   300    50   100

A reasonable global weight vector reflecting device capabilities is:

     node0=3D3 node1=3D3 node2=3D1 node3=3D1

However, applying it flat to all sources yields the effective map:

         0     1     2     3
     0   3     3     1     1
     1   3     3     1     1

This does not account for the interconnect penalty (e.g., node0->node1
drops 300->150, node0->node3 drops 100->50) and thus permits cross-socket
allocations that underutilize local bandwidth.

This patch makes weighted interleave socket-aware. Before weighting is
applied, the candidate nodes are restricted to the current socket; only
if no eligible local nodes remain does the policy fall back to the wider
set. The resulting effective map becomes:

         0     1     2     3
     0   3     0     1     0
     1   0     3     0     1

Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
effective bandwidth, preserves NUMA locality, and reduces cross-socket
traffic.

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
---
 mm/mempolicy.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 90 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a3f0fde6c626..541853ac08bc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -117,6 +117,7 @@
 #include <asm/tlb.h>
 #include <linux/uaccess.h>
 #include <linux/memory.h>
+#include <linux/memory-tiers.h>
=20
 #include "internal.h"
=20
@@ -2134,17 +2135,87 @@ bool apply_policy_zone(struct mempolicy *policy, en=
um zone_type zone)
 	return zone >=3D dynamic_policy_zone;
 }
=20
+/**
+ * policy_resolve_package_nodes - Restrict policy nodes to the current pac=
kage
+ * @policy: Target mempolicy whose user-selected nodes are in @policy->nod=
es.
+ * @mask:   Output nodemask. On success, contains policy->nodes limited to
+ *          the package that should be used for the allocation.
+ *
+ * This helper combines two constraints to decide where within a socket/pa=
ckage
+ * memory may be allocated:
+ *
+ *   1) The caller's package: derived via mp_get_package_nodes(numa_node_i=
d()).
+ *   2) The user's preselected set @policy->nodes (cpusets/mempolicy).
+ *
+ * The function obtains the nodemask of the current CPU's package and
+ * intersects it with @policy->nodes. If the intersection is empty (e.g. t=
he
+ * user excluded every node of the current package), it falls back to the
+ * node in @policy->nodes, derives that node's package, and intersects
+ * again. If the fallback also yields an empty set, @mask stays empty and a
+ * non-zero error is returned.
+ *
+ * Examples (packages: P0=3D{CPU:0, MEM:2}, P1=3D{CPU:1, MEM:3}):
+ *   - policy->nodes =3D {0,1,2,3}
+ *       on P0: mask =3D {0,2}; on P1: mask =3D {1,3}.
+ *   - policy->nodes =3D {0,1,3}
+ *       on P0: mask =3D {0}      (only node 0 from P0 is allowed).
+ *   - policy->nodes =3D {1,2,3}
+ *       on P0: mask =3D {2}      (only node 2 from P0 is allowed).
+ *   - policy->nodes =3D {1,3}
+ *       on P0: current package (P0) & policy =3D NULL -> fallback to poli=
cy=3D1,
+ *               package(1)=3DP1, mask =3D {1,3}. (User effectively opted =
out of P0.)
+ *
+ * Return:
+ *   0 on success with @mask set as above;
+ *   -EINVAL if @policy/@mask is NULL;
+ *   Propagated error from mp_get_package_nodes() on failure.
+ */
+static int policy_resolve_package_nodes(struct mempolicy *policy, nodemask=
_t *mask)
+{
+	unsigned int node, ret =3D 0;
+	nodemask_t package_mask;
+
+	if (!policy || !mask)
+		return -EINVAL;
+
+	nodes_clear(*mask);
+
+	node =3D numa_node_id();
+	ret =3D mp_get_package_nodes(node, &package_mask);
+	if (!ret) {
+		nodes_and(*mask, package_mask, policy->nodes);
+
+		if (nodes_empty(*mask)) {
+			node =3D first_node(policy->nodes);
+			ret =3D mp_get_package_nodes(node, &package_mask);
+			if (ret)
+				goto out;
+			nodes_and(*mask, package_mask, policy->nodes);
+			if (nodes_empty(*mask))
+				goto out;
+		}
+	}
+
+out:
+	return ret;
+}
+
 static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 {
 	unsigned int node;
 	unsigned int cpuset_mems_cookie;
+	nodemask_t mask;
=20
 retry:
 	/* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
 	cpuset_mems_cookie =3D read_mems_allowed_begin();
 	node =3D current->il_prev;
-	if (!current->il_weight || !node_isset(node, policy->nodes)) {
-		node =3D next_node_in(node, policy->nodes);
+
+	if (policy_resolve_package_nodes(policy, &mask))
+		mask =3D policy->nodes;
+
+	if (!current->il_weight || !node_isset(node, mask)) {
+		node =3D next_node_in(node, mask);
 		if (read_mems_allowed_retry(cpuset_mems_cookie))
 			goto retry;
 		if (node =3D=3D MAX_NUMNODES)
@@ -2237,6 +2308,21 @@ static unsigned int read_once_policy_nodemask(struct=
 mempolicy *pol,
 	return nodes_weight(*mask);
 }
=20
+static unsigned int read_once_policy_package_nodemask(struct mempolicy *po=
l,
+						      nodemask_t *mask)
+{
+	nodemask_t package_mask;
+
+	barrier();
+	if (policy_resolve_package_nodes(pol, &package_mask))
+		memcpy(mask, &pol->nodes, sizeof(nodemask_t));
+	else
+		memcpy(mask, &package_mask, sizeof(nodemask_t));
+	barrier();
+
+	return nodes_weight(*mask);
+}
+
 static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t=
 ilx)
 {
 	struct weighted_interleave_state *state;
@@ -2247,7 +2333,7 @@ static unsigned int weighted_interleave_nid(struct me=
mpolicy *pol, pgoff_t ilx)
 	u8 weight;
 	int nid =3D 0;
=20
-	nr_nodes =3D read_once_policy_nodemask(pol, &nodemask);
+	nr_nodes =3D read_once_policy_package_nodemask(pol, &nodemask);
 	if (!nr_nodes)
 		return numa_node_id();
=20
@@ -2691,7 +2777,7 @@ static unsigned long alloc_pages_bulk_weighted_interl=
eave(gfp_t gfp,
 	/* read the nodes onto the stack, retry if done during rebind */
 	do {
 		cpuset_mems_cookie =3D read_mems_allowed_begin();
-		nnodes =3D read_once_policy_nodemask(pol, &nodes);
+		nnodes =3D read_once_policy_package_nodemask(pol, &nodes);
 	} while (read_mems_allowed_retry(cpuset_mems_cookie));
=20
 	/* if the nodemask has become invalid, we cannot do anything */
--=20
2.34.1