From nobody Tue Apr 7 05:58:52 2026 Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 31A2F33A9E9; Mon, 16 Mar 2026 05:13:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=166.125.252.92 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773638002; cv=none; b=FXe52AXWYRDhI5NVr6p6MPlAitvX8iIfdhYoNnMdRUn3Aa+VcSu8hiQx0iNGiNGAjaOJU2Q5BoeKd+ACj44h7h+B55dLUtF+V2qQ4XQSQDULeQ5wTvPO74xsJSdn1pUvJTowzG7VziQoyM0VNf8ahY35lafliShltkhIeLif/Tk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773638002; c=relaxed/simple; bh=3u397GeryJKzFd7s/4nxD7m9qs90WJido5jXQ0N/mtM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XVAykeQLw5ijpLfzzd93ApkaHUgL1t2mjeAMUktH4xILx+qju/h0CSMkLSbS4YjqXTBBIhRXq1qh0MtjWPgD7JSaKqQZAm5sCY4+QJatZWSBMWF2Pcxbw5kDPbHv/q2tVkEo/go0+rDrCK2DlR1b57vGAYlT1Z9VVDmhRChsyzs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com; spf=pass smtp.mailfrom=sk.com; arc=none smtp.client-ip=166.125.252.92 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=sk.com X-AuditID: a67dfc5b-c45ff70000001609-5a-69b7916a4adf From: Rakie Kim To: akpm@linux-foundation.org Cc: gourry@gourry.net, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com, byungchul@sk.com, ying.huang@linux.alibaba.com, apopple@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, kernel_team@skhynix.com, honggyu.kim@sk.com, yunjeong.mun@sk.com, rakie.kim@sk.com Subject: [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Date: Mon, 16 Mar 2026 14:12:52 +0900 Message-ID: <20260316051258.246-5-rakie.kim@sk.com> X-Mailer: git-send-email 2.52.0.windows.1 In-Reply-To: <20260316051258.246-1-rakie.kim@sk.com> References: <20260316051258.246-1-rakie.kim@sk.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrHIsWRmVeSWpSXmKPExsXC9ZZnkW7WxO2ZBvvPcVrMWb+GzeLu4wts FrtuhFhMn3qB0eLEzUY2i9U31zBaPN/6i9Hi593j7Bb7nz5nsVi18BqbxfGt89gttjc8YLc4 P+sUi8XlXXPYLO6t+c9qcXLWShaLb33SFvf7HCyOrN/OZDH50gI2i9mNfYwWtyYcY7JYvSbD YvbRe+wOEh47Z91l91iwqdSju+0yu0fLkbesHov3vGTy2LSqk81j06dJ7B4nZvxm8dj50NKj t/kdm8fHp7dYPKbOrvdYv+Uqi8eZBUfYPT5vkgvgj+KySUnNySxLLdK3S+DKePl6P1PBQfOK +b/nszYwrtLqYuTkkBAwkXi+cS0bjP1lynsgm4ODTUBJ4tjeGJCwiICsxNS/51m6GLk4mAVW skqcP/mbGaRGWCBW4uPEEJAaFgFViS+zJrKA2LwCxhJbJ02CGqkpsW7jLRaQck6g8dsWGIOE hYBK5j35wA5RLihxcuYTsFZmAXmJ5q2zmUFWSQh8ZZc42XeUFWKOpMTBFTdYJjDyz0LSMwtJ zwJGplWMQpl5ZbmJmTkmehmVeZkVesn5uZsYgbG6rPZP9A7GTxeCDzEKcDAq8fBmHNqWKcSa WFZcmXuIUYKDWUmEd9kRoBBvSmJlVWpRfnxRaU5q8SFGaQ4WJXFeo2/lKUIC6YklqdmpqQWp RTBZJg5OqQZGlWcMG6fMNrV33XPozfYK/3zvKcnuv+b+3GrFfp/33EdfhdiUu+Yr5CsidWXV Y9bonIkvSM5aaH7t25LWz7xdq1UVveabTzojpqYrW/qouSo0Puwv7wOXzzo/9521r/M/pfPv 4ZTIK+pP3m0/Pm312d9NdltnvP2svW/jBdbssuRm04fLJ+m4KLEUZyQaajEXFScCAIL+GRHR AgAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrJIsWRmVeSWpSXmKPExsXCNUM9Rjdz4vZMg/UHLS3mrF/DZnH38QU2 i103QizOTZnNZjF96gVGixM3G9ksVt9cw2jxfOsvRoufd4+zW3x+9prZYv/T5ywWqxZeY7M4 vnUeu8XhuSdZLbY3PGC3OD/rFIvF5V1z2CzurfnPanFy1koWi2990hb3+xwsDl17zmpxZP12 JovJlxawWcxu7GO0uDXhGJPF6jUZFr+3rQAKHb3H7iDrsXPWXXaPBZtKPbrbLrN7tBx5y+qx eM9LJo9NqzrZPDZ9msTucWLGbxaPnQ8tPXqb37F5fHx6i8Xj220Pj8UvPjB5TJ1d77F+y1UW jzMLjrAHCEZx2aSk5mSWpRbp2yVwZbx8vZ+p4KB5xfzf81kbGFdpdTFyckgImEh8mfKerYuR g4NNQEni2N4YkLCIgKzE1L/nWboYuTiYBVaySpw/+ZsZpEZYIFbi48QQkBoWAVWJL7MmsoDY vALGElsnTWKDGKkpsW7jLRaQck6g8dsWGIOEhYBK5j35wA5RLihxcuYTsFZmAXmJ5q2zmScw 8sxCkpqFJLWAkWkVo0hmXlluYmaOqV5xdkZlXmaFXnJ+7iZGYIQuq/0zcQfjl8vuhxgFOBiV eHgzDm3LFGJNLCuuzD3EKMHBrCTCu+wIUIg3JbGyKrUoP76oNCe1+BCjNAeLkjivV3hqgpBA emJJanZqakFqEUyWiYNTqoGRZ3dI+3bxaV/7Mz4vfXsj/0LV26Jws7Damw86M8/Hi/opxKhL azRHLbx5JuVD0WORrwsObFZ3rc/p22fUJ3dOcOPPzi1rOW6dn7snUNqwqmsy15e8WNHrM5q0 /m+f5rX/7tL83iyVf1xNXAH9fw9VSOw2n3dReOmOjRMWybjfuRy8U2YqR/MvJZbijERDLeai 4kQAbKmZC8wCAAA= X-CFilter-Loop: Reflected Content-Type: text/plain; charset="utf-8" Flat weighted interleave applies one global weight vector regardless of where a task runs. On multi-socket systems this ignores inter-socket interconnect costs and can steer allocations to remote sockets even when local capacity exists, degrading effective bandwidth and increasing latency. Consider a dual-socket system: node0 node1 +-------+ +-------+ | CPU0 |---------| CPU1 | +-------+ +-------+ | DRAM0 | | DRAM1 | +---+---+ +---+---+ | | +---+---+ +---+---+ | CXL0 | | CXL1 | +-------+ +-------+ node2 node3 Local device capabilities (GB/s) versus cross-socket effective bandwidth: 0 1 2 3 0 300 150 100 50 1 150 300 50 100 A reasonable global weight vector reflecting device capabilities is: node0=3D3 node1=3D3 node2=3D1 node3=3D1 However, applying it flat to all sources yields the effective map: 0 1 2 3 0 3 3 1 1 1 3 3 1 1 This does not account for the interconnect penalty (e.g., node0->node1 drops 300->150, node0->node3 drops 100->50) and thus permits cross-socket allocations that underutilize local bandwidth. This patch makes weighted interleave socket-aware. Before weighting is applied, the candidate nodes are restricted to the current socket; only if no eligible local nodes remain does the policy fall back to the wider set. The resulting effective map becomes: 0 1 2 3 0 3 0 1 0 1 0 3 0 1 Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual effective bandwidth, preserves NUMA locality, and reduces cross-socket traffic. Signed-off-by: Rakie Kim --- mm/mempolicy.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 90 insertions(+), 4 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index a3f0fde6c626..541853ac08bc 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -117,6 +117,7 @@ #include #include #include +#include =20 #include "internal.h" =20 @@ -2134,17 +2135,87 @@ bool apply_policy_zone(struct mempolicy *policy, en= um zone_type zone) return zone >=3D dynamic_policy_zone; } =20 +/** + * policy_resolve_package_nodes - Restrict policy nodes to the current pac= kage + * @policy: Target mempolicy whose user-selected nodes are in @policy->nod= es. + * @mask: Output nodemask. On success, contains policy->nodes limited to + * the package that should be used for the allocation. + * + * This helper combines two constraints to decide where within a socket/pa= ckage + * memory may be allocated: + * + * 1) The caller's package: derived via mp_get_package_nodes(numa_node_i= d()). + * 2) The user's preselected set @policy->nodes (cpusets/mempolicy). + * + * The function obtains the nodemask of the current CPU's package and + * intersects it with @policy->nodes. If the intersection is empty (e.g. t= he + * user excluded every node of the current package), it falls back to the + * node in @policy->nodes, derives that node's package, and intersects + * again. If the fallback also yields an empty set, @mask stays empty and a + * non-zero error is returned. + * + * Examples (packages: P0=3D{CPU:0, MEM:2}, P1=3D{CPU:1, MEM:3}): + * - policy->nodes =3D {0,1,2,3} + * on P0: mask =3D {0,2}; on P1: mask =3D {1,3}. + * - policy->nodes =3D {0,1,3} + * on P0: mask =3D {0} (only node 0 from P0 is allowed). + * - policy->nodes =3D {1,2,3} + * on P0: mask =3D {2} (only node 2 from P0 is allowed). + * - policy->nodes =3D {1,3} + * on P0: current package (P0) & policy =3D NULL -> fallback to poli= cy=3D1, + * package(1)=3DP1, mask =3D {1,3}. (User effectively opted = out of P0.) + * + * Return: + * 0 on success with @mask set as above; + * -EINVAL if @policy/@mask is NULL; + * Propagated error from mp_get_package_nodes() on failure. + */ +static int policy_resolve_package_nodes(struct mempolicy *policy, nodemask= _t *mask) +{ + unsigned int node, ret =3D 0; + nodemask_t package_mask; + + if (!policy || !mask) + return -EINVAL; + + nodes_clear(*mask); + + node =3D numa_node_id(); + ret =3D mp_get_package_nodes(node, &package_mask); + if (!ret) { + nodes_and(*mask, package_mask, policy->nodes); + + if (nodes_empty(*mask)) { + node =3D first_node(policy->nodes); + ret =3D mp_get_package_nodes(node, &package_mask); + if (ret) + goto out; + nodes_and(*mask, package_mask, policy->nodes); + if (nodes_empty(*mask)) + goto out; + } + } + +out: + return ret; +} + static unsigned int weighted_interleave_nodes(struct mempolicy *policy) { unsigned int node; unsigned int cpuset_mems_cookie; + nodemask_t mask; =20 retry: /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ cpuset_mems_cookie =3D read_mems_allowed_begin(); node =3D current->il_prev; - if (!current->il_weight || !node_isset(node, policy->nodes)) { - node =3D next_node_in(node, policy->nodes); + + if (policy_resolve_package_nodes(policy, &mask)) + mask =3D policy->nodes; + + if (!current->il_weight || !node_isset(node, mask)) { + node =3D next_node_in(node, mask); if (read_mems_allowed_retry(cpuset_mems_cookie)) goto retry; if (node =3D=3D MAX_NUMNODES) @@ -2237,6 +2308,21 @@ static unsigned int read_once_policy_nodemask(struct= mempolicy *pol, return nodes_weight(*mask); } =20 +static unsigned int read_once_policy_package_nodemask(struct mempolicy *po= l, + nodemask_t *mask) +{ + nodemask_t package_mask; + + barrier(); + if (policy_resolve_package_nodes(pol, &package_mask)) + memcpy(mask, &pol->nodes, sizeof(nodemask_t)); + else + memcpy(mask, &package_mask, sizeof(nodemask_t)); + barrier(); + + return nodes_weight(*mask); +} + static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t= ilx) { struct weighted_interleave_state *state; @@ -2247,7 +2333,7 @@ static unsigned int weighted_interleave_nid(struct me= mpolicy *pol, pgoff_t ilx) u8 weight; int nid =3D 0; =20 - nr_nodes =3D read_once_policy_nodemask(pol, &nodemask); + nr_nodes =3D read_once_policy_package_nodemask(pol, &nodemask); if (!nr_nodes) return numa_node_id(); =20 @@ -2691,7 +2777,7 @@ static unsigned long alloc_pages_bulk_weighted_interl= eave(gfp_t gfp, /* read the nodes onto the stack, retry if done during rebind */ do { cpuset_mems_cookie =3D read_mems_allowed_begin(); - nnodes =3D read_once_policy_nodemask(pol, &nodes); + nnodes =3D read_once_policy_package_nodemask(pol, &nodes); } while (read_mems_allowed_retry(cpuset_mems_cookie)); =20 /* if the nodemask has become invalid, we cannot do anything */ --=20 2.34.1