From nobody Sun Feb 8 06:56:34 2026 Received: from mail-dl1-f74.google.com (mail-dl1-f74.google.com [74.125.82.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 04E39357A20 for ; Wed, 14 Jan 2026 20:53:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768423991; cv=none; b=nSU2DaihscBk4ybZLAZrNF41/TGZ+jDy7ZpS9stw6eBuFDQl+6fsl6YW8te1qbjN7ArEWxBGuHYbjw6y36G2RsrFtodt3D595MIXIwluullcHpWcMMXyuYByHFUX/FqD87NAGdLmuBl4lxkMgkNnQ9rDXBKwqnZOpHxYcbBsVwc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768423991; c=relaxed/simple; bh=/rBZx0YB0gt6V8TOLopN22zzR/+Lt5sTM5XFhgihwvY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ggWcAWmpSnoJNwyr3UmB89JUP8jnwlLPz1tnG5RWKF7/KINVLW/2ybtjlnessP1blCFY5FScY82zPMyIL2+FODGdCMtLTEKZZVmW3eScPS9Lq9AP5gaTd2tLnIU3nsko/T8RdItQkHjOnQYNy6wWxFYkrp5LNzOXJnStni0jX9Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=WgHwFUEX; arc=none smtp.client-ip=74.125.82.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="WgHwFUEX" Received: by mail-dl1-f74.google.com with SMTP id a92af1059eb24-11b94abc09dso342451c88.1 for ; Wed, 14 Jan 2026 12:53:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1768423988; x=1769028788; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=rx4GaMs2Q/4qfts+N5AuEA3y8Pe+7VCYzOqGGqst5P0=; b=WgHwFUEXpQX7FR30gL+uIhWZe6EF0Uj92v5+OaPp+1un5QQHufWwXwhd0Xr13cQXlN /HPqIqBpm9WwAggv4de/MP4vpu8tuPWqfpblsEGtosKOwlkM1SQ9lafSRoSYpLm++043 EPaPRVaAYVgLFO+AjNHua5aif5a29q1qlJ0sNiasJmyPPEmRjIPf/7ptjddX1GC/TBvu A/ja9WvNS7itWAjQ/EdmJ7Q8F7EIlkx+hsuJipZ3wQmRdxOlBp23kc7AxVXMibpeVJed EG9P4WvPOXHKCUT9w+ZpFCbRn1FGFc4rB/WCz8JeOViEdAgwTCivIcJ6nYmZbycvEJ5k tayQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1768423988; x=1769028788; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rx4GaMs2Q/4qfts+N5AuEA3y8Pe+7VCYzOqGGqst5P0=; b=fsd3uxR7SGWz08ElGT6hQE9S4vP4fLdl8udXfX9VYhFHmIHmsMLl1XjA64ImjEi1qS YdViB7vpjsRVjd5f01LvAWzlsGVuxfM9b9Hjaf+1uUhLFjOhN0LZH6YXuwVjk5Cmv+es Si4F5+ZMyl6duS1MTH2PftPq0T8Rt6/WwYLeKRTPXOorsru2Dg6eNwfmpbtujSkYdCFB fS3hg5OknV9gClICNFU2hb6W9b6Hs/tZcEwMp4GcE0nI6BT13Fe5TyQjtKSVz5TnlEuj 22vQDysvfxtBwr68UcgszPSUFAmDeWH+3LLKtZci6d7lzSLskjukPLhjqQ1LdZ4bR5CR 2BDg== X-Forwarded-Encrypted: i=1; AJvYcCXyAoMSRPIZPjJmejbfub2PtL5niIByNvVNoE1GofiU48Exgl6ynuQY9Edp6ma12R4itXjoctmhEJtdkxo=@vger.kernel.org X-Gm-Message-State: AOJu0Yx5dtqhE0EmYtitY8xjm4aG0WC6TbBdwxlR8RHxRO0kxKbU2vdA sK9sUEaidlAHdP0PZTpkM8W8igHkX2LTQvwNWdAgpLQouHqhAhjL/zNSorz57kpHVGCGIrPicAF xNSDtZgAC00ynBA== X-Received: from dlbtu14.prod.google.com ([2002:a05:7022:3c0e:b0:11f:30f6:ea10]) (user=bingjiao job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:418b:b0:11f:2c69:31 with SMTP id a92af1059eb24-123377660bfmr4150749c88.46.1768423988024; Wed, 14 Jan 2026 12:53:08 -0800 (PST) Date: Wed, 14 Jan 2026 20:53:02 +0000 In-Reply-To: <20260114205305.2869796-1-bingjiao@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260114070053.2446770-1-bingjiao@google.com> <20260114205305.2869796-1-bingjiao@google.com> X-Mailer: git-send-email 2.52.0.457.g6b5491de43-goog Message-ID: <20260114205305.2869796-2-bingjiao@google.com> Subject: [PATCH v9 1/2] mm/vmscan: fix demotion targets checks in reclaim/demotion From: Bing Jiao To: bingjiao@google.com Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Axel Rasmussen , Yuanchu Xie , Wei Xu , Johannes Weiner , Qi Zheng , Shakeel Butt , Gregory Price , Joshua Hahn , muchun.song@linux.dev, roman.gushchin@linux.dev, tj@kernel.org, longman@redhat.com, chenridong@huaweicloud.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Fix two bugs in demote_folio_list() and can_demote() due to incorrect demotion target checks against cpuset.mems_effective in reclaim/demotion. Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However: 1. It does not apply this check in demote_folio_list(), which leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. 2. It checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets in can_demote(). This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective. These bugs break resource isolation provided by cpuset.mems. This is visible from userspace because pages can either fail to be demoted entirely or are demoted to nodes that are not allowed in multi-tier memory systems. To address these bugs, update cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. Also update can_demote() and demote_folio_list() accordingly. Bug 1 reproduction: Assume a system with 4 nodes, where nodes 0-1 are top-tier and nodes 2-3 are far-tier memory. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Should respect node 0-2 limit. # Observation: Node 3 shows significant allocation (MemFree drops) stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1 Bug 2 reproduction: Assume a system with 6 nodes, where nodes 0-2 are top-tier, node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Pages are demoted to Nodes 4-5 # Observation: No pages are demoted before oom. stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2 Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao Cc: Acked-by: Shakeel Butt --- v7 -> v9: Minor updates in demote_folio_list() for better code logic. include/linux/cpuset.h | 6 ++--- include/linux/memcontrol.h | 6 ++--- kernel/cgroup/cpuset.c | 54 +++++++++++++++++++++++++------------- mm/memcontrol.c | 16 +++++++++-- mm/vmscan.c | 34 +++++++++++++++--------- 5 files changed, 78 insertions(+), 38 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index a98d3330385c..631577384677 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); +extern void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask); #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int= seq) return false; } -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid) +static inline void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t = *mask) { - return true; + nodes_copy(*mask, node_states[N_MEMORY]); } #endif /* !CONFIG_CPUSETS */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0651865a4564..412db7663357 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1744,7 +1744,7 @@ static inline void count_objcg_events(struct obj_cgro= up *objcg, rcu_read_unlock(); } -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid); +void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *= mask); void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg); @@ -1815,9 +1815,9 @@ static inline ino_t page_cgroup_ino(struct page *page) return 0; } -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int n= id) +static inline void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, + nodemask_t *mask) { - return true; } static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *mem= cg) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6e6eb09b8db6..289fb1a72550 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4416,40 +4416,58 @@ bool cpuset_current_node_allowed(int node, gfp_t gf= p_mask) return allowed; } -bool cpuset_node_allowed(struct cgroup *cgroup, int nid) +/** + * cpuset_nodes_allowed - return effective_mems mask from a cgroup cpuset. + * @cgroup: pointer to struct cgroup. + * @mask: pointer to struct nodemask_t to be returned. + * + * Returns effective_mems mask from a cgroup cpuset if it is cgroup v2 and + * has cpuset subsys. Otherwise, returns node_states[N_MEMORY]. + * + * This function intentionally avoids taking the cpuset_mutex or callback_= lock + * when accessing effective_mems. This is because the obtained effective_m= ems + * is stale immediately after the query anyway (e.g., effective_mems is up= dated + * immediately after releasing the lock but before returning). + * + * As a result, returned @mask may be empty because cs->effective_mems can= be + * rebound during this call. Besides, nodes in @mask are not guaranteed to= be + * online due to hot plugins. Callers should check the mask for validity on + * return based on its subsequent use. + **/ +void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask) { struct cgroup_subsys_state *css; struct cpuset *cs; - bool allowed; /* * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy * and mems_allowed is likely to be empty even if we could get to it, - * so return true to avoid taking a global lock on the empty check. + * so return directly to avoid taking a global lock on the empty check. */ - if (!cpuset_v2()) - return true; + if (!cgroup || !cpuset_v2()) { + nodes_copy(*mask, node_states[N_MEMORY]); + return; + } css =3D cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); - if (!css) - return true; + if (!css) { + nodes_copy(*mask, node_states[N_MEMORY]); + return; + } /* - * Normally, accessing effective_mems would require the cpuset_mutex - * or callback_lock - but node_isset is atomic and the reference - * taken via cgroup_get_e_css is sufficient to protect css. - * - * Since this interface is intended for use by migration paths, we - * relax locking here to avoid taking global locks - while accepting - * there may be rare scenarios where the result may be innaccurate. + * The reference taken via cgroup_get_e_css is sufficient to + * protect css, but it does not imply safe accesses to effective_mems. * - * Reclaim and migration are subject to these same race conditions, and - * cannot make strong isolation guarantees, so this is acceptable. + * Normally, accessing effective_mems would require the cpuset_mutex + * or callback_lock - but the correctness of this information is stale + * immediately after the query anyway. We do not acquire the lock + * during this process to save lock contention in exchange for racing + * against mems_allowed rebinds. */ cs =3D container_of(css, struct cpuset, css); - allowed =3D node_isset(nid, cs->effective_mems); + nodes_copy(*mask, cs->effective_mems); css_put(css); - return allowed; } /** diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 86f43b7e5f71..702c3db624a0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5624,9 +5624,21 @@ subsys_initcall(mem_cgroup_swap_init); #endif /* CONFIG_SWAP */ -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) +void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *= mask) { - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true; + nodemask_t allowed; + + if (!memcg) + return; + + /* + * Since this interface is intended for use by migration paths, and + * reclaim and migration are subject to race conditions such as changes + * in effective_mems and hot-unpluging of nodes, inaccurate allowed + * mask is acceptable. + */ + cpuset_nodes_allowed(memcg->css.cgroup, &allowed); + nodes_and(*mask, *mask, allowed); } void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg) diff --git a/mm/vmscan.c b/mm/vmscan.c index 670fe9fae5ba..5ea1dd2b8cce 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -344,19 +344,21 @@ static void flush_reclaim_state(struct scan_control *= sc) static bool can_demote(int nid, struct scan_control *sc, struct mem_cgroup *memcg) { - int demotion_nid; + struct pglist_data *pgdat =3D NODE_DATA(nid); + nodemask_t allowed_mask; - if (!numa_demotion_enabled) + if (!pgdat || !numa_demotion_enabled) return false; if (sc && sc->no_demotion) return false; - demotion_nid =3D next_demotion_node(nid); - if (demotion_nid =3D=3D NUMA_NO_NODE) + node_get_allowed_targets(pgdat, &allowed_mask); + if (nodes_empty(allowed_mask)) return false; - /* If demotion node isn't in the cgroup's mems_allowed, fall back */ - return mem_cgroup_node_allowed(memcg, demotion_nid); + /* Filter out nodes that are not in cgroup's mems_allowed. */ + mem_cgroup_node_filter_allowed(memcg, &allowed_mask); + return !nodes_empty(allowed_mask); } static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, @@ -1019,9 +1021,10 @@ static struct folio *alloc_demote_folio(struct folio= *src, * Folios which are not demoted are left on @demote_folios. */ static unsigned int demote_folio_list(struct list_head *demote_folios, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + struct mem_cgroup *memcg) { - int target_nid =3D next_demotion_node(pgdat->node_id); + int target_nid; unsigned int nr_succeeded; nodemask_t allowed_mask; @@ -1033,7 +1036,6 @@ static unsigned int demote_folio_list(struct list_hea= d *demote_folios, */ .gfp_mask =3D (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid =3D target_nid, .nmask =3D &allowed_mask, .reason =3D MR_DEMOTION, }; @@ -1041,10 +1043,18 @@ static unsigned int demote_folio_list(struct list_h= ead *demote_folios, if (list_empty(demote_folios)) return 0; - if (target_nid =3D=3D NUMA_NO_NODE) + node_get_allowed_targets(pgdat, &allowed_mask); + mem_cgroup_node_filter_allowed(memcg, &allowed_mask); + if (nodes_empty(allowed_mask)) return 0; - node_get_allowed_targets(pgdat, &allowed_mask); + target_nid =3D next_demotion_node(pgdat->node_id); + if (target_nid =3D=3D NUMA_NO_NODE) + /* No lower-tier nodes or nodes were hot-unplugged. */ + return 0; + if (!node_isset(target_nid, allowed_mask)) + target_nid =3D node_random(&allowed_mask); + mtc.nid =3D target_nid; /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_folios, alloc_demote_folio, NULL, @@ -1566,7 +1576,7 @@ static unsigned int shrink_folio_list(struct list_hea= d *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_demoted =3D demote_folio_list(&demote_folios, pgdat); + nr_demoted =3D demote_folio_list(&demote_folios, pgdat, memcg); nr_reclaimed +=3D nr_demoted; stat->nr_demoted +=3D nr_demoted; /* Folios that could not be demoted are still in @demote_folios */ -- 2.52.0.457.g6b5491de43-goog From nobody Sun Feb 8 06:56:34 2026 Received: from mail-dl1-f74.google.com (mail-dl1-f74.google.com [74.125.82.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2948E357A33 for ; Wed, 14 Jan 2026 20:53:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768424000; cv=none; b=ldpLLJ9YyaaC6h6fIzR8uXo/tcoWDCdgOGleJNzCEif4JszSmK2lmnmaFGbLCmrMAEQIvnFAitFX7CSg5PQgngE3VB3bYbI8nDK98fV/1KWIJRUMSF2erTYT00KJPONTCp0xv9FHOGeyUBLd8lAyoXZWAXuAjIIZ6tDBMaVfctY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768424000; c=relaxed/simple; bh=9B0HAws0jD6ktvEqCEUvq5OCM9p89Vzub9JqPk8brWM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=cNvFDAhbbHUiFM0tzMRaP0Tk1s2yF9v6bfZU2r2sRmc6B2PrETgwxJvgnL0lxX1mRL6Tni3KtEK81K9wRV9/cfYS3tqdwJKtBb0vVaUDC9bjpG+I7FEJnErnCJaDvMRw6OtSE969Uc5RPJr4slBVwqzA24RcX9q1QyPfYoZNTlw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=MV+dwNDx; arc=none smtp.client-ip=74.125.82.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="MV+dwNDx" Received: by mail-dl1-f74.google.com with SMTP id a92af1059eb24-1233b91de6bso349207c88.1 for ; Wed, 14 Jan 2026 12:53:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1768423989; x=1769028789; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=CCsRJ41+Ggq8G+xRjLn+/BFhG0UFmGF8M8WZhyAL+Ls=; b=MV+dwNDx3WDh5Q2Hcv0ZS66h9V3205AxXvfXKMhmwuPFRI4TOcJf6jy6Ljo1qNNeFV TlDQ+XYPPwL1vBVDn8vM6l6fO8SJ8PI6OhzFxU93JqVDmtwULkL5VtfR9LLb2GmzCa1A R3JyUsmCENAPCgwgOSeI3GXAelLgshVQvEdcRc8FViP1RkKkqGN1hLzFCKU22s6K34F5 1JGiWcJeMxvpwgtWIRLjAPK+qBj2DXGpTVVal+9V5BVABVghTk7O4q32gokZxw+Qvo/U ZrWcrCSqQ0JpJWLSgZyD1fFZbXFbZ4qLH91achfhBEcFTPJYx0GkCViIucRZ2tWU//jN vYiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1768423989; x=1769028789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CCsRJ41+Ggq8G+xRjLn+/BFhG0UFmGF8M8WZhyAL+Ls=; b=hWJl14dHFn69HmXLKp817GRmNaXvYPXEbRkPSBg+xt25u1kDXGTJ/roL5TBo2rhzTc aT3y5KsN4EnPWbhyxkD9WfXSi5Hxxk1jkktl/cBSTZY2zP4znKViJaNpoQbq7upBN7Lv hSdUET9CLFQXvRs8ld/8/OaoQRlG3G7YApVtIAcYVsnG+MvH+W6b8weJmMjoItSk4e/V ftH38dS/hPbpx+5qRW8JnPEQvkB+bVYERmPqL269V8YRF54O1ZwC5QdS9guUoJnjk8ni W+T/Oolbyj2OwBELOqqoLoMwQpRUWoXvspPSjQ90n4J4aRdBliOQmn3O62GmDhpj2YjF bSgQ== X-Forwarded-Encrypted: i=1; AJvYcCVVT2g+7ovS2iM80UpmREXeZakqYfS88IYn4JI1UEVuAOj9gehU9u1AGCoCL03WwwDpVlbbqwGmN3hgluk=@vger.kernel.org X-Gm-Message-State: AOJu0YyZllHNeo4syVHLqX3bVCKHYDvq16BY3jX2wVTeM2trrb55cL1G URwrmyR9xp4oaLYAWdmDCil4g3/e0KrwXHYwzzfwoe1egb0Q9ScoW+PqBTBYSYBjVn2ZpRqWbMu t/DxvJOiYlWftJQ== X-Received: from dybic39.prod.google.com ([2002:a05:7300:c727:b0:2a4:612e:b41f]) (user=bingjiao job=prod-delivery.src-stubby-dispatcher) by 2002:a05:701a:ca03:b0:11b:923d:7753 with SMTP id a92af1059eb24-12336a12704mr4017500c88.3.1768423989308; Wed, 14 Jan 2026 12:53:09 -0800 (PST) Date: Wed, 14 Jan 2026 20:53:03 +0000 In-Reply-To: <20260114205305.2869796-1-bingjiao@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260114070053.2446770-1-bingjiao@google.com> <20260114205305.2869796-1-bingjiao@google.com> X-Mailer: git-send-email 2.52.0.457.g6b5491de43-goog Message-ID: <20260114205305.2869796-3-bingjiao@google.com> Subject: [PATCH v9 2/2] mm/vmscan: select the closest perferred node in demote_folio_list() From: Bing Jiao To: bingjiao@google.com Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Axel Rasmussen , Yuanchu Xie , Wei Xu , Johannes Weiner , Qi Zheng , Shakeel Butt , Gregory Price , Joshua Hahn , muchun.song@linux.dev, roman.gushchin@linux.dev, tj@kernel.org, longman@redhat.com, chenridong@huaweicloud.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The preferred demotion node (migration_target_control.nid) should be the one closest to the source node to minimize migration latency. Currently, a discrepancy exists where demote_folio_list() randomly selects an allowed node if the preferred node from next_demotion_node() is not set in mems_effective. To address it, update next_demotion_node() to select a preferred target against allowed nodes; and to return the closest demotion target if all preferred nodes are not in mems_effective via next_demotion_node(). It ensures that the preferred demotion target is consistently the closest available node to the source node. Signed-off-by: Bing Jiao Acked-by: Shakeel Butt --- v7 -> v8: Fix bugs in v7. Remove the while loop of getting the preferred node via next_demotion_node(). Use find_next_best_node() to find the closest demotion target. v8 -> v9: Move allowed node checks and identification of the closest demotion target into next_demotion_node() for better function splitting. include/linux/memory-tiers.h | 6 +++--- mm/memory-tiers.c | 21 ++++++++++++++++----- mm/vmscan.c | 5 ++--- 3 files changed, 21 insertions(+), 11 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 7a805796fcfd..96987d9d95a8 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -53,11 +53,11 @@ struct memory_dev_type *mt_find_alloc_memory_type(int a= dist, struct list_head *memory_types); void mt_put_memory_types(struct list_head *memory_types); #ifdef CONFIG_MIGRATION -int next_demotion_node(int node); +int next_demotion_node(int node, const nodemask_t *allowed_mask); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); bool node_is_toptier(int node); #else -static inline int next_demotion_node(int node) +static inline int next_demotion_node(int node, const nodemask_t *allowed_m= ask) { return NUMA_NO_NODE; } @@ -101,7 +101,7 @@ static inline void clear_node_memory_type(int node, str= uct memory_dev_type *memt } -static inline int next_demotion_node(int node) +static inline int next_demotion_node(int node, const nodemask_t *allowed_m= ask) { return NUMA_NO_NODE; } diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 864811fff409..2d6c3754e6a8 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -320,16 +320,17 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodem= ask_t *targets) /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node + * @allowed_mask: The pointer to allowed node mask * * Return: node id for next memory node in the demotion path hierarchy * from @node; NUMA_NO_NODE if @node is terminal. This does not keep * @node online or guarantee that it *continues* to be the next demotion * target. */ -int next_demotion_node(int node) +int next_demotion_node(int node, const nodemask_t *allowed_mask) { struct demotion_nodes *nd; - int target; + nodemask_t mask; if (!node_demotion) return NUMA_NO_NODE; @@ -344,6 +345,10 @@ int next_demotion_node(int node) * node_demotion[] reads need to be consistent. */ rcu_read_lock(); + /* Filter out nodes that are not in allowed_mask. */ + nodes_and(mask, nd->preferred, *allowed_mask); + rcu_read_unlock(); + /* * If there are multiple target nodes, just select one * target node randomly. @@ -356,10 +361,16 @@ int next_demotion_node(int node) * caching issue, which seems more complicated. So selecting * target node randomly seems better until now. */ - target =3D node_random(&nd->preferred); - rcu_read_unlock(); + if (!nodes_empty(mask)) + return node_random(&mask); - return target; + /* + * Preferred nodes are not in allowed_mask. Filp bits in + * allowed_mask as used node mask. Then, use it to get the + * closest demotion target. + */ + nodes_complement(mask, *allowed_mask); + return find_next_best_node(node, &mask); } static void disable_all_demotion_targets(void) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5ea1dd2b8cce..7a631de46064 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1048,12 +1048,11 @@ static unsigned int demote_folio_list(struct list_h= ead *demote_folios, if (nodes_empty(allowed_mask)) return 0; - target_nid =3D next_demotion_node(pgdat->node_id); + target_nid =3D next_demotion_node(pgdat->node_id, &allowed_mask); if (target_nid =3D=3D NUMA_NO_NODE) /* No lower-tier nodes or nodes were hot-unplugged. */ return 0; - if (!node_isset(target_nid, allowed_mask)) - target_nid =3D node_random(&allowed_mask); + mtc.nid =3D target_nid; /* Demotion ignores all cpuset and mempolicy settings */ -- 2.52.0.457.g6b5491de43-goog